Extracts a random subset from a CSV data file while preserving the header (unless you don't want that).
RcsvExtract counts the lines in a given CSV data file, extracts a subset of a chosen size and optionally saves the generated subset to a new file.
Calling RcsvExtract with a path as the only parameter will tell you the line count and quit:
./ExtractRandomSubSet.rb some/file.csv
Note: You may have to call ruby ExtractRandomSubSet.rb some/file.csv instead if RcsvExtract is not executable (chmod +x ExtractRandomSubSet.rb) on your system.
Adding a number as the second parameter will provide you with a random subset of the size you specified and flood your console with it:
./ExtractRandomSubSet.rb some/file.csv 3000
If you rather want to save your subset to a new CSV file, just add a third parameter specifying its name like this:
./ExtractRandomSubSet.rb some/file.csv 3000 my/subset.csv
RcsvExtract expects to find a CSV header in the first line and per default preserves this line in your subset. If your data does not have a header, add the no_header flag to have RcsvExtract ignore the first line in any case.
no_headerparameter set:generate_indicesneeds to include line 0 instead of ignoring it. Moveno_headerswitch fromgenerate_subsettogenerate_indicesto make it happen.
Copyright (c) 2010 Julian Schrader, http://julianschrader.com
See LICENSE for details.