csvm is a multithreaded csv manipulation tool written in C++.
- Needs a C++17 compiler.
- Currently works only on ASCII input.
- Provides a simple high level scripting language for defining manipulations, implemented using the PEGTL framework.
- Has partial unit test coverage using the Catch2 framework.
- Tested on Fedora and Manjaro, but should run on most Linux based OS, might need some changes to be able to run on other OS.
- Currently needs explicit to_num and to_str conversions for numeric comparisons or numeric sort, will be improved later.
- Some benchmarks here.
It only a few things it can do on csv files currently, but adding additional features should be straight forward. The commands it has currently are:
- cols
- to rearrange the columns in the file.
- select
- to keep some of the rows based on some criteria and filter out the rest.
- sort
- to sort the file based on some conditions. uses tmp files for very large files.
You can provide multiple commands in a single run of csvm. It currently expects the first line of the file to be a header line. Later there might be a command to specify it.
- checkout using:
git clone https://github.com/shsms/csvm.git && cd csvm
- init submodules (currently CLI11 for parsing cli args and PEGTL for parsing commands and csv files.)
make init
- build using
make
- install to local bin directory. (depends on systemd)
make install
- Read from stdin, keep only 3 columns in given order, drop the rest, write to stdout:
cat input.csv | csvm "cols(id, fieldA, countZ)"
- Read from stdin, drop these columns, keep the rest, write to file:
cat input.csv | csvm -o output.csv "!cols(fieldA)"
- Read from file, keep only rows that match criteria, write to stdout:
csvm -f input.csv "select(fieldA == 't' && countZ != '0')"
- For numeric comparisons:
csvm -f input.csv "to_num(countA, countZ); select(fieldA == 't' && (countZ > 0 || countA > 0)); to_str(countA, countZ);"
- Filter by a field, then drop that field:
csvm -f input.csv "select(fieldA == 't'); !cols(fieldA)"
- filter by a field, forward sort by ‘fieldA’, reverse sort by ‘fieldB’:
csvm -f input.csv "select(fieldA != 't'); sort(fieldA, fieldB:r)"
- numeric filter and numeric reverse sort:
csvm -f input.csv "to_num(countA); select(countA > 0); sort(countA:r); to_str(countA)"
At the moment, asking csvm to use additional threads is straight forward only when you are not using sort. For example,
csvm -n 4 -f input.csv "select(fieldA == 't'); !cols(fieldA)"
would use 4 threads to do the actual work. (there’s also the input and output threads - those don’t use a lot of CPU, they are there just to synchronize the worker threads.)
sort commands run in their own separate threads. When you add -n 4, sort creates 4 new threads to sort and for large input files(>32MB), it uses 4 additional threads to save to/retrieve from tmp files.
So the below command would have 12 active threads:
csvm -n 4 -f input.csv "cols(fieldA); sort(fieldB);"
When using just the sort command, csvm would still use 12 threads, the first 4 will be used just for parsing the input csv into internal representation.
This will change, optimizing for the given number of threads will come later.
The --print-engine
argument would display the stages csvm would use. For example, this command:
bin/csvm --print-engine -n 4 -f input.csv 'to_num(colA); sort(colA); select(colB == "t"); to_str(colA);'
would print:
stage: 1 (exec_order: 0)
1.1 to_num:
5 : colA
stage: 2 (exec_order: 2)
2.1 sort:
5 : colA
stage: 3 (exec_order: 0)
3.1 select:
colB t ==
3.2 to_str:
5 : colA
(5 is the position of colA in the input file.)