Skip to content

Latest commit

 

History

History
112 lines (95 loc) · 4.28 KB

README.org

File metadata and controls

112 lines (95 loc) · 4.28 KB

csvm

csvm is a multithreaded csv manipulation tool written in C++.

Current status

  • Needs a C++17 compiler.
  • Currently works only on ASCII input.
  • Provides a simple high level scripting language for defining manipulations, implemented using the PEGTL framework.
  • Has partial unit test coverage using the Catch2 framework.
  • Tested on Fedora and Manjaro, but should run on most Linux based OS, might need some changes to be able to run on other OS.
  • Currently needs explicit to_num and to_str conversions for numeric comparisons or numeric sort, will be improved later.
  • Some benchmarks here.

Features

It only a few things it can do on csv files currently, but adding additional features should be straight forward. The commands it has currently are:

cols
to rearrange the columns in the file.
select
to keep some of the rows based on some criteria and filter out the rest.
sort
to sort the file based on some conditions. uses tmp files for very large files.

You can provide multiple commands in a single run of csvm. It currently expects the first line of the file to be a header line. Later there might be a command to specify it.

Build instructions

  • checkout using:
git clone https://github.com/shsms/csvm.git && cd csvm
  • init submodules (currently CLI11 for parsing cli args and PEGTL for parsing commands and csv files.)
make init
  • build using
make
  • install to local bin directory. (depends on systemd)
make install

Usage examples

  • Read from stdin, keep only 3 columns in given order, drop the rest, write to stdout:
cat input.csv | csvm "cols(id, fieldA, countZ)"
  • Read from stdin, drop these columns, keep the rest, write to file:
cat input.csv | csvm -o output.csv "!cols(fieldA)"
  • Read from file, keep only rows that match criteria, write to stdout:
csvm -f input.csv "select(fieldA == 't' && countZ != '0')"
  • For numeric comparisons:
csvm -f input.csv "to_num(countA, countZ); select(fieldA == 't' && (countZ > 0 || countA > 0)); to_str(countA, countZ);"
  • Filter by a field, then drop that field:
csvm -f input.csv "select(fieldA == 't'); !cols(fieldA)"
  • filter by a field, forward sort by ‘fieldA’, reverse sort by ‘fieldB’:
csvm -f input.csv "select(fieldA != 't'); sort(fieldA, fieldB:r)"
  • numeric filter and numeric reverse sort:
csvm -f input.csv "to_num(countA); select(countA > 0); sort(countA:r); to_str(countA)"

Threading

At the moment, asking csvm to use additional threads is straight forward only when you are not using sort. For example,

csvm -n 4 -f input.csv "select(fieldA == 't'); !cols(fieldA)"

would use 4 threads to do the actual work. (there’s also the input and output threads - those don’t use a lot of CPU, they are there just to synchronize the worker threads.)

sort commands run in their own separate threads. When you add -n 4, sort creates 4 new threads to sort and for large input files(>32MB), it uses 4 additional threads to save to/retrieve from tmp files.

So the below command would have 12 active threads:

csvm -n 4 -f input.csv "cols(fieldA); sort(fieldB);"

When using just the sort command, csvm would still use 12 threads, the first 4 will be used just for parsing the input csv into internal representation.

This will change, optimizing for the given number of threads will come later.

The --print-engine argument would display the stages csvm would use. For example, this command:

bin/csvm --print-engine -n 4 -f input.csv 'to_num(colA); sort(colA); select(colB == "t"); to_str(colA);'

would print:

stage: 1 (exec_order: 0)
1.1 to_num:
        5 : colA

stage: 2 (exec_order: 2)
2.1 sort:
        5 : colA

stage: 3 (exec_order: 0)
3.1 select:
        colB t ==

3.2 to_str:
        5 : colA

(5 is the position of colA in the input file.)