csvm

csvm is a multithreaded csv manipulation tool written in C++.

Current status

Needs a C++17 compiler.
Currently works only on ASCII input.
Provides a simple high level scripting language for defining manipulations, implemented using the PEGTL framework.
Has partial unit test coverage using the Catch2 framework.
Tested on Fedora and Manjaro, but should run on most Linux based OS, might need some changes to be able to run on other OS.
Currently needs explicit to_num and to_str conversions for numeric comparisons or numeric sort, will be improved later.
Some benchmarks here.

Features

It only a few things it can do on csv files currently, but adding additional features should be straight forward. The commands it has currently are:

cols: to rearrange the columns in the file.
select: to keep some of the rows based on some criteria and filter out the rest.
sort: to sort the file based on some conditions. uses tmp files for very large files.

You can provide multiple commands in a single run of csvm. It currently expects the first line of the file to be a header line. Later there might be a command to specify it.

Build instructions

checkout using:

git clone https://github.com/shsms/csvm.git && cd csvm

init submodules (currently CLI11 for parsing cli args and PEGTL for parsing commands and csv files.)

make init

build using

make

install to local bin directory. (depends on systemd)

make install

Usage examples

Read from stdin, keep only 3 columns in given order, drop the rest, write to stdout:

cat input.csv | csvm "cols(id, fieldA, countZ)"

Read from stdin, drop these columns, keep the rest, write to file:

cat input.csv | csvm -o output.csv "!cols(fieldA)"

Read from file, keep only rows that match criteria, write to stdout:

csvm -f input.csv "select(fieldA == 't' && countZ != '0')"

For numeric comparisons:

csvm -f input.csv "to_num(countA, countZ); select(fieldA == 't' && (countZ > 0 || countA > 0)); to_str(countA, countZ);"

Filter by a field, then drop that field:

csvm -f input.csv "select(fieldA == 't'); !cols(fieldA)"

filter by a field, forward sort by ‘fieldA’, reverse sort by ‘fieldB’:

csvm -f input.csv "select(fieldA != 't'); sort(fieldA, fieldB:r)"

numeric filter and numeric reverse sort:

csvm -f input.csv "to_num(countA); select(countA > 0); sort(countA:r); to_str(countA)"

Threading

At the moment, asking csvm to use additional threads is straight forward only when you are not using sort. For example,

csvm -n 4 -f input.csv "select(fieldA == 't'); !cols(fieldA)"

would use 4 threads to do the actual work. (there’s also the input and output threads - those don’t use a lot of CPU, they are there just to synchronize the worker threads.)

sort commands run in their own separate threads. When you add -n 4, sort creates 4 new threads to sort and for large input files(>32MB), it uses 4 additional threads to save to/retrieve from tmp files.

So the below command would have 12 active threads:

csvm -n 4 -f input.csv "cols(fieldA); sort(fieldB);"

When using just the sort command, csvm would still use 12 threads, the first 4 will be used just for parsing the input csv into internal representation.

This will change, optimizing for the given number of threads will come later.

The --print-engine argument would display the stages csvm would use. For example, this command:

bin/csvm --print-engine -n 4 -f input.csv 'to_num(colA); sort(colA); select(colB == "t"); to_str(colA);'

would print:

stage: 1 (exec_order: 0)
1.1 to_num:
        5 : colA

stage: 2 (exec_order: 2)
2.1 sort:
        5 : colA

stage: 3 (exec_order: 0)
3.1 select:
        colB t ==

3.2 to_str:
        5 : colA

(5 is the position of colA in the input file.)

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
src		src
vendor		vendor
.ccls		.ccls
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.org		README.org
benchmarks.org		benchmarks.org
run_benchmarks.bash		run_benchmarks.bash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

csvm

Current status

Features

Build instructions

Usage examples

Threading

About

Releases

Packages

Languages

License

shsms/csvm

Folders and files

Latest commit

History

Repository files navigation

csvm

Current status

Features

Build instructions

Usage examples

Threading

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages