Skip to content

leeoniya/uDSV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

𝌠 μDSV

A faster CSV parser in 5KB (min) (MIT Licensed)


Introduction

uDSV is a fast JS library for parsing well-formed CSV strings, either from memory or incrementally from disk or network. It is mostly RFC 4180 compliant, with support for quoted values containing commas, escaped quotes, and line breaks¹. The aim of this project is to handle the 99.5% use-case without adding complexity and performance trade-offs to support the remaining 0.5%.

¹ Line breaks (\n,\r,\r\n) within quoted values must match the row separator.


Features

What does uDSV pack into 5KB?

  • RFC 4180 compliant
  • Incremental or full parsing, with optional accumulation
  • Auto-detection and customization of delimiters (rows, columns, quotes, escapes)
  • Schema inference and value typing: string, number, boolean, date, json
  • Defined handling of '', 'null', 'NaN'
  • Whitespace trimming of values & skipping empty lines
  • Multi-row header skipping and column renaming
  • Multiple outputs: arrays (tuples), objects, nested objects, columnar arrays

Of course, most of these are table stakes for CSV parsers :)


Performance

Is it Lightning Fast™ or Blazing Fast™?

No, those are too slow! uDSV has Ludicrous Speed™; it's faster than the parsers you recognize and faster than those you've never heard of.

Most CSV parsers have one happy/fast path -- the one without quoted values, without value typing, and only when using the default settings & output format. Once you're off that path, you can generally throw any self-promoting benchmarks in the trash. In contrast, uDSV remains fast with any datasets and all options; its happy path is every path.

On a Ryzen 7 ThinkPad, Linux v6.13.3, and NodeJS v22.14.0, a diverse set of benchmarks show a 1x-5x performance boost relative to the popular, proven-fast, Papa Parse.

For way too many synthetic and real-world benchmarks, head over to /bench...and don't forget your coffee!

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ customers-100000.csv (17 MB, 12 cols x 100K rows)                        (parsing to strings) │
├────────────────────────┬────────┬─────────────────────────────────────────────────────────────┤
│ Name                   │ Rows/s │ Throughput (MiB/s)                                          │
├────────────────────────┼────────┼─────────────────────────────────────────────────────────────┤
│ csv-simple-parser      │ 1.45M  │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 240 │
│ uDSV                   │ 1.39M  │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 230   │
│ PapaParse              │ 1.13M  │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 187             │
│ tiddlycsv              │ 1.09M  │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 180              │
│ ACsv                   │ 1.07M  │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 176               │
│ but-csv                │ 1.05M  │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 174                │
│ d3-dsv                 │ 987K   │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 163                  │
│ csv-rex                │ 887K   │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 147                      │
│ csv42                  │ 781K   │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 129                          │
│ achilles-csv-parser    │ 687K   │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░ 114                             │
│ arquero                │ 567K   │ ░░░░░░░░░░░░░░░░░░░░░░░ 93.6                                │
│ comma-separated-values │ 545K   │ ░░░░░░░░░░░░░░░░░░░░░ 90                                    │
│ node-csvtojson         │ 456K   │ ░░░░░░░░░░░░░░░░░░ 75.3                                     │
│ @vanillaes/csv         │ 427K   │ ░░░░░░░░░░░░░░░░░ 70.5                                      │
│ SheetJS                │ 415K   │ ░░░░░░░░░░░░░░░░ 68.5                                       │
│ csv-parser (neat-csv)  │ 307K   │ ░░░░░░░░░░░░ 50.7                                           │
│ CSVtoJSON              │ 297K   │ ░░░░░░░░░░░░ 49.1                                           │
│ dekkai                 │ 221K   │ ░░░░░░░░░ 36.5                                              │
│ csv-js                 │ 206K   │ ░░░░░░░░ 34.1                                               │
│ @gregoranders/csv      │ 202K   │ ░░░░░░░░ 33.3                                               │
│ csv-parse/sync         │ 177K   │ ░░░░░░░ 29.3                                                │
│ jquery-csv             │ 155K   │ ░░░░░░ 25.6                                                 │
│ @fast-csv/parse        │ 114K   │ ░░░░░ 18.9                                                  │
│ utils-dsv-base-parse   │ 74.3K  │ ░░░ 12.3                                                    │
└────────────────────────┴────────┴─────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ customers-100000.csv (17 MB, 12 cols x 100K rows)                        (parsing with types) │
├────────────────────────┬────────┬────────────────────────────────────────┬────────────────────┤
│ Name                   │ Rows/s │ Throughput (MiB/s)                     │ Types              │
├────────────────────────┼────────┼────────────────────────────────────────┼────────────────────┤
│ uDSV                   │ 993K   │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 164 │ date,number,string │
│ csv42                  │ 686K   │ ░░░░░░░░░░░░░░░░░░░░░░░░ 113           │ number,string      │
│ csv-simple-parser      │ 666K   │ ░░░░░░░░░░░░░░░░░░░░░░░ 110            │ date,number,string │
│ csv-rex                │ 627K   │ ░░░░░░░░░░░░░░░░░░░░░░ 104             │ number,string      │
│ comma-separated-values │ 536K   │ ░░░░░░░░░░░░░░░░░░░ 88.5               │ number,string      │
│ achilles-csv-parser    │ 517K   │ ░░░░░░░░░░░░░░░░░░ 85.3                │ number,string      │
│ arquero                │ 478K   │ ░░░░░░░░░░░░░░░░░ 79                   │ date,number,string │
│ PapaParse              │ 463K   │ ░░░░░░░░░░░░░░░░ 76.4                  │ number,string      │
│ d3-dsv                 │ 389K   │ ░░░░░░░░░░░░░░ 64.3                    │ date,number,string │
│ @vanillaes/csv         │ 312K   │ ░░░░░░░░░░░ 51.5                       │ NaN,number,string  │
│ CSVtoJSON              │ 284K   │ ░░░░░░░░░░ 46.8                        │ number,string      │
│ csv-parser (neat-csv)  │ 265K   │ ░░░░░░░░░░ 43.7                        │ number,string      │
│ csv-js                 │ 211K   │ ░░░░░░░░ 34.8                          │ number,string      │
│ dekkai                 │ 209K   │ ░░░░░░░░ 34.6                          │ number,string      │
│ csv-parse/sync         │ 101K   │ ░░░░ 16.7                              │ date,number,string │
│ SheetJS                │ 64.5K  │ ░░░ 10.7                               │ number,string      │
└────────────────────────┴────────┴────────────────────────────────────────┴────────────────────┘

Installation

npm i udsv

or

<script src="./dist/uDSV.iife.min.js"></script>

API

A 150 LoC uDSV.d.ts TypeScript def.


Basic Usage

import { inferSchema, initParser } from 'udsv';

let csvStr = 'a,b,c\n1,2,3\n4,5,6';

let schema = inferSchema(csvStr);
let parser = initParser(schema);

// native format (fastest)
let stringArrs = parser.stringArrs(csvStr); // [ ['1','2','3'], ['4','5','6'] ]

// typed formats (internally converted from native)
let typedArrs  = parser.typedArrs(csvStr);  // [ [1, 2, 3], [4, 5, 6] ]
let typedObjs  = parser.typedObjs(csvStr);  // [ {a: 1, b: 2, c: 3}, {a: 4, b: 5, c: 6} ]
let typedCols  = parser.typedCols(csvStr);  // [ [1, 4], [2, 5], [3, 6] ]

let stringObjs = parser.stringObjs(csvStr); // [ {a: '1', b: '2', c: '3'}, {a: '4', b: '5', c: '6'} ]
let stringCols = parser.typedCols(csvStr);  // [ ['1', '4'], ['2', '5'], ['3', '6'] ]

Nested/deep objects can be re-constructed from column naming via .typedDeep():

// deep/nested objects (from column naming)
let csvStr2 = `
_type,name,description,location.city,location.street,location.geo[0],location.geo[1],speed,heading,size[0],size[1],size[2]
item,Item 0,Item 0 description in text,Rotterdam,Main street,51.9280712,4.4207888,5.4,128.3,3.4,5.1,0.9
`.trim();

let schema2 = inferSchema(csvStr2);
let parser2 = initParser(schema2);

let typedDeep = parser2.typedDeep(csvStr2);

/*
[
  {
    _type: 'item',
    name: 'Item 0',
    description: 'Item 0 description in text',
    location: {
      city: 'Rotterdam',
      street: 'Main street',
      geo: [ 51.9280712, 4.4207888 ]
    },
    speed: 5.4,
    heading: 128.3,
    size: [ 3.4, 5.1, 0.9 ],
  }
]
*/

CSP Note:

uDSV uses dynamically-generated functions (via new Function()) for its .typed*() methods. These functions are lazy-generated and use JSON.stringify() code-injection guards, so the risk should be minimal. Nevertheless, if you have strict CSP headers without unsafe-eval, you won't be able to take advantage of the typed methods and will have to do the type conversion from the string tuples yourself.


Incremental / Streaming

uDSV has no inherent knowledge of streams. Instead, it exposes a generic incremental parsing API to which you can pass sequential chunks. These chunks can come from various sources, such as a Web Stream or Node stream via fetch() or fs, a WebSocket, etc.

Here's what it looks like with Node's fs.createReadStream():

let stream = fs.createReadStream(filePath);

let parser = null;
let result = null;

stream.on('data', (chunk) => {
  // convert from Buffer
  let strChunk = chunk.toString();
  // on first chunk, infer schema and init parser
  parser ??= initParser(inferSchema(strChunk));
  // incremental parse to string arrays
  parser.chunk(strChunk, parser.stringArrs);
});

stream.on('end', () => {
  result = parser.end();
});

...and Web streams in Node, or Fetch's Response.body:

let stream = fs.createReadStream(filePath);

let webStream = Stream.Readable.toWeb(stream);
let textStream = webStream.pipeThrough(new TextDecoderStream());

let parser = null;

for await (const strChunk of textStream) {
  parser ??= initParser(inferSchema(strChunk));
  parser.chunk(strChunk, parser.stringArrs);
}

let result = parser.end();

The above examples show accumulating parsers -- they will buffer the full result into memory. This may not be something you need (or want), for example with huge datasets where you're looking to get the sum of a single column, or want to filter only a small subset of rows. To bypass this auto-accumulation behavior, simply pass your own handler as the third argument to parser.chunk():

// ...same as above

let sum = 0;

// sums fourth column
let reducer = (row) => {
  sum += row[3];
};

for await (const strChunk of textStream) {
  parser ??= initParser(inferSchema(strChunk));
  parser.chunk(strChunk, parser.typedArrs, reducer); // typedArrs + reducer
}

parser.end();

Building on the non-accumulating example, Node's Transform stream will be something like:

import { Transform } from "stream";

class ParseCSVTransform extends Transform {
  #parser = null;
  #push   = null;

  constructor() {
    super({ objectMode: true });

    this.#push = parsed => {
      this.push(parsed);
    };
  }

  _transform(chunk, encoding, callback) {
    let strChunk = chunk.toString();
    this.#parser ??= initParser(inferSchema(strChunk));
    this.#parser.chunk(strChunk, this.#parser.typedArrs, this.#push);
    callback();
  }

  _flush(callback) {
    this.#parser.end();
    callback();
  }
}

TODO?

  • handle #comment rows
  • emit empty-row and #comment events?