Make the benchmark unit test pass in the presence of quoted fields. #52

MarkPflug · 2023-06-15T18:43:34Z

The NCsvPerf benchmark PackageAssets.csv file doesn't contain quotes, but all existing parsers can be configured to produce consistent results in the presence of a quoted field. This PR adds the required configurations for a few odd libraries that don't do this be default. Note, this PR doesn't change the CSV file, so the benchmark still doesn't test correctness. To test this, I modified the PackageAssets.csv and changed >Akinzekeel.BlazorGrid< to >"Akinzekeel,BlazorGrid"< on the first row; adding quotes and a comma in the middle. I exclude string.Split from the unit test, since the naive implementation is expected to produce incorrect results.

This PR is a bit reactionary to @nietras submission of Sep #51, as Sep would be the only library that produces inconsistent results, since it doesn't trim or escape quoted values and I don't see any configuration to enable it. Personally, I see this as fundamental functionality that a CSV library should provide, as without it a user would have to implement their own mechanism to process values. It feels a bit misleading that parser claiming to be the fastest would be producing results that disagrees with every other CSV library. I think most users would be surprised by this behavior, even though it is clearly stated in the docs.

The benchmark data file doesn't contain quotes, but all existing parser can be configured to produce identical results in the presence of quoted fields. To test this, I modified the PackageAssets.csv and changed >Akinzekeel.BlazorGrid< to >"Akinzekeel,BlazorGrid"< on the first row.

MarkPflug · 2023-06-15T23:18:31Z

I added a few unit test cases to compare behavior of certain parser edge-cases.
This covers the following scenarios:
Quoted: "a"
QuotedComma: "a,b"
QuotedQuote: "a""b"
QuotedNewLine: "a\r\nb"

QuotedNewLine is the only one that currently fails, with some libraries failing or producing differing results.
Most libraries produce the correct, expected result a\r\nb.
The "DSV" library turns the "\r\n" into "\n" in the output: a\nb

The following parsers throw an exception when processing this file:

Csv
CsvTools
FluentCSV
Microsoft_ML
RecordParser
ServiceStack_Text
TinyCsvReader
TxtCsvHelper

I don't know if this test will survive git, which has a tendency to modify line ends in files. This might partially satisfy issue #42.

Edit: The newline did not survive git, and was replaced with a \n, which is what I worried might happen.

joelverhagen · 2023-10-27T00:01:42Z

The results in this PR are very interesting. If I find some time, I think it would be good to assess quote handling and newline handling per library and add it as a ✔️ or ✖️ to the results table in my blog post. I think I could use this PR as a starting point.

MarkPflug added 2 commits June 15, 2023 11:02

Add a few unit tests that compare behavior of parser "edge cases".

0d0482a

leandromoh mentioned this pull request Oct 30, 2023

update/record-parser2.1.0 MarkPflug/Benchmarks#16

Merged

MarkPflug closed this by deleting the head repository May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the benchmark unit test pass in the presence of quoted fields. #52

Make the benchmark unit test pass in the presence of quoted fields. #52

MarkPflug commented Jun 15, 2023

MarkPflug commented Jun 15, 2023 •

edited

Loading

joelverhagen commented Oct 27, 2023

Make the benchmark unit test pass in the presence of quoted fields. #52

Make the benchmark unit test pass in the presence of quoted fields. #52

Conversation

MarkPflug commented Jun 15, 2023

MarkPflug commented Jun 15, 2023 • edited Loading

joelverhagen commented Oct 27, 2023

MarkPflug commented Jun 15, 2023 •

edited

Loading