Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the benchmark unit test pass in the presence of quoted fields. #52

Closed
wants to merge 2 commits into from
Closed

Conversation

MarkPflug
Copy link
Contributor

The NCsvPerf benchmark PackageAssets.csv file doesn't contain quotes, but all existing parsers can be configured to produce consistent results in the presence of a quoted field. This PR adds the required configurations for a few odd libraries that don't do this be default. Note, this PR doesn't change the CSV file, so the benchmark still doesn't test correctness. To test this, I modified the PackageAssets.csv and changed >Akinzekeel.BlazorGrid< to >"Akinzekeel,BlazorGrid"< on the first row; adding quotes and a comma in the middle. I exclude string.Split from the unit test, since the naive implementation is expected to produce incorrect results.

This PR is a bit reactionary to @nietras submission of Sep #51, as Sep would be the only library that produces inconsistent results, since it doesn't trim or escape quoted values and I don't see any configuration to enable it. Personally, I see this as fundamental functionality that a CSV library should provide, as without it a user would have to implement their own mechanism to process values. It feels a bit misleading that parser claiming to be the fastest would be producing results that disagrees with every other CSV library. I think most users would be surprised by this behavior, even though it is clearly stated in the docs.

The benchmark data file doesn't contain quotes, but all existing parser can be configured to produce identical results in the presence of quoted fields.
To test this, I modified the PackageAssets.csv and changed >Akinzekeel.BlazorGrid< to >"Akinzekeel,BlazorGrid"< on the first row.
@MarkPflug
Copy link
Contributor Author

MarkPflug commented Jun 15, 2023

I added a few unit test cases to compare behavior of certain parser edge-cases.
This covers the following scenarios:
Quoted: "a"
QuotedComma: "a,b"
QuotedQuote: "a""b"
QuotedNewLine: "a\r\nb"

QuotedNewLine is the only one that currently fails, with some libraries failing or producing differing results.
Most libraries produce the correct, expected result a\r\nb.
The "DSV" library turns the "\r\n" into "\n" in the output: a\nb

The following parsers throw an exception when processing this file:

  • Csv
  • CsvTools
  • FluentCSV
  • Microsoft_ML
  • RecordParser
  • ServiceStack_Text
  • TinyCsvReader
  • TxtCsvHelper

I don't know if this test will survive git, which has a tendency to modify line ends in files. This might partially satisfy issue #42.

Edit: The newline did not survive git, and was replaced with a \n, which is what I worried might happen.

@joelverhagen
Copy link
Owner

The results in this PR are very interesting. If I find some time, I think it would be good to assess quote handling and newline handling per library and add it as a ✔️ or ✖️ to the results table in my blog post. I think I could use this PR as a starting point.

@MarkPflug MarkPflug closed this by deleting the head repository May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants