-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add quoted values to the test CSV and possibly some larger blocks. #42
Comments
I might be looking at the wrong CSV for your tests, if so, please close. |
You're right, the current tests don't include multi-line data or even quote escaped fields. I don't remember my original motivation for this (perhaps it was just a coincidence) but I think the likelihood for diverging behavior/support across the various libraries goes up as more esoteric CSV features are used. Aside from a performance perspective, it would be cool to test various libraries handling and support of some different nasty CSV files. I have been thinking about a big table of "CSV features" as columns and "CSV libraries" as rows, each cell would be a red x or green check. A data set we could use is https://github.com/maxogden/csv-spectrum. Great idea :). If you're interested in adding a benchmark for these additional cases, you're more than welcome to it. |
I think it's important to show the performance differences as you cannot simply capture a |
This is true, but one can use the ArrayPoolBufferWriter<> class to reconstruct the result and eliminate most, if not all, allocations. |
Related to this is benchmarking OpenSerDe definitions. I hate Amazon Athena / DynamoDB serialization, but it is what it is. |
@Kittoes0124 I'm curious how you are using |
@electricessence Like so. The idea is that, when possible, we're always yielding a window into the original buffer. When "stitching" is required we write into the As the kids are saying these days it uh... slaps.
|
Also, adding massive CSVs with degenerate properties might be beneficial. I use the latest version of the NPPES downloadable file to really stress test my stuff because everything about it is unreasonable. It's 8GB unzipped, has 330 fields per row, every single field is quoted, and most of them are sparsely populated. Oh, and it's real-world data; which is a nice bonus! |
@Kittoes0124 where's this |
|
Maybe we should continue this discussion somewhere else. That said... Pipeline overhead is not zero: |
To add more insult to injury. The overhead is not zero and maybe not great in ideal conditions.
|
Thanks mate, that means a lot as I did indeed write them myself! Also feel the same way about their lack of inclusion in the base lib. Will continue our discussion further in your Reddit post. |
Just a suggestion, but having a column with quoted and quote escapes will round out the tests.
I know it might break someone's code, but I realized when diving in that a 'row' is not a 'line' in the file.
A 'row' can span multiple lines if it's within a quoted value. So having a newline character in a quoted value somewhere might be good to add to the tests.
The text was updated successfully, but these errors were encountered: