update/record-parser2.1.0 #16

leandromoh · 2023-10-26T16:15:14Z

No description provided.

MarkPflug · 2023-10-27T22:53:44Z

I'm not convinced that it's possible for a general purpose, standards-compliant CSV parser to benefit from parallelism. Are there any documented limitations around your parallelism implementation? Can you describe the basic strategy that you've employed to parallelize?

leandromoh · 2023-10-30T15:09:29Z

I'm not convinced that it's possible for a general purpose, standards-compliant CSV parser to benefit from parallelism.

your Sylvan library uses SIMD, right? it is other kind of parallelism but still parallel computing.
could you please merge the PR? once the PR is merged we can run the benchmark and check the result to see if it benefits or not. in my personal tests, the result is positive.

Are there any documented limitations around your parallelism implementation?

no, there is no limitations. there are tests covering the quoted scenarios. btw, can you update this PR by making a rebase on joelverhagen:main ? the "QuotedNewLine" scenario was fixed.

Can you describe the basic strategy that you've employed to parallelize?

this is the 1 million dollars question 🤣 it is a bit hard to explain.

MarkPflug · 2023-10-30T15:50:43Z

The reason I haven't merged it is because I'm not convinced your implementation is robust. Here is a gist where your parser is the only one that fails to parse: https://gist.github.com/MarkPflug/14123a1eb62e3151c6d8b6bfc5b94b6f
I thought your parallelism implementation might be flawed in the way it finds quoted fields with newlines, but it fails regardless of parallelism. Yes, this is an extreme example, but the other tested parsers handle it. Maybe I'm using it wrong?

Correctness comes before speed. I feel very strongly that if you're claiming to have the fastest CSV parser that you must first prove that it is correct and robust. @nietras Sep is exceptionally fast, and also appears to be correct, at least in this scenario. I was a bit surprised by the 16mb record limit, but the exception makes it clear that the limitation exists.

I plan on updating these benchmarks when .NET 8 is officially released, at which point I might need to concede that Sylvan isn't the fastest anymore.

nietras · 2023-10-30T16:05:15Z

@MarkPflug you are right that Sep has a row length limit of 16MB. There are a couple of reasons for this, some historical and also to avoid misuse or ddos scenarios. It can be easily lifted if I wanted to or anyone had a real world use case that exceeded it. Stress testing is fine but limits can be there for valid reasons. Another case is run away quote, that is a quote that does not end. This can result in GBs being allocated for row buffer. So I feel very much this limit is there for a reason.

Yes, I could add an option for setting it, but have been trying to avoid option explosion. 😅

MarkPflug · 2023-10-30T16:12:47Z

@nietras I should clarify that I was surprised only because the limitation wasn't mentioned in your documentation (that I could see). A 16MB limit is very generous. Sylvan has a similar limitation (with the default config) for the same reason: DDOS protection. I wanted to make sure it was safe in a web environment where users might sending malicious payloads. My default limit is only 16KB, and even that is more than ample in most realistic scenarios. I make the user explicitly opt-in for unbounded (int.MaxValue) buffer growth.

nietras · 2023-10-30T16:23:30Z

In machine learning there are often examples of csv files with 100k cols, hence my generosity. You are right I should document it better. 😊

leandromoh · 2023-10-30T16:58:45Z

@MarkPflug I totally agree with you that correctness comes before speed and I ensure that with a lot of unit tests to check that the implementation is working correctly, including tests for quoted fields scenarios, for example here.

About your gist, you used the lib right. I tested changing the for loop condition to 900k instead of 1mm and everything works as expected. The problem probably is releated to the internal buffer size, which was not enough to hold the single record. That is why it fails regardless of parallelism.

To fix the problem I can see two alternatives:

to throw an exception with the appropriate message or
allow user to configure the buffer size

Since both alternatives can be implemented, so I will think what to do.
Btw, thanks for point this exceptional scenario.
IMHO I dont think this unusual extreme case would be enough to say that the implementation is not robust.
If you have more of these corner cases, please let me know. otherwise since the problem is not releated to the algorithm but sizing (it works for 900k), I think it would be ok to proceed.

MarkPflug · 2023-10-30T18:33:26Z

@leandromoh There is still something wrong with your parser that is not related to buffer size. I've modified the gist. Your library throws when parsing the following:

A,B,C,D
"x
y",2,3,4

This happens regardless of parallelization.

leandromoh · 2023-11-01T19:52:55Z

I reproduced it and indeed this is a bug. according to my analysis, the bug happen only when the first field of the record is a quoted field, when it is not the first field, everything works fine (dotnetfiddle). fortunately, it does not looks hard to fix.

I will add more unit tests for quoted scenarios, and I come back here when the new version convering these corner cases is published.

If you find new bugs please let me know. I am doing a hard work for cover every scenario I could think using unit tests but of course some corner cases will be catch only by other users.

MarkPflug · 2023-11-06T22:50:28Z

@leandromoh Studied your implementation enough to convince myself that your bugs don't appear to be due to "shortcuts" in the parallel implementation, which is what I'd originally thought. I still plan on updating all benchmarks sometime after .NET 8 is released in the next couple weeks.

leandromoh · 2023-11-09T14:51:06Z

@leandromoh Studied your implementation enough to convince myself that your bugs don't appear to be due to "shortcuts" in the parallel implementation, which is what I'd originally thought.

since the points you mentioned happened regardless of parallelization configuration I was sure two points were not related.
glad to know that now you are convinced and comfortable to merge the PR.

I still plan on updating all benchmarks sometime after .NET 8 is released in the next couple weeks.

no problem.

leandromoh · 2023-11-13T12:35:33Z

FYI: RecordParser v2.2.1 was released, including:

bugfix when the first column is quoted (your example now works, see on dotnetfiddle)
RecordTooLargeException when field is too large (previously was Exception with wrong message)

MarkPflug · 2023-11-15T05:45:27Z

@leandromoh The benchmarks are updated. Nice work on your parallel implementation. Maybe I can still claim the most "efficient" (single threaded) implementation.

When updating the "CsvSum" benchmarks I noticed that you hard-coded the ordinal of the column (13) rather than looking it up by name. Can you adjust that method? I didn't see an obvious way to get the ordinal from the column header. Every other impl is getting the column/ordinal by name.

leandromoh · 2023-11-20T19:27:13Z

Nice work on your parallel implementation.

thank you @MarkPflug 😀, you also did a really great work on Sylvan

Maybe I can still claim the most "efficient" (single threaded) implementation.

IMO it is pretty possible

When updating the "CsvSum" benchmarks I noticed that you hard-coded the ordinal of the column (13) rather than looking it up by name. Can you adjust that method?

currently RecordParser just discards the header row, for now the only way to map columns is by index. maybe it supports mapping columns by name in future version. but it is not in the plans for now.

update

e809422

MarkPflug merged commit 18f59b1 into MarkPflug:main Nov 6, 2023

leandromoh deleted the update/record-parser2.1.0 branch November 8, 2023 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update/record-parser2.1.0 #16

update/record-parser2.1.0 #16

leandromoh commented Oct 26, 2023

MarkPflug commented Oct 27, 2023

leandromoh commented Oct 30, 2023 •

edited

Loading

MarkPflug commented Oct 30, 2023

nietras commented Oct 30, 2023

MarkPflug commented Oct 30, 2023

nietras commented Oct 30, 2023

leandromoh commented Oct 30, 2023 •

edited

Loading

MarkPflug commented Oct 30, 2023 •

edited

Loading

leandromoh commented Nov 1, 2023

MarkPflug commented Nov 6, 2023

leandromoh commented Nov 9, 2023 •

edited

Loading

leandromoh commented Nov 13, 2023 •

edited

Loading

MarkPflug commented Nov 15, 2023

leandromoh commented Nov 20, 2023 •

edited

Loading

update/record-parser2.1.0 #16

update/record-parser2.1.0 #16

Conversation

leandromoh commented Oct 26, 2023

MarkPflug commented Oct 27, 2023

leandromoh commented Oct 30, 2023 • edited Loading

MarkPflug commented Oct 30, 2023

nietras commented Oct 30, 2023

MarkPflug commented Oct 30, 2023

nietras commented Oct 30, 2023

leandromoh commented Oct 30, 2023 • edited Loading

MarkPflug commented Oct 30, 2023 • edited Loading

leandromoh commented Nov 1, 2023

MarkPflug commented Nov 6, 2023

leandromoh commented Nov 9, 2023 • edited Loading

leandromoh commented Nov 13, 2023 • edited Loading

MarkPflug commented Nov 15, 2023

leandromoh commented Nov 20, 2023 • edited Loading

leandromoh commented Oct 30, 2023 •

edited

Loading

leandromoh commented Oct 30, 2023 •

edited

Loading

MarkPflug commented Oct 30, 2023 •

edited

Loading

leandromoh commented Nov 9, 2023 •

edited

Loading

leandromoh commented Nov 13, 2023 •

edited

Loading

leandromoh commented Nov 20, 2023 •

edited

Loading