Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update/record-parser2.1.0 #16

Merged
merged 1 commit into from
Nov 6, 2023

Conversation

leandromoh
Copy link
Contributor

No description provided.

@MarkPflug
Copy link
Owner

I'm not convinced that it's possible for a general purpose, standards-compliant CSV parser to benefit from parallelism. Are there any documented limitations around your parallelism implementation? Can you describe the basic strategy that you've employed to parallelize?

@leandromoh
Copy link
Contributor Author

leandromoh commented Oct 30, 2023

I'm not convinced that it's possible for a general purpose, standards-compliant CSV parser to benefit from parallelism.

your Sylvan library uses SIMD, right? it is other kind of parallelism but still parallel computing.
could you please merge the PR? once the PR is merged we can run the benchmark and check the result to see if it benefits or not. in my personal tests, the result is positive.

Are there any documented limitations around your parallelism implementation?

no, there is no limitations. there are tests covering the quoted scenarios. btw, can you update this PR by making a rebase on joelverhagen:main ? the "QuotedNewLine" scenario was fixed.

Can you describe the basic strategy that you've employed to parallelize?

this is the 1 million dollars question 🤣 it is a bit hard to explain.

@MarkPflug
Copy link
Owner

The reason I haven't merged it is because I'm not convinced your implementation is robust. Here is a gist where your parser is the only one that fails to parse: https://gist.github.com/MarkPflug/14123a1eb62e3151c6d8b6bfc5b94b6f
I thought your parallelism implementation might be flawed in the way it finds quoted fields with newlines, but it fails regardless of parallelism. Yes, this is an extreme example, but the other tested parsers handle it. Maybe I'm using it wrong?

Correctness comes before speed. I feel very strongly that if you're claiming to have the fastest CSV parser that you must first prove that it is correct and robust. @nietras Sep is exceptionally fast, and also appears to be correct, at least in this scenario. I was a bit surprised by the 16mb record limit, but the exception makes it clear that the limitation exists.

I plan on updating these benchmarks when .NET 8 is officially released, at which point I might need to concede that Sylvan isn't the fastest anymore.

@nietras
Copy link

nietras commented Oct 30, 2023

@MarkPflug you are right that Sep has a row length limit of 16MB. There are a couple of reasons for this, some historical and also to avoid misuse or ddos scenarios. It can be easily lifted if I wanted to or anyone had a real world use case that exceeded it. Stress testing is fine but limits can be there for valid reasons. Another case is run away quote, that is a quote that does not end. This can result in GBs being allocated for row buffer. So I feel very much this limit is there for a reason.

Yes, I could add an option for setting it, but have been trying to avoid option explosion. 😅

@MarkPflug
Copy link
Owner

@nietras I should clarify that I was surprised only because the limitation wasn't mentioned in your documentation (that I could see). A 16MB limit is very generous. Sylvan has a similar limitation (with the default config) for the same reason: DDOS protection. I wanted to make sure it was safe in a web environment where users might sending malicious payloads. My default limit is only 16KB, and even that is more than ample in most realistic scenarios. I make the user explicitly opt-in for unbounded (int.MaxValue) buffer growth.

@nietras
Copy link

nietras commented Oct 30, 2023

In machine learning there are often examples of csv files with 100k cols, hence my generosity. You are right I should document it better. 😊

@leandromoh
Copy link
Contributor Author

leandromoh commented Oct 30, 2023

@MarkPflug I totally agree with you that correctness comes before speed and I ensure that with a lot of unit tests to check that the implementation is working correctly, including tests for quoted fields scenarios, for example here.

About your gist, you used the lib right. I tested changing the for loop condition to 900k instead of 1mm and everything works as expected. The problem probably is releated to the internal buffer size, which was not enough to hold the single record. That is why it fails regardless of parallelism.

To fix the problem I can see two alternatives:

  • to throw an exception with the appropriate message or
  • allow user to configure the buffer size

Since both alternatives can be implemented, so I will think what to do.
Btw, thanks for point this exceptional scenario.
IMHO I dont think this unusual extreme case would be enough to say that the implementation is not robust.
If you have more of these corner cases, please let me know. otherwise since the problem is not releated to the algorithm but sizing (it works for 900k), I think it would be ok to proceed.

image

@MarkPflug
Copy link
Owner

MarkPflug commented Oct 30, 2023

@leandromoh There is still something wrong with your parser that is not related to buffer size. I've modified the gist. Your library throws when parsing the following:

A,B,C,D
"x
y",2,3,4

This happens regardless of parallelization.

@leandromoh
Copy link
Contributor Author

I reproduced it and indeed this is a bug. according to my analysis, the bug happen only when the first field of the record is a quoted field, when it is not the first field, everything works fine (dotnetfiddle). fortunately, it does not looks hard to fix.

I will add more unit tests for quoted scenarios, and I come back here when the new version convering these corner cases is published.

If you find new bugs please let me know. I am doing a hard work for cover every scenario I could think using unit tests but of course some corner cases will be catch only by other users.

@MarkPflug MarkPflug merged commit 18f59b1 into MarkPflug:main Nov 6, 2023
@MarkPflug
Copy link
Owner

@leandromoh Studied your implementation enough to convince myself that your bugs don't appear to be due to "shortcuts" in the parallel implementation, which is what I'd originally thought. I still plan on updating all benchmarks sometime after .NET 8 is released in the next couple weeks.

@leandromoh leandromoh deleted the update/record-parser2.1.0 branch November 8, 2023 12:40
@leandromoh
Copy link
Contributor Author

leandromoh commented Nov 9, 2023

@leandromoh Studied your implementation enough to convince myself that your bugs don't appear to be due to "shortcuts" in the parallel implementation, which is what I'd originally thought.

since the points you mentioned happened regardless of parallelization configuration I was sure two points were not related.
glad to know that now you are convinced and comfortable to merge the PR.

I still plan on updating all benchmarks sometime after .NET 8 is released in the next couple weeks.

no problem.

@leandromoh
Copy link
Contributor Author

leandromoh commented Nov 13, 2023

FYI: RecordParser v2.2.1 was released, including:

  • bugfix when the first column is quoted (your example now works, see on dotnetfiddle)
  • RecordTooLargeException when field is too large (previously was Exception with wrong message)

@MarkPflug
Copy link
Owner

@leandromoh The benchmarks are updated. Nice work on your parallel implementation. Maybe I can still claim the most "efficient" (single threaded) implementation.

When updating the "CsvSum" benchmarks I noticed that you hard-coded the ordinal of the column (13) rather than looking it up by name. Can you adjust that method? I didn't see an obvious way to get the ordinal from the column header. Every other impl is getting the column/ordinal by name.

@leandromoh
Copy link
Contributor Author

leandromoh commented Nov 20, 2023

Nice work on your parallel implementation.

thank you @MarkPflug 😀, you also did a really great work on Sylvan

Maybe I can still claim the most "efficient" (single threaded) implementation.

IMO it is pretty possible

When updating the "CsvSum" benchmarks I noticed that you hard-coded the ordinal of the column (13) rather than looking it up by name. Can you adjust that method?

currently RecordParser just discards the header row, for now the only way to map columns is by index. maybe it supports mapping columns by name in future version. but it is not in the plans for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants