-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update/record-parser2.1.0 #16
Conversation
I'm not convinced that it's possible for a general purpose, standards-compliant CSV parser to benefit from parallelism. Are there any documented limitations around your parallelism implementation? Can you describe the basic strategy that you've employed to parallelize? |
your Sylvan library uses SIMD, right? it is other kind of parallelism but still parallel computing.
no, there is no limitations. there are tests covering the quoted scenarios. btw, can you update this PR by making a rebase on
this is the 1 million dollars question 🤣 it is a bit hard to explain. |
The reason I haven't merged it is because I'm not convinced your implementation is robust. Here is a gist where your parser is the only one that fails to parse: https://gist.github.com/MarkPflug/14123a1eb62e3151c6d8b6bfc5b94b6f Correctness comes before speed. I feel very strongly that if you're claiming to have the fastest CSV parser that you must first prove that it is correct and robust. @nietras Sep is exceptionally fast, and also appears to be correct, at least in this scenario. I was a bit surprised by the 16mb record limit, but the exception makes it clear that the limitation exists. I plan on updating these benchmarks when .NET 8 is officially released, at which point I might need to concede that Sylvan isn't the fastest anymore. |
@MarkPflug you are right that Sep has a row length limit of 16MB. There are a couple of reasons for this, some historical and also to avoid misuse or ddos scenarios. It can be easily lifted if I wanted to or anyone had a real world use case that exceeded it. Stress testing is fine but limits can be there for valid reasons. Another case is run away quote, that is a quote that does not end. This can result in GBs being allocated for row buffer. So I feel very much this limit is there for a reason. Yes, I could add an option for setting it, but have been trying to avoid option explosion. 😅 |
@nietras I should clarify that I was surprised only because the limitation wasn't mentioned in your documentation (that I could see). A 16MB limit is very generous. Sylvan has a similar limitation (with the default config) for the same reason: DDOS protection. I wanted to make sure it was safe in a web environment where users might sending malicious payloads. My default limit is only 16KB, and even that is more than ample in most realistic scenarios. I make the user explicitly opt-in for unbounded (int.MaxValue) buffer growth. |
In machine learning there are often examples of csv files with 100k cols, hence my generosity. You are right I should document it better. 😊 |
@MarkPflug I totally agree with you that correctness comes before speed and I ensure that with a lot of unit tests to check that the implementation is working correctly, including tests for quoted fields scenarios, for example here. About your gist, you used the lib right. I tested changing the for loop condition to 900k instead of 1mm and everything works as expected. The problem probably is releated to the internal buffer size, which was not enough to hold the single record. That is why it fails regardless of parallelism. To fix the problem I can see two alternatives:
Since both alternatives can be implemented, so I will think what to do. |
@leandromoh There is still something wrong with your parser that is not related to buffer size. I've modified the gist. Your library throws when parsing the following:
This happens regardless of parallelization. |
I reproduced it and indeed this is a bug. according to my analysis, the bug happen only when the first field of the record is a quoted field, when it is not the first field, everything works fine (dotnetfiddle). fortunately, it does not looks hard to fix. I will add more unit tests for quoted scenarios, and I come back here when the new version convering these corner cases is published. If you find new bugs please let me know. I am doing a hard work for cover every scenario I could think using unit tests but of course some corner cases will be catch only by other users. |
@leandromoh Studied your implementation enough to convince myself that your bugs don't appear to be due to "shortcuts" in the parallel implementation, which is what I'd originally thought. I still plan on updating all benchmarks sometime after .NET 8 is released in the next couple weeks. |
since the points you mentioned happened regardless of parallelization configuration I was sure two points were not related.
no problem. |
FYI: RecordParser v2.2.1 was released, including:
|
@leandromoh The benchmarks are updated. Nice work on your parallel implementation. Maybe I can still claim the most "efficient" (single threaded) implementation. When updating the "CsvSum" benchmarks I noticed that you hard-coded the ordinal of the column (13) rather than looking it up by name. Can you adjust that method? I didn't see an obvious way to get the ordinal from the column header. Every other impl is getting the column/ordinal by name. |
thank you @MarkPflug 😀, you also did a really great work on Sylvan
IMO it is pretty possible
currently RecordParser just discards the header row, for now the only way to map columns is by index. maybe it supports mapping columns by name in future version. but it is not in the plans for now. |
No description provided.