-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbindlist(l, use.names=TRUE) handle different encodings for column names #5453
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5453 +/- ##
==========================================
- Coverage 98.61% 98.61% -0.01%
==========================================
Files 79 79
Lines 14536 14534 -2
==========================================
- Hits 14335 14333 -2
Misses 201 201 ☔ View full report in Codecov by Sentry. |
Generated via commit f85272b Download link for the artifact containing the test results: ↓ atime-results.zip
|
Could you write out in the PR description what are the three call sites being updated? Is there any possibility of using a helper instead to avoid relying on updating 3 places to maintain similar changes in the future? Also please ensure there are tests for each of the 3 call sites (I guess IIUC, UTF-8 was chosen as the "winning" encoding, it seems like even if none of the inputs are UTF-8. If that's right, I think it is fine, though it might be a breaking change, right? Please describe the behavior more explicitly in the NEWS item, and perhaps in |
Saw my earlier comment. At least I'm consistent :) |
Co-authored-by: Michael Chirico <[email protected]>
…e into rbindlist_encodings
Updated the description and also the code comments! |
Closes #5452
As also now document in the code, our rbindlist mapping follows a 3 pass approach:
1st pass - gather unique column names
2nd pass - count duplicates
3rd pass - create final column mapping
We actually do not change the underlying encodings here, but convert the read value for comparisons which seems safer since we do not want to alter the input tables.