-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix LightGBM models locale sensitivity and improve R/W performance. #3
Conversation
When Java is used, the default C++ locale is broken. This is true for Java providers that use the C API or even Python models that require JEP. This patch solves that issue making the model reads/writes insensitive to such settings. To achieve it, within the model read/write codebase: - C++ streams are imbued with the classic locale - Calls to functions that are dependent on the locale are replaced - The default locale is not changed! This approach means: - The user's locale is never tampered with, avoiding issues such as microsoft#2979 with the previous approach microsoft#2891 - Datasets can still be read according the user's locale - The model file has a single format independent of locale Changes: - Add CommonC namespace which provides faster locale-independent versions of Common's methods - Model code makes conversions through CommonC - Cleanup unused Common methods - Performance improvements. Use fast libraries for locale-agnostic conversion: - value->string: https://github.com/fmtlib/fmt - string->double: https://github.com/lemire/fast_double_parser (10x faster double parsing according to their benchmark) Bugfixes: - microsoft#2500 - microsoft#2890 - ninia/jep#205 (as it is related to LGBM as well)
This is our internal PR to fix the model locale. The PR to the Microsoft codebase is independent and lives at microsoft#3405. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify, I looked at the code, basically, you added submodules fmt and fast double parser with its functionalities as you described at the description. And then you change the code that every line that uses common to commonC parser that you created. I believe that it was already discussed and I didn't see anything wrong. Yet try to understand why did travis build fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good job :)
@shenggwang can you approve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As discussed, this branch will still receive more commits to make it acceptable for Microsoft's mainline code. As such, a tag |
When Java is used, the default C++ locale is broken. This is true for
Java providers that use the C API or even Python models that require JEP.
This patch solves that issue making the model reads/writes insensitive
to such settings.
To achieve it, within the model read/write codebase:
This approach means:
[CRITICAL BUG][Python] cannot wrire() UTF-8 strings by UnicodeEncodeError microsoft/LightGBM#2979 with the previous
approach Fix Booster read/write locale dependency microsoft/LightGBM#2891
Changes:
faster double parsing according to their benchmark)
Bugfixes: