-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] Locale independent model load/save #3267
Comments
Speaking about float2str conversion in saving models, XGBoost recently adopted Ryū algorithm (dmlc/xgboost#5772) and numpy uses Dragon4 algorithm (numpy/numpy#9941). |
Before I commit to working on this, can someone help me direct me to the files where load/save model files code is found? Searching for the code... (read only if at a loss)The fix we had posted in the past https://github.com/microsoft/LightGBM/pull/2891/files required changes in: LGBM_BoosterCreateFromModelfile, LGBM_BoosterSaveModel, LGBM_BoosterSaveModelToString and LGBM_BoosterDumpModel in the C API. This new implementations will need to change the core code, which means fixes must be done in the functions called therein, namely, Reading model files seems to be implemented in:
Conclusion and questionsI got lost on the virtual function pointers, though, I was expecting to see DART and others implement model read/write. Can I thus assume model text file read/writes are implemented in
+1 ping for @guolinke as you wrote most of that code Thank you all :) |
Gently ping @guolinke for two questions above about the code location. |
sorry for missing this thread...
|
Thanks so much! I will start working on it ;) |
By the way @StrikerRUS , regarding such libraries, I think I'll go with Ryü, I've been reading up a bit on it. Both came up after David Gay's implementations and Ryü came after Dragon4. Dragon4 has a fallback algorithm for some numbers which is slower. I was thinking of using this C++ implementation of Ryü https://github.com/expnkx/fast_io as it provides a single-header to include but it requires C++20 when LGBM is only C++11. This leaves only the original implementation with C interface: https://github.com/ulfjack/ryu. I hope it's easy - and possible - to add to our build chain, still will have to figure that one out. |
@AlbertoEAF BTW, is that possible to have the two-stage solution? like, have a script to locale/de-locale the model? |
Hello @guolinke , I'm not sure what you mean. Why would we want to transform the model content to different locales? Also, I believe it would be even trickier as you would have to know precisely the locale with which the model was written to read it correctly, which can be tricky to know in some corner cases. Finally, as LightGBM itself does not allow "changing locale" during its execution, we would have to create an external tool, and that tool would need to parse the model file structure to read it and then rewrite it with exactly the same structure - lots of code duplication and complexity. |
@AlbertoEAF I misunderstood the locale problem. |
Hello, I have 2 requests for input @StrikerRUS @guolinke:
Thanks! |
@AlbertoEAF I'm totally OK with using |
Hah, just noticed that |
Yeah @StrikerRUS , even integer to string conversions might benefit from it, not only floating point conversions with shortest representation. LightGBM already has a custom |
@guolinke @StrikerRUS I've hit some issue with the LightGBM model (same as the one above with the image) which I don't understand and really need your help. Any ideas of what might be causing this are greatly appreciated! IssueWhen I switch the call that turns floating point parameters to strings in the model write, I get numbers which have sometimes a redundant trailing 0 in the scientific notation which shouldn't matter - just posted a bug at fmtlib: fmtlib/fmt#1873. See it to understand what I mean. The big issue though is that for each of the two trees where one parameter changed, that tree_size in the model grew by +1 as well! Any ideas?! Diffs in the output models files produced by LightGBM. Each image reports the line difference with the "unpatched"/normal LightGBM code (in red) and with the patch (in green), followed by the differences in items from such lines also in red/green: In this case the model written with the patched code results in two trees where one parameter as a trailing 0, which results in this tree size changes: and that's what I don't understand. Why do the tree sizes change because of the written floating point parameters? Source code referencesIf you want to take a look at the difference in the code that led to this issue reported here and in fmtlib/fmt#1873, the code change is this one: feedzai@da3d8f2. The images above were generated with the two commits:
PR's in progress (still in very early development):
Thank you! Any input is appreciated! |
To further reinforce the "bug" origin, I added code to strip away trailing 0's in the decimal digits of numbers, and it already works properly - i.e., there are no differences in the output models! (This is the temporary hack I proposed on the fmtlib/fmt#1873). This commit feedzai@c34c43d adds that code (which is not a permanent solution as it's a hack which reduces speed and is not 'pretty'). |
I was already able to replace all model write calls except |
@AlbertoEAF the tree-size is the string size of tree model content. We use it for the multi-threading loading tree models. |
Ah! So it's no bug! I get it now, shouldn't be a problem then, thanks! |
Hello @guolinke @StrikerRUS I have good news, I've completed the code changes and it's passing all the tests! Instead of relying on Ryu, actually I used fmt to write values to strings as it provides similar performance and its API is much more inline with the rest of the LightGBM codebase. To do string->double parsing I am using this library: https://github.com/lemire/fast_double_parser which provides maximal floating-point string parsing resolution and matches the standard library parsing outputs bit by bit. Hence, transforming data to strings and string arrays should now be faster. Parsing strings to doubles with I'd like to streamline the code more by merging Common's And seems to have less resolution: As such I had to leave both |
Looks like the difference is in the 18th+ digit in all examples. And according to IEEE 754, double gives only 15-17 first significant digits. So, I believe it is OK to have inconsistent digits after 17th place. |
Hmm that's interesting, in that case we could replace Thanks Nikita! I'll run some tests tomorrow and see if the scores don't change. |
I remember that |
Fixed via #3405. |
Thanks a lot for all the help @StrikerRUS @jameslamb ;) |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Hello,
Issue
With the revert of its PR, this issue became again a concern in Java: #2890.
At the moment LightGBM uses strtod() and C++ streams to read/write floating point the model file numbers.
The issue is that Java breaks the initialization to the standard "C" locale (see #2891). Current evidence shows this is true for Java providers that use the C API through SWIG (I suspect MMLSpark's LightGBM might be affected by this too), and even for users of the LightGBM Python API that have their code wrapped by JEP.
For production scenarios ensuring the correctness of scores is of paramount importance and right now there are none for LGBM providers that use Java (see issue above).
Request for Change
After assessing alternatives, and seeing that the last fix was reverted, the robust solution would be to change the model read/write of floating numbers through use of process-locale-independent methods.
Implementation proposal
Replace floating point read/writes in model save/load with calls to C++ streams (we already use them anyway for some parameters). We imbue such streams with the "C" locale, to force the same locale for reads and writes. This fixes our bug.
It might be slightly slower to read/write the model but it will be correct everytime independently of process locale settings.
According to the benchmark https://tinodidriksen.com/2011/05/cpp-convert-string-to-double-speed/ it would result in a ~3x slower floating-point conversion (reusing streams). I doubt it's actually 3x slower (see tricks to make it faster http://cplusplus.bordoon.com/speeding_up_string_conversions.html). Besides, we're already using streams for some numbers, thus the model read/write slowdown wouldn't reach that value.
Is that an issue?
Can I get your thoughts on this? Good, bad? Secondary effects I might be missing? Would such PR be welcome? @guolinke @StrikerRUS @imatiach-msft et al.
Thank you!
Not enough? Future - Alternative libraries
In the future, if needed we can upgrade to use different libraries like Boost.Spirit or David M. Gay's library which is more accurate than glibc (current implementation). If we do so, we can have rounding differences (that in principle are small).
Side note on floating-point conversion precision and David Gay's library
GCC uses glibc to compile LightGBM, and according to these links:
David Gay's implementation https://www.ampl.com/netlib/fp/dtoa.c of to/from string/floating point conversion is even more accurate than we're using now. It is a single file that can be compiled and configured through preprocessor flags, in our case to ignore locale processing altogether (resulting in the "C" locale behaviour for reads and writes).
The text was updated successfully, but these errors were encountered: