Fix LightGBM models locale sensitivity and improve R/W performance. #3

AlbertoEAF · 2020-09-23T19:20:46Z

When Java is used, the default C++ locale is broken. This is true for
Java providers that use the C API or even Python models that require JEP.

This patch solves that issue making the model reads/writes insensitive
to such settings.
To achieve it, within the model read/write codebase:

C++ streams are imbued with the classic locale
Calls to functions that are dependent on the locale are replaced
The default locale is not changed!

This approach means:

The user's locale is never tampered with, avoiding issues such as
[CRITICAL BUG][Python] cannot wrire() UTF-8 strings by UnicodeEncodeError microsoft/LightGBM#2979 with the previous
approach Fix Booster read/write locale dependency microsoft/LightGBM#2891
Datasets can still be read according the user's locale
The model file has a single format independent of locale

Changes:

Add CommonC namespace which provides faster locale-independent versions of Common's methods
Model code makes conversions through CommonC
Cleanup unused Common methods
Performance improvements. Use fast libraries for locale-agnostic conversion:
- value->string: https://github.com/fmtlib/fmt
- string->double: https://github.com/lemire/fast_double_parser (10x
  faster double parsing according to their benchmark)

Bugfixes:

When Java is used, the default C++ locale is broken. This is true for Java providers that use the C API or even Python models that require JEP. This patch solves that issue making the model reads/writes insensitive to such settings. To achieve it, within the model read/write codebase: - C++ streams are imbued with the classic locale - Calls to functions that are dependent on the locale are replaced - The default locale is not changed! This approach means: - The user's locale is never tampered with, avoiding issues such as microsoft#2979 with the previous approach microsoft#2891 - Datasets can still be read according the user's locale - The model file has a single format independent of locale Changes: - Add CommonC namespace which provides faster locale-independent versions of Common's methods - Model code makes conversions through CommonC - Cleanup unused Common methods - Performance improvements. Use fast libraries for locale-agnostic conversion: - value->string: https://github.com/fmtlib/fmt - string->double: https://github.com/lemire/fast_double_parser (10x faster double parsing according to their benchmark) Bugfixes: - microsoft#2500 - microsoft#2890 - ninia/jep#205 (as it is related to LGBM as well)

AlbertoEAF · 2020-09-23T19:21:25Z

This is our internal PR to fix the model locale. The PR to the Microsoft codebase is independent and lives at microsoft#3405.

shengwangsw

To simplify, I looked at the code, basically, you added submodules fmt and fast double parser with its functionalities as you described at the description. And then you change the code that every line that uses common to commonC parser that you created. I believe that it was already discussed and I didn't see anything wrong. Yet try to understand why did travis build fail

include/LightGBM/utils/common.h

paulojrp

good job :)

AlbertoEAF · 2020-09-29T11:19:00Z

@shenggwang can you approve?
Ignore the checks. The java build works which is the one we're interested in.

shengwangsw

LGTM

AlbertoEAF · 2020-09-29T13:45:59Z

As discussed, this branch will still receive more commits to make it acceptable for Microsoft's mainline code.

As such, a tag v3.0.0-with_model_locale_fix_for_java was added at this point and will be used in the provider (feedzai/feedzai-openml-java#53).

shengwangsw reviewed Sep 23, 2020

View reviewed changes

include/LightGBM/utils/common.h Show resolved Hide resolved

include/LightGBM/utils/common.h Outdated Show resolved Hide resolved

AlbertoEAF added 2 commits September 24, 2020 10:44

Align CommonC namespace

13f1de5

Add new external_libs/ to python setup

c53bbbe

paulojrp approved these changes Sep 28, 2020

View reviewed changes

shengwangsw approved these changes Sep 29, 2020

View reviewed changes

AlbertoEAF closed this Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LightGBM models locale sensitivity and improve R/W performance. #3

Fix LightGBM models locale sensitivity and improve R/W performance. #3

AlbertoEAF commented Sep 23, 2020

AlbertoEAF commented Sep 23, 2020 •

edited

Loading

shengwangsw left a comment

paulojrp left a comment

AlbertoEAF commented Sep 29, 2020

shengwangsw left a comment

AlbertoEAF commented Sep 29, 2020

Fix LightGBM models locale sensitivity and improve R/W performance. #3

Fix LightGBM models locale sensitivity and improve R/W performance. #3

Conversation

AlbertoEAF commented Sep 23, 2020

AlbertoEAF commented Sep 23, 2020 • edited Loading

shengwangsw left a comment

Choose a reason for hiding this comment

paulojrp left a comment

Choose a reason for hiding this comment

AlbertoEAF commented Sep 29, 2020

shengwangsw left a comment

Choose a reason for hiding this comment

AlbertoEAF commented Sep 29, 2020

AlbertoEAF commented Sep 23, 2020 •

edited

Loading