Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LightGBM models locale sensitivity and improve R/W performance. #3

Closed
wants to merge 3 commits into from

Conversation

AlbertoEAF
Copy link

When Java is used, the default C++ locale is broken. This is true for
Java providers that use the C API or even Python models that require JEP.

This patch solves that issue making the model reads/writes insensitive
to such settings.
To achieve it, within the model read/write codebase:

  • C++ streams are imbued with the classic locale
  • Calls to functions that are dependent on the locale are replaced
  • The default locale is not changed!

This approach means:

Changes:

  • Add CommonC namespace which provides faster locale-independent versions of Common's methods
  • Model code makes conversions through CommonC
  • Cleanup unused Common methods
  • Performance improvements. Use fast libraries for locale-agnostic conversion:

Bugfixes:

When Java is used, the default C++ locale is broken. This is true for
Java providers that use the C API or even Python models that require JEP.

This patch solves that issue making the model reads/writes insensitive
to such settings.
To achieve it, within the model read/write codebase:
 - C++ streams are imbued with the classic locale
 - Calls to functions that are dependent on the locale are replaced
 - The default locale is not changed!

This approach means:
 - The user's locale is never tampered with, avoiding issues such as
    microsoft#2979 with the previous
    approach microsoft#2891
 - Datasets can still be read according the user's locale
 - The model file has a single format independent of locale

Changes:
 - Add CommonC namespace which provides faster locale-independent versions of Common's methods
 - Model code makes conversions through CommonC
 - Cleanup unused Common methods
 - Performance improvements. Use fast libraries for locale-agnostic conversion:
   - value->string: https://github.com/fmtlib/fmt
   - string->double: https://github.com/lemire/fast_double_parser (10x
      faster double parsing according to their benchmark)

Bugfixes:
 - microsoft#2500
 - microsoft#2890
 - ninia/jep#205 (as it is related to LGBM as well)
@AlbertoEAF
Copy link
Author

AlbertoEAF commented Sep 23, 2020

This is our internal PR to fix the model locale. The PR to the Microsoft codebase is independent and lives at microsoft#3405.

Copy link

@shengwangsw shengwangsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify, I looked at the code, basically, you added submodules fmt and fast double parser with its functionalities as you described at the description. And then you change the code that every line that uses common to commonC parser that you created. I believe that it was already discussed and I didn't see anything wrong. Yet try to understand why did travis build fail

include/LightGBM/utils/common.h Show resolved Hide resolved
include/LightGBM/utils/common.h Outdated Show resolved Hide resolved
Copy link

@paulojrp paulojrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job :)

@AlbertoEAF
Copy link
Author

@shenggwang can you approve?
Ignore the checks. The java build works which is the one we're interested in.

Copy link

@shengwangsw shengwangsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AlbertoEAF
Copy link
Author

As discussed, this branch will still receive more commits to make it acceptable for Microsoft's mainline code.

As such, a tag v3.0.0-with_model_locale_fix_for_java was added at this point and will be used in the provider (feedzai/feedzai-openml-java#53).

@AlbertoEAF AlbertoEAF closed this Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants