Fix model locale issue and improve model R/W performance. #3405

AlbertoEAF · 2020-09-23T18:12:47Z

When Java is used, the default C++ locale is broken. This is true for
Java providers that use the C API or even Python models that require JEP.

This patch solves that issue making the model reads/writes insensitive
to such settings.
To achieve it, within the model read/write codebase:

C++ streams are imbued with the classic locale
Calls to functions that are dependent on the locale are replaced
The default locale is not changed!

This approach means:

The user's locale is never tampered with, avoiding issues such as
[CRITICAL BUG][Python] cannot wrire() UTF-8 strings by UnicodeEncodeError #2979 with the previous approach Fix Booster read/write locale dependency #2891
Datasets can still be read according the user's locale
The model file has a single format independent of locale

Changes:

Performance improvements: More than 35% faster model read/writes. Use fast libraries for locale-agnostic conversion:
- value->string: https://github.com/fmtlib/fmt
- string->double: https://github.com/lemire/fast_double_parser (10x faster double parsing according to their benchmark)
Add CommonC namespace which provides faster locale-independent versions of Common's methods as if using the "C" locale
Model code makes conversions through CommonC
Cleanup unused Common methods

Bugfixes:

AlbertoEAF · 2020-09-23T18:48:07Z

Hello @guolinke @StrikerRUS @henry0312 @imatiach-msft I think finally the fix is in #3267! :)

Running the integration tests

We are using a Java provider for LightGBM focused on low-latency: https://github.com/feedzai/feedzai-openml-java/tree/master/openml-lightgbm

The Java provider PR where the new tests and LightGBM patch live can be found here: feedzai/feedzai-openml-java#53.

To run the tests just clone it and do mvn clean install -DskipTests once on the full repo, and then you can just run mvn test inside the openml-lightgbm folder to test.

This will fetch the requested LightGBM codebase, build it in a docker container and then start building the Java provider which uses LightGBM's .so's and .jar. Finally it will run the tests. As I am using a tool to compare the output model files you need python >=3.6 and termcolor installed for the time being.

Integration tests and results

To prove the patch works, there are many tests. The main ones for this issue are 2:

Compare scores to reference scores and ensure they remain the same as in v3.0.0.
Compare round-trip reference model file read followed by a write to a file. The output model should be the same with the v3.0.0 code as with the patch.

When locale is forced to be "C" both v3.0.0 and this custom version work, but when another locale is used and C is not enforced (pt_UTF-8 in my case)

(openml-lightgbm/lightgbm-provider/pom.xml)

the standard v3.0.0 starts failing the tests:

(openml-lightgbm/lightgbm-builder/pom.xml)

with differences in the output round-trip model read/write with the C locale:

and failed tests including wrong scores:

With the patched version however, all outputs match the behaviour of the v3.0.0 codebase with the C locale even when using other locales:

AlbertoEAF · 2020-09-23T18:53:18Z

Any tips on what's causing so many failed builds? In my system it works.
I added 2 new libraries, but they should be compatible with gcc 4.8.4.

guolinke · 2020-09-24T04:19:06Z

@AlbertoEAF did you test the model loading speed of This PR?
I remember these deleted codes are integrated for (large) model loading performance.

henry0312 · 2020-09-24T06:06:58Z

Hello, Alberto.
I'll also check this PR.

AlbertoEAF · 2020-09-24T09:22:10Z

@AlbertoEAF did you test the model loading speed of This PR?
I remember these deleted codes are integrated for (large) model loading performance.

Hello @guolinke no I didn't. Any hints on what's a large model?
I guess I can try manually modifying one of my LGBM model .txt files to have 5000 trees or something by copying+pasting.

Also I have another question, I guess we should have tests for this right? I'm not sure how we should proceed though. Could we have a regression test with a reference model file in the repo so we could read and re-write it to disk and compare, just like we do on the Java provider? (~10MB model.txt preferrably in git-lfs to avoid weighing down on the repo).

AlbertoEAF · 2020-09-24T10:58:54Z

All failing R-package tests are not cloning the new git submodules in external_libs/ which results in things like:

CMake Error at CMakeLists.txt:448 (add_subdirectory):
  add_subdirectory given source "external_libs/fmt" which is not an existing
  directory.

As for Python I didn't find the cause but could it be the same thing? For instance the compute submodule is explicitly copied in setup.py. Do the two new submodules under external_libs need the same handling?

Not used to Travis nor this build system, would appreciate help to fix the failing builds.

StrikerRUS · 2020-09-24T14:08:59Z

@AlbertoEAF

All failing R-package tests ...

Please don't care about them. There are a lot of maintenance burden to get new files worked with R CRAN installation process. We will help with it later.

As for Python ... Do the two new submodules under external_libs need the same handling?

Yes, I guess two new folders should be copied unconditionally just like src and include:

LightGBM/python-package/setup.py

Lines 44 to 45 in 2870490

    
           copy_files_helper('include') 
        
           copy_files_helper('src')

AlbertoEAF · 2020-09-24T16:19:51Z

@guolinke Regarding model read/write speed, I ran first all the tests I had before with both versions.

Regular tests. 3 runs per LightGBM version. Both display similar performance even though the patched version seems slightly faster (not statistically significant with only these runs):

v3.0.0: {3.8s, 5.2s, 4.4s}
patch: {3.4s, 2.9s, 3.5s}

Then I took a model file with 200 trees and 8.8MB, wrote a script to sample more trees from it and add it to the file and generated a model with 50200 trees and 2.2GB. I ran all the tests as before as well as read/re-write that model to disk with both versions of LightGBM:

v3.0.0: {1m 41s, 1m 48s, 1m 46s}
patch: {1m 09s, 1m 07s, 1m 09s}

The new version is significantly faster, by around 35% on round-trip model read+writes.

AlbertoEAF · 2020-09-24T17:51:19Z

Ok, added that to the Python setup @StrikerRUS but some of the builds still fail.

For instance, this one fails building fast_double_parser: https://travis-ci.org/github/microsoft/LightGBM/jobs/730024056 because it lacks the non-standard strtod_l for instance (check line 704, and 708 - which misses the locale_t type).

I'm not sure I understand all this environments, what do they differ by? I guess it's related to the compiler. With the pretty old Ubuntu 14.04 and gcc 4.8.4 this compiles though.

guolinke · 2020-09-25T02:19:48Z

@AlbertoEAF great! happy to see the performance gain!

lemire · 2020-09-25T17:16:59Z

For instance, this one fails building fast_double_parser: https://travis-ci.org/github/microsoft/LightGBM/jobs/730024056 because it lacks the non-standard strtod_l for instance (check line 704, and 708 - which misses the locale_t type).

Let me comment here as well as in the related issue you just opened. I do not think that the support is missing. The issue, I believe, is that the xlocale.h file does not get included whereas it should get included. It is almost surely my fault.

See my comment. I think we can fix this easily with a few lines of code.

jameslamb · 2020-09-26T21:05:05Z

Please let me know when this is ready for help on the R side, and I'll make a PR into this one.

CRAN (the official package manager for R) has very strict requirements for portability of compiled code, so if those checks turn up anything that would make the R package not work on CRAN's set of environments (https://cran.r-project.org/web/checks/check_results_lightgbm.html), we'll have to talk about how to make this change in a way that doesn't violate their requirements.

AlbertoEAF · 2020-09-29T17:31:39Z

Hello,

Seeing some Python builds still where I suspect the setup.py is not copying the external_libs as it still complains of not finding all the dependencies. That was the error I had in every single Python build before adding

LightGBM/python-package/setup.py

Line 46 in 32d5533

copy_files_helper('external_libs')

I still get that error in some Python builds though. Example: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=7444&view=logs&j=02a2c3ba-81f8-54e3-0767-5d5adbb0daa9&t=720ee3fa-96d4-5b47-dbf4-01607b74ade2&l=1269).

Any tips? Should I declare that copy outside the "if"? As this issue seems to happen for "dist" "sdist" "bdist" builds.

However that doesn't make sense either, as looking at the full log the compute folder is being copied but external_libs are not. However, copying compute is conditional on passing through those copies first, as well as src and include folders. So I wonder what's going on 🤔 . Is it fetching a different setup.py from somewhere else?

I ask that as in the full log I see:

...
copying compile/compute/...
...
copying compile/src/...
...

but I don't see copying external_libs or include. Also, looking at setup.py I was expecting to see copies of include, src, external_libs and optionally compute, in this order:

LightGBM/python-package/setup.py

Lines 43 to 66 in 32d5533

    
           if not os.path.isfile(os.path.join(CURRENT_DIR, '_IS_SOURCE_PACKAGE.txt')): 
        
               copy_files_helper('include') 
        
               copy_files_helper('src') 
        
               copy_files_helper('external_libs') 
        
               if not os.path.exists(os.path.join(CURRENT_DIR, "compile", "windows")): 
        
                   os.makedirs(os.path.join(CURRENT_DIR, "compile", "windows")) 
        
               copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "windows", "LightGBM.sln"), 
        
                         os.path.join(CURRENT_DIR, "compile", "windows", "LightGBM.sln"), 
        
                         verbose=0) 
        
               copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "windows", "LightGBM.vcxproj"), 
        
                         os.path.join(CURRENT_DIR, "compile", "windows", "LightGBM.vcxproj"), 
        
                         verbose=0) 
        
               copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "LICENSE"), 
        
                         os.path.join(CURRENT_DIR, "LICENSE"), 
        
                         verbose=0) 
        
               copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "CMakeLists.txt"), 
        
                         os.path.join(CURRENT_DIR, "compile", "CMakeLists.txt"), 
        
                         verbose=0) 
        
               if integrated_opencl: 
        
                   copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "CMakeIntegratedOpenCL.cmake"), 
        
                             os.path.join(CURRENT_DIR, "compile", "CMakeIntegratedOpenCL.cmake"), 
        
                             verbose=0) 
        
               if use_gpu: 
        
                   copy_files_helper('compute')

Any ideas @StrikerRUS ?

AlbertoEAF · 2020-09-29T18:00:47Z

I took a look at all failing Travis builds. I listed what I believe are 4 types of build errors which might require changes:

Disable pylint over the external_libs:

~~https://travis-ci.org/github/microsoft/LightGBM/jobs/731344752~~ <- Fixed: Now it's regular linting issues I have to fix.

~~R builds failing compilation without -fPIC (@jameslamb). Any reason not to compile shared libs with -fPIC? Did I inadvertently mess something in the build?~~ ^UPDATE: I misread the log, thanks @jameslamb. In C++ with USE_MPI=ON, fmt is being compiled without -fPIC, don't know how to force it:

https://travis-ci.org/github/microsoft/LightGBM/jobs/731344754

Some Python builds where I believe setup.py is not adding external_libs and I don't know why:

My noobiness with CMakeLists. Somehow TARGET_LINK_LIBRARIES is not being properly used by me?

~~https://travis-ci.org/github/microsoft/LightGBM/jobs/731344756#L793-L811~~
~~https://travis-ci.org/github/microsoft/LightGBM/jobs/731344762#L775-L791~~
Fixed both above by removing PUBLIC from target_link_libraries(... fmt)

I'd appreciate input over any of the issues @StrikerRUS @guolinke @jameslamb :) Thank you!

jameslamb · 2020-09-29T20:10:46Z

R builds failing compilation without -fPIC (@jameslamb). Any reason not to compile shared libs with -fPIC? Did I inadvertently mess something in the build?
https://travis-ci.org/github/microsoft/LightGBM/jobs/731344754

This is not an R build. All of the R builds are on GitHub Actions.

This is a C++ test, checking code compiled with USE_MPI=ON:

Sorry, I don't really understand why it's failing.

Does add_subdirectory(external_libs/fmt) mean "also compile fmt, using fmt/CMakeLists.txt"?

If so, I believe that opens us up to a lot of weird issues, where values like CXXFLAGS set in LightGBM's CMakeLists.txt could conflict with those in fmt/CMakeLists.txt. Maybe that is why you're seeing these issues.

Co-authored-by: James Lamb <[email protected]>

AlbertoEAF · 2020-11-24T20:57:07Z

@jameslamb by rebasing against the latest master now there are 2 failing jobs in Travis: python 2 & mpi build

I thought python2 support was being dropped

The other error was "connection reset by peer".

Any ideas?

jameslamb · 2020-11-24T21:25:41Z

there are 2 failing jobs in Travis: python 2 & mpi build

The two test failures look unrelated to your code. I'll just restart them.

by rebasing against the latest master now

By the way, we squash commits on merge in this project, so you don't have to rebase + force push every time. You can just git merge master and push merge commits.

I like doing that because then git will raise an error if your branch isn't up to date with the remote, which usually means you accepted suggestions in the GitHub UI but then forgot to pull those changes to your local branch. If you rebase + force push, you can end up accidentally force-pushing over those changes, which results in more review churn.

Not a big deal though, totally up to your preference.

I thought python2 support was being dropped

Officially, 3.1.x is the final minor release to support Python 2. 3.2.x will not support it. But we haven't removed the support yet, because if we decide to release a bugfix-only release next (3.1.1), we have to keep that support.

You can follow #3581 for the current status of Python 2 in this project.

AlbertoEAF · 2020-11-25T09:54:16Z

@jameslamb @StrikerRUS finally green tests! Anything left to merge? :D

Thank you so much for all the help ;)

StrikerRUS · 2020-11-25T12:22:58Z

@guolinke Could you please review this?

guolinke · 2020-11-26T03:40:33Z

I will review it today

.gitmodules

include/LightGBM/utils/common.h

jameslamb · 2020-11-28T02:33:29Z

Now that we have all these approvals (!!!), I think we need to wait to merge until we decide if the next release will be 3.1.1 or 3.2.0.

StrikerRUS · 2020-11-28T06:37:28Z

Yeah, I agree that this "sensitive" PR should go into 3.2.0.

guolinke · 2020-11-29T01:21:54Z

okay, let us wait for 3.1.1 release first

AlbertoEAF · 2020-12-08T12:09:00Z

I saw 3.1.1 is out! 🚀

@guolinke do you think we can merge now?

guolinke · 2020-12-08T13:36:10Z

@AlbertoEAF yeah, this should be included in v3.2.0

github-actions · 2023-08-24T04:01:15Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

AlbertoEAF requested review from btrotta, chivee, guolinke, henry0312, huanzhang12, jameslamb, Laurae2, StrikerRUS and wxchan as code owners September 23, 2020 18:12

AlbertoEAF changed the title ~~Fix LightGBM models locale sensitivity and improve R/W performance.~~ Fix locale sensitivity and improve R/W performance. Sep 23, 2020

AlbertoEAF changed the title ~~Fix locale sensitivity and improve R/W performance.~~ Fix model locale sensitivity and improve R/W performance. Sep 23, 2020

AlbertoEAF changed the title ~~Fix model locale sensitivity and improve R/W performance.~~ Fix model locale issue and improve R/W performance. Sep 23, 2020

AlbertoEAF mentioned this pull request Sep 23, 2020

Fix LightGBM models locale sensitivity and improve R/W performance. feedzai/LightGBM#3

Closed

This was referenced Sep 25, 2020

Noob question: Compiling issues - missing strtod_l and locale_t fmtlib/fmt#1901

Closed

Compilation issue: Missing strtod_l and locale_t lemire/fast_double_parser#30

Closed

AlbertoEAF and others added 7 commits November 24, 2020 14:07

Update R-package/.Rbuildignore

eda4759

Co-authored-by: James Lamb <[email protected]>

Upgrade to fast_double_parser v0.3.0 - Solaris support

082c1d4

Use legacy code only in Solaris

cc19daf

Fix lint issues

69d0488

Fix comment

1411133

Address StrikerRUS's comments (solaris ifdef).

ba28c0f

Change header guards

7e17729

AlbertoEAF force-pushed the model-locale-fix branch from 9aeba44 to 7e17729 Compare November 24, 2020 14:07

guolinke reviewed Nov 26, 2020

View reviewed changes

.gitmodules Show resolved Hide resolved

guolinke reviewed Nov 26, 2020

View reviewed changes

include/LightGBM/utils/common.h Show resolved Hide resolved

guolinke approved these changes Nov 28, 2020

View reviewed changes

jameslamb mentioned this pull request Nov 29, 2020

release 3.1.1 #3611

Merged

16 tasks

guolinke merged commit 792c930 into microsoft:master Dec 8, 2020

StrikerRUS mentioned this pull request Dec 8, 2020

[fix] Locale independent model load/save #3267

Closed

This was referenced Dec 10, 2020

[python-package] remove unnecessary files to reduce sdist size #3639

Merged

[R-package] add support for non-ASCII feature names (fixes #2983) #3647

Merged

jameslamb mentioned this pull request Dec 18, 2020

[docs] fix R documentation builds (fixes #3655) #3656

Merged

jameslamb mentioned this pull request Dec 17, 2021

[RFC] Unify model format customize string or Json #4887

Open

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix model locale issue and improve model R/W performance. #3405

Fix model locale issue and improve model R/W performance. #3405

AlbertoEAF commented Sep 23, 2020 •

edited

Loading

AlbertoEAF commented Sep 23, 2020 •

edited

Loading

AlbertoEAF commented Sep 23, 2020

guolinke commented Sep 24, 2020

henry0312 commented Sep 24, 2020

AlbertoEAF commented Sep 24, 2020 •

edited

Loading

AlbertoEAF commented Sep 24, 2020 •

edited

Loading

StrikerRUS commented Sep 24, 2020

AlbertoEAF commented Sep 24, 2020 •

edited

Loading

AlbertoEAF commented Sep 24, 2020 •

edited

Loading

guolinke commented Sep 25, 2020 •

edited

Loading

lemire commented Sep 25, 2020 •

edited

Loading

jameslamb commented Sep 26, 2020

AlbertoEAF commented Sep 29, 2020 •

edited

Loading

AlbertoEAF commented Sep 29, 2020 •

edited

Loading

jameslamb commented Sep 29, 2020 •

edited

Loading

AlbertoEAF commented Nov 24, 2020

jameslamb commented Nov 24, 2020

AlbertoEAF commented Nov 25, 2020 •

edited

Loading

StrikerRUS commented Nov 25, 2020

guolinke commented Nov 26, 2020

jameslamb commented Nov 28, 2020

StrikerRUS commented Nov 28, 2020

guolinke commented Nov 29, 2020

AlbertoEAF commented Dec 8, 2020

guolinke commented Dec 8, 2020

github-actions bot commented Aug 24, 2023

Fix model locale issue and improve model R/W performance. #3405

Fix model locale issue and improve model R/W performance. #3405

Conversation

AlbertoEAF commented Sep 23, 2020 • edited Loading

AlbertoEAF commented Sep 23, 2020 • edited Loading

Running the integration tests

Integration tests and results

AlbertoEAF commented Sep 23, 2020

guolinke commented Sep 24, 2020

henry0312 commented Sep 24, 2020

AlbertoEAF commented Sep 24, 2020 • edited Loading

AlbertoEAF commented Sep 24, 2020 • edited Loading

StrikerRUS commented Sep 24, 2020

AlbertoEAF commented Sep 24, 2020 • edited Loading

AlbertoEAF commented Sep 24, 2020 • edited Loading

guolinke commented Sep 25, 2020 • edited Loading

lemire commented Sep 25, 2020 • edited Loading

jameslamb commented Sep 26, 2020

AlbertoEAF commented Sep 29, 2020 • edited Loading

AlbertoEAF commented Sep 29, 2020 • edited Loading

jameslamb commented Sep 29, 2020 • edited Loading

AlbertoEAF commented Nov 24, 2020

jameslamb commented Nov 24, 2020

AlbertoEAF commented Nov 25, 2020 • edited Loading

StrikerRUS commented Nov 25, 2020

guolinke commented Nov 26, 2020

jameslamb commented Nov 28, 2020

StrikerRUS commented Nov 28, 2020

guolinke commented Nov 29, 2020

AlbertoEAF commented Dec 8, 2020

guolinke commented Dec 8, 2020

github-actions bot commented Aug 24, 2023

AlbertoEAF commented Sep 23, 2020 •

edited

Loading

AlbertoEAF commented Sep 23, 2020 •

edited

Loading

AlbertoEAF commented Sep 24, 2020 •

edited

Loading

AlbertoEAF commented Sep 24, 2020 •

edited

Loading

AlbertoEAF commented Sep 24, 2020 •

edited

Loading

AlbertoEAF commented Sep 24, 2020 •

edited

Loading

guolinke commented Sep 25, 2020 •

edited

Loading

lemire commented Sep 25, 2020 •

edited

Loading

AlbertoEAF commented Sep 29, 2020 •

edited

Loading

AlbertoEAF commented Sep 29, 2020 •

edited

Loading

jameslamb commented Sep 29, 2020 •

edited

Loading

AlbertoEAF commented Nov 25, 2020 •

edited

Loading