-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBM does not like NUMA (large performance impact on servers) #1441
Comments
@Laurae2 BTW, the multi-threading is based on openmp. Thus, maybe we need additional commands to improve openmp's performance on numa: https://stackoverflow.com/questions/11959906/openmp-and-numa-relation http://prace.it4i.cz/sites/prace.it4i.cz/files/files/advancedopenmptutorial_2.pdf |
I also observed a similar phenomenon before. Ideally, BTW, a few years ago I wrote an paper on optimizing SGD on NUMA machines which uses similar techniques: https://ieeexplore.ieee.org/document/7837887/ (HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent) Because we use static work scheduling, the slowest CPU will determine the running time. This will make our situation even worse. We can first try to use dynamic OpenMP scheduling (change |
Thanks @huanzhang12 so much. That is very helpful. |
@Laurae2 any chance to try @huanzhang12 's suggestions ? |
@guolinke Is it only this line to change? https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L429 to: #pragma omp parallel for schedule(dynamic,1024) if (num_features_ >= 2048) Or should I change all static to dynamic? |
@Laurae2 I guess there are many others, but this one should be the most important one if the data is large and dense (like Higgs). You can try it first. |
@Laurae2 Oh sorry I actually misread the code. The major computation loop is actually here: (there are actually some different cases here, so you probably need to change all |
@guolinke @huanzhang12 Here are some results with different schedulers on the function you quoted (I changed all static schedulers in ConstructHistograms). Using Dual Xeon Gold 6130 (3.7 GHz single thread, 2.8 GHz all cores). Timings, average of 5 runs:
Efficiencies vs 1 thread (1 thread is 132% efficient on my CPU because turbo boost), average of 5 runs:
This is the CPU behavior observed during training:
|
@Laurae2 So it seems |
@huanzhang12 Higgs (11M observations, https://archive.ics.uci.edu/ml/datasets/HIGGS), with 2-way interactions (multiplication) of all features. It makes a total of 11M observations x 408 features. |
@Laurae2 Thanks for the clarification! |
@huanzhang12 It seems my dynamic scheduler results are incorrect because I changed the wrong pragmas (not the right function, I'm using commit 3f54429 (the line numbers are not the same with master branch). I'll come back later today/tomorrow with updated results. |
@Laurae2 Thanks for the clarification! Look forward to the updated results 🙂 |
@huanzhang12 New & correct results below. Higgs with two-way interactions using multiplication (11M x 408 features). Speed, average of 5 runs:
Efficiencies vs 1 thread (1 thread is 132% efficient on my CPU because turbo boost), average of 5 runs:
|
It seems using dynamic scheduler always made LightGBM slower. On smaller datasets I reached 8x slower when using all 64 threads (it's probably the worst scenario because the optimal number of threads was 2). @huanzhang12 Do you have any other tentative solutions in mind? |
@Laurae2 Thanks for the detailed benchmarking! It seems the dynamic scheduler has significant overhead and cannot help in our case, so this problem cannot be easily fixed. You can try to do a more detailed profiling using Intel Vtune Amplifier to see where the slow down comes from, which will be very helpful. Ideally, we need to redesign the |
Does changing sysctl |
@huanzhang12 sorry for the late response. Yeah, the ideal solution is to make it NUMA aware. |
@Laurae2 could you help for the documentation about how to pin cores when calling lightgbm? |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
New results: szilard/GBM-perf#29 (comment). |
Environment info
By default, servers are shipping with NUMA nodes enabled. However, with UMA (Node Interleaving), LightGBM performance increases significantly (sometimes by over 80%+). On dual processor systems, the impact can be over 30%.
This issue is not limited to R, includes also Python and CLI.
I also tested the impact on a 8x Xeon Platinum 8180 (768GB RAM, all RAM banks populated evenly): it was 200%+ slower with Node Interleaving off (NUMA on).
xgboost is also affected by this issue, but it's less severe than LightGBM.
Related issues:
Untested:
numactl
was not used because, first it is not available in Windows (Windows pins data to the correct RAM banks only if affinity is specified), second when using all available threads it does not make senseReproducible examples
I used a private dataset (takes 250GB+ RAM), but the issue is easily reproducible using HIGGS dataset: https://archive.ics.uci.edu/ml/datasets/HIGGS
See the results on a dual Xeon Gold 6130 and 384GB RAM here: https://public.tableau.com/views/NodeInterleavingv1/Dashboard?:bootstrapWhenNotified=true&:display_count=y&:display_static_image=y&:embed=y&:showVizHome=no&publish=yes
Steps to reproduce
Code from #542 could be used through CLI.
The text was updated successfully, but these errors were encountered: