Fix single row prediction performance in a multi-threaded environment #6024

Ten0 · 2023-08-06T23:08:39Z

Store all resources that are reused across single-row predictions in the dedicated SingleRowPredictor (aka FastConfig)
Use that instead of resources in the Booster when doing Fast™ single row predictions to avoid having to lock the Booster exclusively.

Fixes microsoft#6021 - Store all resources that are reused across single-row predictions in the dedicated `SingleRowPredictor` (aka `FastConfig`) - Use that instead of resources in the `Booster` when doing single row predictions to avoid having to lock the `Booster` exclusively. - A FastConfig being alive now takes a shared lock on the booster (it was likely very incorrect to mutate the booster while this object was already built anyway)

Ten0 · 2023-08-06T23:19:42Z

Workflows run would help :) (not sure what is currently using this and should be tested)

Ten0 · 2023-08-07T11:27:17Z

@microsoft-github-policy-service agree

jameslamb · 2023-08-07T16:25:48Z

You don't need to force-push here. All commits will be squashed on merge.

…SingleRowPredictor

Ten0 · 2023-08-07T22:08:56Z

You don't need to force-push here. All commits will be squashed on merge.

Thanks! :) That was just me organizing my work though (I wanted this commit specifically to be a single one and working in case it would end up being reverted)

jameslamb · 2023-09-03T21:54:10Z

@shiyu1994 or @guolinke could you please help with a review on this one?

And @AlbertoEAF if you have time?

src/c_api.cpp

AlbertoEAF · 2023-09-10T14:04:02Z

@Ten0 thanks for contributing! Since we're adding capability for it, can you add a test to our C++ test suite showing multiple predictors using the fast API and sharing the booster?

Ten0 · 2023-09-18T17:39:47Z

So I've finished my benchmarks and it turns out:

I can see the expected performance improvements of x N_CPUs performance
The sub-optimal locking has no impact
I was afraid that memory barriers due to the read lock might be noticeable, so I thought a bit unfortunate that we would have to take the shared read lock for every prediction. I was originally considering holding the shared lock for the entire lifetime of the SingleRowPredictor, but as it turned out some implementations using the C library (notably R) would rely on being able to update the Booster after constructing a SingleRowPredictor, so that would cause a deadlock. (f41755b). As it turns out, this shows no impact in my benchmarks, so I'm going to be reasonably happy with leaving it as is (that is, with taking a shared lock for every single row prediction, as is currently implemented).

stonebrakert6 · 2023-10-11T05:21:44Z

Thanks a lot @Ten0 for this commit. I am a new user of lightGBM(need to mention this so that you don't confuse me with a veteran).
So, IIRC, the usage now would be something similar to this

BoosterHandle is created once and is then shared among all threads.
Each thread has its own copy of FastConfigHandle and then calls LGBM_BoosterPredictForMatSingleRowFastInit to initialize its private(non-shared) FastConfigHandle using the BoosterHandle created in step 1.
Each thread can now call LGBM_BoosterPredictForMatSingleRowFast(passing in its FastConfigHandle) to make predictions.

Is this understanding correct?

Ten0 · 2023-10-11T08:29:52Z

Each thread has its own copy of FastConfigHandle

Of BoosterHandle but I think that's what you meant. Otherwise yes that's exactly it.

stonebrakert6 · 2023-10-11T10:38:04Z

Of BoosterHandle but I think that's what you meant.

Sorry, but I didn't mean that. I really meant FastConfigHandle. Quoting you on #6021

This way, when processing on multiple threads, each thread can make its call to LGBM_BoosterPredictForMatSingleRowFastInit, then work without contention:

Here is the pseudo-code, would need to fill ... to make it compile. Also, it has memory/resource leakage.

#include <iostream>
#include <thread>
#include <vector>

#include "LightGBM/c_api.h"

// # of features
const int kFeatures = 13;
// # of threads which call predict
const int nworkers = 10;
// Shared booster
BoosterHandle booster_handle;

// Every thread exectues this function
void predict(ssize_t beg, ssize_t end) {
  // Each thread has its own FastConfigHandle
  FastConfigHandle config;
  int rc = LGBM_BoosterPredictForMatSingleRowFastInit(
      booster_handle, C_API_PREDICT_NORMAL, 0, 0, C_API_DTYPE_FLOAT64,
      kFeatures, "", &config);
  if (rc != 0) {
    abort();
  }
  for (ssize_t i = beg; i < end; i++) {
    int64_t len = 0;
    rc = LGBM_BoosterPredictForMatSingleRowFast(config, ...);
  }
}

int main(int argc, char* argv[]) {
  int num_iterations;
  int rc = LGBM_BoosterCreateFromModelfile(argv[1], &num_iterations,
                                           &booster_handle);
  if (rc != 0) {
    std::cout << "LGBM_BoosterCreateFromModelfile() returned " << rc << '\n';
    return 1;
  }
  std::vector<std::thread> workers(nworkers);
  for (ssize_t i = 0; i < nworkers; i++) {
    if (i != nworkers - 1) {
      workers[i] = std::thread(predict, ...);
    }
  }
  for (std::thread& t : workers) {
    t.join();
  }
  return 0;
}

If each thread creates its own Booster then memory usage for a single model would be x N, where N is the number of threads. So I want to share the booster with all threads and during prediction, I don't want to acquire the write lock, (that's why I am using your fork).

Let me know if my understanding of API usage is correct or not. I want to use your fork(because I need scalable predict/inference performance)

Ten0 · 2023-10-11T12:36:49Z

Let me know if my understanding of API usage is correct or not

Your understanding is correct, it seems to be just a matter of definitions that caused our misunderstanding. (Copy, handle, globals...)

I want to use your fork(because I need scalable predict/inference performance)

Happy to help :)
If you happen to feel like helping to get this merged to not have to deal with the complications of using a fork I'd happily accept contributions to the PR with regards to the "please showcase usage in the C++ test suite" review comment - I'm not sure exactly when I'll have time to finalize this.

As discussed here: microsoft#6024 (comment)

Ten0 · 2024-01-12T00:21:11Z

TLDR; All comments resolved.

can you add a test to our C++ test suite showing multiple predictors using the fast API and sharing the booster?

That took a while to implement because a similar test didn't already exist (the C api was only tested through the python wrapper and the other implementations binding to it) but it's finally done! (1fcbc3f)

I've also added a workaround for #6142. The state of this issue is that, with the workaround, users would not hit issues any more with this API than they would with the current existing APIs around both single row and matrix prediction.

AFAICT this is ready for merge.

Side note: we've been using this in production for a few months without any issue (with the workaround implemented in the wrapper)

…_row_contention

Ten0 · 2024-01-22T18:09:12Z

@guolinke Static analysis complains that the TODO tag (#6024 (comment)) is not assigned. Who should I assign it to?

jameslamb · 2024-01-22T18:19:46Z

TODO

We prefer not using TODO comments at all, and instead using the Issues backlog to track future work. You won't find many in this repo:

Could you please put the content of that comment into a new issue at https://github.com/microsoft/LightGBM/issues with as much detail as necessary for someone else to pick it up (and linking to this PR)?

…_row_contention

…ow_contention

guolinke

Thank you!

jameslamb · 2024-03-18T15:21:18Z

Sorry, I'd missed that @guolinke approved this back in January!

I've just updated it to latest master. Will merge this today assuming that CI passes.

jameslamb · 2024-03-18T21:46:18Z

Thanks very much!

Ten0 requested review from guolinke, jameslamb, shiyu1994 and jmoralez as code owners August 6, 2023 23:08

jameslamb added the awaiting review label Aug 6, 2023

Ten0 added 2 commits August 7, 2023 01:47

fix missing file change

fe31d4e

fix lint

feaf3dc

jameslamb added in progress efficiency labels Aug 7, 2023

Ten0 force-pushed the 6021-fix_single_row_contention branch from 846dc01 to f0e4227 Compare August 7, 2023 16:19

check whether freeze is due to booster shared lock being held by the …

f41755b

…SingleRowPredictor

Ten0 force-pushed the 6021-fix_single_row_contention branch from f0e4227 to f41755b Compare August 7, 2023 22:08

Ten0 and others added 2 commits August 8, 2023 00:46

what you get for having everything as an int

c52a7d7

Merge branch 'master' into 6021-fix_single_row_contention

601316b

Ten0 mentioned this pull request Aug 17, 2023

feat: Change locking strategy of Booster, allow for share and unique locks #2760

Merged

Merge branch 'master' into 6021-fix_single_row_contention

f4cf79d

Merge branch 'master' into 6021-fix_single_row_contention

c7e9d2e

guolinke reviewed Sep 8, 2023

View reviewed changes

src/c_api.cpp Show resolved Hide resolved

Merge branch 'master' into 6021-fix_single_row_contention

6db33b3

Ten0 added a commit to Ten0/LightGBM that referenced this pull request Jan 12, 2024

Add TODO about FastConfig naming needing to be updated

993f8f5

As discussed here: microsoft#6024 (comment)

Ten0 requested a review from borchero as a code owner January 12, 2024 00:08

Ten0 force-pushed the 6021-fix_single_row_contention branch from f54a6e2 to 6db33b3 Compare January 12, 2024 00:10

add more cleanup comment

49d7217

Ten0 force-pushed the 6021-fix_single_row_contention branch from 2022132 to b3848a5 Compare January 12, 2024 00:30

Ten0 added 2 commits January 12, 2024 01:58

cleanup some unnecessary includes in the test

1a4f1aa

Merge branch '6021-fix_single_row_contention_v4' into 6021-fix_single…

bf5fc5f

…_row_contention

Ten0 force-pushed the 6021-fix_single_row_contention branch from b3848a5 to bf5fc5f Compare January 12, 2024 00:59

Ten0 added 4 commits January 12, 2024 11:19

make windows cpp compiler happy

e00191a

Merge branch '6021-fix_single_row_contention_v4' into 6021-fix_single…

46ca8e6

…_row_contention

hopefully make static analysis happy

bfefc20

Merge branch '6021-fix_single_row_contention_v4' into 6021-fix_single…

e0d7913

…_row_contention

Ten0 added 2 commits January 24, 2024 23:32

Turn TODO into a regular comment

ff613ad

Merge branch '6021-fix_single_row_contention_v4' into 6021-fix_single…

063008a

…_row_contention

Ten0 mentioned this pull request Jan 24, 2024

C public API: Rename FastConfig to SingleRowPredictor #6286

Open

Merge remote-tracking branch 'upstream/master' into 6021-fix_single_r…

a01916a

…ow_contention

guolinke approved these changes Jan 26, 2024

View reviewed changes

jameslamb added 2 commits February 2, 2024 23:47

Merge branch 'master' into 6021-fix_single_row_contention

99140fa

Merge branch 'master' into 6021-fix_single_row_contention

4e29833

jameslamb removed in progress awaiting review labels Mar 18, 2024

jameslamb approved these changes Mar 18, 2024

View reviewed changes

jameslamb merged commit 0a3e1a5 into microsoft:master Mar 18, 2024
40 of 42 checks passed

shuttie mentioned this pull request Jun 27, 2024

Performance improvements metarank/lightgbm4j#81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix single row prediction performance in a multi-threaded environment #6024

Fix single row prediction performance in a multi-threaded environment #6024

Ten0 commented Aug 6, 2023 •

edited

Loading

Ten0 commented Aug 6, 2023

Ten0 commented Aug 7, 2023

jameslamb commented Aug 7, 2023

Ten0 commented Aug 7, 2023 •

edited

Loading

jameslamb commented Sep 3, 2023

AlbertoEAF commented Sep 10, 2023

Ten0 commented Sep 18, 2023

stonebrakert6 commented Oct 11, 2023 •

edited

Loading

Ten0 commented Oct 11, 2023

stonebrakert6 commented Oct 11, 2023

Ten0 commented Oct 11, 2023

Ten0 commented Jan 12, 2024 •

edited

Loading

Ten0 commented Jan 22, 2024 •

edited

Loading

jameslamb commented Jan 22, 2024

guolinke left a comment

jameslamb commented Mar 18, 2024

jameslamb commented Mar 18, 2024

Fix single row prediction performance in a multi-threaded environment #6024

Fix single row prediction performance in a multi-threaded environment #6024

Conversation

Ten0 commented Aug 6, 2023 • edited Loading

Ten0 commented Aug 6, 2023

Ten0 commented Aug 7, 2023

jameslamb commented Aug 7, 2023

Ten0 commented Aug 7, 2023 • edited Loading

jameslamb commented Sep 3, 2023

AlbertoEAF commented Sep 10, 2023

Ten0 commented Sep 18, 2023

stonebrakert6 commented Oct 11, 2023 • edited Loading

Ten0 commented Oct 11, 2023

stonebrakert6 commented Oct 11, 2023

Ten0 commented Oct 11, 2023

Ten0 commented Jan 12, 2024 • edited Loading

Ten0 commented Jan 22, 2024 • edited Loading

jameslamb commented Jan 22, 2024

guolinke left a comment

Choose a reason for hiding this comment

jameslamb commented Mar 18, 2024

jameslamb commented Mar 18, 2024

Ten0 commented Aug 6, 2023 •

edited

Loading

Ten0 commented Aug 7, 2023 •

edited

Loading

stonebrakert6 commented Oct 11, 2023 •

edited

Loading

Ten0 commented Jan 12, 2024 •

edited

Loading

Ten0 commented Jan 22, 2024 •

edited

Loading