Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is predict(..., pred_contrib=True) thread safe? #5482

Open
zyxue opened this issue Sep 13, 2022 · 3 comments
Open

Is predict(..., pred_contrib=True) thread safe? #5482

zyxue opened this issue Sep 13, 2022 · 3 comments
Labels

Comments

@zyxue
Copy link
Contributor

zyxue commented Sep 13, 2022

Description

I'm using the model behind a gRCP prediction service. The service predicts one example at a time.

I can reproduce the bug locally like

data = [df.loc[:1][model.feature_name()]] * 1_000

def _predict(df_one_row):
    return model.predict(df_one_row, pred_contrib=True)

with ThreadPoolExecutor(max_workers=32) as exc:
    exc.map(_predict, data)

The error is like

free(): invalid next size (normal)

or

double free or corruption (!prev)

or

malloc(): corrupted top size

depending on different runs.

Reproducible example

I can produce some diff error, but also related to threading using the following code:

from concurrent.futures import ThreadPoolExecutor

import sklearn.datasets
import lightgbm

df = (
    sklearn.datasets.load_iris(as_frame=True)["frame"]
    .sample(99, random_state=123)
    .rename(
        columns={
            "sepal length (cm)": "sepal_length",
            "sepal width (cm)": "sepal_width",
            "petal length (cm)": "petal_length",
            "petal width (cm)": "petal_width",
        }
    )
    .assign(sepal_length_cat=lambda df: (df.sepal_length > 1).astype(str).astype('category'))
    .reset_index(drop=True)
)

X, y = df.drop(columns="target"), df["target"]

regressor = lightgbm.LGBMRegressor(n_estimators=100, max_depth=7, objective="mse").fit(
    X, y
)

regressor.fit(X, y)

model = regressor.booster_

print(f'{df.dtypes=:}')

data = [df.loc[:1][model.feature_name()]] * 10_000


def _predict(df_one_row):
    return model.predict(df_one_row, pred_contrib=True)


with ThreadPoolExecutor(max_workers=32) as exc:
    exc.map(_predict, data)

The error is like

corrupted size vs. prev_size
corrupted size vs. prev_size

or

python3: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.

Environment info

LightGBM version or commit hash:

lightgbm==3.3.2

Command(s) you used to install LightGBM

I'm using lightgbm inside a monorepo with bazel, but I think under the hood it's equivalent to python -m pip install lightgbm

Additional Comments

@shuttie
Copy link
Contributor

shuttie commented Jun 28, 2024

Seems to be an issue related to the pred_contrib=True handling in the C library itself: in the lightgbm4j library we have an issue with concurrent prediction crashing the whole JVM process due to the call not being thread-safe. See issue metarank/lightgbm4j#88 for details and a reproducer.

@AndreyOrb
Copy link

I found a bug in Predictor related to incorrect usage of omp_get_thread_num.

int tid = omp_get_thread_num();

When working in multi-threaded mode, different threads can return tid=0.
https://stackoverflow.com/questions/68484180/is-it-safe-to-use-omp-get-thread-num-to-index-a-global-vector
https://stackoverflow.com/questions/4087852/omp-set-num-threads-always-returns-0-and-im-unable-to-get-thread-num-with-omp-ge
https://stackoverflow.com/questions/43952078/openmp-always-working-on-the-same-thread

So, different threads (predict_fun_ lambdas) can write to the same memory block.
There's another issue mentioning the same problem: #3751

I think the same problem is in all 4 prediction types.

@jameslamb
Copy link
Collaborator

Thanks @AndreyOrb . Are you interested in proposing a fix? We'd welcome it!

It is probably worth looking at these other places, to see if they're safe:

git grep omp_get_thread_num

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants