Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prediction of 2.1.1 compared to 1.7.6 is significantly slower #10882

Open
Raemi opened this issue Oct 11, 2024 · 4 comments
Open

Prediction of 2.1.1 compared to 1.7.6 is significantly slower #10882

Raemi opened this issue Oct 11, 2024 · 4 comments

Comments

@Raemi
Copy link

Raemi commented Oct 11, 2024

We are currently using xgboost 1.6.2 and are trying to upgrade to 2.1.1. On the way through the versions, we observed the following prediction time averages:

1.6.2: 15ms
1.7.6: 17ms
2.0.3: 43ms
2.1.1: 110ms

As you can see, there is a big jump from 1.7 to 2.0, and then an even bigger jump from 2.0 to 2.1. It's not easy for me to share the model unfortunately, but I found this related bug report & updated the scripts to my use case: #8865

import time

import numpy as np
import pandas as pd
import xgboost
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

MODEL_NAME = '/tmp/model.model'


def train_model():
    data = load_digits()
    X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], test_size=.2)
    dtrain = xgboost.DMatrix(X_train, label=y_train)
    params = {'max_depth':3, 'eta':1, 'objective':'reg:linear', 'eval_metric':'rmse'}
    bst = xgboost.train(params, dtrain, 10, [(dtrain, 'train')])
    bst.save_model(MODEL_NAME)

def predict_np_array():
    bst = xgboost.Booster()
    bst.set_param({"nthread": 1})
    bst.load_model(fname=MODEL_NAME)
    times = []
    np.random.seed(7)
    iterations = 10000
    for _ in range(iterations):
        sample = np.random.uniform(-1, 10, size=(1, 64))
        start = time.time()
        bst.inplace_predict(sample)
        times.append(time.time() - start)
    iter_time = sum(times[iterations // 2:]) / iterations / 2
    print("np.array iter_time: ", iter_time * 1000, "ms")

def predict_sklearn():
    xgb = XGBClassifier()
    xgb.set_params(n_jobs=1, nthread=1)
    xgb.load_model(fname=MODEL_NAME)
    times = []
    np.random.seed(7)
    iterations = 500
    attrs = {f"{i}" for i in range(64)}
    for _ in range(iterations):
        sample = pd.DataFrame({ind: [np.random.uniform(-1, 10)] for ind in attrs})
        start = time.time()
        xgb.predict_proba(sample)
        times.append(time.time() - start)
    iter_time = sum(times[iterations // 2:]) / iterations / 2
    print("DataFrame iter_time: ", iter_time * 1000, "ms")

if __name__ == "__main__":
    train_model()
    for i in range(10):
        predict_np_array()
        predict_sklearn()

I get the following times when they stabilize:

1.7.6:
np.array iter_time:  0.012594342231750488 ms
DataFrame iter_time:  0.3071410655975342 ms

2.1.1
np.array iter_time:  0.03231525421142578 ms
DataFrame iter_time:  1.8953888416290283 ms

While not as severe for this artificial model, it still looks like a significant performance degradation. I see now that using pd.DataFrame is a lot worse than np.array, so I think I can work around my issue. But it is still surprising to me that the performance regressed that significantly.

Additional context

Our production model has the following attributes (extracted from the model.json, in case that is helpful):

   "scikit_learn": "{\"_estimator_type\": \"classifier\"}",
   "feature_names": [],
   "feature_types": [],
   "gradient_booster": {
      "model": {
        "gbtree_model_param": {
          "num_parallel_tree": "1",
          "num_trees": "272"
        },
      "name": "gbtree",
    "learner_model_param": {
      "base_score": "5E-1",
      "boost_from_average": "1",
      "num_class": "2",
      "num_feature": "310",
      "num_target": "1"
    },
    "objective": {
      "name": "multi:softprob",
      "softmax_multiclass_param": {
        "num_class": "2"
      }
    }
  },
  "version": [
    2,
    1,
    1
  ]

The model was trained on xgboost 1.5.2 but then re-saved on 2.1.1.

The requirements.lock file

I used these version locks when measuring the above numbers. The only change to the file being xgboost==1.7.6 when testing for that version.

All tested on Ubuntu 24.04.1, 11th Gen Intel(R) Core(TM) i7-11800H

requirements_dev.zip

@trivialfis
Copy link
Member

Thank you for opening the issue. Yeah, we have added some more inspection for pd.DataFrame due to support for its extension. But the performance degradation looks bad, will do some profiling.

@jankogasic
Copy link

Hello. I was not sure if I should open new issue, but I have the same problem.

I am working on binary classification on 220M with an imbalanced dataset. The training was incremental with 5 batches of data. Version 2.1.2 was really slow, and after that, I tried 1.6.2, which was much faster.

@trivialfis
Copy link
Member

@jankogasic Could you please share the training parameters?

@trivialfis
Copy link
Member

As for the original issue, here are the two functions that take up the bulk of the time:

I'm sure some people understand the internals of pandas way better than me. If you have suggestions for how to optimize, please share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants