Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monotonicity of 0's different result than no monotonicity set #4936

Open
pseudotensor opened this issue Jan 7, 2022 · 9 comments
Open

Monotonicity of 0's different result than no monotonicity set #4936

pseudotensor opened this issue Jan 7, 2022 · 9 comments
Labels

Comments

@pseudotensor
Copy link

pseudotensor commented Jan 7, 2022

df.csv

import lightgbm as lgb
import pandas as pd

df = pd.read_csv("df.csv")
y = df['y']
X = df.drop('y', axis=1)

model_class = lgb.sklearn.LGBMRegressor
params = {
          'min_child_samples': 1,
          'subsample': 0.7,
          'subsample_freq': 1,

          'random_state': 1234,
          'deterministic': True,

          'monotone_constraints': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
          }

model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

params.pop('monotone_constraints', None)
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

This is MRE, but more complicated examples also do this and lead to arbitrarily different results for predictions.

[15.14082331 17.42164852 17.27809565 24.66228158 31.7912921  25.07472208
 38.66674383 30.14650357 24.10321141 19.28946953 23.59302071 19.06902673
 13.8462058  50.01765005 22.09789499 15.48053459 21.67382484 23.59418642
 16.14454625 20.21541095]
[15.11855937 17.458745   17.22814776 24.65581646 31.72203189 25.0968243
 38.65681328 30.14518504 24.03192297 19.31654484 23.60164724 19.06489953
 13.85319787 50.03306803 22.07789009 15.45374806 21.75793041 23.61585482
 16.14733807 20.18282579]

@jameslamb
Copy link
Collaborator

jameslamb commented Jan 8, 2022

Thanks for this write-up and reproducible example! I ran this code tonight and can confirm I got the same results you reported.

Experimenting with this, I found that any of the following individual changes result in the predictions being identical:

  • removing parameter subsample
  • adding "bagging_seed": 708 to params
  • removing monotone_constraints
test code (click me)
import lightgbm as lgb
import pandas as pd
import numpy as np

data_url = "https://github.com/microsoft/LightGBM/files/7831557/df.csv"
df = pd.read_csv(data_url)

y = df['y']
X = df.drop('y', axis=1)

model_class = lgb.sklearn.LGBMRegressor
params = {
    'min_child_samples': 1,
    'subsample_freq': 1,
    'random_state': 1234,
    'deterministic': True,
    'monotone_constraints': [0] * X.shape[1],
}

model = model_class(**params)
model.fit(X, y)
preds1 = model.predict(X)
print(preds1[0:20])

params.pop('monotone_constraints', None)
model = model_class(**params)
model.fit(X, y)
preds2 = model.predict(X)
print(preds2[0:20])

assert np.allclose(preds1, preds2)

This definitely seems like a bug, but it seems a bit more specific than "passing all 0s for monotone_constraints results in a different model".

I think it looks like:

Providing all 0s for monotone_constraints results in a different model than if monotone_constraints are not provided, if also using bagging and not setting bagging_seed.

@jameslamb jameslamb added the bug label Jan 8, 2022
@pseudotensor
Copy link
Author

pseudotensor commented Jan 8, 2022

I don't find the same result as you. Adding bagging_seed doesn't help:

import lightgbm as lgb
import pandas as pd

df = pd.read_csv("df.csv")
y = df['y']
X = df.drop('y', axis=1)

model_class = lgb.sklearn.LGBMRegressor
params = {
          'min_child_samples': 1,
          'subsample': 0.7,
          'subsample_freq': 1,

          'random_state': 1234,
          'bagging_seed': 708,
          'deterministic': True,

          'monotone_constraints': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
          }

model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

params.pop('monotone_constraints', None)
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

gives

[15.09656531 17.43069474 17.27949377 24.68849732 31.7121405  25.07867232
 38.73125164 30.1278061  23.99720783 19.31545673 23.53189337 19.09648957
 13.87783463 50.01956777 22.1214959  15.44131316 21.7363739  23.6298187
 16.17185313 20.28922842]
[15.08843553 17.4425064  17.2973164  24.66865837 31.6657934  25.0441509
 38.66294203 30.1151615  24.02734721 19.33964692 23.52776624 19.10077709
 13.87791364 50.02051734 22.13321018 15.45728055 21.69579512 23.65150691
 16.16314868 20.26343255]

Also note that In making the MRE I originally had the bagging_seed set to 1236 and determined it didn't matter/effect the outcome, so that is why I removed it. But it is not relevant, since you can keep it if you wish and still the same problem I described happens.

And again, just because bagging causes it doesn't mean it is the only way it can be caused. That is just a result of my MRE reduction to one specific case that happen to show it.

@jameslamb
Copy link
Collaborator

What version of lightgbm are you on and how did you install it? I tested this on latest master (305369d)

@pseudotensor
Copy link
Author

>>> lgb.__version__
'3.2.1.99'

I can try on master.

But are you aware of a specific fix?

@jameslamb
Copy link
Collaborator

But are you aware of a specific fix?

No, I didn't know this issue existed until this bug report. Just trying to help narrow it down further.

@pseudotensor
Copy link
Author

pseudotensor commented Jan 8, 2022

Same result on latest from pypi on different machine after only installing:

virtualenv jon
source jon/bin/activate
pip install numpy pandas sklearn lightgbm

and the same script using bagging_seed set:

import lightgbm as lgb                                                                                                                                                                                    
import pandas as pd

df = pd.read_csv("df.csv")
y = df['y']
X = df.drop('y', axis=1)

model_class = lgb.sklearn.LGBMRegressor
params = {
          'min_child_samples': 1,
          'subsample': 0.7,
          'subsample_freq': 1,

          'random_state': 1234,
          'bagging_seed': 708,
          'deterministic': True,

          'monotone_constraints': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
          }

model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

params.pop('monotone_constraints', None)
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

gives:

[15.09656531 17.43069474 17.27949377 24.68849732 31.7121405  25.07867232
 38.73125164 30.1278061  23.99720783 19.31545673 23.53189337 19.09648957
 13.87783463 50.01956777 22.1214959  15.44131316 21.7363739  23.6298187
 16.17185313 20.28922842]
[15.08843553 17.4425064  17.2973164  24.66865837 31.6657934  25.0441509
 38.66294203 30.1151615  24.02734721 19.33964692 23.52776624 19.10077709
 13.87791364 50.02051734 22.13321018 15.45728055 21.69579512 23.65150691
 16.16314868 20.26343255]

version:

>>> lgb.__version__
'3.3.2'

@jameslamb
Copy link
Collaborator

3.3.2 is a special release that doesn't include most of the changes currently on master. It's just 3.3.1 + one small patch requested by the CRAN maintainers (see discussion in #4923).

To install from latest master:

git clone --recursive https://github.com/microsoft/LightGBM.git
cd python-package
python setup.py install

@pseudotensor
Copy link
Author

Same result:

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000330 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 665
[LightGBM] [Info] Number of data points in the train set: 253, number of used features: 13
[LightGBM] [Info] Start training from score 22.522925
[15.09656531 17.43069474 17.27949377 24.68849732 31.7121405  25.07867232
 38.73125164 30.1278061  23.99720783 19.31545673 23.53189337 19.09648957
 13.87783463 50.01956777 22.1214959  15.44131316 21.7363739  23.6298187
 16.17185313 20.28922842]
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000261 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 665
[LightGBM] [Info] Number of data points in the train set: 253, number of used features: 13
[LightGBM] [Info] Start training from score 22.522925
[15.08843553 17.4425064  17.2973164  24.66865837 31.6657934  25.0441509
 38.66294203 30.1151615  24.02734721 19.33964692 23.52776624 19.10077709
 13.87791364 50.02051734 22.13321018 15.45728055 21.69579512 23.65150691
 16.16314868 20.26343255]
>>> import lightgbm as lgb
>>> print(lgb.__version__)
3.3.2.99

@jameslamb
Copy link
Collaborator

Ah yeah you're right, I just tried again and got the same result you did. I crossed out the suggestion about bagging_seed in #4936 (comment), must have accidentally changed two things at once when I thought I was testing only that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants