Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Introduce refit_tree_manual to Booster class. #6617

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

neNasko1
Copy link
Contributor

This PR strives to introduce a way to refit a tree manually, for example during a callback. Currently the only related functionality is using set_leaf_output, however this does not update the predictions underneath.

Enabling this will allow users to implement regularisation methods that are out of scope of the library, e.g. honest splits. The provided test also includes a debiasing callback, which creates a model whose mean is the same as the dataset mean.

I am open to discussion of this feature and I am wondering if there is already some work in a related direction!

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your continued interest in LightGBM.

At first glance, I'm -1 on this proposal.

  1. I don't understand what you mean when you say that set_leaf_output() does not update "the predictions underneath". Can you elaborate?
  2. Couldn't this same behavior be achieved by using a custom objective function? The tree output values are in terms of the loss function.

@neNasko1
Copy link
Contributor Author

neNasko1 commented Aug 16, 2024

Thank you for the fast response!

Maybe I am missing something but:

  1. I don't understand what you mean when you say that set_leaf_output() does not update "the predictions underneath". Can you elaborate?

If you try to refit the tree using set_leaf_output-s using a callback those updates are essentially ignored in the next boosting rounds, because they do not update the score(i.e. using AddScore).

  1. Couldn't this same behavior be achieved by using a custom objective function? The tree output values are in terms of the loss function.

For the case of debiasing it may indeed be possible(I am not sure how), but for honest splits(splitting on one dataset and creating the leaf values on another) I do not think it is remotely possible.

All in all, I think this change will introduce a clean way to manage the training of the tree with a minimal user-facing API. There is also the precedence of the existence of rollback_one_iter.

@lbittarello
Copy link

I can provide some context for this PR. :)

As @neNasko1 mentioned, we are interested in implementing debiasing and honest splits.

With certain objective functions, predictions can be biased. For example, the "gamma" and "tweedie" inbuilt objectives typically undershoot. That occurs because they apply the exponential link function to the raw scores (which is not canonical in a GLM sense). In business contexts, bias can be unacceptable (say, you shouldn't systematically underestimate risk in insurance). While we could patch leaf values in one go at the end of training, the tree structure will have been contaminated by the bias in each iteration. We could also use a different objective function, but that has drawbacks too (poorer fit overall, loss of multiplicative structure, etc.).

Honest splits involve using one data set to determine the splits and another to compute the leaf values. It is a form of regularization: if a tree overfits and produces a spurious split, the leaf values should still end up similar, because outcomes in the second data set shouldn't differ across the spurious split. Honest splits also produce interesting statistical properties for researchers (see Wager and Athey (2018)): the resulting predictions can be shown to be asymptotically consistent, unbiased and normally distributed. Here, again, we could refit leaf values after training, but the tree structure will have been contaminated.

In both cases, we need to adjust the leaf values of each tree after its construction. We then need to compute the raw scores, accounting for the updated values, before constructing the next tree. We're planning on performing the adjustment in downstream callbacks, minimising the disruption in LightGBM itself. However, we need to be able to modify leaf values after each iteration and to have those modifications reflected in the raw scores that LightGBM will use in the following iteration (hence this PR).

If LightGBM already offers similar functionality, we'd obviously prefer to use that. :)

@lbittarello
Copy link

@lorentzenchr might be interested too.

@jameslamb
Copy link
Collaborator

Thanks very much for the detailed explanation.

To set the right expectation, I personally will need to do some significant reading on these topics to be able to provide a thoughtful review. We are very cautious here with expansions of the public API (both in the Python package and the C API)... such expansions are relatively easy to make, but relatively difficult to take back (because removing a part of the public API is a user-facing breaking change). That same comment applies to the proposal in #6568 as well.

Hopefully some other review who's more knowledgeable about these topics will be able to offer a review.

@lorentzenchr
Copy link
Contributor

@lbittarello I read https://arxiv.org/abs/1510.04342 and the asymtotic results as valid for random forests, not so much for boosted trees.
The bias of, e.g., Gamma deviance with a log-link stems indeed from the fact that the log-link is non-canonical for the Gamma deviance. IMHO, This won't change with honest trees:
Even if the leaf values are unbiased, they live on the log scale and the exponentiation of values introduces the bias (same for gradient boosting as for GLMs).

I have had good experiences with a multiplicative (1 parameter) recalibration of non-canonical losses with a log-link, see scikit-learn/scikit-learn#26311.

@neNasko1
Copy link
Contributor Author

neNasko1 commented Aug 27, 2024

@lorentzenchr

Maybe it is my bad for not explaining in too much detail. This change enables both honest splits and debiasing(i.e. in different implementation) to be implemented as callbacks to training. Debiasing in the case of the gamma loss(implemented as a test) can be as simple as offsetting the raw_scores after tree training.

The honest split use case is not connected to debiasing, but more so is a way of reducing overfitting and can be applied to all types of losses.

Again worth noting that if a similar mechanism exists - being able to change the leaf outputs during training and updating the scores for the training data in an efficient way, we will obviously prefer to use it instead of contributing changes.

@lbittarello
Copy link

the asymtotic results as valid for random forests, not so much for boosted trees.

True. LightGBM also supports random forests though and honest splits are still useful as a regularisation strategy for GBMs. :)

The bias of, e.g., Gamma deviance with a log-link stems indeed from the fact that the log-link is non-canonical for the Gamma deviance. IMHO, This won't change with honest trees

As @neNasko1 explained, this PR also enables debiasing, which is independent from honest splitting.

@neNasko1
Copy link
Contributor Author

neNasko1 commented Aug 28, 2024

Since all written above is a bit abstract let me illustrate by example the utility honest splits can bring to a trained model.
This example uses the French Motor Claims Datasets. Here is a sample code that implements the honest splits training using a callback to refit after each training step:

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder

class HonestSplitCallback():

    def __init__(self, data, labels):

        self.score = 0
        self.data = data
        self.labels = labels

    def _init(self, env):
        self.learning_rate = env.model.params["learning_rate"]

    def _get_gradients(self):
        y = self.labels
        mu = np.exp(-self.score)

        gradient = 1 - y * mu
        hessian = y * mu

        return gradient, hessian

    def _internal_refit(self, env):

        booster = env.model

        gradient, hessian = self._get_gradients()

        predicted_leaves = booster.predict(
            self.data,
            pred_leaf=True,
            start_iteration=env.iteration,
            num_iteration=1,
        )

        sums = pd.DataFrame({'grad': gradient, 'hess': hessian, 'leaf': predicted_leaves}).groupby('leaf').sum()
        sums['refitted'] = -sums['grad'] / sums['hess'] * self.learning_rate
        refitted_leaves = np.zeros(env.model.dump_model()['tree_info'][env.iteration]['num_leaves'])
        refitted_leaves[sums.index.to_numpy()] = sums['refitted'].to_numpy()

        booster.refit_tree_manual(
            env.iteration,
            refitted_leaves
        )
        self.score += refitted_leaves[predicted_leaves]

    def __call__(self, env):

        if env.iteration == env.begin_iteration:
            self._init(env)

        self._internal_refit(env)

df = pd.read_csv("https://www.openml.org/data/get_csv/20649148/freMTPL2freq.arff", quotechar="'")
df_sev = pd.read_csv("https://www.openml.org/data/get_csv/20649149/freMTPL2sev.arff", index_col=0)

df.rename(lambda x: x.replace('"', ''), axis='columns', inplace=True)
df['IDpol'] = df['IDpol'].astype(np.int64)
df.set_index('IDpol', inplace=True)

df = df.join(df_sev.groupby(level=0).sum(), how='left')
df.fillna(value={'ClaimAmount': 0, 'ClaimAmountCut': 0}, inplace=True)

labelencoder = LabelEncoder()

df['Area'] = labelencoder.fit_transform(df['Area'])
df['VehBrand'] = labelencoder.fit_transform(df['VehBrand'])
df['VehGas'] = labelencoder.fit_transform(df['VehGas'])
df['Region'] = labelencoder.fit_transform(df['Region'])

df = df[df['ClaimAmount'] > 0]

df_train = df[['Area', 'VehPower', 'VehAge', 'DrivAge', 'BonusMalus', 'VehBrand', 'VehGas', 'Density', 'Region']]
sample_column = np.random.choice(['train', 'hsplit', 'test'], len(df))

honest_split_callback = HonestSplitCallback(
    df_train[sample_column == 'hsplit'],
    df['ClaimAmount'][sample_column == 'hsplit']
)

bst = lgb.LGBMRegressor(
    learning_rate=0.05,
    n_estimators=500,
    num_leaves=31,
    objective='gamma'
).fit(
    df_train[sample_column == 'train'], df['ClaimAmount'][sample_column == 'train'],
    eval_set=[
        (df_train[sample_column == 'train'], df['ClaimAmount'][sample_column == 'train']),
        (df_train[sample_column == 'hsplit'], df['ClaimAmount'][sample_column == 'hsplit']),
        (df_train[sample_column == 'test'], df['ClaimAmount'][sample_column == 'test']),
    ],
    eval_names=["train", "hsplit", "test"],
    callbacks=[
        honest_split_callback,
        lgb.log_evaluation(period=50),
    ],
    eval_metric='gamma',
)

The example is rudimentary and only for illustration purposes but training with/without and examining the "test" dataset we can see the presence of overfitting in the vanilla training and the absence of it in the honest-splits one:
log_evaluation(period=50) without honest splits:

[50]	train's gamma: 8.47428	hsplit's gamma: 8.75784	test's gamma: 8.79983
[100]	train's gamma: 8.39185	hsplit's gamma: 8.78157	test's gamma: 8.86788
[150]	train's gamma: 8.35027	hsplit's gamma: 8.79986	test's gamma: 8.91633
[200]	train's gamma: 8.31865	hsplit's gamma: 8.82251	test's gamma: 8.93402
[250]	train's gamma: 8.29276	hsplit's gamma: 8.83342	test's gamma: 8.96648
[300]	train's gamma: 8.2702	hsplit's gamma: 8.84719	test's gamma: 8.9856
[350]	train's gamma: 8.2532	hsplit's gamma: 8.86033	test's gamma: 9.02807
[400]	train's gamma: 8.23536	hsplit's gamma: 8.87122	test's gamma: 9.05208
[450]	train's gamma: 8.22125	hsplit's gamma: 8.87913	test's gamma: 9.06393
[500]	train's gamma: 8.20966	hsplit's gamma: 8.88448	test's gamma: 9.07705

log_evaluation(period=50) with honest splits:

[50]	train's gamma: 219.634	hsplit's gamma: 172.385	test's gamma: 210.213
[100]	train's gamma: 23.926	hsplit's gamma: 19.7791	test's gamma: 23.0972
[150]	train's gamma: 9.61772	hsplit's gamma: 8.99064	test's gamma: 9.47934
[200]	train's gamma: 8.96602	hsplit's gamma: 8.60529	test's gamma: 8.86329
[250]	train's gamma: 8.94747	hsplit's gamma: 8.59725	test's gamma: 8.83827
[300]	train's gamma: 8.94775	hsplit's gamma: 8.59631	test's gamma: 8.83784
[350]	train's gamma: 8.94703	hsplit's gamma: 8.59619	test's gamma: 8.83751
[400]	train's gamma: 8.94677	hsplit's gamma: 8.59619	test's gamma: 8.83749
[450]	train's gamma: 8.947	hsplit's gamma: 8.59616	test's gamma: 8.83751
[500]	train's gamma: 8.94747	hsplit's gamma: 8.59613	test's gamma: 8.83755

Moreover examining the partial dependency plots we get "smoother" results:

import matplotlib.pyplot as plt
from sklearn.inspection import PartialDependenceDisplay

PartialDependenceDisplay.from_estimator(bst, df_train, df_train.columns)
plt.show()

Without honest splits:
WithoutHSplits
With honest splits:
WithHSplits

Note: it seems like the name refit_tree_manual may be a bit misleading as we are essentially overwriting the leaves and committing the changes to the training procedure(again - if this is possible to do in the already existing codebase, we will be happy to close the PR).

Copy link
Collaborator

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb I'm generally in favor of adding this feature. It does not add a new complex algorithm to this package but rather enables implementing more complex algorithms (such as honest splits) on top of LightGBM.

To this end, we are only exposing a little more LightGBM internals here. However, I would even argue that we're not exposing any "implementational details" as the new "hook" to properly modify outputs after each boosting iteration is an "algorithmic detail", i.e. a "semantic hook" rather than something odd that LightGBM introduces (not sure if it is clear what I mean 😄)

@@ -4912,6 +4912,37 @@ def refit(
new_booster._network = self._network
return new_booster

def refit_tree_manual(self, tree_id: int, values: np.ndarray) -> "Booster":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user it would not yet be clear to me how this function is different to set_leaf_output, i.e. why is this not just called set_leaf_outputs?

Copy link
Contributor Author

@neNasko1 neNasko1 Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you propose changing the name of the function or just writing better docs? I am a bit unsure if calling this function set_leaf_outputs would be a bit strange as it does additional things than to just update the leaf values.

src/boosting/gbdt.cpp Outdated Show resolved Hide resolved
src/boosting/gbdt.cpp Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants