Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No split found when target/output of RandomForestRegressor is very low. #191

Open
Hugemiler opened this issue Sep 28, 2022 · 3 comments
Open

Comments

@Hugemiler
Copy link

Hugemiler commented Sep 28, 2022

Hello! I'm running into an issue when I use the RandomForestRegressor on a dataset with very low (in the range of 1e-3, 1e-4 and lower) values as outputs (Y). When the values fall into that range, the DecisionTrees are made up entirely of a single layer of leaf, which predict always a constant value, the mean of the desired output. Curiously, simply multiplying the output vector for 100 or 1000 seems to solve the problem, and valid splits can then be found.

I'll try to provide a shareable MWE in the next few days. It might take a while, because the feature set is large and the data is protected, and I will have to do some masking. However, going through the source code, I'm wondering if the cuplrit is not

|| tsum * node.label > -1e-7 * wsum + tssq)

because the cases where the issue happens coincide with the true values of sum(y) * mean(y) > -1e-7 * length(y) + sum(y .^ 2). Needless to say, when I multiply everything for a small power of 10, the value of this expression becomes false, and valid splits can then be found.

Now, it is my understanding that, to a certain extent, Decision Trees should be irrespective of any kind of normalization. I expect that valid splits can be found even when the outputs are very, very low. Also, sweeping through Breiman's original C code, - avaiable here - I have not found an equivalent test to this specific line I tagged (but that could be my poor C reading skills?).

Can you please clarify the purpose of such line, confirm whether this is intended behavior, and see what can be done to mitigate this issue? (I know, the lack of a MWE makes things harder, however, I just wanted to leave the issue registered for now)

@ablaom
Copy link
Member

ablaom commented Sep 28, 2022

@bensadeghi Perhaps you care to comment on this one?

@ablaom
Copy link
Member

ablaom commented Oct 4, 2022

The test in question is equivalent to node_impurity/wsum > 1e-7 and it's a little odd that it isn't written that way, because the impurity is already computed. Anyway...

I suppose it makes sense not to split if the impurity is negligible, but as impurity itself is not scale invariant, the test, as written isn't either. This mightn't be such an issue but for the fact that the (absolute) threshold cannot be changed by the user (it is not exposed as hyperparameter). In this sense I agree the current implementation is "incorrect".

The user does have control over min_purity_increase, but this test comes later after optimising the feature threshold split point (so more compute time): if the there isn't a sufficient improvement in purity, then the split candidate is abandoned. This later test will eventually kick in if the other were removed.

Some options to proceed:

  1. Drop the minimum impurity test
  2. Compute a scale from training target (square of mean-value, say) and replace the absolute test to a relative one using the pre-computed scale.
  3. Expose a new user-specified parameter minimum_impurity_split to use instead of 1e-7 in existing test.

My vote would be for 1.

Would be good to see what happens in the C library that sckitlearn uses. I expect the Breiman code is less optimised. The file linked by @Hugemiler above is a port from there. Perhaps @Eight1911, who I now see carried out the port, would care to comment.

@Hugemiler
Copy link
Author

Hugemiler commented Oct 6, 2022

I second the thought that 1 might be the better option. I also understand that when @ablaom says "The user does have control over min_purity_increase, but this test comes later after optimising the feature threshold split point (so more compute time)", he might be referring to

|| best_purity - tsum * node.label < min_purity_increase * wsum)

and I understand that this does cause more compute time.

However, right now, I cannot determine the length of the impact of this increase in computer time.

My understanding is: such increase will happen only on the very last attempt to compute a split at a given node, the case where (right now) the first test dismisses the split attempt - saving us from the computation of a split trial -, where (in the future) a split trial will have to be computed, and reach the second min_purity_increase test to be potentially dismissed.

The impact of such computation depends, as I see, on (A) the generally expected depth of the tree; (B) the combination of min_samples_leaf and the amount of samples you have; and (C) the amount of splits that effectively pass the first test and fail the second as they are right now. Especially (C), in my opinion. How many times does a node become a leaf due to the first test, compared to the second? Intuitively (by all means, just an intuition), for the vast majority of the models, the dismissal of a split probably happens due to a small enough increase in the best purity, whether on line 202 or line 159

if purity > best_purity && !isapprox(purity, best_purity)

If someone feels like pursuing what @ablaom suggested in 2 or 3, I would say that maybe its possible to get the best of both worlds... expose an optional argument which defaults to computing a scale when nothing is passed? However this is just an idea, and I'm already way more supportive of 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants