-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No split found when target/output of RandomForestRegressor is very low. #191
Comments
@bensadeghi Perhaps you care to comment on this one? |
The test in question is equivalent to I suppose it makes sense not to split if the impurity is negligible, but as impurity itself is not scale invariant, the test, as written isn't either. This mightn't be such an issue but for the fact that the (absolute) threshold cannot be changed by the user (it is not exposed as hyperparameter). In this sense I agree the current implementation is "incorrect". The user does have control over Some options to proceed:
My vote would be for 1. Would be good to see what happens in the C library that sckitlearn uses. I expect the Breiman code is less optimised. The file linked by @Hugemiler above is a port from there. Perhaps @Eight1911, who I now see carried out the port, would care to comment. |
I second the thought that 1 might be the better option. I also understand that when @ablaom says "The user does have control over min_purity_increase, but this test comes later after optimising the feature threshold split point (so more compute time)", he might be referring to DecisionTree.jl/src/regression/tree.jl Line 202 in 9dab9c1
and I understand that this does cause more compute time. However, right now, I cannot determine the length of the impact of this increase in computer time. My understanding is: such increase will happen only on the very last attempt to compute a split at a given node, the case where (right now) the first test dismisses the split attempt - saving us from the computation of a split trial -, where (in the future) a split trial will have to be computed, and reach the second The impact of such computation depends, as I see, on (A) the generally expected depth of the tree; (B) the combination of DecisionTree.jl/src/regression/tree.jl Line 159 in 9dab9c1
If someone feels like pursuing what @ablaom suggested in 2 or 3, I would say that maybe its possible to get the best of both worlds... expose an optional argument which defaults to computing a scale when nothing is passed? However this is just an idea, and I'm already way more supportive of 1. |
Hello! I'm running into an issue when I use the
RandomForestRegressor
on a dataset with very low (in the range of 1e-3, 1e-4 and lower) values as outputs (Y
). When the values fall into that range, theDecisionTree
s are made up entirely of a single layer ofleaf
, which predict always a constant value, the mean of the desired output. Curiously, simply multiplying the output vector for100
or1000
seems to solve the problem, and valid splits can then be found.I'll try to provide a shareable MWE in the next few days. It might take a while, because the feature set is large and the data is protected, and I will have to do some masking. However, going through the source code, I'm wondering if the cuplrit is not
DecisionTree.jl/src/regression/tree.jl
Line 90 in 9dab9c1
because the cases where the issue happens coincide with the true values of
sum(y) * mean(y) > -1e-7 * length(y) + sum(y .^ 2)
. Needless to say, when I multiply everything for a small power of 10, the value of this expression becomes false, and valid splits can then be found.Now, it is my understanding that, to a certain extent, Decision Trees should be irrespective of any kind of normalization. I expect that valid splits can be found even when the outputs are very, very low. Also, sweeping through Breiman's original C code, - avaiable here - I have not found an equivalent test to this specific line I tagged (but that could be my poor C reading skills?).
Can you please clarify the purpose of such line, confirm whether this is intended behavior, and see what can be done to mitigate this issue? (I know, the lack of a MWE makes things harder, however, I just wanted to leave the issue registered for now)
The text was updated successfully, but these errors were encountered: