Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug #15707

Closed
profPlum opened this issue Aug 21, 2023 · 1 comment
Closed

Bug #15707

profPlum opened this issue Aug 21, 2023 · 1 comment
Assignees
Labels

Comments

@profPlum
Copy link

profPlum commented Aug 21, 2023

H2O version, Operating System and Environment
H2O: v3.42.0.1, Mac OS w/ M1 chip, R v4.3.1, no containers.

Actual behavior
h2o.r2() is very different from manually computed R2 (when used on automl).
e.g. for iris: R^2 (reported by H2O): 0.9868298, R^2 (manually computed): -1.779279

Expected behavior
They should match ofc.

Steps to reproduce
See my SO question

Screenshots
N/A, use code to reproduce.

Additional context
I used nfolds=0 to ensure that there was no cross-validation or train/test sets.
Then I reproduced R^2 manually using entire (iris) dataset.

@tomasfryda
Copy link
Contributor

@profPlum There's a bug in the manual $R^2$ computation - the first value of the R's numeric vector gets broadcasted with the operation across the h2o column.

To see what's happening I tried the following:

> Y_true <- Y[,1]
> Y_pred <- Y_pred[,1]
> (as.numeric(Y_true)-as.numeric(Y_pred))
       predict
1 -0.085572158
2 -0.090655887
3  0.090492763
4  0.127857837
5  0.005002167
6 -0.335313060

[150 rows x 1 column] 
> (as.numeric(Y_true)-as.numeric(Y_pred))+as.numeric(Y_pred) 
  predict
1     1.4
2     1.4
3     1.4
4     1.4
5     1.4
6     1.4

[150 rows x 1 column] 
> Y_true
  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2 1.3 1.4
 [39] 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4
 [77] 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0
[115] 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1

You can either convert it to R:

R2 = function(Y_pred, Y_true) {
-  MSE = mean((as.numeric(Y_true)-as.numeric(Y_pred))**2)
+  MSE = mean((as.numeric(as.vector(Y_true))-as.numeric(as.vector(Y_pred)))**2)
  R2 = 1-MSE/var(Y_true)
  return(R2)
}

Or convert it to H2O:

R2 = function(Y_pred, Y_true) {
-  MSE = mean((as.numeric(Y_true)-as.numeric(Y_pred))**2)
+  MSE = mean((as.numeric(as.h2o(Y_true))-as.numeric(as.h2o(Y_pred)))**2)
  R2 = 1-MSE/var(Y_true)
  return(R2)
}

NOTE: Even with this you might notice that the manually calculated $R^2$ and H2O's $R^2$ differ. I believe that this is caused by train/test split that is used in AutoML when nfolds=0 (e.g. for early stopping). (difference between manual R^2 & H2O reported R^2: 0.0005075975).

NOTE 2: The as.numeric is probably unnecessary but I kept it there just in case you need it for your data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants