Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small error in the H-L-C calculation #1

Open
prockenschaub opened this issue Apr 29, 2020 · 0 comments
Open

Small error in the H-L-C calculation #1

prockenschaub opened this issue Apr 29, 2020 · 0 comments

Comments

@prockenschaub
Copy link

Thanks for your paper and providing this code, very useful!

I just quickly wanted to flag that there is a little error in the H-L-C calculation when splitting the dataset into equal chunks. If the dataset size isn't a multiple of g, the nrow(data) % g rows with the highest predicted scores are grouped with the lowest decile group. The issue is in the following line:

split_mtx = split(mtx, rep(1:ceiling(nr/n), each=n, length.out=nr))

Minimal example

g = 3
y = rep(c(0, 1), each = 5)
prob = seq(0, 1, length.out = 10)

mtx = cbind(y, y_not = 1- y, prob, prob_not = 1-prob)
mtx = as.data.frame(mtx)
mtx = mtx[order(mtx$prob),]
n <- length(prob)/g
nr <- nrow(mtx)

split(mtx, rep(1:ceiling(nr/n), each=n, length.out=nr))
#> $`1`
#>    y y_not      prob  prob_not
#> 1  0     1 0.0000000 1.0000000
#> 2  0     1 0.1111111 0.8888889
#> 3  0     1 0.2222222 0.7777778
#> 10 1     0 1.0000000 0.0000000    <-- wrongly put in lowest class
#> 
#> $`2`
#>   y y_not      prob  prob_not
#> 4 0     1 0.3333333 0.6666667
#> 5 0     1 0.4444444 0.5555556
#> 6 1     0 0.5555556 0.4444444
#> 
#> $`3`
#>   y y_not      prob  prob_not
#> 7 1     0 0.6666667 0.3333333
#> 8 1     0 0.7777778 0.2222222
#> 9 1     0 0.8888889 0.1111111

Easy fix

An easy fix could be to use dplyr::ntile instead or rep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant