Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory usage #123

Open
CameronBieganek opened this issue Jun 29, 2020 · 2 comments
Open

Excessive memory usage #123

CameronBieganek opened this issue Jun 29, 2020 · 2 comments

Comments

@CameronBieganek
Copy link

I have a data set of dimensions (87390, 243). Most of the columns are categorical variables that have been one-hot encoded. The size of the data set in memory is ~160 MB. I compared the memory usage for DecisionTree.jl and R's ranger package.

DecisionTree.jl

using DecisionTree

df = CSV.read("rf_training_data.csv")

y = string.(df.y)
X = convert(Matrix, df[:, 2:end])

n_subfeatures = 15
n_trees = 600

# Default vaues:
# partial_sampling = 0.7
# max_depth = -1
# min_samples_leaf = 1

rf = build_forest(y, X, n_subfeatures, n_trees)

Memory consumption:

julia> varinfo(r"rf")
  name      size summary                 
  –––– ––––––––– ––––––––––––––––––––––––
  rf   1.417 GiB Ensemble{Float64,String}

ranger

library(readr)
library(ranger)

df <- read_csv('rf_training_data.csv')
df$y <- factor(df$y)

rf <- ranger(
    y ~ .,
    data = df,
    num.trees = 600,
    mtry = 15,
    min.node.size = 1,
    replace = FALSE,
    sample.fraction = 0.7
)

Memory consumption:

> print(object.size(rf), units = "MB")
585.2 Mb

Conclusion

Thus, it appears that DecisionTree.jl is using 2.4x as much memory as ranger for this model. Is it possible to reduce the memory footprint of DecisionTree.jl? I can provide a scrubbed version of my data set if that helps.

@CameronBieganek CameronBieganek changed the title Excessive memory usage? Excessive memory usage Jun 29, 2020
@bensadeghi
Copy link
Member

bensadeghi commented Jun 30, 2020

You could cast the features to a concrete type (ie X = Int.(X)) as opposed to using the Any type, which is quite heavy. That should help a little bit.
But otherwise, we need a new implementation of the Leaf type (see #90), which requires a significant amount of work.

@CameronBieganek
Copy link
Author

You could cast the features to a concrete type (ie X = Int.(X)) as opposed to using the Any type, which is quite heavy. That should help a little bit.

The features matrix in my example had typeof(X) == Array{Float64,2}, so I think I dodged that bullet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants