Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project: prunr package #10

Open
camroach87 opened this issue Mar 29, 2017 · 4 comments
Open

Project: prunr package #10

camroach87 opened this issue Mar 29, 2017 · 4 comments

Comments

@camroach87
Copy link

I'm interested in creating a package to automatically prune large and unneeded components from objects to help with caching performance. Details and motivation in this gist

https://gist.github.com/camroach87/12b658afdd9f2d051721ad21311a960a

Thoughts, suggestions, package name improvements all welcome. Also - if someone has already created something like this please let me know because I will use it all the time :)

@MilesMcBain
Copy link
Member

Yes. I've hit this and I know @jonocarroll has as well.

I note that lm, glm and other ml packages like xgboost and ranger have options to turn off some of the weighty pieces. I think it's probably better to use the options if possible, to avoid a memory spike in processing that means we don't even get to the pruning stage.

Maybe there is some kind of wrapper function that could enable all said options? I'm thinking along the lines of this package: https://github.com/rbertolusso/intubate

Bonus evil Hadley: https://twitter.com/hadleywickham/status/759412516539600896

@MilesMcBain
Copy link
Member

Okay so after reading your links I see the options are not as effective as one might hope. I still reckon that general architecture might be the right way to go though. It could intercept that model object and set references to NULL as you suggest. Just have to somehow ensure garbage collection is happening often enough.

@camroach87
Copy link
Author

Ha, I like the Hadley comment :) and intubate looks interesting - going to look into it a bit more. Yeah, I've played around with some of the trim options for those packages and didn't have much luck :(

By general architecture do you mean that we have a function that first looks at the object type and then removes components based on the type? i.e., there will be a set of rules for an lm object, a different set of rules for an xgboost object, etc.? So we'll end up with a list of supported packages. I quite like this as well and was considering it - it's definitely more robust than the other approach.

@jonocarroll
Copy link

I've been down this road a few times and (as the WinVector blog shows) there's lots of data that comes along for the ride and ends up in the final, very complex, model output object. I tend to find though that different analyses require different parts of that object, hence I doubt all of the components are redundant and can be generally removed.

A few points I will note as I wait for my flight to BNE:

  • Actual copies of the data are, as @MilesMcBain has shown elsewhere, merely references to the original (thank you R and your copy-on-modify structure) -- try creating a tibble full of "copies" of an object, models of that object, and extractions of the same, and you'll find they all point to the same memory address. For a large model data object, it's the residuals, qr, and other "not copies" that make the model object large.

  • The best approach I've found, which also addresses the "naming things is hard" issue, is to do all of this within the tibble/purrr approach; data, model, and thing I want (extracted from the model column within the tibble) as mapped columns, then drop model. Sure, it uses up a lot of memory while generating it, but you'd need to use that if you're creating the standard model object anyway. This relies on you knowing what you want to keep, but that's somewhat part and parcel of doing this either of these ways.

  • The last alternative would be to create a new glm (or whatever model) function which creates a leaner output. This can be domain-specific since your prune would drop things anyway. This way the object never becomes larger than you want rather than getting too big then pruning back.

Food for thought, but an interesting project for sure. I'll be keen to hear what comes of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants