-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
152 lines (108 loc) · 4.07 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
title: "README"
author: "Eric Graves"
date: "July 10, 2017"
output: md_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rulefit)
data(titanic, package="binnr")
```
## Installation
```{r install, eval=FALSE}
devtools::install_git("https://[email protected]/minneapolis-r-packages/rulefit.git")
```
## Usage
### Creating a RuleFit Model
A RuleFit model uses a tree ensemble to generate its rules. As such, a tree
ensemble model must be provided to the rulefit function. This funciton returns
a RuleFit object which can be used to mine rules and train rule ensembles.
```{r constructor, warning=FALSE}
mod <- gbm.fit(titanic[-1], titanic$Survived, distribution="bernoulli",
interaction.depth=3, shrinkage=0.1, verbose = FALSE)
rf <- rulefit(mod, n.trees=100)
print(rf)
```
the `rulefit` function wraps a gbm model in a class that manages rule
construction and model fitting. The rules are generated immediately but the
model is not fit until the `train` function is called.
```{r rules}
head(rf$rules)
```
For ease of programming *every* internal node is generated -- even the root
node. That is why the first rule listed above is empty. Root nodes are not
splits. This was a design decision and does not affect how the package is used
in practice.
### Training
Training a RuleFit model is as easy as calling the train method. The train
method uses the `cv.glmnet` function from the `glmnet` package and accepts all
of the same arguments.
##### Common Arguments
Argument | Purpose
---|---
x | Dataset of predictors that should match what was used for training the ensemble.
y | Target variable to train against.
family | What is the distribution of the target? Binomial for 0/1 variables.
alpha | Penatly mixing parameter. LASSO regression uses the default of 0.
nfolds | How many k-folds to train the model with. Defaults to 5.
dfmax | How many variables should the final model have?
parallel | TRUE/FALSE to build kfold models in parallel. Requires a backend.
```{r train}
fit <- train(rf, titanic[-1], y = titanic$Survived, family="binomial")
```
### Bagging
Training the model on repeated, random samples with replacement can generate
better parameter estimates. This is known as bagging.
```{r, bagging}
library(doSNOW)
cl <- makeCluster(3)
registerDoSNOW(cl)
fit <- train(rf, titanic[-1], y = titanic$Survived, bag = 20, parallel = TRUE,
family="binomial")
stopCluster(cl)
```
### Predicting
Once a RuleFit model is trained. Predictions can be produced by calling the
predict method. As with the train function, `predict` also takes arguments
accepted by `predict.cv.glmnet`. The most important of which is the lambda
parameter, `s`. The default is to use `s="lambda.min"` which minimizes the
out-of-fold error.
Both a score as well as a sparse matrix of rules can be predicted.
```{r, predict-basics, warning=FALSE}
p_rf <- predict(fit, newx = titanic[-1], s="lambda.1se")
head(p_rf)
```
The out-of-fold predictions can also be extracted if the model was trained with
`keep=TRUE`. Again, this is working with the `cv.glmnet` API. There is nothing
magical going on here:
```{r, oof}
p_val <- fit$fit$fit.preval[,match(fit$fit$lambda.1se, fit$fit$lambda)]
```
#### Comparing RuleFit dev & val to GBM
```{r predict, warning=FALSE}
p_gbm <- predict(mod, titanic[-1], n.trees = gbm.perf(mod, plot.it = F))
roc_rf <- pROC::roc(titanic$Survived, -p_rf)
roc_val <- pROC::roc(titanic$Survived, -p_val)
roc_gbm <- pROC::roc(titanic$Survived, -p_gbm)
plot(roc_rf)
par(new=TRUE)
plot(roc_val, col="blue")
par(new=TRUE)
plot(roc_gbm, col="red")
```
### Rule Summary
RuleFit also provides a summary method to inspect and measure the coverage of
fitted rules.
```{r, summary}
fit_summary <- summary(fit, s="lambda.1se", dedup=TRUE)
head(fit_summary)
```
### Variable Importance
Like other tree ensemble techniques, variable importance can be calculated. This
is different than the **rule** importance. Variable importance corresponds to
the input variables used to generate the rules.
```{r, importance}
imp <- importance(fit, titanic[-1], s="lambda.1se")
plot(imp)
```