README.Rmd

---
title: "README"
author: "Eric Graves"
date: "July 10, 2017"
output: md_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rulefit)
data(titanic, package="binnr")
```

## Installation

```{r install, eval=FALSE}
devtools::install_git("https://GravesEE@gitlab.ins.risk.regn.net/minneapolis-r-packages/rulefit.git")
```

## Usage

### Creating a RuleFit Model

A RuleFit model uses a tree ensemble to generate its rules. As such, a tree 
ensemble model must be provided to the rulefit function. This funciton returns
a RuleFit object which can be used to mine rules and train rule ensembles.

```{r constructor, warning=FALSE}

mod <- gbm.fit(titanic[-1], titanic$Survived, distribution="bernoulli",
  interaction.depth=3, shrinkage=0.1, verbose = FALSE)

rf <- rulefit(mod, n.trees=100)

print(rf)
```

the `rulefit` function wraps a gbm model in a class that manages rule
construction and model fitting. The rules are generated immediately but the
model is not fit until the `train` function is called.

```{r rules}
head(rf$rules)
```

For ease of programming *every* internal node is generated -- even the root
node. That is why the first rule listed above is empty. Root nodes are not
splits. This was a design decision and does not affect how the package is used
in practice.

### Training

Training a RuleFit model is as easy as calling the train method. The train
method uses the `cv.glmnet` function from the `glmnet` package and accepts all
of the same arguments.

##### Common Arguments
Argument | Purpose
---|---
x | Dataset of predictors that should match what was used for training the ensemble.
y | Target variable to train against.
family | What is the distribution of the target? Binomial for 0/1 variables.
alpha | Penatly mixing parameter. LASSO regression uses the default of 0.
nfolds | How many k-folds to train the model with. Defaults to 5.
dfmax | How many variables should the final model have?
parallel | TRUE/FALSE to build kfold models in parallel. Requires a backend.

```{r train}
fit <- train(rf, titanic[-1], y = titanic$Survived, family="binomial")
```

### Bagging

Training the model on repeated, random samples with replacement can generate
better parameter estimates. This is known as bagging.

```{r, bagging}
library(doSNOW)

cl <- makeCluster(3)
registerDoSNOW(cl)

fit <- train(rf, titanic[-1], y = titanic$Survived, bag = 20, parallel = TRUE, 
  family="binomial")

stopCluster(cl)
```

### Predicting

Once a RuleFit model is trained. Predictions can be produced by calling the 
predict method. As with the train function, `predict` also takes arguments
accepted by `predict.cv.glmnet`. The most important of which is the lambda
parameter, `s`. The default is to use `s="lambda.min"` which minimizes the
out-of-fold error.

Both a score as well as a sparse matrix of rules can be predicted.

```{r, predict-basics, warning=FALSE}
p_rf <- predict(fit, newx = titanic[-1], s="lambda.1se")

head(p_rf)
```

The out-of-fold predictions can also be extracted if the model was trained with
`keep=TRUE`. Again, this is working with the `cv.glmnet` API. There is nothing
magical going on here:

```{r, oof}
p_val <- fit$fit$fit.preval[,match(fit$fit$lambda.1se, fit$fit$lambda)]
```

#### Comparing RuleFit dev & val to GBM

```{r predict, warning=FALSE}

p_gbm <- predict(mod, titanic[-1], n.trees = gbm.perf(mod, plot.it = F))

roc_rf <- pROC::roc(titanic$Survived, -p_rf)
roc_val <- pROC::roc(titanic$Survived, -p_val)
roc_gbm <- pROC::roc(titanic$Survived, -p_gbm)

plot(roc_rf)
par(new=TRUE)
plot(roc_val, col="blue")
par(new=TRUE)
plot(roc_gbm, col="red")

```

### Rule Summary

RuleFit also provides a summary method to inspect and measure the coverage of
fitted rules.

```{r, summary}

fit_summary <- summary(fit, s="lambda.1se", dedup=TRUE)
head(fit_summary)

```

### Variable Importance

Like other tree ensemble techniques, variable importance can be calculated. This
is different than the **rule** importance. Variable importance corresponds to
the input variables used to generate the rules.

```{r, importance}
imp <- importance(fit, titanic[-1], s="lambda.1se")
plot(imp)
```