Skip to content

Commit

Permalink
Merge pull request #195 from badge/docs
Browse files Browse the repository at this point in the history
Added Sphinx-based documentation, updated requirements.txt
  • Loading branch information
dswah authored Sep 20, 2018
2 parents 1391150 + 3ccdc86 commit 1dd48e0
Show file tree
Hide file tree
Showing 42 changed files with 2,760 additions and 588 deletions.
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Compiled source #
###################
*.pyc
*.ipynb

# Packages #
############
Expand Down Expand Up @@ -36,3 +35,11 @@ Thumbs.db
dist
.cache

# Documentation
###############
_build/
.ipynb_checkpoints/

# PyCharm
#########
.idea/
355 changes: 1 addition & 354 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Generalized Additive Models in Python.

<img src=imgs/pygam_tensor.png>

## Tutorial
## Documentation
[pyGAM: Getting started with Generalized Additive Models in Python](https://medium.com/@jpoberhauser/pygam-getting-started-with-generalized-additive-models-in-python-457df5b4705f)

## Installation
Expand Down Expand Up @@ -72,359 +72,6 @@ GAMs extend generalized linear models by allowing non-linear functions of featur

The result is a very flexible model, where it is easy to incorporate prior knowledge and control overfitting.


## Regression
For **regression** problems, we can use a **linear GAM** which models:

![alt tag](http://latex.codecogs.com/svg.latex?\mathbb{E}[y|X]=\beta_0+f_1(X_1)+f_2(X_2)+\dots+f_p(X_p))

```python
from pygam import LinearGAM, s, f
from pygam.datasets import wage

X, y = wage(return_X_y=True)

gam = LinearGAM(s(0) + s(1) + f(2)).gridsearch(X, y)

fig, axs = plt.subplots(1, 3)
titles = ['year', 'age', 'education']

for i, ax in enumerate(axs):
XX = gam.generate_X_grid(term=i)
pdep, confi = gam.partial_dependence(term=i, width=.95)

ax.plot(XX[:, i], pdep)
ax.plot(XX[:, i], confi, c='r', ls='--')
ax.set_title(titles[i])
```
<img src=imgs/pygam_wage_data_linear.png>

Even though we allowed **n_splines=20** per numerical feature, our **smoothing penalty** reduces us to just 19 **effective degrees of freedom**:

```
gam.summary()
LinearGAM
=============================================== ==========================================================
Distribution: NormalDist Effective DoF: 19.2602
Link Function: IdentityLink Log Likelihood: -24116.7451
Number of Samples: 3000 AIC: 48274.0107
AICc: 48274.2999
GCV: 1250.3656
Scale: 1235.9245
Pseudo R-Squared: 0.2945
==========================================================================================================
Feature Function Lambda Rank EDoF P > x Sig. Code
================================= ==================== ============ ============ ============ ============
s(0) [15.8489] 20 6.9 5.52e-03 **
s(1) [15.8489] 20 8.5 1.11e-16 ***
f(2) [15.8489] 5 3.8 1.11e-16 ***
intercept 0 1 0.0 1.11e-16 ***
==========================================================================================================
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```


With **LinearGAMs**, we can also check the **prediction intervals**:

```python
from pygam import LinearGAM
from pygam.datasets import mcycle

X, y = mcycle(return_X_y=True)

gam = LinearGAM().gridsearch(X, y)
XX = gam.generate_X_grid(term=0)

plt.plot(XX, gam.predict(XX), 'r--')
plt.plot(XX, gam.prediction_intervals(XX, width=.95), color='b', ls='--')

plt.scatter(X, y, facecolor='gray', edgecolors='none')
plt.title('95% prediction interval')
```
<img src=imgs/pygam_mcycle_data_linear.png>

And simulate from the posterior:

```python
# continuing last example with the mcycle dataset
for response in gam.sample(X, y, quantity='y', n_draws=50, sample_at_X=XX):
plt.scatter(XX, response, alpha=.03, color='k')
plt.plot(XX, gam.predict(XX), 'r--')
plt.plot(XX, gam.prediction_intervals(XX, width=.95), color='b', ls='--')
plt.title('draw samples from the posterior of the coefficients')
```

<img src=imgs/pygam_mcycle_data_linear_sample_from_posterior.png>

## Classification
For **binary classification** problems, we can use a **logistic GAM** which models:

![alt tag](http://latex.codecogs.com/svg.latex?log\left(\frac{P(y=1|X)}{P(y=0|X)}\right)=\beta_0+f_1(X_1)+f_2(X_2)+\dots+f_p(X_p))

```python
from pygam import LogisticGAM, s, f
from pygam.datasets import default

X, y = default(return_X_y=True)

gam = LogisticGAM(f(0) + s(1) + s(2)).gridsearch(X, y)

fig, axs = plt.subplots(1, 3)
titles = ['student', 'balance', 'income']

for i, ax in enumerate(axs):
XX = gam.generate_X_grid(term=i)
pdep, confi = gam.partial_dependence(term=i, width=.95)

ax.plot(XX[:, i], pdep)
ax.plot(XX[:, i], confi, c='r', ls='--')
ax.set_title(titles[i])

# and check the accuracy
gam.accuracy(X, y)
```
<img src=imgs/pygam_default_data_logistic.png>

Since the **scale** of the **Binomial distribution** is known, our gridsearch minimizes the **Un-Biased Risk Estimator** (UBRE) objective:

```
gam.summary()
LogisticGAM
=============================================== ==========================================================
Distribution: BinomialDist Effective DoF: 3.8047
Link Function: LogitLink Log Likelihood: -788.877
Number of Samples: 10000 AIC: 1585.3634
AICc: 1585.369
UBRE: 2.1588
Scale: 1.0
Pseudo R-Squared: 0.4598
==========================================================================================================
Feature Function Lambda Rank EDoF P > x Sig. Code
================================= ==================== ============ ============ ============ ============
f(0) [1000.] 2 1.7 4.61e-03 **
s(1) [1000.] 20 1.2 0.00e+00 ***
s(2) [1000.] 20 0.8 3.29e-02 *
intercept 0 1 0.0 0.00e+00 ***
==========================================================================================================
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```


## Poisson and Histogram Smoothing
We can intuitively perform **histogram smoothing** by modeling the counts in each bin
as being distributed Poisson via **PoissonGAM**.

```python
from pygam import PoissonGAM
from pygam.datasets import faithful

X, y = faithful(return_X_y=True)

gam = PoissonGAM().gridsearch(X, y)

plt.hist(faithful(return_X_y=False)['eruptions'], bins=200, color='k');
plt.plot(X, gam.predict(X), color='r')
plt.title('Best Lambda: {0:.2f}'.format(gam.lam[0][0]));
```
<img src=imgs/pygam_poisson.png>

## Terms and Interactions

pyGAM can also fit interactions using tensor products via `te()`
```python
from pygam import LinearGAM, s, te
from pygam.datasets import chicago

X, y = chicago(return_X_y=True)

gam = PoissonGAM(s(0, n_splines=200) + te(3, 1) + s(2)).fit(X, y)
```

and plot a 3D surface:

```python
XX = gam.generate_X_grid(term=0, meshgrid=True)
Z = gam.partial_dependence(term=0, X=XX, meshgrid=True)

from mpl_toolkits import mplot3d
ax = plt.axes(projection='3d')
ax.plot_surface(XX[0], XX[1], Z, cmap='viridis')
```

<img src=imgs/pygam_chicago_tensor.png>

For simple interactions it is sometimes useful to add a by-variable to a term

```python
from pygam import LinearGAM, s
from pygam.datasets import toy_interaction

X, y = toy_interaction(return_X_y=True)

gam = LinearGAM(s(0, by=1)).fit(X, y)
gam.summary()
```

#### Available Terms
- `l()` linear terms
- `s()` spline terms
- `f()` factor terms
- `te()` tensor products
- `intercept`

## Custom Models
It's also easy to build custom models, by using the base **GAM** class and specifying the **distribution** and the **link function**.

```python
from pygam import GAM
from pygam.datasets import trees

X, y = trees(return_X_y=True)

gam = GAM(distribution='gamma', link='log')
gam.gridsearch(X, y)

plt.scatter(y, gam.predict(X))
plt.xlabel('true volume')
plt.ylabel('predicted volume')
```
<img src=imgs/pygam_custom.png>

We can check the quality of the fit by looking at the `Pseudo R-Squared`:

```
gam.summary()
GAM
=============================================== ==========================================================
Distribution: GammaDist Effective DoF: 25.3616
Link Function: LogLink Log Likelihood: -26.1673
Number of Samples: 31 AIC: 105.0579
AICc: 501.5549
GCV: 0.0088
Scale: 0.001
Pseudo R-Squared: 0.9993
==========================================================================================================
Feature Function Lambda Rank EDoF P > x Sig. Code
================================= ==================== ============ ============ ============ ============
s(0) [0.001] 20 2.04e-08 ***
s(1) [0.001] 20 7.36e-06 ***
intercept 0 1 4.39e-13 ***
==========================================================================================================
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

## Penalties / Constraints
With GAMs we can encode **prior knowledge** and **control overfitting** by using penalties and constraints.

#### Available penalties:
- second derivative smoothing (default on numerical features)
- L2 smoothing (default on categorical features)

#### Availabe constraints:
- monotonic increasing/decreasing smoothing
- convex/concave smoothing
- periodic smoothing [soon...]


We can inject our intuition into our model by using **monotonic** and **concave** constraints:

```python
from pygam import LinearGAM, s
from pygam.datasets import hepatitis

X, y = hepatitis(return_X_y=True)

gam1 = LinearGAM(s(0, constraints='monotonic_inc')).fit(X, y)
gam2 = LinearGAM(s(0, constraints='concave')).fit(X, y)

fig, ax = plt.subplots(1, 2)
ax[0].plot(X, y, label='data')
ax[0].plot(X, gam1.predict(X), label='monotonic fit')
ax[0].legend()

ax[1].plot(X, y, label='data')
ax[1].plot(X, gam2.predict(X), label='concave fit')
ax[1].legend()
```
<img src=imgs/pygam_constraints.png>

## API
pyGAM is intuitive, modular, and adheres to a familiar API:

```python
from pygam import LogisticGAM
from pygam.datasets import toy_classification

X, y = toy_classification(return_X_y=True)

gam = LogisticGAM(s(0) + s(1) + s(2) + s(3) + s(4) + f(5))
gam.fit(X, y)
```

Since GAMs are additive, it is also super easy to visualize each individual **feature function**, `f_i(X_i)`. These feature functions describe the effect of each `X_i` on `y` individually while marginalizing out all other predictors:

```python
pdeps = gam.partial_dependence(X)
plt.plot(pdeps)
```
<img src=imgs/pygam_multi_pdep.png>

## Current Features
### Models
pyGAM comes with many models out-of-the-box:

- GAM (base class for constructing custom models)
- LinearGAM
- LogisticGAM
- GammaGAM
- PoissonGAM
- InvGaussGAM
- ExpectileGAM

You can mix and match distributions with link functions to create custom models!

```python
gam = GAM(distribution='gamma', link='inverse')
```

### Distributions

- Normal
- Binomial
- Gamma
- Poisson
- Inverse Gaussian

### Link Functions
Link functions take the distribution mean to the linear prediction. These are the canonical link functions for the above distributions:

- Identity
- Logit
- Inverse
- Log
- Inverse-squared

### Callbacks
Callbacks are performed during each optimization iteration. It's also easy to write your own.

- deviance - model deviance
- diffs - differences of coefficient norm
- accuracy - model accuracy for LogisticGAM
- coef - coefficient logging

You can check a callback by inspecting:

```python
plt.plot(gam.logs_['deviance'])
```
<img src=imgs/pygam_multi_deviance.png>

### Linear Extrapolation
<img src=imgs/pygam_mcycle_data_extrapolation.png>

## Citing pyGAM
Please consider citing pyGAM if it has helped you in your research or work:

Expand Down
Loading

0 comments on commit 1dd48e0

Please sign in to comment.