Skip to content

Commit

Permalink
Document refactor
Browse files Browse the repository at this point in the history
change badge
  • Loading branch information
tqchen committed Aug 3, 2015
1 parent c43fee5 commit e8de5da
Show file tree
Hide file tree
Showing 20 changed files with 287 additions and 185 deletions.
15 changes: 12 additions & 3 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
Contributors of DMLC/XGBoost
=======
============================
XGBoost has been developed and used by a group of active community. Everyone is more than welcomed to is a great way to make the project better and more accessible to more users.

Comitters
=======
---------
Committers are people who have made substantial contribution to the project and granted write access to the project.
* [Tianqi Chen](https://github.com/tqchen), University of Washington
- Tianqi is a PhD working on large-scale machine learning, he is the creator of the project.
Expand All @@ -14,8 +14,17 @@ Committers are people who have made substantial contribution to the project and
* [Michael Benesty](https://github.com/pommedeterresautee)
- Micheal is a lawyer, data scientist in France, he is the creator of xgboost interactive analysis module in R.

Become a Comitter
-----------------
XGBoost is a opensource project and we are actively looking for new comitters who are willing to help maintaining and lead the project.
Committers comes from contributors who:
* Made substantial contribution to the project.
* Willing to spent time on maintaining and lead the project.

New committers will be proposed by current comitter memembers, with support from more than two of current comitters.

List of Contributors
=======
--------------------
* [Full List of Contributors](https://github.com/dmlc/xgboost/graphs/contributors)
- To contributors: please add your name to the list when you submit a patch to the project:)
* [Kailong Chen](https://github.com/kalenhaha)
Expand Down
4 changes: 2 additions & 2 deletions R-package/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: xgboost
Type: Package
Title: Extreme Gradient Boosting
Version: 0.4-1
Date: 2015-05-11
Version: 0.4-2
Date: 2015-08-01
Author: Tianqi Chen <[email protected]>, Tong He <[email protected]>, Michael Benesty <[email protected]>
Maintainer: Tong He <[email protected]>
Description: Extreme Gradient Boosting, which is an
Expand Down
8 changes: 7 additions & 1 deletion R-package/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,13 @@ R package for xgboost
Installation
------------

For up-to-date version (which is recommended), please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
We are [on CRAN](https://cran.r-project.org/web/packages/xgboost/index.html) now. For stable/pre-compiled(for Windows and OS X) version, please install from CRAN:

```r
install.packages('xgboost')
```

For up-to-date version, please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.

```r
devtools::install_github('dmlc/xgboost',subdir='R-package')
Expand Down
46 changes: 23 additions & 23 deletions R-package/vignettes/xgboostPresentation.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Xgboost presentation"
output:
output:
rmarkdown::html_vignette:
css: vignette.css
number_sections: yes
Expand All @@ -16,7 +16,7 @@ vignette: >
Introduction
============

**Xgboost** is short for e**X**treme **G**radient **Boost**ing package.
**Xgboost** is short for e**X**treme **G**radient **Boost**ing package.

The purpose of this Vignette is to show you how to use **Xgboost** to build a model and make predictions.

Expand All @@ -25,9 +25,9 @@ It is an efficient and scalable implementation of gradient boosting framework by
- *linear* model ;
- *tree learning* algorithm.

It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective functions easily.
It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective functions easily.

It has been [used](https://github.com/dmlc/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions.
It has been [used](https://github.com/dmlc/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions.

It has several features:

Expand Down Expand Up @@ -64,7 +64,7 @@ Formerly available versions can be obtained from the CRAN [archive](http://cran.
Learning
========

For the purpose of this tutorial we will load **Xgboost** package.
For the purpose of this tutorial we will load **XGBoost** package.

```{r libLoading, results='hold', message=F, warning=F}
require(xgboost)
Expand All @@ -73,7 +73,7 @@ require(xgboost)
Dataset presentation
--------------------

In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-).
In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-).

Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.

Expand All @@ -85,7 +85,7 @@ We will load the `agaricus` datasets embedded with the package and will link the
The datasets are already split in:

* `train`: will be used to build the model ;
* `test`: will be used to assess the quality of our model.
* `test`: will be used to assess the quality of our model.

Why *split* the dataset in two parts?

Expand Down Expand Up @@ -115,7 +115,7 @@ dim(train$data)
dim(test$data)
```

This dataset is very small to not make the **R** package too heavy, however **Xgboost** is built to manage huge dataset very efficiently.
This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge dataset very efficiently.

As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`):

Expand All @@ -124,7 +124,7 @@ class(train$data)[1]
class(train$label)
```

Basic Training using Xgboost
Basic Training using XGBoost
----------------------------

This step is the most critical part of the process for the quality of our model.
Expand Down Expand Up @@ -160,7 +160,7 @@ bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth

#### xgb.DMatrix

**Xgboost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later.
**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later.

```{r trainingDmatrix, message=F, warning=F}
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
Expand All @@ -169,7 +169,7 @@ bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround

#### Verbose option

**Xgboost** has severa features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
**XGBoost** has severa features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.

One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).

Expand All @@ -188,7 +188,7 @@ bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, o
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 2)
```

Basic prediction using Xgboost
Basic prediction using XGBoost
==============================

Perform the prediction
Expand All @@ -211,7 +211,7 @@ These numbers doesn't look like *binary classification* `{0,1}`. We need to perf
Transform the regression in a binary classification
---------------------------------------------------

The only thing that **Xgboost** does is a *regression*. **Xgboost** is using `label` vector to build its *regression* model.
The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model.

How can we use a *regression* model to perform a binary classification?

Expand Down Expand Up @@ -240,7 +240,7 @@ Steps explanation:
2. `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ;
3. `mean(vectorOfErrors)` computes the *average error* itself.

The most important thing to remember is that **to do a classification, you just do a regression to the** `label` **and then apply a threshold**.
The most important thing to remember is that **to do a classification, you just do a regression to the** `label` **and then apply a threshold**.

*Multiclass* classification works in a similar way.

Expand Down Expand Up @@ -269,7 +269,7 @@ Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.

One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following technics will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.

One way to measure progress in learning of a model is to provide to **Xgboost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.

> in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors.
Expand All @@ -281,7 +281,7 @@ watchlist <- list(train=dtrain, test=dtest)
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchlist=watchlist, objective = "binary:logistic")
```

**Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
**XGBoost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.

Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.

Expand All @@ -298,13 +298,13 @@ bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchli
Linear boosting
---------------

Until know, all the learnings we have performed were based on boosting trees. **Xgboost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).
Until know, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).

```{r linearBoosting, message=F, warning=F}
bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nthread = 2, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic")
```

In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm.
In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm.

In simple cases, it will happem because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.

Expand Down Expand Up @@ -340,7 +340,7 @@ print(paste("test-error=", err))
View feature importance/influence from the learnt model
-------------------------------------------------------

Feature importance is similar to R gbm package's relative influence (rel.inf).
Feature importance is similar to R gbm package's relative influence (rel.inf).

```
importance_matrix <- xgb.importance(model = bst)
Expand Down Expand Up @@ -370,7 +370,7 @@ Save and load models

May be your dataset is big, and it takes time to train a model on it? May be you are not a big fan of loosing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.

Hopefully for you, **Xgboost** implements such functions.
Hopefully for you, **XGBoost** implements such functions.

```{r saveModel, message=F, warning=F}
# save model to binary local file
Expand All @@ -397,7 +397,7 @@ file.remove("./xgboost.model")

> result is `0`? We are good!
In some very specific cases, like when you want to pilot **Xgboost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it.
In some very specific cases, like when you want to pilot **XGBoost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it.

```{r saveLoadRBinVectorModel, message=F, warning=F}
# save model to R's raw vector
Expand All @@ -412,9 +412,9 @@ pred3 <- predict(bst3, test$data)
# pred2 should be identical to pred
print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))
```
```

> Again `0`? It seems that `Xgboost` works pretty well!
> Again `0`? It seems that `XGBoost` works pretty well!
References
==========
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
DMLC/XGBoost
=======

[![Build Status](https://travis-ci.org/dmlc/xgboost.svg?branch=master)](https://travis-ci.org/dmlc/xgboost) [![Gitter chat for developers at https://gitter.im/dmlc/xgboost](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dmlc/xgboost?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
<img src=https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/xgboost.png width=100/> eXtreme Gradient Boosting
===========
[![Build Status](https://travis-ci.org/dmlc/xgboost.svg?branch=master)](https://travis-ci.org/dmlc/xgboost)
[![Documentation Status](https://readthedocs.org/projects/xgboost/badge/?version=latest)](https://xgboost.readthedocs.org)
[![Gitter chat for developers at https://gitter.im/dmlc/xgboost](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dmlc/xgboost?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

An optimized general purpose gradient boosting library. The library is parallelized, and also provides an optimized distributed version.

Expand Down
10 changes: 5 additions & 5 deletions demo/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
XGBoost Examples
====
XGBoost Code Examples
=====================
This folder contains all the code examples using xgboost.

* Contribution of examples, benchmarks is more than welcome!
* If you like to share how you use xgboost to solve your problem, send a pull request:)

Features Walkthrough
====
--------------------
This is a list of short codes introducing different functionalities of xgboost packages.
* Basic walkthrough of packages
[python](guide-python/basic_walkthrough.py)
Expand Down Expand Up @@ -37,7 +37,7 @@ This is a list of short codes introducing different functionalities of xgboost p
[R](../R-package/demo/predict_leaf_indices.R)

Basic Examples by Tasks
====
-----------------------
Most of examples in this section are based on CLI or python version.
However, the parameter settings can be applied to all versions
* [Binary classification](binary_classification)
Expand All @@ -46,7 +46,7 @@ However, the parameter settings can be applied to all versions
* [Learning to Rank](rank)

Benchmarks
====
----------
* [Starter script for Kaggle Higgs Boson](kaggle-higgs)
* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)

4 changes: 2 additions & 2 deletions demo/guide-python/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
XGBoost Python Feature Walkthrough
====
* [Basic walkthrough of wrappers](basic_walkthrough.py)
==================================
* [Basic walkthrough of wrappers](basic_walkthrough.py)
* [Cutomize loss function, and evaluation metric](custom_objective.py)
* [Boosting from existing prediction](boost_from_prediction.py)
* [Predicting using first n trees](predict_first_ntree.py)
Expand Down
14 changes: 7 additions & 7 deletions demo/kaggle-otto/understandingXGBoostModel.Rmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Understanding XGBoost Model on Otto Dataset"
author: "Michaël Benesty"
output:
output:
rmarkdown::html_vignette:
css: ../../R-package/vignettes/vignette.css
number_sections: yes
Expand Down Expand Up @@ -54,7 +54,7 @@ test[1:6,1:5, with =F]
Each *column* represents a feature measured by an `integer`. Each *row* is an **Otto** product.

Obviously the first column (`ID`) doesn't contain any useful information.
Obviously the first column (`ID`) doesn't contain any useful information.

To let the algorithm focus on real stuff, we will delete it.

Expand Down Expand Up @@ -124,7 +124,7 @@ param <- list("objective" = "multi:softprob",
cv.nround <- 5
cv.nfold <- 3
bst.cv = xgb.cv(param=param, data = trainMatrix, label = y,
bst.cv = xgb.cv(param=param, data = trainMatrix, label = y,
nfold = cv.nfold, nrounds = cv.nround)
```
> As we can see the error rate is low on the test dataset (for a 5mn trained model).
Expand All @@ -144,7 +144,7 @@ Feature importance

So far, we have built a model made of **`r nround`** trees.

To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).
To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).

Each division operation is called a *split*.

Expand All @@ -158,7 +158,7 @@ In the same way, in Boosting we try to optimize the missclassification at each r

The improvement brought by each *split* can be measured, it is the *gain*.

Each *split* is done on one feature only at one value.
Each *split* is done on one feature only at one value.

Let's see what the model looks like.

Expand All @@ -168,7 +168,7 @@ model[1:10]
```
> For convenience, we are displaying the first 10 lines of the model only.
Clearly, it is not easy to understand what it means.
Clearly, it is not easy to understand what it means.

Basically each line represents a *branch*, there is the *tree* ID, the feature ID, the point where it *splits*, and information regarding the next *branches* (left, right, when the row for this feature is N/A).

Expand Down Expand Up @@ -217,7 +217,7 @@ xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2)

We are just displaying the first two trees here.

On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated.
On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated.
Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes.

Going deeper
Expand Down
5 changes: 5 additions & 0 deletions doc/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The document of xgboost is generated with recommonmark and sphinx.

You can build it locally by typing "make html" in this folder.
- You will need to rerun the recommonmark script for readthedocs in sphinx_util.
- This was a hack to get the customized parser into readthedocs, hopefully to be removed in future.
Loading

0 comments on commit e8de5da

Please sign in to comment.