diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index 71f48a1666df..36ccc9d5daaf 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -1,9 +1,9 @@ Contributors of DMLC/XGBoost -======= +============================ XGBoost has been developed and used by a group of active community. Everyone is more than welcomed to is a great way to make the project better and more accessible to more users. Comitters -======= +--------- Committers are people who have made substantial contribution to the project and granted write access to the project. * [Tianqi Chen](https://github.com/tqchen), University of Washington - Tianqi is a PhD working on large-scale machine learning, he is the creator of the project. @@ -14,8 +14,17 @@ Committers are people who have made substantial contribution to the project and * [Michael Benesty](https://github.com/pommedeterresautee) - Micheal is a lawyer, data scientist in France, he is the creator of xgboost interactive analysis module in R. +Become a Comitter +----------------- +XGBoost is a opensource project and we are actively looking for new comitters who are willing to help maintaining and lead the project. +Committers comes from contributors who: +* Made substantial contribution to the project. +* Willing to spent time on maintaining and lead the project. + +New committers will be proposed by current comitter memembers, with support from more than two of current comitters. + List of Contributors -======= +-------------------- * [Full List of Contributors](https://github.com/dmlc/xgboost/graphs/contributors) - To contributors: please add your name to the list when you submit a patch to the project:) * [Kailong Chen](https://github.com/kalenhaha) diff --git a/R-package/DESCRIPTION b/R-package/DESCRIPTION index 4560971e2589..19410d65a44a 100644 --- a/R-package/DESCRIPTION +++ b/R-package/DESCRIPTION @@ -1,8 +1,8 @@ Package: xgboost Type: Package Title: Extreme Gradient Boosting -Version: 0.4-1 -Date: 2015-05-11 +Version: 0.4-2 +Date: 2015-08-01 Author: Tianqi Chen , Tong He , Michael Benesty Maintainer: Tong He Description: Extreme Gradient Boosting, which is an diff --git a/R-package/README.md b/R-package/README.md index 96113c39162c..294691416349 100644 --- a/R-package/README.md +++ b/R-package/README.md @@ -4,7 +4,13 @@ R package for xgboost Installation ------------ -For up-to-date version (which is recommended), please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first. +We are [on CRAN](https://cran.r-project.org/web/packages/xgboost/index.html) now. For stable/pre-compiled(for Windows and OS X) version, please install from CRAN: + +```r +install.packages('xgboost') +``` + +For up-to-date version, please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first. ```r devtools::install_github('dmlc/xgboost',subdir='R-package') diff --git a/R-package/vignettes/xgboostPresentation.Rmd b/R-package/vignettes/xgboostPresentation.Rmd index 39ab819f7037..89d27fb45dc2 100644 --- a/R-package/vignettes/xgboostPresentation.Rmd +++ b/R-package/vignettes/xgboostPresentation.Rmd @@ -1,6 +1,6 @@ --- title: "Xgboost presentation" -output: +output: rmarkdown::html_vignette: css: vignette.css number_sections: yes @@ -16,7 +16,7 @@ vignette: > Introduction ============ -**Xgboost** is short for e**X**treme **G**radient **Boost**ing package. +**Xgboost** is short for e**X**treme **G**radient **Boost**ing package. The purpose of this Vignette is to show you how to use **Xgboost** to build a model and make predictions. @@ -25,9 +25,9 @@ It is an efficient and scalable implementation of gradient boosting framework by - *linear* model ; - *tree learning* algorithm. -It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective functions easily. +It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective functions easily. -It has been [used](https://github.com/dmlc/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions. +It has been [used](https://github.com/dmlc/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions. It has several features: @@ -64,7 +64,7 @@ Formerly available versions can be obtained from the CRAN [archive](http://cran. Learning ======== -For the purpose of this tutorial we will load **Xgboost** package. +For the purpose of this tutorial we will load **XGBoost** package. ```{r libLoading, results='hold', message=F, warning=F} require(xgboost) @@ -73,7 +73,7 @@ require(xgboost) Dataset presentation -------------------- -In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-). +In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-). Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013. @@ -85,7 +85,7 @@ We will load the `agaricus` datasets embedded with the package and will link the The datasets are already split in: * `train`: will be used to build the model ; -* `test`: will be used to assess the quality of our model. +* `test`: will be used to assess the quality of our model. Why *split* the dataset in two parts? @@ -115,7 +115,7 @@ dim(train$data) dim(test$data) ``` -This dataset is very small to not make the **R** package too heavy, however **Xgboost** is built to manage huge dataset very efficiently. +This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge dataset very efficiently. As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`): @@ -124,7 +124,7 @@ class(train$data)[1] class(train$label) ``` -Basic Training using Xgboost +Basic Training using XGBoost ---------------------------- This step is the most critical part of the process for the quality of our model. @@ -160,7 +160,7 @@ bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth #### xgb.DMatrix -**Xgboost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later. +**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later. ```{r trainingDmatrix, message=F, warning=F} dtrain <- xgb.DMatrix(data = train$data, label = train$label) @@ -169,7 +169,7 @@ bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround #### Verbose option -**Xgboost** has severa features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality. +**XGBoost** has severa features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality. One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics). @@ -188,7 +188,7 @@ bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, o bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 2) ``` -Basic prediction using Xgboost +Basic prediction using XGBoost ============================== Perform the prediction @@ -211,7 +211,7 @@ These numbers doesn't look like *binary classification* `{0,1}`. We need to perf Transform the regression in a binary classification --------------------------------------------------- -The only thing that **Xgboost** does is a *regression*. **Xgboost** is using `label` vector to build its *regression* model. +The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model. How can we use a *regression* model to perform a binary classification? @@ -240,7 +240,7 @@ Steps explanation: 2. `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ; 3. `mean(vectorOfErrors)` computes the *average error* itself. -The most important thing to remember is that **to do a classification, you just do a regression to the** `label` **and then apply a threshold**. +The most important thing to remember is that **to do a classification, you just do a regression to the** `label` **and then apply a threshold**. *Multiclass* classification works in a similar way. @@ -269,7 +269,7 @@ Both `xgboost` (simple) and `xgb.train` (advanced) functions train models. One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following technics will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible. -One way to measure progress in learning of a model is to provide to **Xgboost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning. +One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning. > in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors. @@ -281,7 +281,7 @@ watchlist <- list(train=dtrain, test=dtest) bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchlist=watchlist, objective = "binary:logistic") ``` -**Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset. +**XGBoost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset. Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset. @@ -298,13 +298,13 @@ bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchli Linear boosting --------------- -Until know, all the learnings we have performed were based on boosting trees. **Xgboost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter). +Until know, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter). ```{r linearBoosting, message=F, warning=F} bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nthread = 2, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic") ``` -In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm. +In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm. In simple cases, it will happem because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use. @@ -340,7 +340,7 @@ print(paste("test-error=", err)) View feature importance/influence from the learnt model ------------------------------------------------------- -Feature importance is similar to R gbm package's relative influence (rel.inf). +Feature importance is similar to R gbm package's relative influence (rel.inf). ``` importance_matrix <- xgb.importance(model = bst) @@ -370,7 +370,7 @@ Save and load models May be your dataset is big, and it takes time to train a model on it? May be you are not a big fan of loosing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required. -Hopefully for you, **Xgboost** implements such functions. +Hopefully for you, **XGBoost** implements such functions. ```{r saveModel, message=F, warning=F} # save model to binary local file @@ -397,7 +397,7 @@ file.remove("./xgboost.model") > result is `0`? We are good! -In some very specific cases, like when you want to pilot **Xgboost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it. +In some very specific cases, like when you want to pilot **XGBoost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it. ```{r saveLoadRBinVectorModel, message=F, warning=F} # save model to R's raw vector @@ -412,9 +412,9 @@ pred3 <- predict(bst3, test$data) # pred2 should be identical to pred print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred)))) -``` +``` -> Again `0`? It seems that `Xgboost` works pretty well! +> Again `0`? It seems that `XGBoost` works pretty well! References ========== diff --git a/README.md b/README.md index be93e99fda76..ac29ef7eb52b 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,8 @@ -DMLC/XGBoost -======= - -[![Build Status](https://travis-ci.org/dmlc/xgboost.svg?branch=master)](https://travis-ci.org/dmlc/xgboost) [![Gitter chat for developers at https://gitter.im/dmlc/xgboost](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dmlc/xgboost?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) + eXtreme Gradient Boosting +=========== +[![Build Status](https://travis-ci.org/dmlc/xgboost.svg?branch=master)](https://travis-ci.org/dmlc/xgboost) +[![Documentation Status](https://readthedocs.org/projects/xgboost/badge/?version=latest)](https://xgboost.readthedocs.org) +[![Gitter chat for developers at https://gitter.im/dmlc/xgboost](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dmlc/xgboost?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) An optimized general purpose gradient boosting library. The library is parallelized, and also provides an optimized distributed version. diff --git a/demo/README.md b/demo/README.md index fcfaa8434e64..d6f061484962 100644 --- a/demo/README.md +++ b/demo/README.md @@ -1,12 +1,12 @@ -XGBoost Examples -==== +XGBoost Code Examples +===================== This folder contains all the code examples using xgboost. * Contribution of examples, benchmarks is more than welcome! * If you like to share how you use xgboost to solve your problem, send a pull request:) Features Walkthrough -==== +-------------------- This is a list of short codes introducing different functionalities of xgboost packages. * Basic walkthrough of packages [python](guide-python/basic_walkthrough.py) @@ -37,7 +37,7 @@ This is a list of short codes introducing different functionalities of xgboost p [R](../R-package/demo/predict_leaf_indices.R) Basic Examples by Tasks -==== +----------------------- Most of examples in this section are based on CLI or python version. However, the parameter settings can be applied to all versions * [Binary classification](binary_classification) @@ -46,7 +46,7 @@ However, the parameter settings can be applied to all versions * [Learning to Rank](rank) Benchmarks -==== +---------- * [Starter script for Kaggle Higgs Boson](kaggle-higgs) * [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution) diff --git a/demo/guide-python/README.md b/demo/guide-python/README.md index 32d0290ab7e6..ff1f98ad0d9d 100644 --- a/demo/guide-python/README.md +++ b/demo/guide-python/README.md @@ -1,6 +1,6 @@ XGBoost Python Feature Walkthrough -==== -* [Basic walkthrough of wrappers](basic_walkthrough.py) +================================== +* [Basic walkthrough of wrappers](basic_walkthrough.py) * [Cutomize loss function, and evaluation metric](custom_objective.py) * [Boosting from existing prediction](boost_from_prediction.py) * [Predicting using first n trees](predict_first_ntree.py) diff --git a/demo/kaggle-otto/understandingXGBoostModel.Rmd b/demo/kaggle-otto/understandingXGBoostModel.Rmd index 6bd64401d206..e04277d4ee20 100644 --- a/demo/kaggle-otto/understandingXGBoostModel.Rmd +++ b/demo/kaggle-otto/understandingXGBoostModel.Rmd @@ -1,7 +1,7 @@ --- title: "Understanding XGBoost Model on Otto Dataset" author: "Michaƫl Benesty" -output: +output: rmarkdown::html_vignette: css: ../../R-package/vignettes/vignette.css number_sections: yes @@ -54,7 +54,7 @@ test[1:6,1:5, with =F] Each *column* represents a feature measured by an `integer`. Each *row* is an **Otto** product. -Obviously the first column (`ID`) doesn't contain any useful information. +Obviously the first column (`ID`) doesn't contain any useful information. To let the algorithm focus on real stuff, we will delete it. @@ -124,7 +124,7 @@ param <- list("objective" = "multi:softprob", cv.nround <- 5 cv.nfold <- 3 -bst.cv = xgb.cv(param=param, data = trainMatrix, label = y, +bst.cv = xgb.cv(param=param, data = trainMatrix, label = y, nfold = cv.nfold, nrounds = cv.nround) ``` > As we can see the error rate is low on the test dataset (for a 5mn trained model). @@ -144,7 +144,7 @@ Feature importance So far, we have built a model made of **`r nround`** trees. -To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products). +To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products). Each division operation is called a *split*. @@ -158,7 +158,7 @@ In the same way, in Boosting we try to optimize the missclassification at each r The improvement brought by each *split* can be measured, it is the *gain*. -Each *split* is done on one feature only at one value. +Each *split* is done on one feature only at one value. Let's see what the model looks like. @@ -168,7 +168,7 @@ model[1:10] ``` > For convenience, we are displaying the first 10 lines of the model only. -Clearly, it is not easy to understand what it means. +Clearly, it is not easy to understand what it means. Basically each line represents a *branch*, there is the *tree* ID, the feature ID, the point where it *splits*, and information regarding the next *branches* (left, right, when the row for this feature is N/A). @@ -217,7 +217,7 @@ xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2) We are just displaying the first two trees here. -On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated. +On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated. Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes. Going deeper diff --git a/doc/README b/doc/README new file mode 100644 index 000000000000..a14ad800b1fb --- /dev/null +++ b/doc/README @@ -0,0 +1,5 @@ +The document of xgboost is generated with recommonmark and sphinx. + +You can build it locally by typing "make html" in this folder. +- You will need to rerun the recommonmark script for readthedocs in sphinx_util. +- This was a hack to get the customized parser into readthedocs, hopefully to be removed in future. diff --git a/doc/build.md b/doc/build.md index 7b8ee96aaadd..b97237bcbac3 100644 --- a/doc/build.md +++ b/doc/build.md @@ -1,5 +1,5 @@ Build XGBoost -==== +============= * Run ```bash build.sh``` (you can also type make) * If you have C++11 compiler, it is recommended to type ```make cxx11=1``` - C++11 is not used by default @@ -12,19 +12,19 @@ Build XGBoost * OS X with multi-threading support: see [next section](#openmp-for-os-x) Build XGBoost in OS X with OpenMP -==== +--------------------------------- Here is the complete solution to use OpenMp-enabled compilers to install XGBoost. 1. Obtain gcc with openmp support by `brew install gcc --without-multilib` **or** clang with openmp by `brew install clang-omp`. The clang one is recommended because the first method requires us compiling gcc inside the machine (more than an hour in mine)! (BTW, `brew` is the de facto standard of `apt-get` on OS X. So installing [HPC](http://hpc.sourceforge.net/) separately is not recommended, but it should work.) -2. **if you are planing to use clang-omp** - in step 3 and/or 4, change line 9 in `xgboost/src/utils/omp.h` to +2. **if you are planing to use clang-omp** - in step 3 and/or 4, change line 9 in `xgboost/src/utils/omp.h` to ```C++ - #include /* instead of #include */` + #include /* instead of #include */` ``` - to make it work, otherwise you might get this error - + to make it work, otherwise you might get this error + `src/tree/../utils/omp.h:9:10: error: 'omp.h' file not found...` @@ -43,11 +43,11 @@ Here is the complete solution to use OpenMp-enabled compilers to install XGBoost export CXX = clang-omp++ ``` - Remember to change `header` (mentioned in step 2) if using clang-omp. - + Remember to change `header` (mentioned in step 2) if using clang-omp. + Then `cd xgboost` then `bash build.sh` to compile XGBoost. And go to `wrapper` sub-folder to install python version. -4. Set the `Makevars` file in highest piority for R. +4. Set the `Makevars` file in highest piority for R. The point is, there are three `Makevars` : `~/.R/Makevars`, `xgboost/R-package/src/Makevars`, and `/usr/local/Cellar/r/3.2.0/R.framework/Resources/etc/Makeconf` (the last one obtained by running `file.path(R.home("etc"), "Makeconf")` in R), and `SHLIB_OPENMP_CXXFLAGS` is not set by default!! After trying, it seems that the first one has highest piority (surprise!). @@ -75,21 +75,21 @@ Here is the complete solution to use OpenMp-enabled compilers to install XGBoost Again, remember to change `header` if using clang-omp. - Then inside R, run + Then inside R, run ```R install.packages('xgboost/R-package/', repos=NULL, type='source') ``` - + Or - + ```R devtools::install_local('xgboost/', subdir = 'R-package') # you may use devtools ``` Build with HDFS and S3 Support -===== +------------------------------ * To build xgboost use with HDFS/S3 support and distributed learnig. It is recommended to build with dmlc, with the following steps - ```git clone https://github.com/dmlc/dmlc-core``` - Follow instruction in dmlc-core/make/config.mk to compile libdmlc.a diff --git a/doc/conf.py b/doc/conf.py index b08f495f58ae..05e1e91babf0 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -22,7 +22,13 @@ sys.path.insert(0, libpath) sys.path.insert(0, curr_path) -from sphinx_util import MarkdownParser +from sphinx_util import MarkdownParser, AutoStructify + +# -- mock out modules +import mock +MOCK_MODULES = ['numpy', 'scipy', 'scipy.sparse', 'sklearn', 'matplotlib'] +for mod_name in MOCK_MODULES: + sys.modules[mod_name] = mock.Mock() # -- General configuration ------------------------------------------------ @@ -155,4 +161,7 @@ def setup(app): # Add hook for building doxygen xml when needed # no c++ API for now # app.connect("builder-inited", generate_doxygen_xml) - pass + app.add_config_value('recommonmark_config', { + 'url_resolver': lambda url: github_doc_root + url, + }, True) + app.add_transform(AutoStructify) diff --git a/doc/dev-guide/contribute.md b/doc/dev-guide/contribute.md new file mode 100644 index 000000000000..5d8f7c26cbee --- /dev/null +++ b/doc/dev-guide/contribute.md @@ -0,0 +1,13 @@ +Developer Guide +=============== +This page contains guide for developers of xgboost. XGBoost has been developed and used by a group of active community. +Everyone is more than welcomed to is a great way to make the project better. +The project is maintained by a committee of [committers](../../CONTRIBUTORS.md#comitters) who will review and merge pull requests from contributors. + +Contributing Code +================= +* The C++ code follows Google C++ style +* We follow numpy style to document our python module +* Tools to precheck codestyle + - clone https://github.com/dmlc/dmlc-core into root directory + - type ```make lint``` and fix possible errors. diff --git a/doc/faq.md b/doc/faq.md new file mode 100644 index 000000000000..5c985182af19 --- /dev/null +++ b/doc/faq.md @@ -0,0 +1,61 @@ +Frequent Asked Questions +======================== +This document contains the frequent asked question to xgboost. + +How to tune parameters +---------------------- +See [Parameter Tunning Guide](param_tuning.md) + + +I have a big dataset +-------------------- +XGBoost is designed to be memory efficient. Usually it could handle problems as long as the data fit into your memory +(This usually means millions of instances). +If you are running out of memory, checkout [external memory version](external_memory.md) or +[distributed version](https://github.com/dmlc/wormhole/tree/master/learn/xgboost) of xgboost. + + +Running xgboost on Platform X (Hadoop/Yarn, Mesos) +-------------------------------------------------- +The distributed version of XGBoost is designed to be portable to various environment. +Distributed XGBoost can be ported to any platform that supports [rabit](https://github.com/dmlc/rabit). +You can directly run xgboost on Yarn. In theory Mesos and other resource allocation engine can be easily supported as well. + + +Why not implement distributed xgboost on top of X (Spark, Hadoop) +----------------------------------------------------------------- +The first fact we need to know is going distributed does not necessarily solve all the problems. +Instead, it creates more problems such as more communication over head and fault tolerance. +The ultimate question will still come back into how to push the limit of each computation node +and use less resources to complete the task (thus with less communication and chance of failure). + +To achieve these, we decide to reuse the optimizations in the single node xgboost and build distributed version on top of it. +The demand of communication in machine learning is rather simple, in a sense that we can depend on a limited set of API (in our case rabit). +Such design allows us to reuse most of the code, and being portable to major platforms such as Hadoop/Yarn, MPI, SGE. +Most importantly, pushs the limit of the computation resources we can use. + + +How can I port the model to my own system +----------------------------------------- +The model and data format of XGBoost is exchangable. +Which means the model trained by one langauge can be loaded in another. +This means you can train the model using R, while running prediction using +Java or C++, which are more common in production system. +You can also train the model using distributed version, +and load them in from python to do some interactive analysis. + + +Do you support LambdaMART +------------------------- +Yes, xgboost implements LambdaMART. Checkout the objective section in [parameters](parameter.md) + + +How to deal with Missing Value +------------------------------ +xgboost support missing value by default + + +Slightly different result between runs +-------------------------------------- +This could happen, due to non-determinism in floating point summation order and multi-threading. +Though the general accurac will usually remain the same. \ No newline at end of file diff --git a/doc/index.md b/doc/index.md index 5d8d5b26f647..7c41d15e240d 100644 --- a/doc/index.md +++ b/doc/index.md @@ -1,28 +1,45 @@ XGBoost Documentation ===================== +This is document of xgboost library. +XGBoost is short for eXtreme gradient boosting. This is a library that is designed, and optimized for boosted (tree) algorithms. +The goal of this library is to push the extreme of the computation limits of machines to provide a ***scalable***, ***portable*** and ***accurate*** +for large scale tree boosting. - -* [Using XGBoost in Python](python/python_intro.md) -* [Using XGBoost in R](../R-package/vignettes/xgboostPresentation.Rmd) -* [Learning to use xgboost by example](../demo) -* [External Memory Version](external_memory.md) -* [Text input format](input_format.md) -* [Build Instruction](build.md) -* [Notes on the Code](../src) -* List of all parameters and their usage: [Parameters](parameter.md) - - [Notes on Parameter Tunning](param_tuning.md) -* Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf) - +This document is hosted at http://xgboost.readthedocs.org/. You can also browse most of the documents in github directly. How to Get Started ------------------ -* Try to read the [binary classification example](../demo/binary_classification) for getting started example -* Find the guide specific language guide above for the language you like to use -* [Learning to use xgboost by example](../demo) contains lots of useful examples +The best way to get started to learn xgboost is by the examples. There are three types of examples you can find in xgboost. +* [Tutorials](#tutorials) are self-conatained tutorials on a complete data science tasks. +* [XGBoost Code Examples](../demo/) are collections of code and benchmarks of xgboost. + - There is a walkthrough section in this to walk you through specific API features. +* [Highlight Solutions](#highlight-solutions) are presentations using xgboost to solve real world problems. + - These examples are usually more advanced. You can usually find state-of-art solutions to many problems and challenges in here. + +After you gets familiar with the interface, checkout the following additional resources +* [Frequently Asked Questions](faq.md) +* [Learning what is in Behind: Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf) +* [User Guide](#user-guide) contains comprehensive list of documents of xgboost. +* [Developer Guide](dev-guide/contribute.md) + +Tutorials +--------- +Tutorials are self contained materials that teaches you how to achieve a complete data science task with xgboost, these +are great resources to learn xgboost by real examples. If you think you have something that belongs to here, send a pull request. +* [Binary classification using XGBoost Command Line](../demo/binary_classification/) (CLI) + - This tutorial introduces the basic usage of CLI version of xgboost +* [Introduction of XGBoost in Python](python/python_intro.md) (python) + - This tutorial introduces the python package of xgboost +* [Introduction to XGBoost in R](../R-package/vignettes/xgboostPresentation.Rmd) (R package) + - This is a general presentation about xgboost in R. +* [Discover your data with XGBoost in R](../R-package/vignettes/discoverYourData.Rmd) (R package) + - This tutorial explaining feature analysis in xgboost. +* [Understanding XGBoost Model on Otto Dataset](../demo/kaggle-otto/understandingXGBoostModel.Rmd) (R package) + - This tutorial teaches you how to use xgboost to compete kaggle otto challenge. -Example Highlight Links ------------------------ +Highlight Solutions +------------------- This section is about blogposts, presentation and videos discussing how to use xgboost to solve your interesting problem. If you think something belongs to here, send a pull request. * [Kaggle CrowdFlower winner's solution by Chenglong Chen](https://github.com/ChenglongChen/Kaggle_CrowdFlower) * [Kaggle Malware Prediction winner's solution](https://github.com/xiaozhouwang/kaggle_Microsoft_Malware) @@ -31,14 +48,25 @@ This section is about blogposts, presentation and videos discussing how to use x * Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model](https://www.youtube.com/watch?v=Og7CGAfSr_Y) * [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/) +User Guide +---------- +* [Frequently Asked Questions](faq.md) +* [Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf) +* [Using XGBoost in Python](python/python_intro.md) +* [Using XGBoost in R](../R-package/vignettes/xgboostPresentation.Rmd) +* [Learning to use XGBoost by Example](../demo) +* [External Memory Version](external_memory.md) +* [Text input format](input_format.md) +* [Build Instruction](build.md) +* [Parameters](parameter.md) +* [Notes on Parameter Tunning](param_tuning.md) + + +Developer Guide +--------------- +* [Developer Guide](dev-guide/contribute.md) + API Reference ------------- - * [Python API Reference](python/python_api.rst) - -Contribution ------------- -Contribution of documents and use-cases are welcomed! -* This package use Google C++ style -* Check tool of codestyle - - clone https://github.com/dmlc/dmlc-core into root directory - - type ```make lint``` and fix possible errors. +* [Python API Reference](python/python_api.rst) + diff --git a/doc/param_tuning.md b/doc/param_tuning.md index 78263a6a859a..c5848f6024d6 100644 --- a/doc/param_tuning.md +++ b/doc/param_tuning.md @@ -1,5 +1,5 @@ Notes on Parameter Tuning -==== +========================= Parameter tuning is a dark art in machine learning, the optimal parameters of a model can depend on many scenarios. So it is impossible to create a comprehensive guide for doing so. @@ -8,7 +8,7 @@ This document tries to provide some guideline for parameters in xgboost. Understanding Bias-Variance Tradeoff -==== +------------------------------------ If you take a machine learning or statistics course, this is likely to be one of the most important concepts. When we allow the model to get more complicated (e.g. more depth), the model @@ -22,7 +22,7 @@ will make the model more conservative or not. This can be used to help you turn the knob between complicated model and simple model. Control Overfitting -==== +------------------- When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem. There are in general two ways that you can control overfitting in xgboost @@ -31,9 +31,9 @@ There are in general two ways that you can control overfitting in xgboost * The second way is to add randomness to make training robust to noise - This include ```subsample```, ```colsample_bytree``` - You can also reduce stepsize ```eta```, but needs to remember to increase ```num_round``` when you do so. - -Handle Imbalanced Dataset -=== + +Handle Imbalanced Dataset +------------------------- For common cases such as ads clickthrough log, the dataset is extremely imbalanced. This can affect the training of xgboost model, and there are two ways to improve it. * If you care only about the ranking order (AUC) of your prediction diff --git a/doc/parameter.md b/doc/parameter.md index 53cdd806f2b9..4e0f365bf3db 100644 --- a/doc/parameter.md +++ b/doc/parameter.md @@ -3,13 +3,15 @@ XGBoost Parameters Before running XGboost, we must set three types of parameters, general parameters, booster parameters and task parameters: - General parameters relates to which booster we are using to do boosting, commonly tree or linear model - Booster parameters depends on which booster you have chosen -- Task parameters that decides on the learning scenario, for example, regression tasks may use different parameters with ranking tasks. -- In addition to these parameters, there can be console parameters that relates to behavior of console version of xgboost(e.g. when to save model) +- Learning Task parameters that decides on the learning scenario, for example, regression tasks may use different parameters with ranking tasks. +- Command line parameters that relates to behavior of CLI version of xgboost. -### Parameters in R Package +Parameters in R Package +----------------------- In R-package, you can use .(dot) to replace under score in the parameters, for example, you can use max.depth as max_depth. The underscore parameters are also valid in R. -### General Parameters +General Parameters +------------------ * booster [default=gbtree] - which booster to use, can be gbtree or gblinear. gbtree uses tree based model while gblinear uses linear function. * silent [default=0] @@ -21,10 +23,8 @@ In R-package, you can use .(dot) to replace under score in the parameters, for e * num_feature [set automatically by xgboost, no need to be set by user] - feature dimension used in boosting, set to maximum dimension of the feature -### Booster Parameters -From xgboost-unity, the ```bst:``` prefix is no longer needed for booster parameters. Parameter with or without bst: prefix will be equivalent(i.e. both bst:eta and eta will be valid parameter setting) . - -#### Parameter for Tree Booster +Parameters for Tree Booster +--------------------------- * eta [default=0.3] - step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. - range: [0,1] @@ -47,7 +47,8 @@ From xgboost-unity, the ```bst:``` prefix is no longer needed for booster parame - subsample ratio of columns when constructing each tree. - range: (0,1] -#### Parameter for Linear Booster +Parameters for Linear Booster +----------------------------- * lambda [default=0] - L2 regularization term on weights * alpha [default=0] @@ -55,7 +56,8 @@ From xgboost-unity, the ```bst:``` prefix is no longer needed for booster parame * lambda_bias - L2 regularization term on bias, default 0(no L1 reg on bias because it is not important) -### Task Parameters +Learning Task Parameters +------------------------ * objective [ default=reg:linear ] - specify the learning task and the corresponding learning objective, and the objective options are below: - "reg:linear" --linear regression @@ -87,7 +89,8 @@ training repeatively * seed [ default=0 ] - random number seed. -### Console Parameters +Command Line Parameters +----------------------- The following parameters are only used in the console version of xgboost * use_buffer [ default=1 ] - whether create binary buffer for text input, this normally will speedup loading when do diff --git a/doc/python-requirements.txt b/doc/python-requirements.txt new file mode 100644 index 000000000000..1a041d154156 --- /dev/null +++ b/doc/python-requirements.txt @@ -0,0 +1,2 @@ +commonmark + diff --git a/doc/python/python_api.rst b/doc/python/python_api.rst index e665efe84a2d..85249cbc4ead 100644 --- a/doc/python/python_api.rst +++ b/doc/python/python_api.rst @@ -1,6 +1,8 @@ Python API Reference ==================== -This page gives the Python API reference of xgboost. +This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more information about python package. + +The document in this page is automatically generated by sphinx. The content do not render at github, you can view it at http://xgboost.readthedocs.org/en/latest/python/python_api.html Core Data Structure ------------------- @@ -33,4 +35,3 @@ Scikit-Learn API .. autoclass:: xgboost.XGBClassifier :members: :show-inheritance: - diff --git a/doc/python/python_intro.md b/doc/python/python_intro.md index 2acb73b3c340..2b670a053924 100644 --- a/doc/python/python_intro.md +++ b/doc/python/python_intro.md @@ -1,32 +1,27 @@ -XGBoost Python Module -===================== +Python Package Introduction +=========================== +This document gives a basic walkthrough of xgboost python package. -This page will introduce XGBoost Python module, including: -* [Building and Import](#building-and-import) -* [Data Interface](#data-interface) -* [Setting Parameters](#setting-parameters) -* [Train Model](#training-model) -* [Early Stopping](#early-stopping) -* [Prediction](#prediction) -* [API Reference](python_api.md) +***List of other Helpful Links*** +* [Python walkthrough code collections](https://github.com/tqchen/xgboost/blob/master/demo/guide-python) +* [Python API Reference](python_api.rst) -A [walk through python example](https://github.com/tqchen/xgboost/blob/master/demo/guide-python) for UCI Mushroom dataset is provided. - -= -#### Install - -To install XGBoost, you need to run `make` in the root directory of the project and then in the `python-package` directory run +Install XGBoost +--------------- +To install XGBoost, do the following steps. +* You need to run `make` in the root directory of the project +* In the `python-package` directory run ```shell python setup.py install ``` -Then import the module in Python as usual + ```python import xgboost as xgb ``` -= -#### Data Interface +Data Interface +-------------- XGBoost python module is able to loading from libsvm txt format file, Numpy 2D array and xgboost binary buffer file. The data will be store in ```DMatrix``` object. * To load libsvm text format file and XGBoost binary file into ```DMatrix```, the usage is like @@ -42,8 +37,8 @@ dtrain = xgb.DMatrix( data, label=label) ``` * Build ```DMatrix``` from ```scipy.sparse``` ```python -csr = scipy.sparse.csr_matrix( (dat, (row,col)) ) -dtrain = xgb.DMatrix( csr ) +csr = scipy.sparse.csr_matrix((dat, (row, col))) +dtrain = xgb.DMatrix(csr) ``` * Saving ```DMatrix``` into XGBoost binary file will make loading faster in next time. The usage is like: ```python @@ -52,18 +47,17 @@ dtrain.save_binary("train.buffer") ``` * To handle missing value in ```DMatrix```, you can initialize the ```DMatrix``` like: ```python -dtrain = xgb.DMatrix( data, label=label, missing = -999.0) +dtrain = xgb.DMatrix(data, label=label, missing = -999.0) ``` * Weight can be set when needed, like ```python -w = np.random.rand(5,1) -dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=w) +w = np.random.rand(5, 1) +dtrain = xgb.DMatrix(data, label=label, missing = -999.0, weight=w) ``` - -= -#### Setting Parameters -XGBoost use list of pair to save [parameters](parameter.md). Eg +Setting Parameters +------------------ +XGBoost use list of pair to save [parameters](../parameter.md). Eg * Booster parameters ```python param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' } @@ -77,8 +71,9 @@ plst += [('eval_metric', 'ams@0')] evallist = [(dtest,'eval'), (dtrain,'train')] ``` -= -#### Training Model +Training +-------- + With parameter list and data, you are able to train a model. * Training ```python @@ -104,10 +99,11 @@ After you save your model, you can load model file at anytime by using bst = xgb.Booster({'nthread':4}) #init model bst.load_model("model.bin") # load data ``` -= -#### Early stopping -If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in `evals`. If there's more than one, it will use the last. +Early Stopping +-------------- +If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. +Early stopping requires at least one set in `evals`. If there's more than one, it will use the last. `train(..., evals=evals, early_stopping_rounds=10)` @@ -117,13 +113,14 @@ If early stopping occurs, the model will have two additional fields: `bst.best_s This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). -= -#### Prediction +Prediction +---------- After you training/loading a model and preparing the data, you can start to do prediction. ```python -data = np.random.rand(7,10) # 7 entities, each contains 10 features -dtest = xgb.DMatrix( data, missing = -999.0 ) -ypred = bst.predict( xgmat ) +# 7 entities, each contains 10 features +data = np.random.rand(7, 10) +dtest = xgb.DMatrix(data) +ypred = bst.predict(xgmat) ``` If early stopping is enabled during training, you can predict with the best iteration. diff --git a/doc/sphinx_util.py b/doc/sphinx_util.py index 33c98d3815bc..0b51786301f6 100644 --- a/doc/sphinx_util.py +++ b/doc/sphinx_util.py @@ -1,50 +1,17 @@ # -*- coding: utf-8 -*- -"""Helper hacking utilty function for customization.""" +"""Helper utilty function for customization.""" import sys import os +import docutils import subprocess -# TODO: make less hacky way than this one if os.environ.get('READTHEDOCS', None) == 'True': - subprocess.call('cd ..; rm -rf recommonmark;' + + subprocess.call('cd ..; rm -rf recommonmark recom;' + 'git clone https://github.com/tqchen/recommonmark;' + - 'cp recommonmark/recommonmark/parser.py doc/parser', shell=True) + 'mv recommonmark/recommonmark recom', shell=True) sys.path.insert(0, os.path.abspath('..')) -import parser +from recom import parser, transform -class MarkdownParser(parser.CommonMarkParser): - github_doc_root = None - doc_suffix = set(['md', 'rst']) - - @staticmethod - def remap_url(url): - if MarkdownParser.github_doc_root is None or url is None: - return url - if url.startswith('#'): - return url - arr = url.split('#', 1) - ssuffix = arr[0].rsplit('.', 1) - - if len(ssuffix) == 2 and (ssuffix[-1] in MarkdownParser.doc_suffix - and arr[0].find('://') == -1): - arr[0] = ssuffix[0] + '.html' - return '#'.join(arr) - else: - if arr[0].find('://') == -1: - return MarkdownParser.github_doc_root + url - else: - return url - - def reference(self, block): - block.destination = remap_url(block.destination) - return super(MarkdownParser, self).reference(block) - -# inplace modify the function in recommonmark module to allow link remap -old_ref = parser.reference - -def reference(block): - block.destination = MarkdownParser.remap_url(block.destination) - return old_ref(block) - -parser.reference = reference +MarkdownParser = parser.CommonMarkParser +AutoStructify = transform.AutoStructify