move documentation to repo

DroLuo · Apr 19, 2015 · c6c8684 · c6c8684
1 parent 5b04269
commit c6c8684
Show file tree

Hide file tree

Showing 11 changed files with 281 additions and 46 deletions.
diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ Distributed Version: [Distributed XGBoost](multi-node)
 
 Notes on the Code: [Code Guide](src)
 
-Documentation: https://github.com/dmlc/xgboost/doc
+[Documentation](https://github.com/dmlc/xgboost/doc)
 
 Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)
 * This slide is made by Tianqi Chen to introduce gradient boosting in a statistical view.

diff --git a/demo/README.md b/demo/README.md
@@ -38,6 +38,8 @@ This is a list of short codes introducing different functionalities of xgboost a
 
 Basic Examples by Tasks
 ====
+Most of examples in this section are based on CLI or python version.
+However, the parameter settings can be applied to all versions
 * [Binary classification](binary_classification)
 * [Multiclass classification](multiclass_classification)
 * [Regression](regression)

diff --git a/demo/binary_classification/README b/demo/binary_classification/README
diff --git a/demo/binary_classification/README.md b/demo/binary_classification/README.md
@@ -0,0 +1,174 @@
+Binary Classification
+====
+This is the quick start tutorial for xgboost CLI version. You can also checkout [../../doc/README.md](../../doc/README.md) for links to tutorial in pyton or R.
+
+Here we demonstrate how to use XGBoost for a binary classification task. Before getting started, make sure you compile xgboost in the root directory of the project by typing ```make```
+
+The path of the binary classification demo is at [demo/binary_classification](../blob/master/demo/binary_classification), and the script runexp.sh can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository. 
+
+### Tutorial
+#### Generate Input Data
+XGBoost takes LibSVM format. An example of faked input data is below:
+```
+1 101:1.2 102:0.03
+0 1:2.1 10001:300 10002:400 
+...
+```
+Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
+
+
+First we will transform the dataset into classic LibSVM format and split the data into training set and test set by running:
+```
+python mapfeat.py
+python mknfold.py agaricus.txt 1
+```
+The two files, 'agaricus.txt.train' and 'agaricus.txt.test' will be used as training set and test set.
+
+#### Training 
+Then we can run the training process:
+```
+../../xgboost mushroom.conf
+```
+
+mushroom.conf is the configuration for both training and testing. Each line containing the [attribute]=[value] configuration:
+
+```conf
+# General Parameters, see comment for each definition
+# can be gbtree or gblinear
+booster = gbtree 
+# choose logistic regression loss function for binary classification
+objective = binary:logistic
+
+# Tree Booster Parameters
+# step size shrinkage
+eta = 1.0 
+# minimum loss reduction required to make a further partition
+gamma = 1.0 
+# minimum sum of instance weight(hessian) needed in a child
+min_child_weight = 1 
+# maximum depth of a tree
+max_depth = 3 
+
+# Task Parameters
+# the number of round to do boosting
+num_round = 2
+# 0 means do not save any model except the final round model
+save_period = 0 
+# The path of training data
+data = "agaricus.txt.train" 
+# The path of validation data, used to monitor training process, here [test] sets name of the validation set
+eval[test] = "agaricus.txt.test" 
+# The path of test data 
+test:data = "agaricus.txt.test"      
+```
+We use the [tree booster](https://github.com/tqchen/xgboost/wiki/Tree-Booster) and logistic regression objective in our setting. This indicates that we accomplish our task using classic gradient boosting regression tree(GBRT), which is a promising method for binary classification.
+
+The parameters shown in the example gives the most common ones that are needed to use xgboost.
+If you are interested in more parameter settings, the complete parameter settings and detailed descriptions are [here](https://github.com/tqchen/xgboost/wiki/Parameters). Besides putting the parameters in the configuration file, we can set them by passing them as arguments as below:
+
+```
+../../xgboost mushroom.conf max_depth=6
+```
+This means that the parameter max_depth will be set as 6 rather than 3 in the conf file. When you use command line, make sure max_depth=6 is passed in as single argument, i.e. do not contain space in the argument. When a parameter setting is provided in both command line input and  the config file, the command line setting will override the setting in config file.
+
+In this example, we use tree booster for gradient boosting. If you would like to use linear booster for regression, you can keep all the parameters except booster and the tree booster parameters as below:
+```conf 
+# General Parameters
+# choose the linear booster
+booster = gblinear
+...
+
+# Change Tree Booster Parameters into Linear Booster Parameters
+# L2 regularization term on weights, default 0
+lambda = 0.01
+# L1 regularization term on weights, default 0
+f ```agaricus.txt.test.buffer``` exists, and automatically loads from binary buffer if possible, this can speedup training process when you do training many times. You can disable it by setting ```use_buffer=0```.
+  - Buffer file can also be used as standalone input, i.e if buffer file exists, but original agaricus.txt.test was removed, xgboost will still run
+* Deviation from LibSVM input format: xgboost is compatible with LibSVM format, with the following minor differences:
+  - xgboost allows feature index starts from 0
+  - for binary classification, the label is 1 for positive, 0 for negative, instead of +1,-1
+  - the feature indices in each line *do not* need to be sorted
+alpha = 0.01 
+# L2 regularization term on bias, default 0
+lambda_bias = 0.01 
+
+# Regression Parameters
+...
+```
+
+#### Get Predictions 
+After training, we can use the output model to get the prediction of the test data:
+```
+../../xgboost mushroom.conf task=pred model_in=0003.model
+```
+For binary classification, the output predictions are probability confidence scores in [0,1], corresponds to the probability of the label to be positive.
+
+#### Dump Model
+This is a preliminary feature, so far only tree model support text dump. XGBoost can display the tree models in text files and we can scan the model in an easy way:
+```
+../../xgboost mushroom.conf task=dump model_in=0003.model name_dump=dump.raw.txt 
+../../xgboost mushroom.conf task=dump model_in=0003.model fmap=featmap.txt name_dump=dump.nice.txt
+```
+
+In this demo, the tree boosters obtained will be printed in dump.raw.txt and dump.nice.txt, and the latter one is easier to understand because of usage of feature mapping featmap.txt
+
+Format of ```featmap.txt: <featureid> <featurename> <q or i or int>\n ```:
+  - Feature id must be from 0 to number of features, in sorted order.
+  - i means this feature is binary indicator feature
+  - q means this feature is a quantitative value, such as age, time, can be missing
+  - int means this feature is integer value (when int is hinted, the decision boundary will be integer)
+
+#### Monitoring Progress
+When you run training we can find there are messages displayed on screen
+```
+tree train end, 1 roots, 12 extra nodes, 0 pruned nodes ,max_depth=3
+[0]  test-error:0.016139
+boosting round 1, 0 sec elapsed
+
+tree train end, 1 roots, 10 extra nodes, 0 pruned nodes ,max_depth=3
+[1]  test-error:0.000000
+```
+The messages for evaluation are printed into stderr, so if you want only to log the evaluation progress, simply type
+```
+../../xgboost mushroom.conf 2>log.txt
+```
+Then you can find the following content in log.txt
+```
+[0]     test-error:0.016139
+[1]     test-error:0.000000
+```
+We can also monitor both training and test statistics, by adding following lines to configure
+```conf
+eval[test] = "agaricus.txt.test" 
+eval[trainname] = "agaricus.txt.train" 
+```
+Run the command again, we can find the log file becomes
+```
+[0]     test-error:0.016139     trainname-error:0.014433
+[1]     test-error:0.000000     trainname-error:0.001228
+```
+The rule is eval[name-printed-in-log] = filename, then the file will be added to monitoring process, and evaluated each round.
+
+xgboost also support monitoring multiple metrics, suppose we also want to monitor average log-likelihood of each prediction during training, simply add ```eval_metric=logloss``` to configure. Run again, we can find the log file becomes
+```
+[0]     test-error:0.016139     test-negllik:0.029795   trainname-error:0.014433        trainname-negllik:0.027023
+[1]     test-error:0.000000     test-negllik:0.000000   trainname-error:0.001228        trainname-negllik:0.002457
+```
+### Saving Progress Models
+If you want to save model every two round, simply set save_period=2. You will find 0002.model in the current folder. If you want to change the output folder of models, add model_dir=foldername. By default xgboost saves the model of last round.
+
+#### Continue from Existing Model
+If you want to continue boosting from existing model, say 0002.model, use
+```
+../../xgboost mushroom.conf model_in=0002.model num_round=2 model_out=continue.model
+```
+xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function.
+#### Use Multi-Threading
+When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running threads to 10, add ```nthread=10``` to your configuration.
+
+#### Additional Notes
+* What are ```agaricus.txt.test.buffer``` and ```agaricus.txt.train.buffer``` generated during runexp.sh? 
+  - By default xgboost will automatically generate a binary format buffer of input data, with suffix ```buffer```. When next time you run xgboost, it detects i
+Demonstrating how to use XGBoost accomplish binary classification tasks  on UCI mushroom dataset  http://archive.ics.uci.edu/ml/datasets/Mushroom
+
+
diff --git a/demo/rank/README b/demo/rank/README
diff --git a/demo/rank/README.md b/demo/rank/README.md
@@ -0,0 +1,22 @@
+Learning to rank
+====
+XGBoost supports accomplishing ranking tasks. In ranking scenario, data are often grouped and we need the [group information file](../../doc/input_format.md#group-input-format) to specify ranking tasks. The model used in XGBoost for ranking is the LambdaRank, this function is not yet completed. Currently, we provide pairwise rank. 
+
+### Parameters
+The configuration setting is similar to the regression and binary classification setting,except user need to specify the objectives:
+
+```
+...
+objective="rank:pairwise"
+...
+```
+For more usage details please refer to the [binary classification demo](../binary_classification), 
+
+Instructions
+====
+The dataset for ranking demo is from LETOR04 MQ2008 fold1, 
+You can use the following command to run the example
+
+Get the data: ./wgetdata.sh
+Run the example: ./runexp.sh
+
diff --git a/demo/regression/README b/demo/regression/README
diff --git a/demo/regression/README.md b/demo/regression/README.md
@@ -0,0 +1,17 @@
+Regression
+====
+Using XGBoost for regression is very similar to using it for binary classification. We suggest that you can refer to the [binary classification demo](../binary_classification) first. In XGBoost if we use negative log likelihood as the loss function for regression, the training procedure is same as training binary classifier of XGBoost. 
+
+### Tutorial
+The dataset we used is the [computer hardware dataset from UCI repository](https://archive.ics.uci.edu/ml/datasets/Computer+Hardware). The demo for regression is almost the same as the [binary classification demo](../binary_classification), except a little difference in general parameter:
+```
+# General parameter
+# this is the only difference with classification, use reg:linear to do linear classification
+# when labels are in [0,1] we can also use reg:logistic
+objective = reg:linear
+...
+
+```
+
+The input format is same as binary classification, except that the label is now the target regression values. We use linear regression here, if we want use objective = reg:logistic logistic regression, the label needed to be pre-scaled into [0,1].
+
diff --git a/doc/README.md b/doc/README.md
@@ -6,15 +6,24 @@ List of Documentations
 ====
 * [Parameters](parameter.md)
 * [Using XGBoost in Python](python.md)
+* [Using XGBoost in R](../R-package/vignettes/xgboostPresentation.Rmd)
+* [Learning to use xgboost by example](../demo)
 * [External Memory Version](external_memory.md)
+* [Text input format](input_format.md)
+
+How to get started
+====
+* Try to read the [binary classification example](../demo/binary_classification) for getting started example
+* Find the guide specific language guide above for the language you like to use
+* [Learning to use xgboost by example](../demo) contains lots of useful examples
 
 Highlights Links
 ====
 This section is about blogposts, presentation and videos discussing how to use xgboost to solve your interesting problem. If you think something belongs to here, send a pull request.
-* Blogpost by phunther: [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/) 
-* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution) 
-* Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model - Machine Learning with R](https://www.youtube.com/watch?v=Og7CGAfSr_Y)
-* Presention of a real use case of XGBoost to prepare tax audit in France: [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit)
+* [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/) 
+* Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model](https://www.youtube.com/watch?v=Og7CGAfSr_Y)
+* [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit)
+* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
 
 Contribution
 ====

diff --git a/doc/input_format.md b/doc/input_format.md
@@ -0,0 +1,51 @@
+Input Format
+====
+## Basic Input Format
+As we have mentioned, XGBoost takes LibSVM format. For training or predicting, XGBoost takes an instance file with the format as below:
+
+train.txt
+```
+1 101:1.2 102:0.03
+0 1:2.1 10001:300 10002:400 
+0 0:1.3 1:0.3
+1 0:0.01 1:0.3
+0 0:0.2 1:0.3
+```
+Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
+
+## Group Input Format
+As XGBoost supports accomplishing [ranking task](https://github.com/tqchen/xgboost/wiki/Ranking), we support the group input format. In ranking task, instances are categorized into different groups in real world scenarios, for example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. Except the instance file mentioned in the group input format, XGBoost need an file indicating the group information. For example, if the instance file is the "train.txt" shown above,
+and the group file is as below:
+
+train.txt.group
+```
+2
+3
+```
+This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are in another group. The numbers in the group file are actually indicating the number of instances in each group in the instance file in order.
+While configuration, you do not have to indicate the path of the group file. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.group" in the same directory and decides whether to read the data as group input format.
+
+## Instance Weight File
+XGBoost supports providing each instance an weight to differentiate the importance of instances. For example, if we provide an instance weight file for the "train.txt" file in the example as below:
+
+train.txt.weight
+```
+1
+0.5
+0.5
+1
+0.5
+```
+It means that XGBoost will emphasize more on the first and fourth instance， that is to say positive instances while training.
+The configuration is similar to configuring the group information. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.weight" in the same directory and if there is, will use the weights while training models. Weights will be included into an "xxx.buffer" file that is created by XGBoost automatically. If you want to update the weights, you need to delete the "xxx.buffer" file prior to launching XGBoost. 
+
+## Initial Margin file
+XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction using logistic regression for "train.txt" file, we can create the following file:
+
+train.txt.base_margin
+```
+-0.4
+1.0
+3.4
+```
+XGBoost will take these values as intial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, use pred_margin=1 to output margin values.
diff --git a/doc/parameter.md b/doc/parameter.md
@@ -11,7 +11,7 @@ In R-package, you can use .(dot) to replace under score in the parameters, for e
 
 ### General Parameters
 * booster [default=gbtree]
-  - which booster to use, can be gbtree or gblinear. The details about different boosters are described [here](https://github.com/dmlc/xgboost/wiki/Boosters). 
+  - which booster to use, can be gbtree or gblinear. gbtree uses tree based model while gblinear uses linear function.
 * silent [default=0]
   - 0 means printing running messages, 1 means silent mode.
 * nthread [default to maximum number of threads available if not set]