使用 Iris 数据集的简单 XGBoost 教程

原文：www.kdnuggets.com/2017/03/simple-xgboost-tutorial-iris-dataset.html

作者：Ieva Zarina，软件开发员，Nordigen。

我有机会开始使用 xgboost 机器学习算法，它快速且显示出良好的结果。在这里，我将使用 iris 数据集进行多分类预测，数据集来自 scikit-learn。

我们的前三大课程推荐

1. 谷歌网络安全证书 - 快速进入网络安全职业生涯。

2. 谷歌数据分析专业证书 - 提升你的数据分析技能

3. 谷歌 IT 支持专业证书 - 支持你所在组织的 IT 工作

XGBoost 算法 (source)。

安装 Anaconda 和 xgboost

为了处理数据，我需要安装各种 Python 科学库。我发现最好的方法是使用 Anaconda。它可以简单地安装所有库，并帮助安装新的库。你可以下载适用于 Windows 的安装程序，但如果你想在 Linux 服务器上安装，只需将以下内容复制粘贴到终端：

wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh
bash Anaconda2-4.0.0-Linux-x86_64.sh -b -p $HOME/anaconda
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc
bash

在此之后，使用 conda 安装 pip，你将需要它来安装 xgboost。重要的是使用 Anaconda（在 Anaconda 的目录中）来安装它，以便 pip 也能在那里安装其他库：

conda install -y pip

现在，一个非常重要的步骤：预先安装 xgboost Python 包的依赖项。根据经验，我会安装这些依赖项：

sudo apt-get install -y make g++ build-essential gfortran libatlas-base-dev liblapacke-dev python-dev python-setuptools libsm6 libxrender1

我升级了我的 Python 虚拟环境，以避免 trouble 与 Python 版本相关的问题：

pip install --upgrade virtualenv

最后，我可以用 pip 安装 xgboost（祈祷好运）：

pip install xgboost

此命令安装最新版本的 xgboost，但如果你想使用之前的版本，只需指定：

pip install xgboost==0.4a30

现在测试一下是否一切正常 – 在终端中输入 python 并尝试导入 xgboost：

import xgboost as xgb

如果没有看到错误 – 完美。

Xgboost 与 Iris 数据集的演示

在这里，我将使用 Iris 数据集来展示如何使用 Xgboost 的简单示例。

首先，你加载数据集来自 sklearn，其中 X 是数据，y 是类别标签：

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

然后你将数据拆分成 80-20%的训练集和测试集，拆分：

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

接下来，你需要从 numpy 数组创建 Xgboost 特定的**DMatrix数据格式。Xgboost 可以直接处理 numpy 数组，加载 svmlignt 文件及其他格式。以下是如何处理numpy 数组**：

import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

如果你想使用 svmlight 以减少内存消耗，首先**导出numpy 数组到svmlight 格式**，然后只需将文件名传递给 DMatrix：

import xgboost as xgb
from sklearn.datasets import dump_svmlight_file

dump_svmlight_file(X_train, y_train, 'dtrain.svm', zero_based=True)
dump_svmlight_file(X_test, y_test, 'dtest.svm', zero_based=True)
dtrain_svm = xgb.DMatrix('dtrain.svm')
dtest_svm = xgb.DMatrix('dtest.svm')

现在为了让 Xgboost 工作，你需要设置**参数**：

param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

不同的数据集在不同的参数下表现不同。一个参数组合的结果可能很低，而另一个可能非常好。你可以查看这个 Kaggle 脚本了解如何寻找最佳参数。通常尝试 eta 0.1、0.2、0.3，max_depth 在 2 到 10 的范围内，num_round 在几百左右。

训练

最终训练可以开始。你只需输入：

bst = xgb.train(param, dtrain, num_round)

要查看模型的样子，你也可以将其导出为人类可读的形式：

bst.dump_model('dump.raw.txt')

它看起来像这样（f0、f1、f2 是特征）：

booster[0]:
0:[f2<2.45] yes=1,no=2,missing=1
    1:leaf=0.426036
    2:leaf=-0.218845
booster[1]:
0:[f2<2.45] yes=1,no=2,missing=1
    1:leaf=-0.213018
    2:[f3<1.75] yes=3,no=4,missing=3
        3:[f2<4.95] yes=5,no=6,missing=5
            5:leaf=0.409091
            6:leaf=-9.75349e-009
        4:[f2<4.85] yes=7,no=8,missing=7
            7:leaf=-7.66345e-009
            8:leaf=-0.210219
....

你可以看到每棵树的深度不超过设置的 3 层。

使用模型来预测类别测试集的类别：

preds = bst.predict(dtest)

但预测结果看起来像这样：

[[ 0.00563804 0.97755206 0.01680986]
 [ 0.98254657 0.01395847 0.00349498]
 [ 0.0036375 0.00615226 0.99021029]
 [ 0.00564738 0.97917044 0.0151822 ]
 [ 0.00540075 0.93640935 0.0581899 ]
....

在这里，每一列代表类别 0、1 或 2。对于每一行，你需要选择概率最高的那一列：

import numpy as np
best_preds = np.asarray([np.argmax(line) for line in preds])

现在你会得到一个包含预测类别的漂亮列表：

[1, 0, 2, 1, 1, ...]

确定此预测的准确率：

from sklearn.metrics import precision_score

print precision_score(y_test, best_preds, average='macro')
# >> 1.0

完美！现在**保存**模型以备后用：

from sklearn.externals import joblib

joblib.dump(bst, 'bst_model.pkl', compress=True)
# bst = joblib.load('bst_model.pkl') # load it later

现在你有一个保存下来的工作模型，并准备进行更多预测。

查看完整代码在github或下方：

简介: Ieva Zarina 是 Nordigen 的软件开发人员。

原始链接。经许可转载。

相关：

掌握 Python 机器学习的 7 个额外步骤
我在 Python 中从头实现分类器的学习经历
XGBoost：在 Spark 和 Flink 中实现获胜的 Kaggle 算法

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple-xgboost-tutorial-iris-dataset.md

simple-xgboost-tutorial-iris-dataset.md

使用 Iris 数据集的简单 XGBoost 教程

我们的前三大课程推荐

安装 Anaconda 和 xgboost

Xgboost 与 Iris 数据集的演示

训练

更多相关主题

Files

simple-xgboost-tutorial-iris-dataset.md

Latest commit

History

simple-xgboost-tutorial-iris-dataset.md

File metadata and controls

使用 Iris 数据集的简单 XGBoost 教程

我们的前三大课程推荐

安装 Anaconda 和 xgboost

Xgboost 与 Iris 数据集的演示

训练

更多相关主题