Skip to content

Commit

Permalink
more changes
Browse files Browse the repository at this point in the history
  • Loading branch information
jameslamb committed Feb 18, 2021
1 parent 7f9b56c commit e76c0a3
Show file tree
Hide file tree
Showing 21 changed files with 58 additions and 61 deletions.
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
OPTION(USE_MPI "Enable MPI-based parallel learning" OFF)
OPTION(USE_MPI "Enable MPI-based distributed learning" OFF)
OPTION(USE_OPENMP "Enable OpenMP" ON)
OPTION(USE_GPU "Enable GPU-accelerated training" OFF)
OPTION(USE_SWIG "Enable SWIG to generate Java API" OFF)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Next you may want to read:
- [**Examples**](https://github.com/microsoft/LightGBM/tree/master/examples) showing command line usage of common tasks.
- [**Features**](https://github.com/microsoft/LightGBM/blob/master/docs/Features.rst) and algorithms supported by LightGBM.
- [**Parameters**](https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst) is an exhaustive list of customization you can make.
- [**Parallel Learning**](https://github.com/microsoft/LightGBM/blob/master/docs/Parallel-Learning-Guide.rst) and [**GPU Learning**](https://github.com/microsoft/LightGBM/blob/master/docs/GPU-Tutorial.rst) can speed up computation.
- [**Distributed Learning Learning**](https://github.com/microsoft/LightGBM/blob/master/docs/Parallel-Learning-Guide.rst) and [**GPU Learning**](https://github.com/microsoft/LightGBM/blob/master/docs/GPU-Tutorial.rst) can speed up computation.
- [**Laurae++ interactive documentation**](https://sites.google.com/view/lauraepp/parameters) is a detailed guide for hyperparameters.
- [**Optuna Hyperparameter Tuner**](https://medium.com/optuna/lightgbm-tuner-new-optuna-integration-for-hyperparameter-optimization-8b7095e99258) provides automated tuning for LightGBM hyperparameters ([code examples](https://github.com/optuna/optuna/blob/master/examples/)).

Expand Down
2 changes: 1 addition & 1 deletion docs/Experiments.rst
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@ Results
| 16 | 42 s | 11GB |
+----------+---------------+---------------------------+

The results show that LightGBM achieves a linear speedup with parallel learning.
The results show that LightGBM achieves a linear speedup with distributed learning.

GPU Experiments
---------------
Expand Down
10 changes: 5 additions & 5 deletions docs/Features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ LightGBM uses histogram-based algorithms\ `[4, 5, 6] <#references>`__, which buc

- No need to store additional information for pre-sorting feature values

- **Reduce communication cost for parallel learning**
- **Reduce communication cost for distributed learning**

Sparse Optimization
-------------------
Expand Down Expand Up @@ -68,14 +68,14 @@ More specifically, LightGBM sorts the histogram (for a categorical feature) acco
Optimization in Network Communication
-------------------------------------

It only needs to use some collective communication algorithms, like "All reduce", "All gather" and "Reduce scatter", in parallel learning of LightGBM.
It only needs to use some collective communication algorithms, like "All reduce", "All gather" and "Reduce scatter", in distributed learning of LightGBM.
LightGBM implements state-of-art algorithms\ `[9] <#references>`__.
These collective communication algorithms can provide much better performance than point-to-point communication.

Optimization in Parallel Learning
---------------------------------
Optimization in Distributed Learning
------------------------------------

LightGBM provides the following parallel learning algorithms.
LightGBM provides the following distributed learning algorithms.

Feature Parallel
~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion docs/Installation-Guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -348,7 +348,7 @@ Build MPI Version
The default build version of LightGBM is based on socket. LightGBM also supports MPI.
`MPI`_ is a high performance communication approach with `RDMA`_ support.

If you need to run a parallel learning application with high performance communication, you can build the LightGBM with MPI support.
If you need to run a distributed learning application with high performance communication, you can build the LightGBM with MPI support.

Windows
^^^^^^^
Expand Down
21 changes: 9 additions & 12 deletions docs/Parallel-Learning-Guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ This section describes how distributed learning in LightGBM works. To learn how
Choose Appropriate Parallel Algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

LightGBM provides 3 parallel learning algorithms.
LightGBM provides 3 distributed learning algorithms.

+--------------------+---------------------------+
| Parallel Algorithm | How to Use |
Expand All @@ -35,7 +35,7 @@ These algorithms are suited for different scenarios, which is listed in the foll
| **#feature is large** | Feature Parallel | Voting Parallel |
+-------------------------+-------------------+-----------------+

More details about these parallel algorithms can be found in `optimization in parallel learning <./Features.rst#optimization-in-parallel-learning>`__.
More details about these parallel algorithms can be found in `optimization in distributed learning <./Features.rst#optimization-in-distributed-learning>`__.

Integrations
------------
Expand Down Expand Up @@ -68,20 +68,17 @@ Kubeflow integrations for LightGBM are not maintained by LightGBM's maintainers.
LightGBM CLI
^^^^^^^^^^^^

Build Parallel Version
''''''''''''''''''''''
Preparation
'''''''''''

Default build version support parallel learning based on the socket.
By default, distributed learning with LightGBM uses socket-based communication.

If you need to build parallel version with MPI support, please refer to `Installation Guide <./Installation-Guide.rst#build-mpi-version>`__.

Preparation
'''''''''''

Socket Version
**************

It needs to collect IP of all machines that want to run parallel learning in and allocate one TCP port (assume 12345 here) for all machines,
It needs to collect IP of all machines that want to run distributed learning in and allocate one TCP port (assume 12345 here) for all machines,
and change firewall rules to allow income of this port (12345). Then write these IP and ports in one file (assume ``mlist.txt``), like following:

.. code::
Expand All @@ -92,7 +89,7 @@ and change firewall rules to allow income of this port (12345). Then write these
MPI Version
***********

It needs to collect IP (or hostname) of all machines that want to run parallel learning in.
It needs to collect IP (or hostname) of all machines that want to run distributed learning in.
Then write these IP in one file (assume ``mlist.txt``) like following:

.. code::
Expand All @@ -102,8 +99,8 @@ Then write these IP in one file (assume ``mlist.txt``) like following:
**Note**: For Windows users, need to start "smpd" to start MPI service. More details can be found `here`_.

Run Parallel Learning
'''''''''''''''''''''
Run Distributed Learning
''''''''''''''''''''''''

Socket Version
**************
Expand Down
10 changes: 5 additions & 5 deletions docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ Core Parameters

- ``voting``, voting parallel tree learner, aliases: ``voting_parallel``

- refer to `Parallel Learning Guide <./Parallel-Learning-Guide.rst>`__ to get more details
- refer to `Distributed Learning Guide <./Parallel-Learning-Guide.rst>`__ to get more details

- ``num_threads`` :raw-html:`<a id="num_threads" title="Permalink to this parameter" href="#num_threads">&#x1F517;&#xFE0E;</a>`, default = ``0``, type = int, aliases: ``num_thread``, ``nthread``, ``nthreads``, ``n_jobs``

Expand All @@ -195,7 +195,7 @@ Core Parameters

- be aware a task manager or any similar CPU monitoring tool might report that cores not being fully utilized. **This is normal**

- for parallel learning, do not use all CPU cores because this will cause poor performance for the network communication
- for distributed learning, do not use all CPU cores because this will cause poor performance for the network communication

- **Note**: please **don't** change this during training, especially when running multiple jobs simultaneously by external packages, otherwise it may cause undesirable errors

Expand Down Expand Up @@ -714,7 +714,7 @@ Dataset Parameters

- ``pre_partition`` :raw-html:`<a id="pre_partition" title="Permalink to this parameter" href="#pre_partition">&#x1F517;&#xFE0E;</a>`, default = ``false``, type = bool, aliases: ``is_pre_partition``

- used for parallel learning (excluding the ``feature_parallel`` mode)
- used for distributed learning (excluding the ``feature_parallel`` mode)

- ``true`` if training data are pre-partitioned, and different machines use different partitions

Expand Down Expand Up @@ -1133,7 +1133,7 @@ Network Parameters

- ``num_machines`` :raw-html:`<a id="num_machines" title="Permalink to this parameter" href="#num_machines">&#x1F517;&#xFE0E;</a>`, default = ``1``, type = int, aliases: ``num_machine``, constraints: ``num_machines > 0``

- the number of machines for parallel learning application
- the number of machines for distributed learning application

- this parameter is needed to be set in both **socket** and **mpi** versions

Expand All @@ -1149,7 +1149,7 @@ Network Parameters

- ``machine_list_filename`` :raw-html:`<a id="machine_list_filename" title="Permalink to this parameter" href="#machine_list_filename">&#x1F517;&#xFE0E;</a>`, default = ``""``, type = string, aliases: ``machine_list_file``, ``machine_list``, ``mlist``

- path of file that lists machines for this parallel learning application
- path of file that lists machines for this distributed learning application

- each line contains one IP and one port for one machine. The format is ``ip port`` (space as a separator)

Expand Down
6 changes: 3 additions & 3 deletions examples/binary_classification/train.conf
Original file line number Diff line number Diff line change
Expand Up @@ -98,13 +98,13 @@ output_model = LightGBM_model.txt
# output_result= prediction.txt


# number of machines in parallel training, alias: num_machine
# number of machines in distributed training, alias: num_machine
num_machines = 1

# local listening port in parallel training, alias: local_port
# local listening port in distributed training, alias: local_port
local_listen_port = 12400

# machines list file for parallel training, alias: mlist
# machines list file for distributed training, alias: mlist
machine_list_file = mlist.txt

# force splits
Expand Down
6 changes: 3 additions & 3 deletions examples/binary_classification/train_linear.conf
Original file line number Diff line number Diff line change
Expand Up @@ -100,13 +100,13 @@ output_model = LightGBM_model.txt
# output_result= prediction.txt


# number of machines in parallel training, alias: num_machine
# number of machines in distributed training, alias: num_machine
num_machines = 1

# local listening port in parallel training, alias: local_port
# local listening port in distributed training, alias: local_port
local_listen_port = 12400

# machines list file for parallel training, alias: mlist
# machines list file for distributed training, alias: mlist
machine_list_file = mlist.txt

# force splits
Expand Down
6 changes: 3 additions & 3 deletions examples/lambdarank/train.conf
Original file line number Diff line number Diff line change
Expand Up @@ -103,11 +103,11 @@ output_model = LightGBM_model.txt
# output_result= prediction.txt


# number of machines in parallel training, alias: num_machine
# number of machines in distributed training, alias: num_machine
num_machines = 1

# local listening port in parallel training, alias: local_port
# local listening port in distributed training, alias: local_port
local_listen_port = 12400

# machines list file for parallel training, alias: mlist
# machines list file for distributed training, alias: mlist
machine_list_file = mlist.txt
10 changes: 5 additions & 5 deletions examples/parallel_learning/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Parallel Learning Example
=========================
Distributed Learning Example
============================

Here is an example for LightGBM to perform parallel learning for 2 machines.
Here is an example for LightGBM to perform distributed learning for 2 machines.

1. Edit [mlist.txt](./mlist.txt): write the ip of these 2 machines that you want to run application on.

Expand All @@ -16,6 +16,6 @@ Here is an example for LightGBM to perform parallel learning for 2 machines.

```"./lightgbm" config=train.conf```

This parallel learning example is based on socket. LightGBM also supports parallel learning based on mpi.
This distributed learning example is based on socket. LightGBM also supports distributed learning based on mpi.

For more details about the usage of parallel learning, please refer to [this](https://github.com/microsoft/LightGBM/blob/master/docs/Parallel-Learning-Guide.rst).
For more details about the usage of distributed learning, please refer to [this](https://github.com/microsoft/LightGBM/blob/master/docs/Parallel-Learning-Guide.rst).
6 changes: 3 additions & 3 deletions examples/parallel_learning/train.conf
Original file line number Diff line number Diff line change
Expand Up @@ -98,11 +98,11 @@ output_model = LightGBM_model.txt
# output_result= prediction.txt


# number of machines in parallel training, alias: num_machine
# number of machines in distributed training, alias: num_machine
num_machines = 2

# local listening port in parallel training, alias: local_port
# local listening port in distributed training, alias: local_port
local_listen_port = 12400

# machines list file for parallel training, alias: mlist
# machines list file for distributed training, alias: mlist
machine_list_file = mlist.txt
6 changes: 3 additions & 3 deletions examples/regression/train.conf
Original file line number Diff line number Diff line change
Expand Up @@ -101,11 +101,11 @@ output_model = LightGBM_model.txt
# output_result= prediction.txt


# number of machines in parallel training, alias: num_machine
# number of machines in distributed training, alias: num_machine
num_machines = 1

# local listening port in parallel training, alias: local_port
# local listening port in distributed training, alias: local_port
local_listen_port = 12400

# machines list file for parallel training, alias: mlist
# machines list file for distributed training, alias: mlist
machine_list_file = mlist.txt
6 changes: 3 additions & 3 deletions examples/xendcg/train.conf
Original file line number Diff line number Diff line change
Expand Up @@ -104,11 +104,11 @@ output_model = LightGBM_model.txt
# output_result= prediction.txt


# number of machines in parallel training, alias: num_machine
# number of machines in distributed training, alias: num_machine
num_machines = 1

# local listening port in parallel training, alias: local_port
# local listening port in distributed training, alias: local_port
local_listen_port = 12400

# machines list file for parallel training, alias: mlist
# machines list file for distributed training, alias: mlist
machine_list_file = mlist.txt
10 changes: 5 additions & 5 deletions include/LightGBM/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ struct Config {
// desc = ``feature``, feature parallel tree learner, aliases: ``feature_parallel``
// desc = ``data``, data parallel tree learner, aliases: ``data_parallel``
// desc = ``voting``, voting parallel tree learner, aliases: ``voting_parallel``
// desc = refer to `Parallel Learning Guide <./Parallel-Learning-Guide.rst>`__ to get more details
// desc = refer to `Distributed Learning Guide <./Parallel-Learning-Guide.rst>`__ to get more details
std::string tree_learner = "serial";

// alias = num_thread, nthread, nthreads, n_jobs
Expand All @@ -209,7 +209,7 @@ struct Config {
// desc = for the best speed, set this to the number of **real CPU cores**, not the number of threads (most CPUs use `hyper-threading <https://en.wikipedia.org/wiki/Hyper-threading>`__ to generate 2 threads per CPU core)
// desc = do not set it too large if your dataset is small (for instance, do not use 64 threads for a dataset with 10,000 rows)
// desc = be aware a task manager or any similar CPU monitoring tool might report that cores not being fully utilized. **This is normal**
// desc = for parallel learning, do not use all CPU cores because this will cause poor performance for the network communication
// desc = for distributed learning, do not use all CPU cores because this will cause poor performance for the network communication
// desc = **Note**: please **don't** change this during training, especially when running multiple jobs simultaneously by external packages, otherwise it may cause undesirable errors
int num_threads = 0;

Expand Down Expand Up @@ -634,7 +634,7 @@ struct Config {
bool feature_pre_filter = true;

// alias = is_pre_partition
// desc = used for parallel learning (excluding the ``feature_parallel`` mode)
// desc = used for distributed learning (excluding the ``feature_parallel`` mode)
// desc = ``true`` if training data are pre-partitioned, and different machines use different partitions
bool pre_partition = false;

Expand Down Expand Up @@ -961,7 +961,7 @@ struct Config {

// check = >0
// alias = num_machine
// desc = the number of machines for parallel learning application
// desc = the number of machines for distributed learning application
// desc = this parameter is needed to be set in both **socket** and **mpi** versions
int num_machines = 1;

Expand All @@ -976,7 +976,7 @@ struct Config {
int time_out = 120;

// alias = machine_list_file, machine_list, mlist
// desc = path of file that lists machines for this parallel learning application
// desc = path of file that lists machines for this distributed learning application
// desc = each line contains one IP and one port for one machine. The format is ``ip port`` (space as a separator)
// desc = **Note**: can be used only in CLI version
std::string machine_list_filename = "";
Expand Down
2 changes: 1 addition & 1 deletion include/LightGBM/dataset.h
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ class Metadata {

/*!
* \brief Partition meta data according to local used indices if need
* \param num_all_data Number of total training data, including other machines' data on parallel learning
* \param num_all_data Number of total training data, including other machines' data on distributed learning
* \param used_data_indices Indices of local used training data
*/
void CheckOrPartition(data_size_t num_all_data,
Expand Down
2 changes: 1 addition & 1 deletion python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2333,7 +2333,7 @@ def set_network(self, machines, local_listen_port=12400,
listen_time_out : int, optional (default=120)
Socket time-out in minutes.
num_machines : int, optional (default=1)
The number of machines for parallel learning application.
The number of machines for distributed learning application.
Returns
-------
Expand Down
2 changes: 1 addition & 1 deletion src/application/application.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ void Application::LoadData() {
config_.num_class, config_.data.c_str());
// load Training data
if (config_.is_data_based_parallel) {
// load data for parallel training
// load data for distributed training
train_data_.reset(dataset_loader.LoadFromFile(config_.data.c_str(),
Network::rank(), Network::num_machines()));
} else {
Expand Down
2 changes: 1 addition & 1 deletion src/io/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@ void Config::CheckParamConflict() {
}
if (is_parallel && (monotone_constraints_method == std::string("intermediate") || monotone_constraints_method == std::string("advanced"))) {
// In distributed mode, local node doesn't have histograms on all features, cannot perform "intermediate" monotone constraints.
Log::Warning("Cannot use \"intermediate\" or \"advanced\" monotone constraints in parallel learning, auto set to \"basic\" method.");
Log::Warning("Cannot use \"intermediate\" or \"advanced\" monotone constraints in distributed learning, auto set to \"basic\" method.");
monotone_constraints_method = "basic";
}
if (feature_fraction_bynode != 1.0 && (monotone_constraints_method == std::string("intermediate") || monotone_constraints_method == std::string("advanced"))) {
Expand Down
2 changes: 1 addition & 1 deletion src/io/dataset_loader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac
// don't support query id in data file when training in parallel
if (num_machines > 1 && !config_.pre_partition) {
if (group_idx_ > 0) {
Log::Fatal("Using a query id without pre-partitioning the data file is not supported for parallel training.\n"
Log::Fatal("Using a query id without pre-partitioning the data file is not supported for distributed training.\n"
"Please use an additional query file or pre-partition the data");
}
}
Expand Down
4 changes: 2 additions & 2 deletions src/io/metadata.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Metadata::Metadata() {

void Metadata::Init(const char* data_filename) {
data_filename_ = data_filename;
// for lambdarank, it needs query data for partition data in parallel learning
// for lambdarank, it needs query data for partition data in distributed learning
LoadQueryBoundaries();
LoadWeights();
LoadQueryWeights();
Expand Down Expand Up @@ -187,7 +187,7 @@ void Metadata::CheckOrPartition(data_size_t num_all_data, const std::vector<data
}
} else {
if (!queries_.empty()) {
Log::Fatal("Cannot used query_id for parallel training");
Log::Fatal("Cannot used query_id for distributed training");
}
data_size_t num_used_data = static_cast<data_size_t>(used_data_indices.size());
// check weights
Expand Down

0 comments on commit e76c0a3

Please sign in to comment.