Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predefined bin thresholds #2325

Merged
merged 57 commits into from
Sep 28, 2019
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
fe5c8e2
Fix bug where small values of max_bin cause crash.
btrotta Jul 31, 2019
439bcfd
Revert "Fix bug where small values of max_bin cause crash."
btrotta Jul 31, 2019
34e72c8
Add functionality to force bin thresholds.
btrotta Aug 13, 2019
5b21573
Fix style issues.
btrotta Aug 14, 2019
2be599a
Use stable sort.
btrotta Aug 14, 2019
6a098f0
Minor style and doc fixes.
btrotta Aug 15, 2019
0cd4abc
Merge remote-tracking branch 'upstream/master'
btrotta Aug 16, 2019
8f73636
Add functionality to force bin thresholds.
btrotta Aug 13, 2019
6c2d048
Fix style issues.
btrotta Aug 14, 2019
feb861f
Use stable sort.
btrotta Aug 14, 2019
873fa64
Minor style and doc fixes.
btrotta Aug 15, 2019
050f57b
Merge branch 'force-bin' of https://github.com/btrotta/lightgbm into …
btrotta Aug 16, 2019
4cd89e4
Change binning behavior to be same as PR #2342.
btrotta Aug 20, 2019
698d9db
Merge remote-tracking branch 'upstream/master'
btrotta Aug 20, 2019
9d22071
Add functionality to force bin thresholds.
btrotta Aug 13, 2019
3178609
Fix style issues.
btrotta Aug 14, 2019
934b305
Use stable sort.
btrotta Aug 14, 2019
dc45bd1
Minor style and doc fixes.
btrotta Aug 15, 2019
018182c
Add functionality to force bin thresholds.
btrotta Aug 13, 2019
7a4df51
Fix style issues.
btrotta Aug 14, 2019
6095148
Use stable sort.
btrotta Aug 14, 2019
8b57a56
Minor style and doc fixes.
btrotta Aug 15, 2019
de83a69
Change binning behavior to be same as PR #2342.
btrotta Aug 20, 2019
01f18fd
Merge branch 'force-bin' of https://github.com/btrotta/lightgbm into …
btrotta Aug 20, 2019
360eacf
Merge remote-tracking branch 'upstream/master'
btrotta Sep 10, 2019
c478775
Add functionality to force bin thresholds.
btrotta Aug 13, 2019
e3f1835
Fix style issues.
btrotta Aug 14, 2019
2280c56
Minor style and doc fixes.
btrotta Aug 15, 2019
76fa4cc
Add functionality to force bin thresholds.
btrotta Aug 13, 2019
93d92eb
Fix style issues.
btrotta Aug 14, 2019
fec30a5
Minor style and doc fixes.
btrotta Aug 15, 2019
503e7b4
Change binning behavior to be same as PR #2342.
btrotta Aug 20, 2019
eecb80c
Add functionality to force bin thresholds.
btrotta Aug 13, 2019
a02b3a3
Fix style issues.
btrotta Aug 14, 2019
cb12379
Use stable sort.
btrotta Aug 14, 2019
abe95d7
Minor style and doc fixes.
btrotta Aug 15, 2019
7aed689
Add functionality to force bin thresholds.
btrotta Aug 13, 2019
35ce38b
Fix style issues.
btrotta Aug 14, 2019
28c0462
Use stable sort.
btrotta Aug 14, 2019
23dbb29
Minor style and doc fixes.
btrotta Aug 15, 2019
9ed04a3
Change binning behavior to be same as PR #2342.
btrotta Aug 20, 2019
7cdc732
Fix merge conflict.
btrotta Sep 10, 2019
51e93a9
Use different bin finding function for predefined bounds.
btrotta Sep 11, 2019
4e3355a
Fix style issues.
btrotta Sep 12, 2019
821b2ab
Minor refactoring, overload FindBinWithZeroAsOneBin.
btrotta Sep 12, 2019
8a52444
Fix style issues.
btrotta Sep 13, 2019
c591e7b
Fix bug and add new test.
btrotta Sep 17, 2019
9c767ae
Add warning when using categorical features with forced bins.
btrotta Sep 21, 2019
cf0afd4
Pass forced_upper_bounds by reference.
btrotta Sep 21, 2019
25387ec
Pass container types by const reference.
btrotta Sep 21, 2019
cc249f0
Get categorical features using FeatureBinMapper.
btrotta Sep 23, 2019
0e26e9f
Fix bug for small max_bin.
btrotta Sep 23, 2019
feeb163
Merge remote-tracking branch 'upstream/master'
btrotta Sep 26, 2019
50ff73b
Fix merge conflicts.
btrotta Sep 26, 2019
b5752ec
Move GetForcedBins to DatasetLoader.
btrotta Sep 27, 2019
58d86aa
Find forced bins in dataset_loader.
btrotta Sep 28, 2019
3e81b94
Minor fixes.
btrotta Sep 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,14 @@ Learning Control Parameters

- see `this file <https://github.com/microsoft/LightGBM/tree/master/examples/binary_classification/forced_splits.json>`__ as an example

- ``forcedbins_filename`` :raw-html:`<a id="forcedbins_filename" title="Permalink to this parameter" href="#forcedbins_filename">&#x1F517;&#xFE0E;</a>`, default = ``""``, type = string

- path to a ``.json`` file that specifies bin upper bounds for some or all features

- ``.json`` file should contain an array of objects, each containing the word ``feature`` (integer feature index) and ``bin_upper_bound`` (array of thresholds for binning)

- see `this file <https://github.com/microsoft/LightGBM/tree/master/examples/regression/forced_bins.json>`__ as an example

- ``refit_decay_rate`` :raw-html:`<a id="refit_decay_rate" title="Permalink to this parameter" href="#refit_decay_rate">&#x1F517;&#xFE0E;</a>`, default = ``0.9``, type = double, constraints: ``0.0 <= refit_decay_rate <= 1.0``

- decay rate of ``refit`` task, will use ``leaf_output = refit_decay_rate * old_leaf_output + (1.0 - refit_decay_rate) * new_leaf_output`` to refit trees
Expand Down
10 changes: 10 additions & 0 deletions examples/regression/forced_bins.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[
{
"feature": 0,
"bin_upper_bound": [ 0.3, 0.35, 0.4 ]
},
{
"feature": 1,
"bin_upper_bound": [ -0.1, -0.15, -0.2 ]
}
]
6 changes: 6 additions & 0 deletions examples/regression/forced_bins2.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[
{
"feature": 0,
"bin_upper_bound": [ 0.19, 0.39, 0.59, 0.79 ]
}
]
3 changes: 3 additions & 0 deletions examples/regression/train.conf
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ is_training_metric = true
# number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy.
max_bin = 255

# forced bin thresholds
# forcedbins_filename = forced_bins.json

# training data
# if exsting weight file, should name to "regression.train.weight"
# alias: train_data, train
Expand Down
3 changes: 2 additions & 1 deletion include/LightGBM/bin.h
Original file line number Diff line number Diff line change
Expand Up @@ -146,9 +146,10 @@ class BinMapper {
* \param bin_type Type of this bin
* \param use_missing True to enable missing value handle
* \param zero_as_missing True to use zero as missing value
* \param forced_upper_bounds Vector of split points that must be used (if this has size less than max_bin, remaining splits are found by the algorithm)
*/
void FindBin(double* values, int num_values, size_t total_sample_cnt, int max_bin, int min_data_in_bin, int min_split_data, BinType bin_type,
bool use_missing, bool zero_as_missing);
bool use_missing, bool zero_as_missing, std::vector<double> forced_upper_bounds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use const T& for container object anywhere.


/*!
* \brief Use specific number of bin to calculate the size of this class
Expand Down
5 changes: 5 additions & 0 deletions include/LightGBM/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,11 @@ struct Config {
// desc = see `this file <https://github.com/microsoft/LightGBM/tree/master/examples/binary_classification/forced_splits.json>`__ as an example
std::string forcedsplits_filename = "";

// desc = path to a ``.json`` file that specifies bin upper bounds for some or all features
// desc = ``.json`` file should contain an array of objects, each containing the word ``feature`` (integer feature index) and ``bin_upper_bound`` (array of thresholds for binning)
// desc = see `this file <https://github.com/microsoft/LightGBM/tree/master/examples/regression/forced_bins.json>`__ as an example
std::string forcedbins_filename = "";

// check = >=0.0
// check = <=1.0
// desc = decay rate of ``refit`` task, will use ``leaf_output = refit_decay_rate * old_leaf_output + (1.0 - refit_decay_rate) * new_leaf_output`` to refit trees
Expand Down
3 changes: 3 additions & 0 deletions include/LightGBM/dataset.h
Original file line number Diff line number Diff line change
Expand Up @@ -596,6 +596,8 @@ class Dataset {

void addFeaturesFrom(Dataset* other);

static std::vector<std::vector<double>> GetForcedBins(std::string forced_bins_path, int num_total_features);

private:
std::string data_filename_;
/*! \brief Store used features */
Expand Down Expand Up @@ -630,6 +632,7 @@ class Dataset {
bool is_finish_load_;
int max_bin_;
std::vector<int32_t> max_bin_by_feature_;
std::vector<std::vector<double>> forced_bin_bounds_;
int bin_construct_sample_cnt_;
int min_data_in_bin_;
bool use_missing_;
Expand Down
127 changes: 120 additions & 7 deletions src/io/bin.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ namespace LightGBM {
return true;
}

std::vector<double> GreedyFindBin(const double* distinct_values, const int* counts,
std::vector<double> GreedyFindBin(const double* distinct_values, const int* counts,
int num_distinct_values, int max_bin, size_t total_cnt, int min_data_in_bin) {
std::vector<double> bin_upper_bound;
CHECK(max_bin > 0);
Expand Down Expand Up @@ -149,8 +149,107 @@ namespace LightGBM {
return bin_upper_bound;
}

std::vector<double> FindBinWithZeroAsOneBin(const double* distinct_values, const int* counts,
int num_distinct_values, int max_bin, size_t total_sample_cnt, int min_data_in_bin) {
std::vector<double> FindBinWithPredefinedBin(const double* distinct_values, const int* counts,
int num_distinct_values, int max_bin, size_t total_sample_cnt, int min_data_in_bin, std::vector<double> forced_upper_bounds) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add an independent function,FindBinWithPredefinedBin, to include these changes? I am afraid of these changes may introduce some bugs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke Yes, I can make that change.

Copy link
Collaborator

@guolinke guolinke Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible to add predefined bin based on the results of FindBinWithZeroAsOneBin?
If it could be, these duplicated codes could be removed.
You can return the data cnt of each bin (or other data you needed) in FindBinWithZeroAsOneBin.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @btrotta

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke I think that would result in sub-optimal choice of bins. Currently my code takes into account the forced bins when choosing the remaining ones, so that we can find evenly sized bins (i.e. it will not choose thresholds too close to the forced thresholds). But if we use FindBinWithZeroAsOneBin to create the non-forced thresholds, the thresholds it chooses may be very close to the forced thresholds in which case the bins will not be evenly sized.

std::vector<double> bin_upper_bound;

// get list of distinct values
int left_cnt_data = 0;
int cnt_zero = 0;
int right_cnt_data = 0;
for (int i = 0; i < num_distinct_values; ++i) {
if (distinct_values[i] <= -kZeroThreshold) {
left_cnt_data += counts[i];
} else if (distinct_values[i] > kZeroThreshold) {
right_cnt_data += counts[i];
} else {
cnt_zero += counts[i];
}
}

// get number of positive and negative distinct values
int left_cnt = -1;
for (int i = 0; i < num_distinct_values; ++i) {
if (distinct_values[i] > -kZeroThreshold) {
left_cnt = i;
break;
}
}
if (left_cnt < 0) {
left_cnt = num_distinct_values;
}
int right_start = -1;
for (int i = left_cnt; i < num_distinct_values; ++i) {
if (distinct_values[i] > kZeroThreshold) {
right_start = i;
break;
}
}

// include zero bounds and infinity bound
if (max_bin == 2) {
if (left_cnt == 0) {
bin_upper_bound.push_back(kZeroThreshold);
} else {
bin_upper_bound.push_back(-kZeroThreshold);
}
} else if (max_bin >= 3) {
if (left_cnt > 0) {
bin_upper_bound.push_back(-kZeroThreshold);
}
if (right_start >= 0) {
bin_upper_bound.push_back(kZeroThreshold);
}
}
bin_upper_bound.push_back(std::numeric_limits<double>::infinity());

// add forced bounds, excluding zeros since we have already added zero bounds
size_t i = 0;
while (i < forced_upper_bounds.size()) {
if (std::fabs(forced_upper_bounds[i]) <= kZeroThreshold) {
forced_upper_bounds.erase(forced_upper_bounds.begin() + i);
} else {
++i;
}
}
int max_to_insert = max_bin - static_cast<int>(bin_upper_bound.size());
int num_to_insert = std::min(max_to_insert, static_cast<int>(forced_upper_bounds.size()));
if (num_to_insert > 0) {
bin_upper_bound.insert(bin_upper_bound.end(), forced_upper_bounds.begin(), forced_upper_bounds.begin() + num_to_insert);
}
std::stable_sort(bin_upper_bound.begin(), bin_upper_bound.end());

// find remaining bounds
int free_bins = max_bin - static_cast<int>(bin_upper_bound.size());
std::vector<double> bounds_to_add;
int value_ind = 0;
for (size_t i = 0; i < bin_upper_bound.size(); ++i) {
int cnt_in_bin = 0;
int distinct_cnt_in_bin = 0;
int bin_start = value_ind;
while ((value_ind < num_distinct_values) && (distinct_values[value_ind] < bin_upper_bound[i])) {
cnt_in_bin += counts[value_ind];
++distinct_cnt_in_bin;
++value_ind;
}
int bins_remaining = max_bin - static_cast<int>(bin_upper_bound.size()) - static_cast<int>(bounds_to_add.size());
int num_sub_bins = static_cast<int>(std::lround((static_cast<double>(cnt_in_bin) * free_bins / total_sample_cnt)));
num_sub_bins = std::min(num_sub_bins, bins_remaining) + 1;
if (i == bin_upper_bound.size() - 1) {
num_sub_bins = bins_remaining + 1;
}
std::vector<double> new_upper_bounds = GreedyFindBin(distinct_values + bin_start, counts + bin_start, distinct_cnt_in_bin,
num_sub_bins, cnt_in_bin, min_data_in_bin);
bounds_to_add.insert(bounds_to_add.end(), new_upper_bounds.begin(), new_upper_bounds.end() - 1); // last bound is infinity
}
bin_upper_bound.insert(bin_upper_bound.end(), bounds_to_add.begin(), bounds_to_add.end());
std::stable_sort(bin_upper_bound.begin(), bin_upper_bound.end());
CHECK(bin_upper_bound.size() <= static_cast<size_t>(max_bin));
return bin_upper_bound;
}

std::vector<double> FindBinWithZeroAsOneBin(const double* distinct_values, const int* counts, int num_distinct_values,
int max_bin, size_t total_sample_cnt, int min_data_in_bin) {
std::vector<double> bin_upper_bound;
int left_cnt_data = 0;
int cnt_zero = 0;
Expand Down Expand Up @@ -207,8 +306,19 @@ namespace LightGBM {
return bin_upper_bound;
}

std::vector<double> FindBinWithZeroAsOneBin(const double* distinct_values, const int* counts, int num_distinct_values,
int max_bin, size_t total_sample_cnt, int min_data_in_bin, std::vector<double> forced_upper_bounds) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use const std::vector<double>& forced_upper_bounds anywhere.

if (forced_upper_bounds.empty()) {
return FindBinWithZeroAsOneBin(distinct_values, counts, num_distinct_values, max_bin, total_sample_cnt, min_data_in_bin);
} else {
return FindBinWithPredefinedBin(distinct_values, counts, num_distinct_values, max_bin, total_sample_cnt, min_data_in_bin,
forced_upper_bounds);
}
}

void BinMapper::FindBin(double* values, int num_sample_values, size_t total_sample_cnt,
int max_bin, int min_data_in_bin, int min_split_data, BinType bin_type, bool use_missing, bool zero_as_missing) {
int max_bin, int min_data_in_bin, int min_split_data, BinType bin_type, bool use_missing, bool zero_as_missing,
std::vector<double> forced_upper_bounds) {
int na_cnt = 0;
int tmp_num_sample_values = 0;
for (int i = 0; i < num_sample_values; ++i) {
Expand Down Expand Up @@ -276,14 +386,17 @@ namespace LightGBM {
int num_distinct_values = static_cast<int>(distinct_values.size());
if (bin_type_ == BinType::NumericalBin) {
if (missing_type_ == MissingType::Zero) {
bin_upper_bound_ = FindBinWithZeroAsOneBin(distinct_values.data(), counts.data(), num_distinct_values, max_bin, total_sample_cnt, min_data_in_bin);
bin_upper_bound_ = FindBinWithZeroAsOneBin(distinct_values.data(), counts.data(), num_distinct_values, max_bin, total_sample_cnt,
min_data_in_bin, forced_upper_bounds);
if (bin_upper_bound_.size() == 2) {
missing_type_ = MissingType::None;
}
} else if (missing_type_ == MissingType::None) {
bin_upper_bound_ = FindBinWithZeroAsOneBin(distinct_values.data(), counts.data(), num_distinct_values, max_bin, total_sample_cnt, min_data_in_bin);
bin_upper_bound_ = FindBinWithZeroAsOneBin(distinct_values.data(), counts.data(), num_distinct_values, max_bin, total_sample_cnt,
min_data_in_bin, forced_upper_bounds);
} else {
bin_upper_bound_ = FindBinWithZeroAsOneBin(distinct_values.data(), counts.data(), num_distinct_values, max_bin - 1, total_sample_cnt - na_cnt, min_data_in_bin);
bin_upper_bound_ = FindBinWithZeroAsOneBin(distinct_values.data(), counts.data(), num_distinct_values, max_bin - 1, total_sample_cnt - na_cnt,
min_data_in_bin, forced_upper_bounds);
bin_upper_bound_.push_back(NaN);
}
num_bin_ = static_cast<int>(bin_upper_bound_.size());
Expand Down
4 changes: 4 additions & 0 deletions src/io/config_auto.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,7 @@ std::unordered_set<std::string> Config::parameter_set({
"monotone_constraints",
"feature_contri",
"forcedsplits_filename",
"forcedbins_filename",
"refit_decay_rate",
"cegb_tradeoff",
"cegb_penalty_split",
Expand Down Expand Up @@ -402,6 +403,8 @@ void Config::GetMembersFromString(const std::unordered_map<std::string, std::str

GetString(params, "forcedsplits_filename", &forcedsplits_filename);

GetString(params, "forcedbins_filename", &forcedbins_filename);

GetDouble(params, "refit_decay_rate", &refit_decay_rate);
CHECK(refit_decay_rate >=0.0);
CHECK(refit_decay_rate <=1.0);
Expand Down Expand Up @@ -617,6 +620,7 @@ std::string Config::SaveMembersToString() const {
str_buf << "[monotone_constraints: " << Common::Join(Common::ArrayCast<int8_t, int>(monotone_constraints), ",") << "]\n";
str_buf << "[feature_contri: " << Common::Join(feature_contri, ",") << "]\n";
str_buf << "[forcedsplits_filename: " << forcedsplits_filename << "]\n";
str_buf << "[forcedbins_filename: " << forcedbins_filename << "]\n";
str_buf << "[refit_decay_rate: " << refit_decay_rate << "]\n";
str_buf << "[cegb_tradeoff: " << cegb_tradeoff << "]\n";
str_buf << "[cegb_penalty_split: " << cegb_penalty_split << "]\n";
Expand Down
Loading