feat: [LightGBM] support dataset rows more then max(int32_t) #1684

junpeng0715 · 2022-10-20T05:09:09Z

Related Issues/PRs

#LightGBM: 5540

What changes are proposed in this pull request?

Allow more than MAX(int32_t) of dataset rows to use LightGBM via SynapseML.

How is this patch tested?

I tested in our Databricks envs use more than Max(int32) dataset rows.

Does this PR change any dependencies?

No. You can skip this section.
Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

No. You can skip this section.
Yes. Make sure you have added samples following below steps.

github-actions · 2022-10-20T05:09:24Z

Hey @junpeng0715 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

mhamilton723 · 2022-10-20T13:19:12Z

/azp run

azure-pipelines · 2022-10-20T13:19:28Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2022-10-20T13:36:43Z

Codecov Report

Merging #1684 (45e427b) into master (448f6b7) will decrease coverage by 59.98%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master    #1684       +/-   ##
===========================================
- Coverage   80.63%   20.65%   -59.99%     
===========================================
  Files         272      244       -28     
  Lines       14224    11971     -2253     
  Branches      732      570      -162     
===========================================
- Hits        11470     2473     -8997     
- Misses       2754     9498     +6744

Impacted Files	Coverage Δ
...rg/apache/spark/injections/BlockManagerUtils.scala	`0.00% <0.00%> (-100.00%)`	⬇️
.../microsoft/azure/synapse/ml/io/binary/Binary.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...icrosoft/azure/synapse/ml/explainers/Sampler.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...soft/azure/synapse/ml/core/contracts/Metrics.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...soft/azure/synapse/ml/explainers/LIMESampler.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...ft/azure/synapse/ml/explainers/TextExplainer.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...spark/sql/types/injections/MetadataUtilities.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...t/azure/synapse/ml/automl/DefaultHyperparams.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...t/azure/synapse/ml/core/schema/SparkBindings.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...t/azure/synapse/ml/explainers/ImageExplainer.scala	`0.00% <0.00%> (-100.00%)`	⬇️
... and 212 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

svotaw · 2022-10-20T16:23:39Z

hello. thanks for this. I will review when I get a chance, but a couple of points:

We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.
Note that LightGBM itself does not fully support Datasets that have more than max(int_32) rows (I'm on a PR there for someone trying to improve that to max(int_64), but currently they are planning to make it a build option).

junpeng0715 · 2022-10-21T12:09:40Z

Hi @svotaw ,thank you!

hello. thanks for this. I will review when I get a chance, but a couple of points:

We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.

Is there any issue ticket refer to?

Note that LightGBM itself does not fully support Datasets that have more than max(int_32) rows (I'm on a PR there for someone trying to improve that to max(int_64), but currently they are planning to make it a build option).

Yes, that is the PR I made in LightGBM repo. microsoft/LightGBM#5540

junpeng0715 · 2022-10-23T12:16:02Z

Hi @svotaw

We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.

Is that will support dataset rows over max(int32)?

svotaw · 2022-10-23T19:07:09Z

@junpeng0715
#1580 is the PR that will enable using "streaming" mode for lightgbm. It is currently waiting for a bug fix to sparse data that is pending in this LightGBM PR: microsoft/LightGBM#5551.

I have not specifically looked into int64 support for the new code. We try to make it so, but not all the LightGBM APIs support it as you know.

svotaw · 2022-10-23T19:14:04Z

You have errors in the pipeline:

[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData")
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType),
[error] ^
[error] 7 errors found

junpeng0715 · 2022-10-24T01:03:24Z

You have errors in the pipeline:

[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData") [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] 7 errors found

Yes the dependency change is microsoft/LightGBM#5540.

junpeng0715 · 2022-10-24T02:10:08Z

Hi @svotaw
Thank you very much.

As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?

mhamilton723 · 2022-10-26T16:02:31Z

/azp run

azure-pipelines · 2022-10-26T16:02:43Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2022-10-26T16:35:44Z

Compilation failed, please confirm this PR compiles locally

[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/DatasetAggregator.scala:532:35: type mismatch;
[error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_void
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int
[error] lightgbmlib.int_to_voidp_ptr(indexes.array),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:29:61: type mismatch;
[error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int
[error] lightgbmlib.LGBM_DatasetGetField(datasetPtr, fieldName, tmpOutLenPtr, outArray, outTypePtr)
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:57:75: type mismatch;
[error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int
[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData")
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType),
[error] ^
[error] 7 errors found
[error] (lightgbm / Compile / compileIncremental) Compilation failed
[error] Total time: 17 s, completed Oct 26, 2022 4:08:19 PM
##[error]Script failed with exit code: 1

svotaw · 2022-10-26T17:32:28Z

Hi @svotaw Thank you very much.

As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?

There will be 2 modes of execution, the older "bulk" mode which uses DatasetAggregator, and the newer "streaming" mode which we hopefully will make the default at some point within the next semester. If you plan to use the older "bulk" mode (e.g. in the next few months) and you want this feature, let me know and I'll review it. Otherwise, you can wait until we check in the newer streaming version and then see what changes need to made for it to support int64. This is blocked at the moment until we get a bug fix into LightGBM microsoft/LightGBM#5551. That PR will hopefully be approved relatively soon since it's only a few lines and only in APIs we added ourselves.

Also, what is the plan for your LightGBM PR? For SynapseMl, we make our own ilghtgbm maven package from the top of microsoft/lightgbm repo. This is an involved process and we don't do it often. I assume your PR would have to be checked in and we make a new synapseml lightgbm package before we could even build this?

junpeng0715 · 2022-10-27T01:41:56Z

Hi @mhamilton723
Thank you for running pipeline.
This PR is depend on microsoft/LightGBM#5540, so the error is known.

junpeng0715 · 2022-10-27T10:57:59Z

Hi @svotaw Thank you very much.
As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?

There will be 2 modes of execution, the older "bulk" mode which uses DatasetAggregator, and the newer "streaming" mode which we hopefully will make the default at some point within the next semester. If you plan to use the older "bulk" mode (e.g. in the next few months) and you want this feature, let me know and I'll review it. Otherwise, you can wait until we check in the newer streaming version and then see what changes need to made for it to support int64. This is blocked at the moment until we get a bug fix into LightGBM microsoft/LightGBM#5551. That PR will hopefully be approved relatively soon since it's only a few lines and only in APIs we added ourselves.

Also, what is the plan for your LightGBM PR? For SynapseMl, we make our own ilghtgbm maven package from the top of microsoft/lightgbm repo. This is an involved process and we don't do it often. I assume your PR would have to be checked in and we make a new synapseml lightgbm package before we could even build this?

I'm very glad that you can take a look for this PR, since the "bulk" mode will always stay in the system for user to choose.

I can't say the plan clearly, there also hava issue block CI jobs in LGBM repo.

guolinke · 2022-11-10T08:41:05Z

Hi @svotaw , we will push the PR, as long as the CI pass, and with @shiyu1994 's review.

junpeng0715 requested a review from svotaw as a code owner October 20, 2022 05:09

junpeng0715 changed the title ~~LGBM: support dataset rows more then max(int32_t)~~ feat: [LightGBM] support dataset rows more then max(int32_t) Oct 20, 2022

junpeng0715 force-pushed the dev_0830 branch from 3918fdc to 1af8093 Compare October 20, 2022 05:18

feat: [LightGBM] support dataset rows more then max(int32_t)

ef07543

junpeng0715 force-pushed the dev_0830 branch from 1af8093 to ef07543 Compare October 20, 2022 05:19

Merge branch 'master' into dev_0830

45e427b

junpeng0715 mentioned this pull request Oct 27, 2022

support dataset rows more then max(int32_t) (add CMAKE option) microsoft/LightGBM#5540

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [LightGBM] support dataset rows more then max(int32_t) #1684

feat: [LightGBM] support dataset rows more then max(int32_t) #1684

junpeng0715 commented Oct 20, 2022

github-actions bot commented Oct 20, 2022

mhamilton723 commented Oct 20, 2022

azure-pipelines bot commented Oct 20, 2022

codecov-commenter commented Oct 20, 2022 •

edited

Loading

svotaw commented Oct 20, 2022

junpeng0715 commented Oct 21, 2022

junpeng0715 commented Oct 23, 2022

svotaw commented Oct 23, 2022

svotaw commented Oct 23, 2022

junpeng0715 commented Oct 24, 2022

junpeng0715 commented Oct 24, 2022

mhamilton723 commented Oct 26, 2022

azure-pipelines bot commented Oct 26, 2022

mhamilton723 commented Oct 26, 2022 •

edited

Loading

svotaw commented Oct 26, 2022

junpeng0715 commented Oct 27, 2022

junpeng0715 commented Oct 27, 2022

guolinke commented Nov 10, 2022

feat: [LightGBM] support dataset rows more then max(int32_t) #1684

Are you sure you want to change the base?

feat: [LightGBM] support dataset rows more then max(int32_t) #1684

Conversation

junpeng0715 commented Oct 20, 2022

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change any dependencies?

Does this PR add a new feature? If so, have you added samples on website?

github-actions bot commented Oct 20, 2022

mhamilton723 commented Oct 20, 2022

azure-pipelines bot commented Oct 20, 2022

codecov-commenter commented Oct 20, 2022 • edited Loading

Codecov Report

svotaw commented Oct 20, 2022

junpeng0715 commented Oct 21, 2022

junpeng0715 commented Oct 23, 2022

svotaw commented Oct 23, 2022

svotaw commented Oct 23, 2022

junpeng0715 commented Oct 24, 2022

junpeng0715 commented Oct 24, 2022

mhamilton723 commented Oct 26, 2022

azure-pipelines bot commented Oct 26, 2022

mhamilton723 commented Oct 26, 2022 • edited Loading

svotaw commented Oct 26, 2022

junpeng0715 commented Oct 27, 2022

junpeng0715 commented Oct 27, 2022

guolinke commented Nov 10, 2022

codecov-commenter commented Oct 20, 2022 •

edited

Loading

mhamilton723 commented Oct 26, 2022 •

edited

Loading