Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [LightGBM] support dataset rows more then max(int32_t) #1684

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

junpeng0715
Copy link

Related Issues/PRs

#LightGBM: 5540

What changes are proposed in this pull request?

Allow more than MAX(int32_t) of dataset rows to use LightGBM via SynapseML.

How is this patch tested?

  • I tested in our Databricks envs use more than Max(int32) dataset rows.

Does this PR change any dependencies?

  • No. You can skip this section.
  • Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

  • No. You can skip this section.
  • Yes. Make sure you have added samples following below steps.

@junpeng0715 junpeng0715 requested a review from svotaw as a code owner October 20, 2022 05:09
@github-actions
Copy link

Hey @junpeng0715 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

@junpeng0715 junpeng0715 changed the title LGBM: support dataset rows more then max(int32_t) feat: [LightGBM] support dataset rows more then max(int32_t) Oct 20, 2022
@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link

codecov-commenter commented Oct 20, 2022

Codecov Report

Merging #1684 (45e427b) into master (448f6b7) will decrease coverage by 59.98%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master    #1684       +/-   ##
===========================================
- Coverage   80.63%   20.65%   -59.99%     
===========================================
  Files         272      244       -28     
  Lines       14224    11971     -2253     
  Branches      732      570      -162     
===========================================
- Hits        11470     2473     -8997     
- Misses       2754     9498     +6744     
Impacted Files Coverage Δ
...rg/apache/spark/injections/BlockManagerUtils.scala 0.00% <0.00%> (-100.00%) ⬇️
.../microsoft/azure/synapse/ml/io/binary/Binary.scala 0.00% <0.00%> (-100.00%) ⬇️
...icrosoft/azure/synapse/ml/explainers/Sampler.scala 0.00% <0.00%> (-100.00%) ⬇️
...soft/azure/synapse/ml/core/contracts/Metrics.scala 0.00% <0.00%> (-100.00%) ⬇️
...soft/azure/synapse/ml/explainers/LIMESampler.scala 0.00% <0.00%> (-100.00%) ⬇️
...ft/azure/synapse/ml/explainers/TextExplainer.scala 0.00% <0.00%> (-100.00%) ⬇️
...spark/sql/types/injections/MetadataUtilities.scala 0.00% <0.00%> (-100.00%) ⬇️
...t/azure/synapse/ml/automl/DefaultHyperparams.scala 0.00% <0.00%> (-100.00%) ⬇️
...t/azure/synapse/ml/core/schema/SparkBindings.scala 0.00% <0.00%> (-100.00%) ⬇️
...t/azure/synapse/ml/explainers/ImageExplainer.scala 0.00% <0.00%> (-100.00%) ⬇️
... and 212 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@svotaw
Copy link
Collaborator

svotaw commented Oct 20, 2022

hello. thanks for this. I will review when I get a chance, but a couple of points:

  1. We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.
  2. Note that LightGBM itself does not fully support Datasets that have more than max(int_32) rows (I'm on a PR there for someone trying to improve that to max(int_64), but currently they are planning to make it a build option).

@junpeng0715
Copy link
Author

Hi @svotaw ,thank you!

hello. thanks for this. I will review when I get a chance, but a couple of points:

  1. We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.

Is there any issue ticket refer to?

  1. Note that LightGBM itself does not fully support Datasets that have more than max(int_32) rows (I'm on a PR there for someone trying to improve that to max(int_64), but currently they are planning to make it a build option).

Yes, that is the PR I made in LightGBM repo. microsoft/LightGBM#5540

@junpeng0715
Copy link
Author

Hi @svotaw

  1. We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.

Is that will support dataset rows over max(int32)?

@svotaw
Copy link
Collaborator

svotaw commented Oct 23, 2022

@junpeng0715
#1580 is the PR that will enable using "streaming" mode for lightgbm. It is currently waiting for a bug fix to sparse data that is pending in this LightGBM PR: microsoft/LightGBM#5551.

I have not specifically looked into int64 support for the new code. We try to make it so, but not all the LightGBM APIs support it as you know.

@svotaw
Copy link
Collaborator

svotaw commented Oct 23, 2022

You have errors in the pipeline:

[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData")
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType),
[error] ^
[error] 7 errors found

@junpeng0715
Copy link
Author

You have errors in the pipeline:

[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData") [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] 7 errors found

Yes the dependency change is microsoft/LightGBM#5540.

@junpeng0715
Copy link
Author

Hi @svotaw
Thank you very much.

As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

mhamilton723 commented Oct 26, 2022

Compilation failed, please confirm this PR compiles locally

[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/DatasetAggregator.scala:532:35: type mismatch;
[error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_void
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int
[error] lightgbmlib.int_to_voidp_ptr(indexes.array),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:29:61: type mismatch;
[error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int
[error] lightgbmlib.LGBM_DatasetGetField(datasetPtr, fieldName, tmpOutLenPtr, outArray, outTypePtr)
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:57:75: type mismatch;
[error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int
[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData")
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType),
[error] ^
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch;
[error] found : Long
[error] required: Int
[error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType),
[error] ^
[error] 7 errors found
[error] (lightgbm / Compile / compileIncremental) Compilation failed
[error] Total time: 17 s, completed Oct 26, 2022 4:08:19 PM
##[error]Script failed with exit code: 1

@svotaw
Copy link
Collaborator

svotaw commented Oct 26, 2022

Hi @svotaw Thank you very much.

As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?

There will be 2 modes of execution, the older "bulk" mode which uses DatasetAggregator, and the newer "streaming" mode which we hopefully will make the default at some point within the next semester. If you plan to use the older "bulk" mode (e.g. in the next few months) and you want this feature, let me know and I'll review it. Otherwise, you can wait until we check in the newer streaming version and then see what changes need to made for it to support int64. This is blocked at the moment until we get a bug fix into LightGBM microsoft/LightGBM#5551. That PR will hopefully be approved relatively soon since it's only a few lines and only in APIs we added ourselves.

Also, what is the plan for your LightGBM PR? For SynapseMl, we make our own ilghtgbm maven package from the top of microsoft/lightgbm repo. This is an involved process and we don't do it often. I assume your PR would have to be checked in and we make a new synapseml lightgbm package before we could even build this?

@junpeng0715
Copy link
Author

Hi @mhamilton723
Thank you for running pipeline.
This PR is depend on microsoft/LightGBM#5540, so the error is known.

@junpeng0715
Copy link
Author

Hi @svotaw Thank you very much.
As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?

There will be 2 modes of execution, the older "bulk" mode which uses DatasetAggregator, and the newer "streaming" mode which we hopefully will make the default at some point within the next semester. If you plan to use the older "bulk" mode (e.g. in the next few months) and you want this feature, let me know and I'll review it. Otherwise, you can wait until we check in the newer streaming version and then see what changes need to made for it to support int64. This is blocked at the moment until we get a bug fix into LightGBM microsoft/LightGBM#5551. That PR will hopefully be approved relatively soon since it's only a few lines and only in APIs we added ourselves.

Also, what is the plan for your LightGBM PR? For SynapseMl, we make our own ilghtgbm maven package from the top of microsoft/lightgbm repo. This is an involved process and we don't do it often. I assume your PR would have to be checked in and we make a new synapseml lightgbm package before we could even build this?

I'm very glad that you can take a look for this PR, since the "bulk" mode will always stay in the system for user to choose.

I can't say the plan clearly, there also hava issue block CI jobs in LGBM repo.

@guolinke
Copy link

Hi @svotaw , we will push the PR, as long as the CI pass, and with @shiyu1994 's review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants