-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: [LightGBM] support dataset rows more then max(int32_t) #1684
base: master
Are you sure you want to change the base?
Conversation
Hey @junpeng0715 👋! We use semantic commit messages to streamline the release process. Examples of commit messages with semantic prefixes:
To test your commit locally, please follow our guild on building from source. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report
@@ Coverage Diff @@
## master #1684 +/- ##
===========================================
- Coverage 80.63% 20.65% -59.99%
===========================================
Files 272 244 -28
Lines 14224 11971 -2253
Branches 732 570 -162
===========================================
- Hits 11470 2473 -8997
- Misses 2754 9498 +6744
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
hello. thanks for this. I will review when I get a chance, but a couple of points:
|
Hi @svotaw ,thank you!
Is there any issue ticket refer to?
Yes, that is the PR I made in LightGBM repo. microsoft/LightGBM#5540 |
Hi @svotaw
Is that will support dataset rows over max(int32)? |
@junpeng0715 I have not specifically looked into int64 support for the new code. We try to make it so, but not all the LightGBM APIs support it as you know. |
You have errors in the pipeline: [error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData") |
Yes the dependency change is microsoft/LightGBM#5540. |
Hi @svotaw As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Compilation failed, please confirm this PR compiles locally [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/DatasetAggregator.scala:532:35: type mismatch; |
There will be 2 modes of execution, the older "bulk" mode which uses DatasetAggregator, and the newer "streaming" mode which we hopefully will make the default at some point within the next semester. If you plan to use the older "bulk" mode (e.g. in the next few months) and you want this feature, let me know and I'll review it. Otherwise, you can wait until we check in the newer streaming version and then see what changes need to made for it to support int64. This is blocked at the moment until we get a bug fix into LightGBM microsoft/LightGBM#5551. That PR will hopefully be approved relatively soon since it's only a few lines and only in APIs we added ourselves. Also, what is the plan for your LightGBM PR? For SynapseMl, we make our own ilghtgbm maven package from the top of microsoft/lightgbm repo. This is an involved process and we don't do it often. I assume your PR would have to be checked in and we make a new synapseml lightgbm package before we could even build this? |
Hi @mhamilton723 |
I'm very glad that you can take a look for this PR, since the "bulk" mode will always stay in the system for user to choose. I can't say the plan clearly, there also hava issue block CI jobs in LGBM repo. |
Hi @svotaw , we will push the PR, as long as the CI pass, and with @shiyu1994 's review. |
Related Issues/PRs
#LightGBM: 5540
What changes are proposed in this pull request?
Allow more than MAX(int32_t) of dataset rows to use LightGBM via SynapseML.
How is this patch tested?
Does this PR change any dependencies?
Does this PR add a new feature? If so, have you added samples on website?