-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c++] Fix memory leak in Arrow table implementation #6314
Conversation
FYI, if anyone has a good idea how to write good tests for this, please let me know 👀 once this is merged, we should probably also cut a new release soonish to allow people to confidently use the Arrow interface. |
~ArrowTable() { | ||
// As consumer of the Arrow array, the Arrow table must release all Arrow arrays it receives | ||
// as well as the schema. As per the specification, children arrays are released by the | ||
// producer. See: https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-consumers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for including this link, super helpful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll defer to @guolinke or @shiyu1994 for a review on this.
Just want to add that if/when we add support for Arrow data via the R-package here, we should be able to catch such things through the valgrind
checks we run on the R package: https://github.com/microsoft/LightGBM/tree/master/R-package#valgrind
Would be great! I myself, however, have close to no experience with R, unfortunately 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the excellent fix.
I've approved this PR. Could you update the branch to sync with master so that it can be merged? |
@jameslamb can you tell me how to retry jobs run on Azure? I can't seem to find a button that allows me to do this. |
It's a Microsoft-controlled account so I suspect @shiyu1994 would have to add explicitly add you. I believe @guolinke did that for me a few years ago while still at Microsoft. Message @shiyu1994 in the maintainer slack and hopefully he can get you set up. Until then, |
@borchero Could you register an Azure DevOps account with the same email as your Github? Then, I'll add you to the Azure DevOps Team for our CI jobs. |
Done @shiyu1994, thanks a lot! :) |
Motivation
When trying to integrate LightGBM v4.2 into our codebase, we noticed an unexpected increase in memory consumption when using LightGBM's new Arrow interface under certain circumstances.
It turns out that the Arrow implementation causes memory leaks as it does not properly release the Arrow arrays being passed to it. Unfortunately, this issue only really surfaces if the data being passed to Arrow is not in memory during the entire program execution (which is frequently the case) and the leakage really only becomes an issue if different data is passed to the LightGBM Arrow interface multiple times (this is where we discovered it).
The fix is simple and the reasoning behind it is outlined clearly in the Arrow C Data interface specification.
Huge shoutout to @delsner who helped me debug this memory leak and helped me learn a lot more about Arrow, allocators and memory debugging 😁
Sorry for not already including this in #6034 and its sibling PRs... I still had some gaps in my understanding of the Arrow C Data interface 👀