-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OutOfMemorySparkException only when including a validationIndicatorCol - LightGBMClassifier #2294
Comments
Facing the same issue |
This may be because the validation dataset is loaded in memory on every executor. Therefore, a large validation dataset may cause out of memory errors |
yes, I think that's almost certainly right. I would regard this as a bug...or at least a very significant drawback warranting a feature request for it to not work this way (see for example dask xgboost). If you want to reap the benefits of training on large data, you're going to want to have validation sets that scale somewhat. |
From the source code perspective, validateData is broadcasted to each executor without any compression processing, which leads to a significant consumption of memory. Therefore, currently, the only way to avoid this problem is to reduce the size of validateData to prevent it from occupying too much memory. |
This may be because the LightGBM C++ library does not support streaming validation datasets, but does support streaming training datasets. Therefore, I think the room for improvement in SynapseML is limited. At most, it can change the way the validation data is loaded from a broadcast method to another method that saves more space, and immediately release it after loading. However, it always has to fully load the validation dataset with the LightGBM C++ library, so the optimization space should be limited. |
SynapseML version
OutOfMemorySparkException
only when including avalidationIndicatorCol
- LightGBMClassifierSystem information
Describe the problem
I have a dataset written to S3 which was created using pyspark.ml's OneHotEncoder and VectorAssembler. So the dataset written to S3 has 3 columns:
When I don't set a validationIndicatorCol in the classifier's constructor, training succeeds. However, as soon as I set
validationIndicatorCol='EvalFlag'
in the constructor, I'm getting the error#org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (60.0 GiB). The average row size was 626.0 B, with 29.0 GiB used for temporary buffers. [shuffleId: None]
Note that I'd already increased
spark.driver.MaxResultSize
from whatever its default is, to 60gb. Even when it was its default value (which I gather would be much smaller), training without the evaluation flag worked just fine.So something about including an evaluation set has massively increased the requirements on
spark.driver.MaxResultSize
Code to reproduce issue
Other info / logs
No response
What component(s) does this bug affect?
area/cognitive
: Cognitive projectarea/core
: Core projectarea/deep-learning
: DeepLearning projectarea/lightgbm
: Lightgbm projectarea/opencv
: Opencv projectarea/vw
: VW projectarea/website
: Websitearea/build
: Project build systemarea/notebooks
: Samples under notebooks folderarea/docker
: Docker usagearea/models
: models related issueWhat language(s) does this bug affect?
language/scala
: Scala source codelanguage/python
: Pyspark APIslanguage/r
: R APIslanguage/csharp
: .NET APIslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse
: Azure Synapse integrationsintegrations/azureml
: Azure ML integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: