Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryError When Reading Large COG File with mosaic.read() in Databricks #496

Open
Thimm opened this issue Jan 3, 2024 · 1 comment

Comments

@Thimm
Copy link

Thimm commented Jan 3, 2024

Describe the bug
I am encountering an OutOfMemoryError when attempting to read a large Cloud Optimized GeoTIFF (COG) file (2.4GB in size) using the mosaic.read() method in an Azure Databricks environment. The error occurs during the execution of df.show() after reading the file.

To Reproduce

  1. Download file to dbfs
  2. Run
from databricks import mosaic
mosaic.enable_mosaic(spark, dbutils)
file_path = "[path to file]"
df = mosaic.read().format("raster_to_grid")\
    .option("driverName", "GTiff")\
    .option("fileExtension", "*.tif")\
    .load(f"file://{file_path}")
df.show()

Expected behavior
The expectation is to successfully read the COG file into a DataFrame and display it using df.show() without encountering memory issues.

Additional Context

  • The COG file being read is 2.4GB in size.
  • This issue occurs consistently with this file size.
  • I tried to run it on a node with 128 GB memory

Environment
Databricks Runtime Version: 3.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)
Cluster Configuration: Standard_D32ads_v5 128 GB Memory, 32 Cores
Language: Python

Traceback.txt

@milos-colic
Copy link
Contributor

@Thimm Thank you for reporting this issue.
It will be resolved in the next release.
There has been a bug that was causing retiling of a large file to happen at a deferred stage and not immediately on read.
Spark buffers do not support binaries > 2GB so on read we have to retile the file to tiles that are < 2gb and then perform transformations on those.
I will be opening a PR today and this will be a part of the next release.
I ran the provided file on my local machine with the new fix without any issues using a docker and rosetta tanslation since I am on mac M1 - even with those constraints it runs now.
The next release should be out within a couple of weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants