Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage reading Parquet files with many struct fields #21031

Open
2 tasks done
adamreeve opened this issue Jan 31, 2025 · 3 comments
Open
2 tasks done

High memory usage reading Parquet files with many struct fields #21031

adamreeve opened this issue Jan 31, 2025 · 3 comments
Assignees
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@adamreeve
Copy link
Contributor

adamreeve commented Jan 31, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import resource
import polars as pl
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

num_fields = 1000
num_rows = 1
rng = np.random.default_rng(0)

use_struct = True

arrays = [pa.array(rng.uniform(0.0, 1.0, num_rows), type=pa.float32()) for _ in range(num_fields)]
names = [f'f{i}' for i in range(num_fields)]

if use_struct:
    table = pa.Table.from_pydict({
        'struct': pa.StructArray.from_arrays(arrays=arrays, names=names)
    })
else:
    table = pa.Table.from_arrays(arrays=arrays, names=names)

pq.write_table(table, 'data.parquet')

rss_before = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

num_copies = 10
scans = [pl.scan_parquet(f"data.parquet") for i in range(num_copies)]
df = pl.concat(scans).collect()

rss_after = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

usage = rss_after - rss_before  # KB on Linux but can be different units on other OSs
print(f'Memory usage = {usage / 1000} MB')
print(f'Estimated size = {df.estimated_size("mb")} MB')

Log output

found multiple sources; run comm_subplan_elim
UNION: `parallel=false` union is run sequentially
parquet scan with parallel = None
CACHE SET: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
Memory usage = 2152.76 MB
Estimated size = 0.03814697265625 MB

Issue description

The memory usage is much higher than expected (2152 MB). The Parquet file size is only 236 KB, and the data size is very small. For comparison, using similar sized data but in a flat schema (by changing use_struct to False in the example code) uses only 38 MB of memory.

Expected behavior

Memory usage should be similar to using a flat schema.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-6.12.9-200.fc41.x86_64-x86_64-with-glibc2.40
Python:              3.13.1 (main, Dec  9 2024, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.2
openpyxl             <not installed>
pandas               <not installed>
pyarrow              19.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@adamreeve adamreeve added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 31, 2025
@ritchie46
Copy link
Member

ritchie46 commented Feb 1, 2025

Observation: The memory usage is the memory of the plan, not of the data. Not collecting to a df shows the same memory usage, but only if I call explain. That's why the caches don't have an influence.

I believe the parquet metadata is large in this example.

@adamreeve
Copy link
Contributor Author

I believe the parquet metadata is large in this example.

Increasing the length of the struct field names significantly increases memory usage, supporting this idea. Eg. changing the names from f{i} to f{i}_012345678901234567890123456789 in my example increases memory usage from 2.1 GB to 3.1 GB, although the size of the Parquet file only increases by about 100 kB.

@coastalwhite
Copy link
Collaborator

I need to finish one thing for the new streaming engine, afterwards I will look into this. There is generally still a lot of wins to be had with loading the parquet metadata.

@coastalwhite coastalwhite reopened this Feb 2, 2025
@coastalwhite coastalwhite self-assigned this Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants