High memory usage reading Parquet files with many struct fields #21031

adamreeve · 2025-01-31T17:38:01Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import resource
import polars as pl
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

num_fields = 1000
num_rows = 1
rng = np.random.default_rng(0)

use_struct = True

arrays = [pa.array(rng.uniform(0.0, 1.0, num_rows), type=pa.float32()) for _ in range(num_fields)]
names = [f'f{i}' for i in range(num_fields)]

if use_struct:
    table = pa.Table.from_pydict({
        'struct': pa.StructArray.from_arrays(arrays=arrays, names=names)
    })
else:
    table = pa.Table.from_arrays(arrays=arrays, names=names)

pq.write_table(table, 'data.parquet')

rss_before = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

num_copies = 10
scans = [pl.scan_parquet(f"data.parquet") for i in range(num_copies)]
df = pl.concat(scans).collect()

rss_after = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

usage = rss_after - rss_before  # KB on Linux but can be different units on other OSs
print(f'Memory usage = {usage / 1000} MB')
print(f'Estimated size = {df.estimated_size("mb")} MB')

Log output

found multiple sources; run comm_subplan_elim
UNION: `parallel=false` union is run sequentially
parquet scan with parallel = None
CACHE SET: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
CACHE HIT: cache id: 0
Memory usage = 2152.76 MB
Estimated size = 0.03814697265625 MB

Issue description

The memory usage is much higher than expected (2152 MB). The Parquet file size is only 236 KB, and the data size is very small. For comparison, using similar sized data but in a flat schema (by changing use_struct to False in the example code) uses only 38 MB of memory.

Expected behavior

Memory usage should be similar to using a flat schema.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-6.12.9-200.fc41.x86_64-x86_64-with-glibc2.40
Python:              3.13.1 (main, Dec  9 2024, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.2
openpyxl             <not installed>
pandas               <not installed>
pyarrow              19.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2025-02-01T10:44:30Z

Observation: The memory usage is the memory of the plan, not of the data. Not collecting to a df shows the same memory usage, but only if I call explain. That's why the caches don't have an influence.

I believe the parquet metadata is large in this example.

adamreeve · 2025-02-02T21:43:36Z

I believe the parquet metadata is large in this example.

Increasing the length of the struct field names significantly increases memory usage, supporting this idea. Eg. changing the names from f{i} to f{i}_012345678901234567890123456789 in my example increases memory usage from 2.1 GB to 3.1 GB, although the size of the Parquet file only increases by about 100 kB.

coastalwhite · 2025-02-02T21:52:02Z

I need to finish one thing for the new streaming engine, afterwards I will look into this. There is generally still a lot of wins to be had with loading the parquet metadata.

adamreeve added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 31, 2025

coastalwhite closed this as completed Feb 2, 2025

coastalwhite reopened this Feb 2, 2025

coastalwhite self-assigned this Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage reading Parquet files with many struct fields #21031

High memory usage reading Parquet files with many struct fields #21031

adamreeve commented Jan 31, 2025 •

edited

Loading

ritchie46 commented Feb 1, 2025 •

edited

Loading

adamreeve commented Feb 2, 2025

coastalwhite commented Feb 2, 2025

High memory usage reading Parquet files with many struct fields #21031

High memory usage reading Parquet files with many struct fields #21031

Comments

adamreeve commented Jan 31, 2025 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Feb 1, 2025 • edited Loading

adamreeve commented Feb 2, 2025

coastalwhite commented Feb 2, 2025

adamreeve commented Jan 31, 2025 •

edited

Loading

ritchie46 commented Feb 1, 2025 •

edited

Loading