Add Memory Profiling Functionality #14510

berkaysynnada · 2025-02-05T12:52:36Z

Is your feature request related to a problem or challenge?

When executing long-running batch processes or streaming queries, it is difficult to diagnose and monitor memory usage. People need an easy way to inspect memory consumption to understand resource usage and troubleshoot potential OOM. issues.

Describe the solution you'd like

Implement a built-in memory profiling feature
(similar to DuckDB’s duckdb_memory() and duckdb_temporary_files() functions)

┌──────────────────┬────────────────────┬─────────────────────────┐
│       tag        │ memory_usage_bytes │ temporary_storage_bytes │
│     varchar      │       int64        │          int64          │
├──────────────────┼────────────────────┼─────────────────────────┤
│ BASE_TABLE       │          168558592 │                       0 │
│ HASH_TABLE       │                  0 │                       0 │
│ PARQUET_READER   │                  0 │                       0 │
│ CSV_READER       │                  0 │                       0 │
│ ORDER_BY         │                  0 │                       0 │
│ ART_INDEX        │                  0 │                       0 │
│ COLUMN_DATA      │                  0 │                       0 │
│ METADATA         │                  0 │                       0 │
│ OVERFLOW_STRINGS │                  0 │                       0 │
│ IN_MEMORY_TABLE  │                  0 │                       0 │
│ ALLOCATOR        │                  0 │                       0 │
│ EXTENSION        │                  0 │                       0 │
├──────────────────┴────────────────────┴─────────────────────────┤
│ 12 rows                                               3 columns │
└─────────────────────────────────────────────────────────────────┘

FROM duckdb_temporary_files();
┌────────────────────────────────┬───────────┐
│              path              │   size    │
│            varchar             │   int64   │
├────────────────────────────────┼───────────┤
│ .tmp/duckdb_temp_storage-0.tmp │ 967049216 │
└────────────────────────────────┴───────────┘

that allows to query:

The memory usage of different streams or even other objects (e.g., hash tables, intermediate results, buffers).
Current temporary file usage, if applicable.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

comphead · 2025-02-05T16:07:34Z

That is nice @berkaysynnada I love it.

To implement it we have to use a memory reservation in most of intensive operations, like Sorting, Join, hash, scan, etc, now we have quite few operations supporting memory reservation, the spill support even fewer.

To break down this into manageable pieces we need to

figure out what parts of execution plan we wanna monitor
create a subissue to integrate memory reservation for each part of exec plan
Display it back to the user. We already have a metrics builder on top of the ExecutionPlan and we can access metrics through EXPLAIN ANALYZE so probably this can be a substitute for the builtin function? Should we output per partition or aggregated, max value or mean?

berkaysynnada · 2025-02-06T07:20:40Z

Thank you @comphead for the great elaboration! I don't have a clear answer to your question yet, but since this is the final part of our work list, I believe someone can complete the first two items before deciding on how to collect and display.

PokIsemaine · 2025-02-25T11:17:01Z

I am interested in this function.

Tag memory that is allocated through the buffer manager, and add duckdb_memory() function by Mytherin · Pull Request #10496 · duckdb/duckdb
DuckDB adds memory tags when the buffer manager allocates memory, making it easier to trace memory usage.

Does DataFusion also need to implement it in a similar way?

I have just started to learn DataFusion’s memory management and I noticed:
https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html

Rather than tracking all allocations, DataFusion takes a pragmatic approach: Intermediate memory used as data streams through the system is not accounted (it assumed to be “small”) but the large consumers of memory must register and constrain their use. This design trades off the additional code complexity of memory tracking with limiting resource usage.

Do we need to trace all memory allocations or just focus on the part managed by MemoryPool?

comphead · 2025-02-27T22:25:08Z

Hi @PokIsemaine I'm planning to experiment with this.
One thing as you mentioned absolutely correct is to use MemoryReservation which is the helper when memory allocated through the pool and can be read in metrics using EXPLAIN ANALYZE query statement. Challenge is our memory pool coverage needs to be improved and get a lot of transformation covered.

Another dirty trick which might work is to try to track the max process memory per the each transformation using https://crates.io/crates/sysinfo

This approach shows the process memory usage globally not just per specific node of physical plan, but using the
Having max memory fingerprint it would be easy to find trends and heavyweight operations

2010YOUY01 · 2025-03-06T04:57:17Z

I am interested in this function.

Tag memory that is allocated through the buffer manager, and add duckdb_memory() function by Mytherin · Pull Request #10496 · duckdb/duckdb DuckDB adds memory tags when the buffer manager allocates memory, making it easier to trace memory usage.

Does DataFusion also need to implement it in a similar way?

No, they're different. I believe duckdb's memory pool is a textbook buffer pool, which manages memory spilling and reading back automatically for the operators.
For DataFusion, MemoryPool perhaps is a misleading naming, it's actually a 'memory tracker', operators have to use this tracker to estimate if it has exceeded the memory limit. If so they will explicitly do the spilling, or fail with a user friendly error message.

I have just started to learn DataFusion’s memory management and I noticed: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html

Rather than tracking all allocations, DataFusion takes a pragmatic approach: Intermediate memory used as data streams through the system is not accounted (it assumed to be “small”) but the large consumers of memory must register and constrain their use. This design trades off the additional code complexity of memory tracking with limiting resource usage.

Do we need to trace all memory allocations or just focus on the part managed by MemoryPool?

Regarding tracking those small allocations by internal MemoryPool, maybe not. However, if we can use the system memory profiler, or internal metrics from a memory allocator (e.g. mimalloc), it would be great to also integrate them.

comphead · 2025-03-18T20:28:48Z

The PR in arrow rs to avoid mem overcount for shared buffers apache/arrow-rs#7303

berkaysynnada added the enhancement New feature or request label Feb 5, 2025

berkaysynnada mentioned this issue Feb 5, 2025

[EPIC] DuckDB-Inspired Feature Enhancements #14514

Open

5 tasks

comphead mentioned this issue Feb 18, 2025

bug: Fix memory reservation and allocation problems for SortExec #14644

Merged

zhuqi-lucas mentioned this issue Feb 19, 2025

feat: Improve datafusion-cli memory usage and considering reserve mem… #14766

Merged

comphead mentioned this issue Mar 6, 2025

Implement tree explain for DataSourceExec #15029

Merged

comphead mentioned this issue Mar 10, 2025

Datafusion binary size has been getting bigger #13816

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Memory Profiling Functionality #14510

Add Memory Profiling Functionality #14510

berkaysynnada commented Feb 5, 2025

comphead commented Feb 5, 2025

berkaysynnada commented Feb 6, 2025

PokIsemaine commented Feb 25, 2025

comphead commented Feb 27, 2025

2010YOUY01 commented Mar 6, 2025

comphead commented Mar 18, 2025

Add Memory Profiling Functionality #14510

Add Memory Profiling Functionality #14510

Comments

berkaysynnada commented Feb 5, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

comphead commented Feb 5, 2025

berkaysynnada commented Feb 6, 2025

PokIsemaine commented Feb 25, 2025

comphead commented Feb 27, 2025

2010YOUY01 commented Mar 6, 2025

comphead commented Mar 18, 2025