Tracking issue for the new streaming engine #20947

coastalwhite · 2025-01-28T09:49:07Z

Information on the new streaming engine: pola-rs#20947.

deanm0000 · 2025-01-28T11:43:12Z

what's the difference between out-of-core group_by/equi-join and streaming nodes group_by/equi-join?

orlp · 2025-01-28T21:10:30Z

@deanm0000 Out-of-core operators will automatically spill to disk when the data you're processing is larger than available memory.

daviskirk · 2025-01-28T21:15:22Z

non-supported (unless mentioned otherwise) features fall back to the in-memory engine

What happens to the other non-supported features like the broken categoricals? Is there any way to opt out of streaming as soon as they are encountered?

orlp · 2025-01-28T21:22:02Z

@daviskirk Categoricals are currently the only thing that are simply broken on the new streaming engine (to my knowledge). Almost all unsupported things automatically fall back to the eager engine (just for the parts of the computation that aren't supported), some give a hard error.

My plan is to fix categoricals on the new streaming engine soon.

deanm0000 · 2025-02-21T17:49:02Z

I saw that var/std are supported in new streaming so I did

import os
os.environ["POLARS_FORCE_NEW_STREAMING"]="1"
import numpy as np
import polars as pl
df=pl.DataFrame({"a":np.random.random(100)})
print(df.lazy().select(pl.col("a").std()).explain(streaming=True))

 SELECT [col("a").std()] FROM
  STREAMING:
    DF ["a"]; PROJECT 1/1 COLUMNS

The std operation is outside of the streaming. I'm not sure how to tell if that's because explain doesn't give a new_streaming query plan or if something is awry with the new_streaming var/cov.

coastalwhite · 2025-02-22T11:01:19Z

Explain does not yet reflect the new streaming engine. To see what it is actually doing, use POLARS_VISUALIZE_PHYSICAL_PLAN=/path/to/graph.dot. I added explain to the list of things that needs to be done.

ion-elgreco · 2025-02-22T11:08:00Z

Sinks should also be possible with deltalake since v0.25 if you pass a RecordBatchStream to the writer

jayceslesar · 2025-03-05T22:33:46Z

Any insight as to when this will be available with sink_* methods?

ritchie46 · 2025-03-06T08:10:41Z

Any insight as to when this will be available with sink_* methods?

Next release the sinks will point to the new streaming engine.

MPJansen · 2025-03-06T09:26:15Z

Any insight as to when this will be available with sink_* methods?

Next release the sinks will point to the new streaming engine.

Hi!

Just checking, does that imply that writing partitioned lazyframes will be supported by the sink_ methods?
Or is that out of scope? If so, is that something Polars is interested in supporting?

KR

coastalwhite · 2025-03-06T10:20:17Z

Any insight as to when this will be available with sink_* methods?

Next release the sinks will point to the new streaming engine.

Hi!

Just checking, does that imply that writing partitioned lazyframes will be supported by the sink_ methods? Or is that out of scope? If so, is that something Polars is interested in supporting?

KR

I am unsure what you mean. If you mean partitioned sinking, yes. We already have #21573, but will support other variants as well. If you mean multiple sinks in separate lazy frames, not yet, but that will be supported eventually.

gdementen · 2025-03-06T11:00:14Z

Will the new streaming engine improve the out-of-core/larger-than-ram capabilities of Polars? I will soon (within 3 months) have to decide which solution to use for a new project with larger than RAM datasets (100-200Gb) and would love to use Polars, which I already use with great results for smaller datasets. Your benchmarks usually involve smaller datasets and when seeing the benchmarks at https://duckdblabs.github.io/db-benchmark/, I got the impression that Polars currently has troubles with those dataset sizes but I am hopeful the new streaming engine will change that. Are my hopes unfounded? (sorry if this is not the correct place to ask such a question)

coastalwhite · 2025-03-06T11:03:23Z

Currently, the new streaming engine is not really out-of-core yet. This will change in the near future and yes, we should get a lot better at out-of-core.

coastalwhite added a commit to coastalwhite/polars that referenced this issue Jan 28, 2025

chore: Deprecate the old streaming engine

ed12996

Information on the new streaming engine: pola-rs#20947.

coastalwhite mentioned this issue Jan 28, 2025

chore: Deprecate the old streaming engine #20949

Merged

coastalwhite changed the title ~~Tracking issue for new streaming engine~~ Tracking issue for the new streaming engine Jan 28, 2025

orlp added the new-streaming Features for or dependent on the new streaming engine label Jan 28, 2025

This comment has been minimized.

Sign in to view

coastalwhite pinned this issue Jan 31, 2025

ritchie46 mentioned this issue Feb 3, 2025

feat: Hold string cache in new streaming engine and fix row-encoding #21039

Merged

This was referenced Feb 7, 2025

Update Lazy Polars to support streaming and in-built parquet support kedro-org/kedro-plugins#838

Closed

polars.LazyPolarsdataset .collect() streaming kedro-org/kedro-plugins#519

Closed

Use pl.sink_* in LazyPolarsDataset._save kedro-org/kedro-plugins#702

Open

ritchie46 mentioned this issue Feb 27, 2025

Tracking issue for Polars Cloud #21487

Open

42 tasks

cmdlineluser mentioned this issue Feb 27, 2025

Support int_range(s) in streaming mode #20015

Open

roykim98 mentioned this issue Mar 2, 2025

scan_delta fails when using the new streaming table for valid delta tables that otherwise succeed with old streaming / non-streaming #21544

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for the new streaming engine #20947

Tracking issue for the new streaming engine #20947

coastalwhite commented Jan 28, 2025 •

edited by nameexhaustion

Loading

deanm0000 commented Jan 28, 2025

orlp commented Jan 28, 2025

daviskirk commented Jan 28, 2025

orlp commented Jan 28, 2025 •

edited

Loading

This comment has been minimized.

deanm0000 commented Feb 21, 2025

coastalwhite commented Feb 22, 2025

ion-elgreco commented Feb 22, 2025

jayceslesar commented Mar 5, 2025

ritchie46 commented Mar 6, 2025

MPJansen commented Mar 6, 2025

coastalwhite commented Mar 6, 2025 •

edited

Loading

gdementen commented Mar 6, 2025

coastalwhite commented Mar 6, 2025

Tracking issue for the new streaming engine #20947

Tracking issue for the new streaming engine #20947

Comments

coastalwhite commented Jan 28, 2025 • edited by nameexhaustion Loading

Sources

Sinks

Out-of-core

Streaming Nodes

Aggregates

Plan translation to streaming

Datatypes

Other

deanm0000 commented Jan 28, 2025

orlp commented Jan 28, 2025

daviskirk commented Jan 28, 2025

orlp commented Jan 28, 2025 • edited Loading

This comment has been minimized.

deanm0000 commented Feb 21, 2025

coastalwhite commented Feb 22, 2025

ion-elgreco commented Feb 22, 2025

jayceslesar commented Mar 5, 2025

ritchie46 commented Mar 6, 2025

MPJansen commented Mar 6, 2025

coastalwhite commented Mar 6, 2025 • edited Loading

gdementen commented Mar 6, 2025

coastalwhite commented Mar 6, 2025

coastalwhite commented Jan 28, 2025 •

edited by nameexhaustion

Loading

orlp commented Jan 28, 2025 •

edited

Loading

coastalwhite commented Mar 6, 2025 •

edited

Loading