Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for the new streaming engine #20947

Open
26 of 42 tasks
coastalwhite opened this issue Jan 28, 2025 · 5 comments
Open
26 of 42 tasks

Tracking issue for the new streaming engine #20947

coastalwhite opened this issue Jan 28, 2025 · 5 comments
Labels
new-streaming Features for or dependent on the new streaming engine

Comments

@coastalwhite
Copy link
Collaborator

coastalwhite commented Jan 28, 2025

Note

TLDR
Starting 1.23.0, the old streaming engine will no longer be maintained and might start becoming less usable with time. People currently using it should pin their versions to 1.22.0 or before.

The old streaming engine is being deprecated, and the new streaming engine is coming to replace it soon™. The new streaming engine is far along, but does not yet have 100% feature parity with the old streaming engine. At the same time, the maintenance burden of maintaining both the old and new streaming engine is becoming too large. Therefore, we are deprecating the old streaming engine starting on Polars version 1.23.0 and telling people to pin their Polars versions to any version before 1.23.0.

This issue tracks the progress on the new streaming engine. All non-supported (unless mentioned otherwise) features fall back to the in-memory engine.

The streaming engine can be used already by using .collect(new_streaming=True) or by setting the POLARS_FORCE_NEW_STREAMING=1. The physical plan can be visualized as a dot graph by using POLARS_VISUALIZE_PHYSICAL_PLAN=filename.dot.

Sources

Sinks

Out-of-core

  • Group-by
  • Equi-join
  • Sort

Streaming Nodes

Aggregates

  • Sum
  • Mean
  • Min/max
  • Last/first
  • Var/std
  • NUnique/count
  • Implode
  • Median/quantile

Plan translation to streaming

  • Literal series in selections
  • Aggregates in selections
  • Sorts in selections
  • Filters in selections
  • .over() to group-by + join

Datatypes

  • Categorical (currently no fallback - simply broken)
  • Everything else
coastalwhite added a commit to coastalwhite/polars that referenced this issue Jan 28, 2025
Information on the new streaming engine: pola-rs#20947.
@coastalwhite coastalwhite changed the title Tracking issue for new streaming engine Tracking issue for the new streaming engine Jan 28, 2025
@orlp orlp added the new-streaming Features for or dependent on the new streaming engine label Jan 28, 2025
@deanm0000
Copy link
Collaborator

what's the difference between out-of-core group_by/equi-join and streaming nodes group_by/equi-join?

@orlp
Copy link
Collaborator

orlp commented Jan 28, 2025

@deanm0000 Out-of-core operators will automatically spill to disk when the data you're processing is larger than available memory.

@daviskirk
Copy link
Contributor

non-supported (unless mentioned otherwise) features fall back to the in-memory engine

What happens to the other non-supported features like the broken categoricals? Is there any way to opt out of streaming as soon as they are encountered?

@orlp
Copy link
Collaborator

orlp commented Jan 28, 2025

@daviskirk Categoricals are currently the only thing that are simply broken on the new streaming engine (to my knowledge). Almost all unsupported things automatically fall back to the eager engine (just for the parts of the computation that aren't supported), some give a hard error.

My plan is to fix categoricals on the new streaming engine soon.

@jhirsch-mhp

This comment has been minimized.

@coastalwhite coastalwhite pinned this issue Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-streaming Features for or dependent on the new streaming engine
Projects
None yet
Development

No branches or pull requests

5 participants