-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Support for lateral joins #10048
Comments
Are correlated sub-queries a prerequisite? #6140 |
@alamb is there still interest in adding this feature? If nobody else is actively working on this I will take a stab at sketching an implementation next week and report back if I encounter any blockers. IIUC the right way to approach the problem here is to implement better (ideally full) query decorrelation as described in this paper. Alternatively the physical layer can be extended with support for correlated plan fragments (which will effectively result in a nested loop evaluation strategy with an outer and inner plan). This will also not play well with parallel execution / Ballista. |
Hi @aalexandrov -- no one is actively working on this as far as I know. Thanks for the offer.
I am not familar enough with lateral joins to be sure without some more research However, to the best of my knowledge current industrial best practice for subquery execution is to decorrelate them to joins (and not actually try and rerun them). Thus if supporting lateral joins in datafusion means supporting more decorrelation rules I think that would be a good design |
The Footnotes |
Thanks -- I have had that paper on my reading list for a while. I think it is time to bump it to the top |
I opened a PR (#11456) for the planner extensions required for this. |
BTW I finally got around to reading this paper the other week. DataFusion does not implement the fullly general decorrelation alluded to in that paper, but instead implements the "basic TPCH style" decorrelations for most subqueries (aka the basic correlations). Thus I suspect some of the lateral queries might "just work" once a plan is created for them |
There is some more discussion about this here: #13659 (comment) Summarizing: Postgres appears to plan this type of query as a NestedLoops join: postgres=# create table unnest_test (c1 int[]);
CREATE TABLE
postgres=# insert into unnest_test values (ARRAY [1,2,3]);
INSERT 0 1
postgres=# select * from unnest_test u, unnest(u.c1);
c1 | unnest
---------+--------
{1,2,3} | 1
{1,2,3} | 2
{1,2,3} | 3
(3 rows) Nested Loop (cost=0.00..295.60 rows=13600 width=36)
Output: u.c1, unnest.unnest
-> Seq Scan on public.unnest_test u (cost=0.00..23.60 rows=1360 width=32)
Output: u.c1
-> Function Scan on pg_catalog.unnest (cost=0.00..0.10 rows=10 width=4)
Output: unnest.unnest
Function Call: unnest(u.c1) 🤔 I think the postgres join operator re-evaluates the right input ( The current DataFusion join operators don't run the input over and over again -- they normally run each input and then combine them (they don't restart inputs) So in a query like select * from t1, unnest(t1.c1) I think we would need a (new) type of JoinExec that actually does run the outer query (aka |
PostgreSQL is able to decorrelate some queries, but isn't necessarily the best at it (at least last time i checked some ... few years ago when adding correlated IN support to Trino).
Or have Unnest relational operator do this. |
Yeah, DataFusion can only run joins that are decorrelated (it has no operator to run the classic subquery where it re-evaluates the subquery on each row) |
This is a clever idea |
Is your feature request related to a problem or challenge?
Here is DuckDB's documentation.
The first example given in the above DuckDB documentation is
Which gives
If I try the same in datafusion-cli 37.0:
(Note that I had to wrap
range(3)
in aunnest
, as Datafusion does not yet have arange
table function.)This gives
Schema error: No field named i.
I would have expected that Datafusion would let
(SELECT i + 1) t2(j)
refer toi
from the first table.Describe the solution you'd like
Support for lateral joins.
DuckDB mentions that the
LATERAL
keyword is optional.The first implementation in Datafusion would not need this.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: