Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support for lateral joins #10048

Open
simonvandel opened this issue Apr 11, 2024 · 11 comments
Open

Feature request: Support for lateral joins #10048

simonvandel opened this issue Apr 11, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@simonvandel
Copy link
Contributor

Is your feature request related to a problem or challenge?

Here is DuckDB's documentation.

The first example given in the above DuckDB documentation is

SELECT *
FROM range(3) t(i), LATERAL (SELECT i + 1) t2(j);

Which gives

i j
0 1
2 3
1 2

If I try the same in datafusion-cli 37.0:

SELECT * FROM unnest(range(3)) t(i), LATERAL (SELECT i + 1) t2(j);

(Note that I had to wrap range(3) in a unnest, as Datafusion does not yet have a range table function.)

This gives
Schema error: No field named i.

I would have expected that Datafusion would let (SELECT i + 1) t2(j) refer to i from the first table.

Describe the solution you'd like

Support for lateral joins.

DuckDB mentions that the LATERAL keyword is optional.
The first implementation in Datafusion would not need this.

Describe alternatives you've considered

No response

Additional context

No response

@simonvandel simonvandel added the enhancement New feature or request label Apr 11, 2024
@simonvandel
Copy link
Contributor Author

Are correlated sub-queries a prerequisite? #6140

@aalexandrov
Copy link
Contributor

aalexandrov commented Jul 12, 2024

@alamb is there still interest in adding this feature?

If nobody else is actively working on this I will take a stab at sketching an implementation next week and report back if I encounter any blockers. IIUC the right way to approach the problem here is to implement better (ideally full) query decorrelation as described in this paper.

Alternatively the physical layer can be extended with support for correlated plan fragments (which will effectively result in a nested loop evaluation strategy with an outer and inner plan). This will also not play well with parallel execution / Ballista.

@alamb
Copy link
Contributor

alamb commented Jul 12, 2024

Hi @aalexandrov -- no one is actively working on this as far as I know. Thanks for the offer.

IIUC the right way to approach the problem here is to implement better (ideally full) query decorrelation as described in this paper.

I am not familar enough with lateral joins to be sure without some more research

However, to the best of my knowledge current industrial best practice for subquery execution is to decorrelate them to joins (and not actually try and rerun them). Thus if supporting lateral joins in datafusion means supporting more decorrelation rules I think that would be a good design

@aalexandrov
Copy link
Contributor

aalexandrov commented Jul 12, 2024

I am not familar enough with lateral joins to be sure without some more research

The LATERAL join syntax from Postges1 (which is adopted by DuckDB) effectively is a syntactic counterpart of the dependent join from the Neumann and Kemper paper cited above. I think the first step is fixing the SqlToRel to make earlier FROM bindings available to the LATERAL input.

Footnotes

  1. https://www.postgresql.org/docs/16/sql-select.html

@alamb
Copy link
Contributor

alamb commented Jul 12, 2024

I am not familar enough with lateral joins to be sure without some more research

The LATERAL join syntax from Postges1 (which is adopted by DuckDB) effectively is a syntactic counterpart of the dependent join from the Neumann and Kemper paper cited above. I think the first step is fixing the SqlToRel to make earlier FROM bindings available to the LATERAL input.

Footnotes

  1. https://www.postgresql.org/docs/16/sql-select.html

Thanks -- I have had that paper on my reading list for a while. I think it is time to bump it to the top

@aalexandrov
Copy link
Contributor

I opened a PR (#11456) for the planner extensions required for this.

@alamb
Copy link
Contributor

alamb commented Aug 17, 2024

IIUC the right way to approach the problem here is to implement better (ideally full) query decorrelation as described in this paper.

BTW I finally got around to reading this paper the other week. DataFusion does not implement the fullly general decorrelation alluded to in that paper, but instead implements the "basic TPCH style" decorrelations for most subqueries (aka the basic correlations).

Thus I suspect some of the lateral queries might "just work" once a plan is created for them

@alamb
Copy link
Contributor

alamb commented Dec 13, 2024

There is some more discussion about this here: #13659 (comment)

Summarizing: Postgres appears to plan this type of query as a NestedLoops join:

postgres=# create table unnest_test (c1 int[]);
CREATE TABLE
postgres=# insert into unnest_test values (ARRAY [1,2,3]);
INSERT 0 1
postgres=# select * from unnest_test u, unnest(u.c1);
   c1    | unnest 
---------+--------
 {1,2,3} |      1
 {1,2,3} |      2
 {1,2,3} |      3
(3 rows)
 Nested Loop  (cost=0.00..295.60 rows=13600 width=36)
   Output: u.c1, unnest.unnest
   ->  Seq Scan on public.unnest_test u  (cost=0.00..23.60 rows=1360 width=32)
         Output: u.c1
   ->  Function Scan on pg_catalog.unnest  (cost=0.00..0.10 rows=10 width=4)
         Output: unnest.unnest
         Function Call: unnest(u.c1)

🤔

I think the postgres join operator re-evaluates the right input (pg_catalog.unnest) on each input row to the join which is how the lateral joins can refer to the outer query

The current DataFusion join operators don't run the input over and over again -- they normally run each input and then combine them (they don't restart inputs)

So in a query like

select * from t1, unnest(t1.c1)

I think we would need a (new) type of JoinExec that actually does run the outer query (aka unnest(t1.c1) for each input row of t1) 🤔

@findepi
Copy link
Member

findepi commented Dec 13, 2024

Summarizing: Postgres appears to plan this type of query as a NestedLoops join:

PostgreSQL is able to decorrelate some queries, but isn't necessarily the best at it (at least last time i checked some ... few years ago when adding correlated IN support to Trino).

I think we would need a (new) type of JoinExec that actually does run the outer query (aka unnest(t1.c1) for each input row of t1)

Or have Unnest relational operator do this.
This is how Trino does this. Because it doesn't support "join operator re-evaluates the right input", it has to model things differently, which happens to be more performant (as long as you are able to decorrelate).

@alamb
Copy link
Contributor

alamb commented Dec 13, 2024

PostgreSQL is able to decorrelate some queries, but isn't necessarily the best at it (at least last time i checked some ... few years ago when adding correlated IN support to Trino).

Yeah, DataFusion can only run joins that are decorrelated (it has no operator to run the classic subquery where it re-evaluates the subquery on each row)

@alamb
Copy link
Contributor

alamb commented Dec 13, 2024

Or have Unnest relational operator do this.

This is a clever idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants