Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with join on st_contains #435

Open
rvo1994 opened this issue Sep 26, 2023 · 1 comment
Open

Performance issue with join on st_contains #435

rvo1994 opened this issue Sep 26, 2023 · 1 comment

Comments

@rvo1994
Copy link

rvo1994 commented Sep 26, 2023

I am trying to inner join delta table g (contains ~9M records but only 1 file of 80MB as it only consists of 2 columns: lat and lon) with delta table v (contains 40 complex polygons - biggest polygons has 2800 boundaries) on a st_contains condition (if polygon from table v contains point from table g):

SELECT vvr.lat
     , vvr.lon
     , vvr.shape_name as vvr
FROM
(
    SELECT /*+ BROADCAST(v) */ v.shape_name
         , v.area
         , MIN(v.area) OVER (PARTITION BY v.shape_name, g.lat, g.lon) AS min_area
         , v.hole
         , g.lat
         , g.lon
    FROM  v
    INNER JOIN g
    ON st_contains(st_point(CAST(g.lon as DOUBLE), CAST(g.lat AS DOUBLE)), st_geomfromwkt(v.polygon))
) AS vvr
WHERE vvr.area = vvr.min_area AND vvr.hole = false;

I cancelled the job after 15 hours as it has only read ~1.5M records from table g (as observed from the SQL / Dataframe tab in the Spark UI) as it was still busy with "Filtering files for query". There is also no parallelization happening as there was only 1 task running. FYI, I am using DBR 11.3 LTS and Standard_DS3_v2 driver/worker with 4 nodes per executor.

Any suggestions on how to optimize this query? Thanks!

@sllynn
Copy link
Contributor

sllynn commented Sep 28, 2023

@rvo1994 For a point in poly join like this one, you might want to try using a grid index to speed the process up.

Take a look at the quickstart notebook to see how you can index your points and polygons using H3 then perform the join directly on the index for points within the so called 'core' cell ids (those wholly contained within the polygons) or with st_contains for cell ids that sit on the boundary of the polys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants