You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to inner join delta table g (contains ~9M records but only 1 file of 80MB as it only consists of 2 columns: lat and lon) with delta table v (contains 40 complex polygons - biggest polygons has 2800 boundaries) on a st_contains condition (if polygon from table v contains point from table g):
SELECT vvr.lat
, vvr.lon
, vvr.shape_name as vvr
FROM
(
SELECT /*+ BROADCAST(v) */ v.shape_name
, v.area
, MIN(v.area) OVER (PARTITION BY v.shape_name, g.lat, g.lon) AS min_area
, v.hole
, g.lat
, g.lon
FROM v
INNER JOIN g
ON st_contains(st_point(CAST(g.lon as DOUBLE), CAST(g.lat AS DOUBLE)), st_geomfromwkt(v.polygon))
) AS vvr
WHERE vvr.area = vvr.min_area AND vvr.hole = false;
I cancelled the job after 15 hours as it has only read ~1.5M records from table g (as observed from the SQL / Dataframe tab in the Spark UI) as it was still busy with "Filtering files for query". There is also no parallelization happening as there was only 1 task running. FYI, I am using DBR 11.3 LTS and Standard_DS3_v2 driver/worker with 4 nodes per executor.
Any suggestions on how to optimize this query? Thanks!
The text was updated successfully, but these errors were encountered:
@rvo1994 For a point in poly join like this one, you might want to try using a grid index to speed the process up.
Take a look at the quickstart notebook to see how you can index your points and polygons using H3 then perform the join directly on the index for points within the so called 'core' cell ids (those wholly contained within the polygons) or with st_contains for cell ids that sit on the boundary of the polys.
I am trying to inner join delta table g (contains ~9M records but only 1 file of 80MB as it only consists of 2 columns: lat and lon) with delta table v (contains 40 complex polygons - biggest polygons has 2800 boundaries) on a st_contains condition (if polygon from table v contains point from table g):
I cancelled the job after 15 hours as it has only read ~1.5M records from table g (as observed from the SQL / Dataframe tab in the Spark UI) as it was still busy with "Filtering files for query". There is also no parallelization happening as there was only 1 task running. FYI, I am using DBR 11.3 LTS and Standard_DS3_v2 driver/worker with 4 nodes per executor.
Any suggestions on how to optimize this query? Thanks!
The text was updated successfully, but these errors were encountered: