Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Tuning Docs #415

Closed
wants to merge 0 commits into from

Conversation

mjohns-databricks
Copy link
Contributor

Initial tuning docs for Mosaic.

@codecov
Copy link

codecov bot commented Jul 5, 2023

Codecov Report

Merging #415 (5eff086) into main (19f2e9f) will increase coverage by 0.01%.
Report is 160 commits behind head on main.
The diff coverage is 97.50%.

❗ Current head 5eff086 differs from pull request most recent head a9aed81. Consider uploading reports for the commit a9aed81 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #415      +/-   ##
==========================================
+ Coverage   95.01%   95.02%   +0.01%     
==========================================
  Files         198      198              
  Lines        5658     5692      +34     
  Branches      178      175       -3     
==========================================
+ Hits         5376     5409      +33     
- Misses        282      283       +1     
Files Coverage Δ
...abricks/labs/mosaic/core/index/H3IndexSystem.scala 97.59% <97.50%> (-0.37%) ⬇️

Copy link
Contributor

@milos-colic milos-colic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few minor suggestions and a couple of questions.
Nothing major.

Data Engineering
################

It is important for customers to have Data Engineering (DE) tables tuned for performance. Mosaic has been designed to operate directly on standard interchange formats which may appear in various DataFrame columns; currently, WKT, WKB, and GeoJSON are supported, with WKB offering the best properties for storing and querying. Further, while capabilities are always advancing, Mosaic does not expose geometry types as defined in `OGC Simple Features for SQL <https://www.ogc.org/standard/sfs/>`__, making investments in DE all the more beneficial. Additionally, for best tuning, pre-standardize all your geospatial data into the same SRID; for most applications, this should be 4326, see `CRS Docs <https://databrickslabs.github.io/mosaic/usage/grid-indexes-bng.html#coordinate-reference-system>`__. *As of Mosaic 0.3 series, customers need to prepare tables for performant queries, though we are exploring options to improve this experience going forward.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjohns-databricks We highlight here to use WKB (WKT/WKB do not support SRIDs) while we are mentioning conversion to 4326 (I would avoid this since it isnt really true, it makes sense due to H3 only) and SRIDs are only supported in GeoJSON, so format conversion is required in the current version.
We can easily add support in the future for handling SR projections for WKT and WKB if both src and dst SRIDs are provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to convey that for standardized DE, customers want to consider SRIDs. What we see often is Shapefiles or GDBs that need conversion so all data layers line up with other data, e.g. with GeoPandas:

  1. Customers might invoke GeoDataFrame.to_crs(crs=4326)[[source]](https://github.com/geopandas/geopandas/blob/main/geopandas/geodataframe.py#L1344-L1427) to standardize
  2. Then customers store in Delta Lake the geometry column as WKT or WKB (agreed no SRID there)

We can have a small callout of BNG if you like, with H3 (uses 4326) as the primary path. Also, bear in mind this is for DE practices to allow spatial SQL queries to be ready elsewhere in our platform (DBR, eventually DBSQL). If you have a recommended sentence or two to better clarify, please call it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a very costly operation and introduces conversion errors. So I would only advise it when absolutely need that conversion. I am just conscious that people that come with strong GIS background will lose trust in the tool if we openly argue that 4326 is where all data should sit in.
To your point, we have BNG and Custom Grid, so for smaller estates this may not be necessary at all.
I am onboard with this being the main advise, but we should list caveats openly otherwise it will be incomplete advise since in some cases it really depends on other factors.

docs/source/usage/performance-tuning.rst Outdated Show resolved Hide resolved

1. Call Mosaic's `grid_tessellateexplode <https://databrickslabs.github.io/mosaic/api/spatial-indexing.html#grid-tessellateexplode>`__ on your geometries at a given resolution; in addition to exploding to per row, this adds an ``index`` column, a struct with the following fields: ``is_core``, ``index_id``, ``wkb``; you essentially now have the result of H3 covers operation on the geometries and knowledge of which index space is fully contained within the geometry and WKB chips for the boundaries as well as core index; this is the fundamental call for performance
2. Call something like ``select index.*, * except(index)`` to pull the struct fields into top level columns; also, adding them first to ensure they are part of the table statistics; you can be more thoughtful based on your data, e.g. don't need WKB in the statistics
3. Save the table as delta lake and call ``optimize <table or delta path> zorder (index_id)`` to get proper layout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we add a sentance on timestamp columns here as well for proper spatio-temporal considerations, if z ordering by cell + timestamp in that order you get better fetching of data for constructing timeseries for each cell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good callout that often comes up! Another is I want to emphasize that H3 cell ids need to be numbers not strings (not a problem with Mosaic calls, but customers need to be aware of the perf differences and standardize everywhere to numbers).


(b) when you are interested in precise results, test boundary chip information

.. code-block:: sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add python examples as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, ok.


sql(f"""OPTIMIZE {tbl_fqn} ZORDER BY ({cell_col})""")

This pattern allows us to level out polygon "worst case" max areas, e.g. resolution 5 and then down to 9 for large area boundaries. You will notice that only ``is_core`` is checked at resolution 5 which is cheap, then resolution 9 has both ``is_core`` and ``st_contains`` in the clause. That query pattern then looks like (showing SQL):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a bit more explanation, why we use only core match at 5, I can see in code that you filter out cells based on is core flag, but to the first time reader this may not be obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can expand a little more here.

Tuning Tips
###########

Spark `Adaptive Query Exection (AQE) <https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution>`__ is tuned for data-heavy processing where each row has a cost. It likes those rows to be even and has a hard time reasoning about hidden compute costs as incurred when ``ST_`` functions, e.g. ``ST_Contains`` are invoked. To bring more control AQE can be turned completely off:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say here "each row has the same cost to process"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add. I think the next sentence is clear expansion: "It likes those rows to be even and has a hard time reasoning about hidden compute costs..."


.. code-block:: py

spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", True) # <- default is True, may want False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this should always be false for spatial use cases, and sometimes should be turned back on, and only if both tables are point tables

Copy link
Contributor Author

@mjohns-databricks mjohns-databricks Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the query pattern, can slow down queries when set to False, can also help (when set to False) e.g. when you are relying on shuffle partitions alone to tune your workload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the config that gave me personally the most pain, I almost exclusively turn it off and havent seen many cases where I needed to turn it back on (except when data is over-fragmented on disk)


spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", True) # <- default is True, may want False
spark.conf.set("spark.sql.adaptive.coalescePartitions.parallelismFirst", False) # <- default is True (respect size)
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "24MB") # <- default is 64MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go even lower here, but it is very hard to say what is the correct number

Copy link
Contributor Author

@mjohns-databricks mjohns-databricks Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have messed with others (16MB being lower but also 36MB and 48MB) and frankly have landed on 24MB as a good goldilocks zone that is ok for smaller and larger data.


.. code-block:: py

spark.conf.set("spark.sql.shuffle.partitions", 256) # <- default is 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 256 is too close to 200 to demonstrate the point, I would go with 500 or something like that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I pulled this from some ongoing auto-tuning work where too large of shuffle partitions can have a significant effect on query times for smaller data, but makes sense to bump up to 512 to be clear that it is a real adjustment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants