New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Performance Tuning Docs #415

Closed

mjohns-databricks wants to merge 0 commits into databrickslabs:main from mjohns-databricks:main

Contributor

mjohns-databricks commented Jul 5, 2023

Initial tuning docs for Mosaic.

mjohns-databricks requested a review from milos-colic

July 5, 2023 18:32

codecov bot commented Jul 5, 2023 •

edited

Loading

Codecov Report

Merging #415 (5eff086) into main (19f2e9f) will increase coverage by 0.01%.
Report is 160 commits behind head on main.
The diff coverage is 97.50%.

❗ Current head 5eff086 differs from pull request most recent head a9aed81. Consider uploading reports for the commit a9aed81 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #415      +/-   ##
==========================================
+ Coverage   95.01%   95.02%   +0.01%     
==========================================
  Files         198      198              
  Lines        5658     5692      +34     
  Branches      178      175       -3     
==========================================
+ Hits         5376     5409      +33     
- Misses        282      283       +1

Files	Coverage Δ
...abricks/labs/mosaic/core/index/H3IndexSystem.scala	`97.59% <97.50%> (-0.37%)`	⬇️

milos-colic suggested changes

View reviewed changes

Contributor

milos-colic left a comment

I had a few minor suggestions and a couple of questions.
Nothing major.

docs/source/usage/performance-tuning.rst Outdated

+              Data Engineering
+              ################
+              It is important for customers to have Data Engineering (DE) tables tuned for performance. Mosaic has been designed to operate directly on standard interchange formats which may appear in various DataFrame columns; currently, WKT, WKB, and GeoJSON are supported, with WKB offering the best properties for storing and querying. Further, while capabilities are always advancing, Mosaic does not expose geometry types as defined in `OGC Simple Features for SQL <https://www.ogc.org/standard/sfs/>`__, making investments in DE all the more beneficial. Additionally, for best tuning, pre-standardize all your geospatial data into the same SRID; for most applications, this should be 4326, see `CRS Docs <https://databrickslabs.github.io/mosaic/usage/grid-indexes-bng.html#coordinate-reference-system>`__. *As of Mosaic 0.3 series, customers need to prepare tables for performant queries, though we are exploring options to improve this experience going forward.*

Contributor

milos-colic Jul 7, 2023

@mjohns-databricks We highlight here to use WKB (WKT/WKB do not support SRIDs) while we are mentioning conversion to 4326 (I would avoid this since it isnt really true, it makes sense due to H3 only) and SRIDs are only supported in GeoJSON, so format conversion is required in the current version.
We can easily add support in the future for handling SR projections for WKT and WKB if both src and dst SRIDs are provided.

Contributor Author

mjohns-databricks Jul 7, 2023

I am trying to convey that for standardized DE, customers want to consider SRIDs. What we see often is Shapefiles or GDBs that need conversion so all data layers line up with other data, e.g. with GeoPandas:

Customers might invoke GeoDataFrame.to_crs(crs=4326)[[source]](https://github.com/geopandas/geopandas/blob/main/geopandas/geodataframe.py#L1344-L1427) to standardize
Then customers store in Delta Lake the geometry column as WKT or WKB (agreed no SRID there)

We can have a small callout of BNG if you like, with H3 (uses 4326) as the primary path. Also, bear in mind this is for DE practices to allow spatial SQL queries to be ready elsewhere in our platform (DBR, eventually DBSQL). If you have a recommended sentence or two to better clarify, please call it out.

Contributor

milos-colic Jul 7, 2023

It is a very costly operation and introduces conversion errors. So I would only advise it when absolutely need that conversion. I am just conscious that people that come with strong GIS background will lose trust in the tool if we openly argue that 4326 is where all data should sit in.
To your point, we have BNG and Custom Grid, so for smaller estates this may not be necessary at all.
I am onboard with this being the main advise, but we should list caveats openly otherwise it will be incomplete advise since in some cases it really depends on other factors.

docs/source/usage/performance-tuning.rst Outdated Show resolved Hide resolved

docs/source/usage/performance-tuning.rst Outdated

+. Call Mosaic's `grid_tessellateexplode <https://databrickslabs.github.io/mosaic/api/spatial-indexing.html#grid-tessellateexplode>`__ on your geometries at a given resolution; in addition to exploding to per row, this adds an ``index`` column, a struct with the following fields: ``is_core``, ``index_id``, ``wkb``; you essentially now have the result of H3 covers operation on the geometries and knowledge of which index space is fully contained within the geometry and WKB chips for the boundaries as well as core index; this is the fundamental call for performance
+. Call something like ``select index.*, * except(index)`` to pull the struct fields into top level columns; also, adding them first to ensure they are part of the table statistics; you can be more thoughtful based on your data, e.g. don't need WKB in the statistics
+. Save the table as delta lake and call ``optimize <table or delta path> zorder (index_id)`` to get proper layout

Contributor

milos-colic Jul 7, 2023

Maybe we add a sentance on timestamp columns here as well for proper spatio-temporal considerations, if z ordering by cell + timestamp in that order you get better fetching of data for constructing timeseries for each cell.

Contributor Author

mjohns-databricks Jul 7, 2023

That is a good callout that often comes up! Another is I want to emphasize that H3 cell ids need to be numbers not strings (not a problem with Mosaic calls, but customers need to be aware of the perf differences and standardize everywhere to numbers).

docs/source/usage/performance-tuning.rst Outdated


		(b) when you are interested in precise results, test boundary chip information

		.. code-block:: sql

Contributor

milos-colic Jul 7, 2023

Can we add python examples as well?

Contributor Author

mjohns-databricks Jul 7, 2023

yes, ok.

docs/source/usage/performance-tuning.rst Outdated


		sql(f"""OPTIMIZE {tbl_fqn} ZORDER BY ({cell_col})""")

		This pattern allows us to level out polygon "worst case" max areas, e.g. resolution 5 and then down to 9 for large area boundaries. You will notice that only ``is_core`` is checked at resolution 5 which is cheap, then resolution 9 has both ``is_core`` and ``st_contains`` in the clause. That query pattern then looks like (showing SQL):

Contributor

milos-colic Jul 7, 2023

This needs a bit more explanation, why we use only core match at 5, I can see in code that you filter out cells based on is core flag, but to the first time reader this may not be obvious.

Contributor Author

mjohns-databricks Jul 7, 2023

I can expand a little more here.

docs/source/usage/performance-tuning.rst Outdated

+              Tuning Tips
+              ###########
+              Spark `Adaptive Query Exection (AQE) <https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution>`__ is tuned for data-heavy processing where each row has a cost. It likes those rows to be even and has a hard time reasoning about hidden compute costs as incurred when ``ST_`` functions, e.g. ``ST_Contains`` are invoked. To bring more control AQE can be turned completely off:

Contributor

milos-colic Jul 7, 2023

I would say here "each row has the same cost to process"

Contributor Author

mjohns-databricks Jul 7, 2023

Sure, will add. I think the next sentence is clear expansion: "It likes those rows to be even and has a hard time reasoning about hidden compute costs..."

docs/source/usage/performance-tuning.rst Outdated


		.. code-block:: py

		spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", True) # <- default is True, may want False

Contributor

milos-colic Jul 7, 2023

I would say this should always be false for spatial use cases, and sometimes should be turned back on, and only if both tables are point tables

Contributor Author

mjohns-databricks Jul 7, 2023 •

edited

Loading

It depends on the query pattern, can slow down queries when set to False, can also help (when set to False) e.g. when you are relying on shuffle partitions alone to tune your workload.

Contributor

milos-colic Jul 7, 2023

this is the config that gave me personally the most pain, I almost exclusively turn it off and havent seen many cases where I needed to turn it back on (except when data is over-fragmented on disk)

docs/source/usage/performance-tuning.rst Outdated

+                  spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", True)                  # <- default is True, may want False
+                  spark.conf.set("spark.sql.adaptive.coalescePartitions.parallelismFirst", False)        # <- default is True (respect size)
+                  spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "24MB")              # <- default is 64MB

Contributor

milos-colic Jul 7, 2023

I would go even lower here, but it is very hard to say what is the correct number

Contributor Author

mjohns-databricks Jul 7, 2023 •

edited

Loading

I have messed with others (16MB being lower but also 36MB and 48MB) and frankly have landed on 24MB as a good goldilocks zone that is ok for smaller and larger data.

docs/source/usage/performance-tuning.rst Outdated


		.. code-block:: py

		spark.conf.set("spark.sql.shuffle.partitions", 256) # <- default is 200

Contributor

milos-colic Jul 7, 2023

I think 256 is too close to 200 to demonstrate the point, I would go with 500 or something like that here.

Contributor Author

mjohns-databricks Jul 7, 2023

Good point. I pulled this from some ongoing auto-tuning work where too large of shuffle partitions can have a significant effect on query times for smaller data, but makes sense to bump up to 512 to be clear that it is a real adjustment.

mjohns-databricks closed this

mjohns-databricks force-pushed the main branch from a9aed81 to 3c5ab2e Compare

November 21, 2023 15:51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet