Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iniitial geoparquet support #888

Merged
merged 24 commits into from
May 22, 2024
Merged

Iniitial geoparquet support #888

merged 24 commits into from
May 22, 2024

Conversation

msbarry
Copy link
Contributor

@msbarry msbarry commented May 19, 2024

Add initial geoparquet support to planetiler for reading datasets like overture maps.

Planetiler will attempt to read geoparquet metadata from the "geo" file metadata field to determine which field contains the default geometry on each field and how to deserialize it (including geoarrow geometries). If that's missing, it will fall back to geometry, wkb_geometry, or wkt_geometry field (similar to gdal).

Parquet supports structured attributes like maps and lists. For now the SourceFeature API is unchanged, so you may get back a Map<String, List<Object>> from feature.getTag(name). A future PR will add more convenient API for working with structured tags.

The --bounds bbox argument gets converted to a push-down predicate that lets planetiler avoid reading entire files, row groups, and records that fall outside the bounding box. For example since overture data is sorted roughly geographically if you specify a bounding box for a city like Boston, it can select and process all the features in less than 5 seconds.

The apache java parquet reader is tightly coupled to the rest of hadoop and cannot easily be used on its own (see https://issues.apache.org/jira/browse/PARQUET-1126), so to avoid pulling in many mb's of dependencies this PR uses the parquet-floor project that uses the minimal set of dependencies and stubs-out the rest so the jar size only goes up from 70 to 84mb.

Planned for followup PRs:

  • add better support for reading structured properties. For now feature.getTag can return a nested map or list on parquet features
  • automatically download geoparquet files
  • add overture-specific shortcut utilities on top of default parquet support
  • add putAttrBetween utility for breaking a linestring up into segments based on attributes that apply only to part of the line
  • make the example overture profile more full-featured
  • let users specify row and column filters for a geoparquet source
  • let users send elements from a geoparquet source to a specific method on a profile
  • geoarrow improvements: construct fewer intemediate objects while parsing, and add make bbox predicate use raw x/y values

Copy link

github-actions bot commented May 19, 2024

This Branch cd1bff2 Base bcaee68
0:01:09 DEB [archive] - Tile stats:
0:01:09 DEB [archive] - Biggest tiles (gzipped)
1. 14/4942/6092 (154k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.40015 (poi:83k)
2. 9/154/190 (149k) https://onthegomap.github.io/planetiler-demo/#9.5/41.77078/-71.36719 (landcover:85k)
3. 10/308/380 (138k) https://onthegomap.github.io/planetiler-demo/#10.5/41.90214/-71.54297 (landcover:66k)
4. 10/308/381 (136k) https://onthegomap.github.io/planetiler-demo/#10.5/41.63994/-71.54297 (landcover:72k)
5. 14/4941/6092 (111k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.42212 (poi:64k)
6. 14/4941/6093 (110k) https://onthegomap.github.io/planetiler-demo/#14.5/41.81227/-71.42212 (building:62k)
7. 14/4940/6092 (99k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.44409 (building:92k)
8. 11/616/762 (98k) https://onthegomap.github.io/planetiler-demo/#11.5/41.7057/-71.63086 (landcover:71k)
9. 14/4942/6091 (96k) https://onthegomap.github.io/planetiler-demo/#14.5/41.84501/-71.40015 (building:79k)
10. 11/616/761 (96k) https://onthegomap.github.io/planetiler-demo/#11.5/41.83679/-71.63086 (landcover:72k)
0:01:09 DEB [archive] - Max tile sizes
                      z0    z1    z2    z3    z4    z5    z6    z7    z8    z9   z10   z11   z12   z13   z14   all
           boundary  154   374   443   583   938   339   433   548   773  1.6k  2.1k  7.2k  6.4k  5.8k  4.5k  7.2k
              water 7.7k  3.7k  8.6k  5.5k  2.6k  5.1k   15k   18k   16k   26k   15k   13k   17k   15k   12k   26k
              place    0     0   441   441   441   639   712    1k  1.5k  3.1k  5.6k  3.3k  1.7k   795   936  5.6k
            landuse    0     0     0     0   548   694  1.6k  6.8k   17k   44k   59k   50k   38k   19k   12k   59k
     transportation    0     0     0     0   243   782  1.2k  5.9k    8k   24k   17k   19k   65k   48k   33k   65k
           waterway    0     0     0     0   111   118     0     0     0  3.1k  2.4k  2.1k  2.1k  4.9k  2.4k  4.9k
               park    0     0     0     0     0     0    1k  3.7k  9.7k   19k   13k  8.2k  4.3k  3.4k  4.4k   19k
transportation_name    0     0     0     0     0     0   369   464  1.2k  1.8k  5.4k  4.6k  3.9k  3.4k   18k   18k
          landcover    0     0     0     0     0     0     0  9.5k   29k   85k   72k   81k   53k   30k   24k   85k
      mountain_peak    0     0     0     0     0     0     0  1.1k  1.8k  3.4k  4.3k  2.8k  1.4k  1.4k   869  4.3k
         water_name    0     0     0     0     0     0     0     0     0   486   461   433   452  1.2k  1.5k  1.5k
    aerodrome_label    0     0     0     0     0     0     0     0     0     0   664   327   273   220   220   664
            aeroway    0     0     0     0     0     0     0     0     0     0  1.6k  2.1k    3k  3.4k  2.7k  3.4k
                poi    0     0     0     0     0     0     0     0     0     0     0     0   501   498   83k   83k
           building    0     0     0     0     0     0     0     0     0     0     0     0     0   59k   92k   92k
        housenumber    0     0     0     0     0     0     0     0     0     0     0     0     0     0   35k   35k
          full tile 7.9k    4k  9.5k  6.5k  3.7k    6k   20k   42k   85k  203k  185k  135k  114k  128k  244k  244k
            gzipped 6.2k  3.5k  7.1k  5.2k  3.1k  4.8k   14k   29k   60k  149k  138k   98k   83k   91k  154k  154k
0:01:09 DEB [archive] -    Max tile: 244k (gzipped: 154k)
0:01:09 DEB [archive] -    Avg tile: 5.4k (gzipped: 4k) using weighted average based on OSM traffic
0:01:09 DEB [archive] -     # tiles: 4,115,012
0:01:09 DEB [archive] -  # features: 5,484,360
0:01:09 INF [archive] - Finished in 19s cpu:1m8s avg:3.7
0:01:09 INF [archive] -   read    1x(3% 0.6s wait:17s done:1s)
0:01:09 INF [archive] -   encode  4x(55% 10s wait:2s done:1s)
0:01:09 INF [archive] -   write   1x(22% 4s wait:12s done:1s)
0:01:09 INF [archive] - Finished in 1m10s cpu:3m30s gc:1s avg:3
0:01:09 INF [archive] - FINISHED!
0:01:09 INF [archive] - 
0:01:09 INF [archive] - ----------------------------------------
0:01:09 INF [archive] - data errors:
0:01:09 INF [archive] - 	render_snap_fix_input	16,639
0:01:09 INF [archive] - 	osm_multipolygon_missing_way	389
0:01:09 INF [archive] - 	osm_boundary_missing_way	73
0:01:09 INF [archive] - 	merge_snap_fix_input	12
0:01:09 INF [archive] - 	osm_boundary_duplicate_member	2
0:01:09 INF [archive] - 	feature_centroid_if_convex_osm_invalid_multipolygon_empty_after_fix	2
0:01:09 INF [archive] - 	feature_polygon_osm_invalid_multipolygon_empty_after_fix	2
0:01:09 INF [archive] - 	omt_park_area_osm_invalid_multipolygon_empty_after_fix	1
0:01:09 INF [archive] - 	omt_fix_water_before_ne_intersect	1
0:01:09 INF [archive] - 	feature_point_on_surface_osm_invalid_multipolygon_empty_after_fix	1
0:01:09 INF [archive] - ----------------------------------------
0:01:09 INF [archive] - 	overall          1m10s cpu:3m30s gc:1s avg:3
0:01:09 INF [archive] - 	lake_centerlines 3s cpu:6s avg:1.9
0:01:09 INF [archive] - 	  read     1x(14% 0.5s done:3s)
0:01:09 INF [archive] - 	  process  4x(0% 0s done:3s)
0:01:09 INF [archive] - 	  write    1x(0% 0s done:3s)
0:01:09 INF [archive] - 	water_polygons   15s cpu:39s avg:2.7
0:01:09 INF [archive] - 	  read     1x(42% 6s done:7s)
0:01:09 INF [archive] - 	  process  4x(25% 4s wait:4s done:5s)
0:01:09 INF [archive] - 	  write    1x(4% 0.5s wait:9s done:5s)
0:01:09 INF [archive] - 	natural_earth    12s cpu:18s avg:1.5
0:01:09 INF [archive] - 	  read     1x(52% 6s done:6s)
0:01:09 INF [archive] - 	  process  4x(7% 0.8s wait:6s done:6s)
0:01:09 INF [archive] - 	  write    1x(0% 0s wait:6s done:6s)
0:01:09 INF [archive] - 	osm_pass1        2s cpu:6s avg:3.2
0:01:09 INF [archive] - 	  read     1x(2% 0s wait:2s)
0:01:09 INF [archive] - 	  parse    4x(35% 0.6s)
0:01:09 INF [archive] - 	  process  1x(67% 1s)
0:01:09 INF [archive] - 	osm_pass2        17s cpu:1m7s avg:3.9
0:01:09 INF [archive] - 	  read     1x(0% 0s wait:10s done:7s)
0:01:09 INF [archive] - 	  process  4x(76% 13s)
0:01:09 INF [archive] - 	  write    1x(3% 0.4s wait:17s)
0:01:09 INF [archive] - 	ne_lakes         0s cpu:0s avg:14.6
0:01:09 INF [archive] - 	boundaries       0s cpu:0s avg:1.3
0:01:09 INF [archive] - 	agg_stop         0s cpu:0s avg:0
0:01:09 INF [archive] - 	sort             1s cpu:4s avg:2.7
0:01:09 INF [archive] - 	  worker  1x(49% 0.7s)
0:01:09 INF [archive] - 	archive          19s cpu:1m8s avg:3.7
0:01:09 INF [archive] - 	  read    1x(3% 0.6s wait:17s done:1s)
0:01:09 INF [archive] - 	  encode  4x(55% 10s wait:2s done:1s)
0:01:09 INF [archive] - 	  write   1x(22% 4s wait:12s done:1s)
0:01:09 INF [archive] - ----------------------------------------
0:01:09 INF [archive] - 	archive	108MB
0:01:09 INF [archive] - 	features	281MB
0:01:03 DEB [archive] - Tile stats:
0:01:03 DEB [archive] - Biggest tiles (gzipped)
1. 14/4942/6092 (154k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.40015 (poi:83k)
2. 9/154/190 (149k) https://onthegomap.github.io/planetiler-demo/#9.5/41.77078/-71.36719 (landcover:85k)
3. 10/308/380 (138k) https://onthegomap.github.io/planetiler-demo/#10.5/41.90214/-71.54297 (landcover:66k)
4. 10/308/381 (136k) https://onthegomap.github.io/planetiler-demo/#10.5/41.63994/-71.54297 (landcover:72k)
5. 14/4941/6092 (111k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.42212 (poi:64k)
6. 14/4941/6093 (110k) https://onthegomap.github.io/planetiler-demo/#14.5/41.81227/-71.42212 (building:62k)
7. 14/4940/6092 (99k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.44409 (building:92k)
8. 11/616/762 (98k) https://onthegomap.github.io/planetiler-demo/#11.5/41.7057/-71.63086 (landcover:71k)
9. 14/4942/6091 (96k) https://onthegomap.github.io/planetiler-demo/#14.5/41.84501/-71.40015 (building:79k)
10. 11/616/761 (96k) https://onthegomap.github.io/planetiler-demo/#11.5/41.83679/-71.63086 (landcover:72k)
0:01:03 DEB [archive] - Max tile sizes
                      z0    z1    z2    z3    z4    z5    z6    z7    z8    z9   z10   z11   z12   z13   z14   all
           boundary  154   374   443   583   938   339   433   548   773  1.6k  2.1k  7.2k  6.4k  5.8k  4.5k  7.2k
              water 7.7k  3.7k  8.6k  5.5k  2.6k  5.1k   15k   18k   16k   26k   15k   13k   17k   15k   12k   26k
              place    0     0   441   441   441   639   712    1k  1.5k  3.1k  5.6k  3.3k  1.7k   795   936  5.6k
            landuse    0     0     0     0   548   694  1.6k  6.8k   17k   44k   59k   50k   38k   19k   12k   59k
     transportation    0     0     0     0   243   782  1.2k  5.9k    8k   24k   17k   19k   65k   48k   33k   65k
           waterway    0     0     0     0   111   118     0     0     0  3.1k  2.4k  2.1k  2.1k  4.9k  2.4k  4.9k
               park    0     0     0     0     0     0    1k  3.7k  9.7k   19k   13k  8.2k  4.3k  3.4k  4.4k   19k
transportation_name    0     0     0     0     0     0   369   464  1.2k  1.8k  5.4k  4.6k  3.9k  3.4k   18k   18k
          landcover    0     0     0     0     0     0     0  9.5k   29k   85k   72k   81k   53k   30k   24k   85k
      mountain_peak    0     0     0     0     0     0     0  1.1k  1.8k  3.4k  4.3k  2.8k  1.4k  1.4k   869  4.3k
         water_name    0     0     0     0     0     0     0     0     0   486   461   433   452  1.2k  1.5k  1.5k
    aerodrome_label    0     0     0     0     0     0     0     0     0     0   664   327   273   220   220   664
            aeroway    0     0     0     0     0     0     0     0     0     0  1.6k  2.1k    3k  3.4k  2.7k  3.4k
                poi    0     0     0     0     0     0     0     0     0     0     0     0   501   498   83k   83k
           building    0     0     0     0     0     0     0     0     0     0     0     0     0   59k   92k   92k
        housenumber    0     0     0     0     0     0     0     0     0     0     0     0     0     0   35k   35k
          full tile 7.9k    4k  9.5k  6.5k  3.7k    6k   20k   42k   85k  203k  185k  135k  114k  128k  244k  244k
            gzipped 6.2k  3.5k  7.1k  5.2k  3.1k  4.8k   14k   29k   60k  149k  138k   98k   83k   91k  154k  154k
0:01:03 DEB [archive] -    Max tile: 244k (gzipped: 154k)
0:01:03 DEB [archive] -    Avg tile: 5.4k (gzipped: 4k) using weighted average based on OSM traffic
0:01:03 DEB [archive] -     # tiles: 4,115,012
0:01:03 DEB [archive] -  # features: 5,484,360
0:01:03 INF [archive] - Finished in 18s cpu:1m7s avg:3.6
0:01:03 INF [archive] -   read    1x(3% 0.6s wait:17s done:1s)
0:01:03 INF [archive] -   encode  4x(55% 10s wait:2s done:1s)
0:01:03 INF [archive] -   write   1x(22% 4s wait:12s done:1s)
0:01:03 INF - Finished in 1m3s cpu:3m23s gc:1s avg:3.2
0:01:03 INF - FINISHED!
0:01:03 INF - 
0:01:03 INF - ----------------------------------------
0:01:03 INF - data errors:
0:01:03 INF - 	render_snap_fix_input	16,639
0:01:03 INF - 	osm_multipolygon_missing_way	389
0:01:03 INF - 	osm_boundary_missing_way	73
0:01:03 INF - 	merge_snap_fix_input	12
0:01:03 INF - 	osm_boundary_duplicate_member	2
0:01:03 INF - 	feature_centroid_if_convex_osm_invalid_multipolygon_empty_after_fix	2
0:01:03 INF - 	feature_polygon_osm_invalid_multipolygon_empty_after_fix	2
0:01:03 INF - 	omt_park_area_osm_invalid_multipolygon_empty_after_fix	1
0:01:03 INF - 	omt_fix_water_before_ne_intersect	1
0:01:03 INF - 	feature_point_on_surface_osm_invalid_multipolygon_empty_after_fix	1
0:01:03 INF - ----------------------------------------
0:01:03 INF - 	overall          1m3s cpu:3m23s gc:1s avg:3.2
0:01:03 INF - 	lake_centerlines 2s cpu:5s avg:2.3
0:01:03 INF - 	  read     1x(20% 0.5s done:2s)
0:01:03 INF - 	  process  4x(0% 0s done:2s)
0:01:03 INF - 	  write    1x(0% 0s done:2s)
0:01:03 INF - 	water_polygons   15s cpu:39s avg:2.7
0:01:03 INF - 	  read     1x(43% 6s done:7s)
0:01:03 INF - 	  process  4x(26% 4s wait:4s done:5s)
0:01:03 INF - 	  write    1x(4% 0.5s wait:9s done:5s)
0:01:03 INF - 	natural_earth    6s cpu:12s avg:1.9
0:01:03 INF - 	  read     1x(95% 6s)
0:01:03 INF - 	  process  4x(13% 0.8s wait:6s)
0:01:03 INF - 	  write    1x(0% 0s wait:6s)
0:01:03 INF - 	osm_pass1        2s cpu:7s avg:3.3
0:01:03 INF - 	  read     1x(2% 0s wait:2s)
0:01:03 INF - 	  parse    4x(32% 0.6s wait:1s)
0:01:03 INF - 	  process  1x(70% 1s)
0:01:03 INF - 	osm_pass2        17s cpu:1m9s avg:3.9
0:01:03 INF - 	  read     1x(0% 0s wait:10s done:8s)
0:01:03 INF - 	  process  4x(74% 13s)
0:01:03 INF - 	  write    1x(2% 0.4s wait:17s)
0:01:03 INF - 	ne_lakes         0s cpu:0s avg:0
0:01:03 INF - 	boundaries       0s cpu:0s avg:2.8
0:01:03 INF - 	agg_stop         0s cpu:0s avg:0
0:01:03 INF - 	sort             1s cpu:3s avg:2.5
0:01:03 INF - 	  worker  1x(54% 0.7s)
0:01:03 INF - 	archive          18s cpu:1m7s avg:3.6
0:01:03 INF - 	  read    1x(3% 0.6s wait:17s done:1s)
0:01:03 INF - 	  encode  4x(55% 10s wait:2s done:1s)
0:01:03 INF - 	  write   1x(22% 4s wait:12s done:1s)
0:01:03 INF - ----------------------------------------
0:01:03 INF - 	archive	108MB
0:01:03 INF - 	features	281MB

Full logs: https://github.com/onthegomap/planetiler/actions/runs/9189106518

@msbarry msbarry linked an issue May 22, 2024 that may be closed by this pull request
@msbarry msbarry removed a link to an issue May 22, 2024
Copy link

@msbarry msbarry merged commit fb1d0e3 into main May 22, 2024
11 of 12 checks passed
@msbarry msbarry deleted the geoparquet branch May 22, 2024 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant