forked from edzer/OGH22
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NOTES.arrow
50 lines (39 loc) · 1.56 KB
/
NOTES.arrow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
https://observablehq.com/@kylebarron/geoarrow-and-geoparquet-in-deck-gl
https://nbviewer.org/gist/jorisvandenbossche/00e5c4a54f7b94375ccc6921c07825a0
Point or Polygon files can get very large, think:
- all fixes of a GPS collection system
- all building outlines of a city, state or country
- all agricultural parcels over EU
LIDAR data is a special topic, see e.g. COPC.io
Why not a regular database, like PostGIS?
Because:
* very large databases are hard to manage (dump, upgrade software, restore)
* databases require software to be running
* running cloud instances is more expensive that storage
* administrative burden to set it up scalable (storage, parallel)
Google BigQuery: stateless DB
* store data in big files (e.g. collections of csv)
* run query involves: read data, process, write output, quit
* only storage cost involved, and runtime of the query only
* can read & write be done faster?
More efficient than .csv:
* binary
* indexed
* columnar storage model (SAP HANA, DuckDB, R, Python, ...)
* analysis by variable, rather than by record
Parquet:
* binary, columnar storage
* indexed, compressed, chunked
* meant for storage (write once, read often)
Arrow:
* binary, columnar,
* focus on memory layout,
* meant for performance, not (primarily) for storage
* VC backed (Voltrondata)
Both are Apache projects (open source, cloud oriented)
Both have
* GEO extensions (geometry column, simple features+)
* (bleeding edge) GDAL drivers (>= 3.5), + GDAL columnar read interface
How to leverage parallel computing, distributing over
* cores
* machines