Skip to content

Commit

Permalink
Parquet based initalisation (#1379)
Browse files Browse the repository at this point in the history
* migrated tha aliasing logic to parquet

* formatting fix

* reverted the data update to use json again

* finalised the parquet based intitalisations first steps

* updated the docs
  • Loading branch information
CommanderStorm authored Aug 4, 2024
1 parent 928975e commit 5f2724a
Show file tree
Hide file tree
Showing 9 changed files with 1,118 additions and 199 deletions.
15 changes: 6 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,12 +64,7 @@ cd Navigatum
### Data Processing

In case you do not want to work on the data processing, you can instead
download the latest compiled files:

```bash
wget -P data/output https://nav.tum.de/cdn/api_data.json
wget -P data/output https://nav.tum.de/cdn/search_data.json
```
download the latest compiled files via running the server.

Else you can follow the steps in the [data documentation](data/README.md).

Expand All @@ -84,9 +79,11 @@ docker compose -f docker-compose.local.yml up --build
```

> [!NOTE]
> While most of the setup is simple, we need to download data (only Oberbayern is needed) for the initial setup. This takes 1-2 minutes.
> Please first bring up a [postgis](https://postgis.net/) instance (for example via `docker compose -f docker-compose.local.yml up --build`) and then run:
>
> While most of the setup is simple, we need to download data (only Oberbayern is needed) for the initial setup. This
> takes 1-2 minutes.
> Please first bring up a [postgis](https://postgis.net/) instance (for example
> via `docker compose -f docker-compose.local.yml up --build`) and then run:
>
> ```bash
> wget -O data.pbf https://download.geofabrik.de/europe/germany/bayern/oberbayern-latest.osm.pbf
> docker run -it -v $(pwd):/data -e PGPASSWORD=CHANGE_ME --network="host" iboates/osm2pgsql:latest osm2pgsql --create --slim --database postgres --user postgres --host 127.0.0.1 --port 5432 /data/data.pbf --hstore --hstore-add-index --hstore-column raw
Expand Down
75 changes: 50 additions & 25 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,21 @@ This folder contains:
The code to retrieve external data, as well as externally retrieved data is located under `external`.

> [!WARNING]
> A lot of this code is more a work in progress than finished. Especially features such as POIs, custom maps or other data types such as events are drafted but not yet fully implemented.
> A lot of this code is more a work-in-progress than finished.
> Especially features such as POIs, custom maps or other data types such as events are drafted but not yet fully implemented.
>
> New external data might break the scripts from time to time, as either rooms or buildings are removed, the external data has errors or we make assumptions here that turn out to be wrong.
> New external data might break the scripts from time to time,
> - as either rooms or buildings are removed,
> - the external data has errors,
> - or we make assumptions here that turn out to be wrong.
## Getting started

### Prerequisites

For getting started, there are some system dependencys which you will need.
Please follow the [system dependencys docs](/resources/documentation/Dependencys.md) before trying to run this part of our project.
For getting started, there are some system dependencies which you will need.
Please follow the [system dependencies docs](/resources/documentation/Dependencys.md) before trying to run this part of
our project.

### Dependencies

Expand Down Expand Up @@ -63,7 +68,8 @@ python3 tumonline.py
python3 compile.py
```

The exported datasets will be stored in `output/` as JSON files.
The exported datasets will be stored in `output/`
as [JSON](https://www.json.org/json-de.html)/[Parquet](https://wikipedia.org/wiki/Apache_Parquet) files.

### Directory structure

Expand Down Expand Up @@ -92,37 +98,52 @@ data

```json
{
"entry-id": {
"id": "entry-id",
"type": "room",
... data as specified in `data-format.yaml`
},
... all other entries in the same form
"entry-id": {
"id": "entry-id",
"type": "room",
...
data
as
specified
in
`
data-format.yaml
`
},
...
all
other
entries
in
the
same
form
}
```

## Compilation process

The data compilation is made of indiviual processing steps, where each step adds new or modifies the current data. The basic structure of the data however stays the same from the beginning on and is specified in `data-format_*.yaml`.
The data compilation is made of indiviual processing steps, where each step adds new or modifies the current data. The
basic structure of the data however stays the same from the beginning on and is specified in `data-format_*.yaml`.

- **Step 00**: The first step reads the base root node, areas, buildings etc. from the
`sources/00_areatree` file and creates an object collection (python dictionary)
with the data format as mentioned above.
- **Steps 01-29**: Within these steps, new rooms or POIs might be added, however no
new areas or buildings, since all areas and buildings have to be defined in the
_areatree_. After them, no new entries are being added to the data.
- **Steps 0x**: Supplement the base data with extended custom data.
- **Steps 1x**: Import rooms and building information from external sources
- **Steps 2x**: Import POIs
- **Steps 0x**: Supplement the base data with extended custom data.
- **Steps 1x**: Import rooms and building information from external sources
- **Steps 2x**: Import POIs
- **Steps 30-89**: Later steps are intended to augment the entries with even more
information and to ensure a consistent format. After them, no new (external or custom)
information should be added to the data.
- **Steps 3x**: Make data more coherent & structural stuff
- **Steps 4x**: Coordinates and maps
- **Steps 5x**: Add images
- **Steps 6x**: -
- **Steps 7x**: -
- **Steps 8x**: Generate properties and sections (such as overview sections)
- **Steps 3x**: Make data more coherent & structural stuff
- **Steps 4x**: Coordinates and maps
- **Steps 5x**: Add images
- **Steps 6x**: -
- **Steps 7x**: -
- **Steps 8x**: Generate properties and sections (such as overview sections)
- **Steps 90-99**: Process and export for search.
- **Step 100**: Export final data (for use in the API). Some temporary data fields might be removed at this point.

Expand All @@ -136,12 +157,16 @@ Details about the formatting are given at the head of the file.

## License

The source data (i.e. all files located in `sources/` that are not images) is made available under the Open Database License: <https://opendatacommons.org/licenses/odbl/1.0/>.
Any rights in individual contents of the database are licensed under the Database Contents License: <http://opendatacommons.org/licenses/dbcl/1.0/>.
The source data (i.e. all files located in `sources/` that are not images) is made available under the Open Database
License: <https://opendatacommons.org/licenses/odbl/1.0/>.
Any rights in individual contents of the database are licensed under the Database Contents
License: <http://opendatacommons.org/licenses/dbcl/1.0/>.

> [!WARNING]
> The images in `sources/img/` are subject to their own licensing terms, which are stated in the file `sources/img/img-sources.yaml`.
> The compiled database may contain contents from external sources (i.e. all files in `external/`) that do have different license terms.
> The images in `sources/img/` are subject to their own licensing terms, which are stated in the
> file `sources/img/img-sources.yaml`.
> The compiled database may contain contents from external sources (i.e. all files in `external/`) that do have
> different license terms.
---

Expand Down
Loading

0 comments on commit 5f2724a

Please sign in to comment.