Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/update table #94

Merged
merged 25 commits into from
Nov 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
768ad48
Refactor h3_utils to use h3ronpy
zacdezgeo Oct 3, 2024
4283cc8
Update lib.py for use of h3ronpy
zacdezgeo Oct 3, 2024
66e3aea
Update ValueError for invalid fields
zacdezgeo Oct 3, 2024
1507512
Add documentation on bug
zacdezgeo Oct 3, 2024
a32650d
Fix issue with Point generation
zacdezgeo Oct 3, 2024
536d36f
Integrate release from h3ronpy that solves cells_to_wkb_points
zacdezgeo Oct 4, 2024
5299454
Remove looping on geometries
zacdezgeo Oct 4, 2024
ac4fdd3
Update user documentation
zacdezgeo Oct 4, 2024
0206fe1
Merge branch 'main' into feature/h3ronpy
zacdezgeo Oct 7, 2024
9e8c7d7
Update user doc notebooks (#76)
zacdezgeo Oct 11, 2024
3f722e6
Update ingest mechanics
zacdezgeo Nov 5, 2024
7e59db3
Update ingestion logic to support update via a new file
zacdezgeo Nov 5, 2024
b2896bc
wip on updating data
zacdezgeo Nov 5, 2024
232046f
Use arrow for merging data
zacdezgeo Nov 5, 2024
525fab3
Refacto approach to using database for join
zacdezgeo Nov 6, 2024
958262e
Add check for existing column names and update docker database settings
zacdezgeo Nov 6, 2024
36c1843
Add rollback logic if update fails and more tests
zacdezgeo Nov 6, 2024
e8fe907
Update database documentation for testing with update process
zacdezgeo Nov 6, 2024
ad87bf4
Add notebook dependency group and update python library example
zacdezgeo Nov 15, 2024
886493f
Merge pull request #89 from worldbank/bug/88
bpstewar Nov 15, 2024
c23af15
Update notebook based on nbqa
zacdezgeo Nov 21, 2024
798b7fe
Remove unused dependencies in core
zacdezgeo Nov 21, 2024
a87b8a9
Merge branch 'feature/h3ronpy' into feature/update-table
zacdezgeo Nov 21, 2024
d05bc2a
Update poetry lock
zacdezgeo Nov 21, 2024
484e3f4
Merge main
zacdezgeo Nov 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 19 additions & 5 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,6 @@ version: '3'

services:
database:
# at time of writing this, ARM64 is not supported so we make sure to use
# a supported platform: https://github.com/postgis/docker-postgis/issues/216
# Could possibly switch to https://github.com/vincentsarago/containers
platform: linux/amd64
image: postgis/postgis:15-3.4
environment:
Expand All @@ -13,6 +10,23 @@ services:
- POSTGRES_DB=postgis
ports:
- 5439:5432
command: postgres -N 500
command: >
postgres -N 500
-c checkpoint_timeout=30min
-c synchronous_commit=off
-c max_wal_senders=0
-c max_connections=8
-c shared_buffers=2GB
-c effective_cache_size=6GB
-c maintenance_work_mem=512MB
-c checkpoint_completion_target=0.9
-c wal_buffers=16MB
-c default_statistics_target=100
-c random_page_cost=1.1
-c effective_io_concurrency=200
-c work_mem=256MB
-c huge_pages=off
-c min_wal_size=1GB
-c max_wal_size=4GB
volumes:
- ./.pgdata:/var/lib/postgresql/data
- ./.pgdata:/var/lib/postgresql/data
50 changes: 29 additions & 21 deletions docs/acceptance/db.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,32 +54,15 @@ You can use the CLI tool for data ingestion. First, ensure you have the required
poetry install
```

To download the Parquet file from S3 and load it into the database, run the following command:
To load a Parquet file it into the database, run the following command:

```bash
poetry run space2stats-ingest download-and-load \
"s3://<bucket>/space2stats.parquet" \
poetry run space2stats-ingest load \
"postgresql://username:password@localhost:5439/postgres" \
"<path>/space2stats.json" \
--parquet-file "local.parquet"
"<item_path>" \
"local.parquet"
```

Alternatively, you can run the `download` and `load` commands separately:

1. **Download the Parquet file**:
```bash
poetry run space2stats-ingest download "s3://<bucket>/space2stats.parquet" --local-path "local.parquet"
```

2. **Load the Parquet file into the database**:
```bash
poetry run space2stats-ingest download-and-load \
"s3://<bucket>/space2stats.parquet" \
"postgresql://username:password@localhost:5439/postgres" \
"<path>/space2stats.json" \
--parquet-file "local.parquet"
```

### Database Configuration

Once connected to the database via `psql` or a PostgreSQL client (e.g., `pgAdmin`), execute the following SQL command to create an index on the `space2stats` table:
Expand Down Expand Up @@ -110,3 +93,28 @@ SELECT sum_pop_2020 FROM space2stats WHERE hex_id IN ('86beabd8fffffff', '86beab
### Conclusion

Ensure all steps are followed to verify the ETL process, database setup, and data ingestion pipeline. Reach out to the development team for any further assistance or troubleshooting.


#### Updating test

- Spin up database with docker:
```
docker-compose up
```
- Download initial dataset:
```
aws s3 cp s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet .
download: s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet to ./space2stats.parquet
```
- Upload initial dataset:
```
space2stats-ingest <connection_string> ./space2stats_ingest/METADATA/stac/space2stats/space2stats_population_2020/space2stats_population_2020.json space2stats.parquet
```
- Generate second dataset:
```
python space2stats_ingest/METADATA/generate_test_data.py
```
- Upload second dataset:
```
space2stats-ingest <connection_string> ./space2stats_ingest/METADATA/stac/space2stats/space2stats_population_2020/space2stats_reupload_test.json space2stats_test.parquet
```
302 changes: 151 additions & 151 deletions space2stats_api/src/poetry.lock

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

# Load the original Parquet file
input_file = "space2stats.parquet"
table = pq.read_table(input_file)

# Select only the 'hex_id' column
table = table.select(["hex_id"])

# Create the new 'test_column' with random values
num_rows = table.num_rows
test_column = pa.array(np.random.random(size=num_rows), type=pa.float64())

# Add 'test_column' to the table
table = table.append_column("test_column", test_column)

# Save the modified table to a new Parquet file
output_file = "space2stats_test.parquet"
pq.write_table(table, output_file)

print(f"Modified Parquet file saved as {output_file}")
Loading