-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
nutrient assimilation capacity indicator
- Loading branch information
Showing
4 changed files
with
235 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
25 changes: 25 additions & 0 deletions
25
data/preprocessing/nutrient_assimilation_capacity/Makefile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Makefile for downloading, processing, and uploading data | ||
# Variables | ||
DATA_DIR=data/ | ||
checksums_dir=../../../../h3_data_importer/data_checksums | ||
AWS_S3_BUCKET_URL=s3://landgriffon-raw-data | ||
|
||
# Targets | ||
.PHONY: unzip-limiting-nutrient | ||
|
||
all: unzip-limiting-nutrient | ||
|
||
# First you need to download the data manually from https://figshare.com/articles/figure/DRP_NO3_TN_TP_rasters/14527638/1?file=31154728 and save it in nutrient_assimilation_capacity/data | ||
unzip-limiting-nutrient: | ||
unzip -q -u $(DATA_DIR)/hybas_l03_v1c_Cases.zip -d $(DATA_DIR)/ | ||
|
||
# Preprocess the data before ingesting instead of performing these calculations on the database | ||
process-limiting-nutrients: | ||
python process_data.py $(DATA_DIR)/hybas_l03_v1c_Cases | ||
|
||
upload_results: | ||
aws s3 cp $(DATA_DIR)/hybas_l03_v1c_Cases/nutrient_assimilation_capacity.shp ${AWS_S3_BUCKET_URL}/processed/nutrients_assimilation_capacity/ | ||
|
||
write_checksum: | ||
cd $(DATA_DIR)/hybas_l03_v1c_Cases && sha256sum nutrient_assimilation_capacity.shp > $(checksums_dir)/nutrient_assimilation_capacity | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# Data Processing Pipeline README | ||
|
||
This repository contains a data processing pipeline implemented using a Makefile and Python script to download, preprocess, upload, and generate checksums for data files. The pipeline is designed to work with geospatial data related to nutrient assimilation capacity. | ||
|
||
## Prerequisites | ||
|
||
Before running the pipeline, ensure you have the following prerequisites in place: | ||
|
||
1. **Data Download**: You need to manually download the data from [here](https://figshare.com/articles/figure/DRP_NO3_TN_TP_rasters/14527638/1?file=31154728) and save it in the `data/` directory. | ||
|
||
2. **Python Dependencies**: The preprocessing script requires Python and the following Python packages: | ||
- `geopandas` | ||
- Other dependencies as specified in your `process_data.py` script. | ||
|
||
3. **AWS Credentials**: To upload results to an AWS S3 bucket, you should have AWS credentials configured on your machine. | ||
|
||
## Usage | ||
|
||
### 1. Download and Unzip Data | ||
|
||
Use the following command to download and unzip the data: | ||
|
||
```bash | ||
make unzip-limiting-nutrient | ||
``` | ||
This command will download the data and place it in the data/ directory. | ||
|
||
### 2. Preprocess Data | ||
|
||
Before ingesting the data into your database, preprocess it using the Python script. Run the following command: | ||
|
||
``` bash | ||
make process-limiting-nutrients | ||
``` | ||
This command will execute the process_data.py script, which performs data preprocessing, including reprojection and calculation of nutrient reduction percentages. | ||
|
||
### 3. Upload Process Data | ||
|
||
To upload the processed data to an AWS S3 bucket, use the following command: | ||
|
||
```bash | ||
make upload_results | ||
``` | ||
Make sure you have AWS credentials configured to access the specified S3 bucket. | ||
|
||
### 4. Generate Checksum | ||
|
||
Generate a SHA-256 checksum for the processed data by running the following command: | ||
|
||
```bash | ||
make write_checksum | ||
``` | ||
This command will calculate the checksum and save it in the data_checksums/ directory. | ||
|
||
## Configuration | ||
|
||
You can configure the pipeline by modifying the variables at the top of the Makefile: | ||
|
||
- `DATA_DIR`: Specify the directory where data files are stored. | ||
- `checksums_dir`: Define the directory where checksum files will be saved. | ||
- `AWS_S3_BUCKET_URL`: Set the AWS S3 bucket URL for uploading results. | ||
|
||
Feel free to adapt this pipeline to suit your specific data processing needs and directory structure. | ||
|
||
`Note`: Make sure you have the necessary permissions and access to the data sources and AWS resources mentioned in this README before running the pipeline. |
90 changes: 90 additions & 0 deletions
90
data/preprocessing/nutrient_assimilation_capacity/process_data.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
""" Reads the limiting nutrients equal area vector file, reporjects the file to EPSG4326 and estimates the percentage of reduction needed to meet a good water quality conditions. | ||
Usage: | ||
process_data.py <folder> | ||
Arguments: | ||
<folder> Folder containing the limiting nutrients shapefile | ||
""" | ||
import os | ||
import logging | ||
from pathlib import Path | ||
import argparse | ||
|
||
import geopandas as gpd | ||
|
||
logging.basicConfig(level=logging.INFO) | ||
log = logging.getLogger("preprocessing_limiting_nutrients_file") | ||
|
||
def check_and_reproject_to_4326(gdf): | ||
""" | ||
Checks if a GeoDataFrame is in CRS 4326 (WGS84) and reprojects it if not. | ||
Parameters: | ||
- gdf: GeoDataFrame to check and reproject if needed. | ||
Returns: | ||
- Reprojected GeoDataFrame (if reprojected) or the original GeoDataFrame (if already in 4326). | ||
""" | ||
if gdf.crs is None or gdf.crs.to_epsg() != 4326: | ||
log.info("Reprojecting GeoDataFrame to EPSG:4326 (WGS84)...") | ||
try: | ||
# Reproject to EPSG:4326 | ||
gdf = gdf.to_crs(epsg=4326) | ||
log.info("Reprojection successful.") | ||
except: | ||
log.error("Reprojection failed with error") | ||
else: | ||
log.info("GeoDataFrame is already in EPSG:4326 (WGS84).") | ||
|
||
return gdf | ||
|
||
# Define the function to calculate perc_reduction | ||
def calculate_perc_reduction(row): | ||
if row['Cases_v2_1'] == 4 and row['TP_con_V2_']: | ||
return ((row['TP_con_V2_'] - 0.046) / row['TP_con_V2_']) * 100 | ||
elif row['Cases_v2_1'] == 2 and row['TN_con_V2_']: | ||
return ((row['TN_con_V2_'] - 0.7) / row['TN_con_V2_']) * 100 | ||
else: | ||
return 0 | ||
|
||
def process_folder(folder): | ||
vec_extensions = "gdb gpkg shp json geojson".split() | ||
path = Path(folder) | ||
vectors = [] | ||
for ext in vec_extensions: | ||
vectors.extend(path.glob(f"*.{ext}")) | ||
if not vectors: | ||
log.error(f"No vectors with extension {vec_extensions} found in {folder}") | ||
return | ||
if len(vectors) == 1: #folder just contains one vector file | ||
# Read the shapefile | ||
gdf = gpd.read_file(vectors[0]) | ||
# Check and reproject to EPSG:4326 | ||
gdf = check_and_reproject_to_4326(gdf) | ||
# Calculate perc_reduction and add it as a new column | ||
gdf['perc_reduc'] = gdf.apply(calculate_perc_reduction, axis=1) | ||
# Save the processed data to a new shapefile | ||
gdf = gdf[['Cases_v2_1', 'perc_reduc', 'geometry']] | ||
output_file = os.path.join(folder, 'nutrient_assimilation_capacity.shp') | ||
log.info(f"Saving preprocessed file to {output_file}") | ||
gdf.to_file(output_file) | ||
else: | ||
mssg = ( | ||
f"Found more than one vector file in {folder}." | ||
f" For now we only support folders with just one vector file." | ||
) | ||
logging.error(mssg) | ||
return | ||
|
||
def main(): | ||
# Parse command-line arguments | ||
parser = argparse.ArgumentParser(description="Process limiting nutrients vector files.") | ||
parser.add_argument("folder", type=str, help="Path to the folder containing vector files") | ||
args = parser.parse_args() | ||
|
||
# Process the specified folder | ||
process_folder(args.folder) | ||
|
||
if __name__ == "__main__": | ||
main() |
You are doing the download nutrient load, extract and convert already above. Wee need to remove the three duplicated commands below.