-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
67 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# Data Processing Pipeline README | ||
|
||
This repository contains a data processing pipeline implemented using a Makefile and Python script to download, preprocess, upload, and generate checksums for data files. The pipeline is designed to work with geospatial data related to unsustainable water use indicator. | ||
|
||
## Prerequisites | ||
|
||
Before running the pipeline, ensure you have the following prerequisites in place: | ||
|
||
|
||
1. **Python Dependencies**: The preprocessing script requires Python and the following Python packages: | ||
- `geopandas` | ||
- Other dependencies as specified in your `preprocess_data.py` script. | ||
|
||
3. **AWS Credentials**: To upload results to an AWS S3 bucket, you should have AWS credentials configured on your machine. | ||
|
||
## Usage | ||
|
||
### 1. Download and Unzip Data | ||
|
||
Use the following command to download and unzip the data: | ||
|
||
```bash | ||
make download-aqueduct | ||
``` | ||
```bash | ||
make extract-aqueduct | ||
``` | ||
This command will download the data and place it in the data/ directory. | ||
|
||
### 2. Preprocess Data | ||
|
||
Before ingesting the data into your database, preprocess it using the Python script. Run the following command: | ||
|
||
``` bash | ||
make process-aqueduct | ||
``` | ||
This command will execute the preprocess_data.py script, which performs data preprocessing, including reprojection and calculation of excess of water withdrawals. | ||
|
||
### 3. Upload Process Data | ||
|
||
To upload the processed data to an AWS S3 bucket, use the following command: | ||
|
||
```bash | ||
make upload_results | ||
``` | ||
Make sure you have AWS credentials configured to access the specified S3 bucket. | ||
|
||
### 4. Generate Checksum | ||
|
||
Generate a SHA-256 checksum for the processed data by running the following command: | ||
|
||
```bash | ||
make write_checksum | ||
``` | ||
This command will calculate the checksum and save it in the data_checksums/ directory. | ||
|
||
## Configuration | ||
|
||
You can configure the pipeline by modifying the variables at the top of the Makefile: | ||
|
||
- `DATA_DIR`: Specify the directory where data files are stored. | ||
- `checksums_dir`: Define the directory where checksum files will be saved. | ||
- `AWS_S3_BUCKET_URL`: Set the AWS S3 bucket URL for uploading results. | ||
|
||
Feel free to adapt this pipeline to suit your specific data processing needs and directory structure. | ||
|
||
`Note`: Make sure you have the necessary permissions and access to the data sources and AWS resources mentioned in this README before running the pipeline. |