Add readme for preprocessing data

Vizzuality · Sep 27, 2023 · 6312058 · 6312058
1 parent 45b7238
commit 6312058
Showing 1 changed file with 67 additions and 0 deletions.
diff --git a/data/preprocessing/unsustainable_water_use/README.MD b/data/preprocessing/unsustainable_water_use/README.MD
@@ -0,0 +1,67 @@
+# Data Processing Pipeline README
+
+This repository contains a data processing pipeline implemented using a Makefile and Python script to download, preprocess, upload, and generate checksums for data files. The pipeline is designed to work with geospatial data related to unsustainable water use indicator.
+
+## Prerequisites
+
+Before running the pipeline, ensure you have the following prerequisites in place:
+
+
+1. **Python Dependencies**: The preprocessing script requires Python and the following Python packages:
+   - `geopandas`
+   - Other dependencies as specified in your `preprocess_data.py` script.
+
+3. **AWS Credentials**: To upload results to an AWS S3 bucket, you should have AWS credentials configured on your machine.
+
+## Usage
+
+### 1. Download and Unzip Data
+
+Use the following command to download and unzip the data:
+
+```bash
+make download-aqueduct
+```
+```bash
+make extract-aqueduct
+```
+This command will download the data and place it in the data/ directory.
+
+### 2. Preprocess Data
+
+Before ingesting the data into your database, preprocess it using the Python script. Run the following command:
+
+``` bash
+make process-aqueduct
+```
+This command will execute the preprocess_data.py script, which performs data preprocessing, including reprojection and calculation of excess of water withdrawals.
+
+### 3. Upload Process Data
+
+To upload the processed data to an AWS S3 bucket, use the following command:
+
+```bash
+make upload_results
+```
+Make sure you have AWS credentials configured to access the specified S3 bucket.
+
+### 4. Generate Checksum
+
+Generate a SHA-256 checksum for the processed data by running the following command:
+
+```bash
+make write_checksum
+```
+This command will calculate the checksum and save it in the data_checksums/ directory.
+
+## Configuration
+
+You can configure the pipeline by modifying the variables at the top of the Makefile:
+
+- `DATA_DIR`: Specify the directory where data files are stored.
+- `checksums_dir`: Define the directory where checksum files will be saved.
+- `AWS_S3_BUCKET_URL`: Set the AWS S3 bucket URL for uploading results.
+
+Feel free to adapt this pipeline to suit your specific data processing needs and directory structure.
+
+`Note`: Make sure you have the necessary permissions and access to the data sources and AWS resources mentioned in this README before running the pipeline.