diff --git a/data/preprocessing/unsustainable_water_use/README.MD b/data/preprocessing/unsustainable_water_use/README.MD new file mode 100644 index 0000000000..fb4f0bddc2 --- /dev/null +++ b/data/preprocessing/unsustainable_water_use/README.MD @@ -0,0 +1,67 @@ +# Data Processing Pipeline README + +This repository contains a data processing pipeline implemented using a Makefile and Python script to download, preprocess, upload, and generate checksums for data files. The pipeline is designed to work with geospatial data related to unsustainable water use indicator. + +## Prerequisites + +Before running the pipeline, ensure you have the following prerequisites in place: + + +1. **Python Dependencies**: The preprocessing script requires Python and the following Python packages: + - `geopandas` + - Other dependencies as specified in your `preprocess_data.py` script. + +3. **AWS Credentials**: To upload results to an AWS S3 bucket, you should have AWS credentials configured on your machine. + +## Usage + +### 1. Download and Unzip Data + +Use the following command to download and unzip the data: + +```bash +make download-aqueduct +``` +```bash +make extract-aqueduct +``` +This command will download the data and place it in the data/ directory. + +### 2. Preprocess Data + +Before ingesting the data into your database, preprocess it using the Python script. Run the following command: + +``` bash +make process-aqueduct +``` +This command will execute the preprocess_data.py script, which performs data preprocessing, including reprojection and calculation of excess of water withdrawals. + +### 3. Upload Process Data + +To upload the processed data to an AWS S3 bucket, use the following command: + +```bash +make upload_results +``` +Make sure you have AWS credentials configured to access the specified S3 bucket. + +### 4. Generate Checksum + +Generate a SHA-256 checksum for the processed data by running the following command: + +```bash +make write_checksum +``` +This command will calculate the checksum and save it in the data_checksums/ directory. + +## Configuration + +You can configure the pipeline by modifying the variables at the top of the Makefile: + +- `DATA_DIR`: Specify the directory where data files are stored. +- `checksums_dir`: Define the directory where checksum files will be saved. +- `AWS_S3_BUCKET_URL`: Set the AWS S3 bucket URL for uploading results. + +Feel free to adapt this pipeline to suit your specific data processing needs and directory structure. + +`Note`: Make sure you have the necessary permissions and access to the data sources and AWS resources mentioned in this README before running the pipeline.