This project automates the extraction, comparison, and storage of updated/inserted OECD dataset information, leveraging APIs and structured workflows.
- Data Fetching: Downloads dataflows from the OECD API.
- Data Comparison: Compares new datasets with previous instances to detect changes (new inserts, deletions).
- Data Management: Saves dataset and metadata for future processing.
- Change Archival: Archives previous datasets and records changes for future reference.
- Error Logging: Logs detailed information about the execution process.
Functions/
: Contains core utility scripts:data_fetcher.py
: Fetches dataflows and saves them to CSV. Designed for first time dataset download as well as subsequent downloads (all_dataflows_new.csv
,all_dataflows_previous.csv
)data_comparator.py
: Compares oldall_dataflows_previous.csv
and new datasetsall_dataflows_new.csv
and identifies changesdata_changes.csv
.api_downloader.py
: Downloads datasets and metadata for new entries.logger.py
: Configures logging for the project.
base_run.py
: Initializes and manages the data fetching workflow for the first time to create Base dataset.main.py
: Invokes regular workflows, including data fetching, comparison, and metadata updates. This is to be scheduled and invoked on regular intervalsconfig.yaml
: Configuration file with API endpoints, file names and file paths.requirements.txt
: Lists Python dependencies.linux_cron_setup.txt
: Scheduling setup to run the main.py job every two weeks once on Tuesday 7 am on Linux server.
Ensure the following folders exist in the project directory before running the scripts:
logs/
:- Used for storing log files generated by the scripts.
output/
:- Used for saving downloaded datasets and metadata files.
data/
:- Used for storing the main dataset files (e.g.,
all_dataflows_new.csv
,all_dataflows_previous.csv
,data_changes.csv
).
- Used for storing the main dataset files (e.g.,
data/archive/
:- Folder for archiving old datasets or backups .
-
Initial Setup:
- Run
base_run.py
to fetch and save the baseline dataset:python base_run.py
- This needs to run only once and will create the first version of the dataset (
all_dataflows_previous.csv
) and set up the workspace for subsequent runs.
- Run
-
Regular Workflow:
- Run
main.py
for periodic execution:python main.py
- This will:
- Run every two weeks once on Tuesday 7 am once after moved to Linux server and executed.
- Fetch the latest datasets from the OECD API.
- Compare the new dataset with the existing one to identify changes.
- Save detected changes to
data_changes.csv
. - Download additional data and metadata for new records.
- Run
-
Output:
- Logs: Found in the
logs/
folder. - Change Summary: Saved as
data_changes.csv
in thedata/
directory. - Downloaded Data and Metadata: Saved in the
output/
directory.
- Logs: Found in the
For a detailed explanation of each script, its role, and how they work together, refer to the Confluence page.