OECD Dataset Download Automation

This project automates the extraction, comparison, and storage of updated/inserted OECD dataset information, leveraging APIs and structured workflows.

Features

Data Fetching: Downloads dataflows from the OECD API.
Data Comparison: Compares new datasets with previous instances to detect changes (new inserts, deletions).
Data Management: Saves dataset and metadata for future processing.
Change Archival: Archives previous datasets and records changes for future reference.
Error Logging: Logs detailed information about the execution process.

File Structure

Functions/: Contains core utility scripts:
- data_fetcher.py: Fetches dataflows and saves them to CSV. Designed for first time dataset download as well as subsequent downloads (all_dataflows_new.csv, all_dataflows_previous.csv)
- data_comparator.py: Compares old all_dataflows_previous.csv and new datasets all_dataflows_new.csv and identifies changes data_changes.csv.
- api_downloader.py: Downloads datasets and metadata for new entries.
- logger.py: Configures logging for the project.
base_run.py: Initializes and manages the data fetching workflow for the first time to create Base dataset.
main.py: Invokes regular workflows, including data fetching, comparison, and metadata updates. This is to be scheduled and invoked on regular intervals
config.yaml: Configuration file with API endpoints, file names and file paths.
requirements.txt: Lists Python dependencies.
linux_cron_setup.txt: Scheduling setup to run the main.py job every two weeks once on Tuesday 7 am on Linux server.

Folder Requirements

Ensure the following folders exist in the project directory before running the scripts:

logs/:
- Used for storing log files generated by the scripts.
output/:
- Used for saving downloaded datasets and metadata files.
data/:
- Used for storing the main dataset files (e.g., all_dataflows_new.csv, all_dataflows_previous.csv, data_changes.csv).
data/archive/ :
- Folder for archiving old datasets or backups .

How to Execute

Initial Setup:
- Run base_run.py to fetch and save the baseline dataset:
```
python base_run.py
```
- This needs to run only once and will create the first version of the dataset (all_dataflows_previous.csv) and set up the workspace for subsequent runs.
Regular Workflow:
- Run main.py for periodic execution:
```
python main.py
```
- This will:
  - Run every two weeks once on Tuesday 7 am once after moved to Linux server and executed.
  - Fetch the latest datasets from the OECD API.
  - Compare the new dataset with the existing one to identify changes.
  - Save detected changes to data_changes.csv.
  - Download additional data and metadata for new records.
Output:
- Logs: Found in the logs/ folder.
- Change Summary: Saved as data_changes.csv in the data/ directory.
- Downloaded Data and Metadata: Saved in the output/ directory.

For a detailed explanation of each script, its role, and how they work together, refer to the Confluence page.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Functions		Functions
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
base_run.py		base_run.py
config.yaml		config.yaml
linux_cron_setup.txt		linux_cron_setup.txt
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OECD Dataset Download Automation

Features

File Structure

Folder Requirements

How to Execute

About

Releases

Packages

Languages

License

JiscSD/OECD

Folders and files

Latest commit

History

Repository files navigation

OECD Dataset Download Automation

Features

File Structure

Folder Requirements

How to Execute

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages