Skip to content
/ OECD Public

Script for fetching and saving API for complete source dataset, Identifying and comparing data changes, Generating and saving metadata and data files.

License

Notifications You must be signed in to change notification settings

JiscSD/OECD

Repository files navigation

OECD Dataset Download Automation

This project automates the extraction, comparison, and storage of updated/inserted OECD dataset information, leveraging APIs and structured workflows.

Features

  • Data Fetching: Downloads dataflows from the OECD API.
  • Data Comparison: Compares new datasets with previous instances to detect changes (new inserts, deletions).
  • Data Management: Saves dataset and metadata for future processing.
  • Change Archival: Archives previous datasets and records changes for future reference.
  • Error Logging: Logs detailed information about the execution process.

File Structure

  • Functions/: Contains core utility scripts:
    • data_fetcher.py: Fetches dataflows and saves them to CSV. Designed for first time dataset download as well as subsequent downloads (all_dataflows_new.csv, all_dataflows_previous.csv)
    • data_comparator.py: Compares old all_dataflows_previous.csv and new datasets all_dataflows_new.csv and identifies changes data_changes.csv.
    • api_downloader.py: Downloads datasets and metadata for new entries.
    • logger.py: Configures logging for the project.
  • base_run.py: Initializes and manages the data fetching workflow for the first time to create Base dataset.
  • main.py: Invokes regular workflows, including data fetching, comparison, and metadata updates. This is to be scheduled and invoked on regular intervals
  • config.yaml: Configuration file with API endpoints, file names and file paths.
  • requirements.txt: Lists Python dependencies.
  • linux_cron_setup.txt: Scheduling setup to run the main.py job every two weeks once on Tuesday 7 am on Linux server.

Folder Requirements

Ensure the following folders exist in the project directory before running the scripts:

  1. logs/:
    • Used for storing log files generated by the scripts.
  2. output/:
    • Used for saving downloaded datasets and metadata files.
  3. data/:
    • Used for storing the main dataset files (e.g., all_dataflows_new.csv, all_dataflows_previous.csv, data_changes.csv).
  4. data/archive/ :
    • Folder for archiving old datasets or backups .

How to Execute

  1. Initial Setup:

    • Run base_run.py to fetch and save the baseline dataset:
      python base_run.py
    • This needs to run only once and will create the first version of the dataset (all_dataflows_previous.csv) and set up the workspace for subsequent runs.
  2. Regular Workflow:

    • Run main.py for periodic execution:
      python main.py
    • This will:
      • Run every two weeks once on Tuesday 7 am once after moved to Linux server and executed.
      • Fetch the latest datasets from the OECD API.
      • Compare the new dataset with the existing one to identify changes.
      • Save detected changes to data_changes.csv.
      • Download additional data and metadata for new records.
  3. Output:

    • Logs: Found in the logs/ folder.
    • Change Summary: Saved as data_changes.csv in the data/ directory.
    • Downloaded Data and Metadata: Saved in the output/ directory.

For a detailed explanation of each script, its role, and how they work together, refer to the Confluence page.

About

Script for fetching and saving API for complete source dataset, Identifying and comparing data changes, Generating and saving metadata and data files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages