Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augment and streamline export functionality #347

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

coussens
Copy link
Contributor

Resolves #338

Explanation

  • Created local exports folder within Airflow (and updating .gitignore and and docker_compose.yml accordingly) so that data marts are exported locally, in addition to on S3 (if destination bucket URI is specified).
  • Added s3fs library to requirements.txt to allow for more streamlined writing of files to S3 (described below)
  • Leveraged the built-in functionality of pandas (when s3fs is installed) to write files (CSV, Parquet, etc.) directly from a DataFrame to an S3 bucket (no need to write to local file then copy to bucket).
  • export_marts DAG writes to .parquet format by default (but I left commented-out lines for writing to CSV for easy editing, if desired).
  • Leveraged pandas default use of environment variables for AWS authentication (no need to explicitly pull the keys from the environment then pass them to boto/pandas/etc).
    • Amended docker_compose.yml file to remove unnecessary references to AWS environment variables. Simply naming them appropriately in the .env file (example below) is all that's required.
# example .env file
AIRFLOW_UID=your_uid
UMLS_API=your_api_key
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_DEST_BUCKET=s3://bucket_name/folder_name
AWS_REGION=your_aws_region

Tests

Successfully ran export_marts DAG in Airflow, noting the files were written properly to both the local exports directory, as well as an S3 bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Export Data Marts to Parquet
1 participant