Great-Expectations-Data
ActionsThis Action allows you to validate and profile your data with Great Expectations. From the docs:
Great Expectations is a leading tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams.
This Action provides the following features:
- Run Expectations Suites, to validate your data as part of your continuous integration workflow.
- Generate Data Docs and Profiling and serve them on a static site generator like GitHub Pages or platform like Netlify.
- More information on how you can use this action can be found in the Use Cases.
- CI for your data.
- GitHub is a natural platform to discuss fixes as issues arise. This action can speed up conversations by presenting data docs reports that make it easy to see how your data has changed.
- MLOps - Retraining ML models.
- Great Expectations can be used to detect when your live prediction data has drifted from your training data. Not only does this protect you from misbehaving models, it also can be used to determine when models need to be retrained.
- In addition to checking model input, Great Expectations can be used to check model output.
- Integration testing with static data fixtures.
- Many data pipeline and ML projects use static data fixtures for unit or integration tests. These test suites can be expressed as Great Expectations suites. Then each time you submit a PR not only will you receive a pass/fail CI check you'll receive a visual data report on how your tests performed.
- Run on code change.
- Use this action as CI/CD for data and ML pipelines that runs when PRs are submitted.
- A lightweight DAG runner
- This action is not limited to running in response to code change or other activity on GitHub. You can also run this action manually or on a schedule. As long as your data sources are configured and accessible, you can get the benefits of data quality testing without integrating Great Expectations directly into your pipelines.
This example triggers Great Expectations to run everytime a pull request is opened, reopened, or a push is made to a pull request. Furthermore, if a checkpoint fails a comment with a link to the Data Docs hosted on Netlify is provided.
Note: This example will not work on pull requests from forks. This is to protect repositories from malicious actors. To trigger a GitHub Action with sufficient permissions to comment on a pull request from a fork, you must trigger the action via another event, such as a comment or a label. This is demonstrated in Example 2 below.
#Automatically Runs Great Expectation Checkpoints on every push to a PR, and provides links to hosted Data Docs if there an error.
name: PR Push
on: pull_request
env: # credentials to your development database (this is illustrative, can be another data source)
DB_HOST: ${{ secrets.DB_HOST }}
DB_PASS: ${{ secrets.DB_PASS }}
DB_USER: ${{ secrets.DB_USER }}
jobs:
great_expectations_validation:
runs-on: ubuntu-latest
steps:
# Clone the contents of the repository
- name: Copy Repository Contents
uses: actions/checkout@main
# Execute your data pipeline on development infrastructure. This is a simplified example where
# we run a local sql file against a remote Postgres development database as the "pipeline". We
# then test the materialized data from the pipeline with this action.
- name: run sql query
run: |
PGPASSWORD=${DB_PASS} psql -h $DB_HOST -d demo -U $DB_USER -f location_frequency.sql
# Run Great Expectations and deploy Data Docs to Netlify
- name: Run Great Expectation Checkpoints
id: ge
uses: great-expectations/great_expectations_action@main
with:
CHECKPOINTS: "locations.rds.chk"
NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }}
NETLIFY_SITE_ID: ${{ secrets.NETLIFY_SITE_ID }}
# Comment on PR with link to deployed Data Docs if there is a failed checkpoint, otherwhise don't comment.
- name: Comment on PR
if: ${{ always() }}
uses: actions/github-script@v2
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
if (process.env.FAILURE_FLAG == 1 ) {
msg = `Failed Great Expectations checkpoint(s) \`${process.env.FAILED_CHECKPOINTS}\` detected for: ${process.env.SHA}. Corresponding Data Docs have been generated and can be viewed [here](${process.env.URL}).`;
console.log(`Message to be emitted: ${msg}`);
github.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: msg
});
}
env:
URL: "${{ steps.ge.outputs.netlify_docs_url }}"
FAILED_CHECKPOINTS: ${{ steps.ge.outputs.failing_checkpoints }}
SHA: ${{ github.sha }}
FAILURE_FLAG: ${{ steps.ge.outputs.checkpoint_failure_flag }}
The below example checks pull request comments for the presence of a special command: /data-docs
. If this command is present, the following steps occur:
- The HEAD SHA for the pull request commented on is retrieved.
- The contents for the repository are fetched at the HEAD SHA of the branch of the pull request.
- Great Expectations checkpoints are run, and Data Docs are deployed to Netlify.
- The Netlify URL is provided as a comment on the pull request.
#Allows repo owners to view data docs hosted on Netlify for a PR with the command "/data-docs" as a comment in a PR.
name: PR Comment
on: [issue_comment]
jobs:
demo-pr:
# Check that a comment with word '/data-docs' is made on pull request (not an issue).
if: |
(github.event.issue.pull_request != null) &&
contains(github.event.comment.body, '/data-docs')
runs-on: ubuntu-latest
steps:
# Get the HEAD SHA of the pull request that has been commented on.
- name: Fetch context about the PR that has been commented on
id: chatops
uses: actions/github-script@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
// Get the branch name
github.pulls.get({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: context.payload.issue.number
}).then( (pr) => {
// Get latest SHA of current branch
var SHA = pr.data.head.sha
console.log(`::set-output name=SHA::${SHA}`)
})
# Clone the contents of the repository at the SHA fetched in the previous step
- name: Copy The PR's Branch Repository Contents
uses: actions/checkout@main
with:
ref: ${{ steps.chatops.outputs.SHA }}
- name: run data pipeline on dev server
run: # <Put your code here> run your data pipeline in development so you can test the results with Great Expectations
# Run Great Expectation checkpoints and deploy Data Docs to Netlify
- name: Run Great Expectation Checkpoints
id: ge
uses: great-expectations/great_expectations_action@main
with:
CHECKPOINTS: ${{ matrix.checkpoints }}
NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }}
NETLIFY_SITE_ID: ${{ secrets.NETLIFY_SITE_ID }}
# Comment on PR with link to deployed Data Docs on Netlify. In this example, we comment
# on the PR with a message for both failed and successful checks.
- name: Comment on PR
if: ${{ always() }}
uses: actions/github-script@v2
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
if (process.env.FAILURE_FLAG == 1 ) {
msg = `Failed Great Expectations checkpoint(s) \`${process.env.FAILED_CHECKPOINTS}\` detected for: ${process.env.SHA}. Corresponding Data Docs have been generated and can be viewed [here](${process.env.URL}).`;
} else {
msg = `All Checkpoints for: ${process.env.SHA} have passed. Corresponding Data Docs have been generated and can be viewed [here](${process.env.URL}).`;
}
console.log(`Message to be emitted: ${msg}`);
github.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: msg
})
env:
URL: "${{ steps.ge.outputs.netlify_docs_url }}"
FAILED_CHECKPOINTS: ${{ steps.ge.outputs.failing_checkpoints }}
SHA: ${{ steps.chatops.outputs.SHA }}
FAILURE_FLAG: ${{ steps.ge.outputs.checkpoint_failure_flag }}
CHECKPOINTS
: A comma separated list of checkpoint names to execute. Example - "checkpoint1,checkpoint2"
-
NETLIFY_AUTH_TOKEN
: A personal access token for Netlify. -
NETLIFY_SITE_ID
: A Netlify site id. -
DEBUG
Setting this input to any value will allow the Action to exit with a status code of 0 even if a checkpoint fails. This is used by maintainers of this Action for testing and debugging.
-
ACTION_DOCS_LOCATION
: The absolute path where generated data docs generated by the Action run are located. This is useful if you want to deploy the data docs to an external service. -
FAILING_CHECKPOINTS
: A comma delimited list of failed checkpoints. -
PASSING_CHECKPOINTS
: A comma delimited list of passing checkpoints. -
NETLIFY_DOCS_URL
: The url to the generated data docs on Netlify. This output is only emitted only if the input parametersNETLIFY_AUTH_TOKEN
andNETLIFY_SITE_ID
are provided. -
CHECKPOINT_FAILURE_FLAG
: This will return 0 if there are no checkpoint failures and 1 if there are one or more checkpoint failures.
This section is for those who wish to develop this GitHub Action. Users of this Action can ignore this section.
pip install -r requriements.txt
- run
great_expectations init
to set up missing directories
Run these commands from the repo root.
To see a checkpoint pass run great_expectations checkpoint run passing_checkpoint
To see a checkpoint fail run great_expectations checkpoint run failing_checkpoint
- If a cloud-based
ValidationStore
is in use we may need to disable it so that the built docs focus only on what is being validated in the Action without other side effects.
Great-Expectations-Data is not certified by GitHub. It is provided by a third-party and is governed by separate terms of service, privacy policy, and support documentation.