Skip to content

How to leverage the power of Databricks notebooks and GX data quality checks to create validated data workflows

License

Notifications You must be signed in to change notification settings

tannerbeam/gx-databricks-bigquery-public

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gx-databricks-bigquery-public

How to leverage the power of Databricks notebooks and GX data quality checks to create validated data workflows

Full instructions available at the Great Expectations Blog.

Requirements

As this is a fairly detailed integration of multiple tools, some working knowledge of Python, SQL, Git, and Databricks is assumed. Prior experience with Great Expecations (GX) could prove useful, but is not strictly required.

Repo organization

/ (top level)

  • Directories
  • dotfiles (e.g. .gitignore)
  • GCP Service Account credentials JSON file
  • repo config YAML file
  • Pandas DataFrame PKL file with some sample data

/src

Databricks notebooks with executable code for scheduled orchestration

/utils

Python files (not databricks notebooks!) to be imported

/great_expectations

Anything pertaining to data validation with GX

Repo configuration file

The default contents of the config.yml file are shown below:

# assumed directory structure is: /Workspace/repos/{repo_directory}/{repo_name}

# {repo_directory} in assumed directory structure
repo_directory: "dev"

# {repo_name} in assumed directory structure
repo_name: "gx-databricks-bigquery-public"

# relative path of BigQuery service account credentials file
bigquery_creds_file: ".bigquery_service_account_creds.json"

# relative path of great expectations directory
gx_dir: "great_expectations"

# provide a name to help identify the GX data connector type
gx_connector_name: "pandas_fluent"

To avoid instances of FileNotFoundError and other problems, it's suggested to:

  • Not rename or move the config.yml file out of the top-level directory.
  • Only use one key: 'value' pair per line.
  • Maintain a directory structure of /Workspace/Repos/{repo_directory}/{repo_name}. If you deviate from this pattern, the helper functions in the repo may not be able to locate files correctly.

Links

About

How to leverage the power of Databricks notebooks and GX data quality checks to create validated data workflows

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages