The sample demonstrate how DevOps principles can be applied end to end Data Pipeline Solution built according to the Modern Data Warehouse (MDW) pattern.
The solution pulls near realtime Melbourne Parking Sensor data from a publicly available REST api endpoint and saves this to Azure Data Lake Gen2. It then validates, cleanses, and transforms the data to a known schema using Azure Databricks. A second Azure Databricks job then transforms these into a Star Schema which are then loaded into Azure Synapse Analytics (formerly SQLDW) using Polybase. The entire pipeline is orchestrated with Azure Data Factory.
The following shows the overall architecture of the solution.
Sample PowerBI report
The following shows the overall CI/CD process end to end.
See here for details.
It makes use of the following azure services:
- Azure Data Factory
- Azure Databricks
- Azure Data Lake Gen2
- Azure Synapse Analytics (formerly SQLDW)
- Azure DevOps
- Application Insights
- PowerBI
For a detailed walk-through of the solution and key concepts, watch the following video recording:
The following summarizes key learnings and best practices demonstrated by this sample solution:
- Generally, you want to divide your data lake into three major areas which contain your Bronze, Silver and Gold datasets.
- Bronze - This is a landing area for your raw datasets with no to minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store.
- Silver - These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by Data Scientists.
- Gold - These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard Fact and Dimension tables.
- Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all succeeding datasets conform to a specific schema and known data invariants. This also can potentially prevent data pipeline failures in cases of unexpected changes to the input data.
- Data that does not pass this validation stage can be rerouted to a Malformed Record store for diagnostic purpose.
- It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can used to replay the data pipeline for both testing (ei. testing data validation logic) and data recovery purposes (ei. data corruption is introduced due to a bug in the data transformation code and thus pipeline needs to be replayed).
- Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines replayable and idempotent, you can recover from this state through deployment of code fix and replaying the data pipelines.
- Idempotency also ensures data-duplication is mitigated when replaying your data pipelines.
- Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages.
- While it is possible to run tests against notebooks, by shifting tests left you increase developer productivity by increasing the speed of the feedback cycle.
- This means including all artifacts needed to build the data pipeline from scratch in source control. This includes infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures, etc), reference/application data, data pipeline definitions, and data validation and transformation logic.
- There should also be a safe, repeatable process to move changes through dev, test and finally production.
- Maintain a central, secure location for sensitive configuration such as database connection strings that can be access by the appropriate services within the specific environment.
- Any example of this is securing secrets in KeyVault per environment, then having the relevant services query KeyVault for the configuration.
- A proper monitoring solution should be in-place to ensure failures are identified, diagnosed and addressed in a timely manner. Aside from the base infrastructure and pipeline runs, data should also be monitored. A common area that should have data monitoring is the malformed record store.
The Build and Release Pipelines definitions can be found here.
- Sandbox and Dev- the DEV resource group is used by developers to build and test their solutions. It contains two logical environments - (1) a Sandbox environment per developer so each developer can make and test their changes in isolation prior committing to
master
, and (2) a shared Dev environment for integrating changes from the entire development team. "Isolated" sandbox environment are accomplish through a number of practices depending on the Azure Service.- Databricks - developers use their dedicated Workspace folder to author and save notebooks. Developers can choose to spin up their own dedicated clusters or share a High-concurrency cluster.
- DataLake Gen2 - a "sandbox" file system is created. Each developer creates their own folder within this Sandbox filesystem.
- AzureSQL or SQLDW - A transient database (restored from DEV) is spun up per developer on demand.
- Data Factory - git integration allows them to make changes to their own branches and debug runs independently.
- Stage - the STG resource group is used to test deployments prior to going to production in a production-like environment. Integration tests are run in this environment.
- Production - the PROD resource group is the final Production environment.
There are eight numbered orange boxes describing the sequence from sandbox development to target environments:
- Developers develop in their own Sandbox environments within the DEV resource group and commit changes into their own short-lived git branches. (i.e. <developer_name>/<branch_name>)
- When changes are complete, developers raise a PR to master for review. This automatically kicks-off the PR validation pipeline which runs the unit tests, linting and DACPAC builds.
- On PR completion, the commit to master will trigger a Build pipeline -- publishing all necessary Build Artifacts.
- The completion of a successful Build pipeline will trigger the first stage of the Release pipeline. This deploys the publish build artifacts into the DEV environment, with the exception of Azure Data Factory*.
- Developers perform a Manual Publish to the DEV ADF from the collaboration branch (
master
). This updates the ARM templates in in theadf_publish
branch. - On the successful completion of the first stage, this triggers an Manual Approval Gate**. On Approval, the release pipeline continues with the second stage -- deploying changes to the Staging environment.
- Integration tests are run to test changes in the Staging environment.
- ***On the successful completion of the second stage, this triggers a second Manual Approval Gate. On Approval, the release pipeline continues with the third stage -- deploying changes to the Production environment.
Notes:
- This is a simplified Build and Release process for demo purposes based on Trunk-based development practices.
- *A manual publish is required -- currently, this cannot be automated.
- **The solution deployment script does not configure Approval Gates at the moment. See Known Issues, Limitations and Workarounds
- ***Many organization use dedicated Release Branches (including Microsoft) instead of deploying from
master
. See Release Flow.
More resources:
- Continuous Integration & Continuous Delivery with Databricks
- Continuous integration and delivery in Azure Data Factory
- Devops for AzureSQL
-
Unit Testing - These test small pieces of functionality within your code. Data transformation code should have unit tests and can be accomplished by abstracting Data Transformation logic into packages. Unit tests along with linting are automatically run when a PR is raised to
master
.- See here for unit tests within the solution and the corresponding QA Pipeline that executes the unit tests on every PR.
-
Integration Testing - These are run to ensure integration points of the solution function as expected. In this demo solution, an actual Data Factory Pipeline run is automatically triggered and its output verified as part of the Release to the Staging Environment.
- See here for the integration tests and the corresponding Release Pipeline Job Definition for running them as part of the Release pipeline.
More resources:
- pytest-adf - Pytest helper plugin for integration testing Azure Data Factory
- nutter testing framework - Testing framework for Databricks notebooks.
- Monitoring Azure Databricks with Azure Monitor
- Monitoring Azure Databricks Jobs with Application Insights
- Github account
- Azure Account
- Permissions needed: ability to create and deploy to an azure resource group, a service principal, and grant the collaborator role to the service principal over the resource group.
- Azure DevOps Project
- Permissions needed: ability to create service connections, pipelines and variable groups.
- For Windows users, Windows Subsystem For Linux
- az cli 2.6+
- az cli - application insights extension
- To install, run
az extension add --name application-insights
- To install, run
- Azure DevOps CLI
- To install, run
az extension add --name azure-devops
- To install, run
- Python 3+
- databricks-cli
- jq
- Rename
.envtemplate
todevcontainer.env
and update the values as explained in Initial Setup section. - Open the project inside the vscode dev container (see details here)
cd
into thee2e_samples/parking_sensors
folder and run./deploy.sh
inside the dev container terminal.
IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for list of deployed resources. NOTE: This deployment was tested using WSL 2 (Ubuntu 18.04) and Debian GNU/Linux 9.9 (stretch)
-
Initial Setup
-
Ensure that:
- You are logged in to the Azure CLI. To login, run
az login
. - Azure CLI is targeting the Azure Subscription you want to deploy the resources to.
- To set target Azure Subscription, run
az account set -s <AZURE_SUBSCRIPTION_ID>
- To set target Azure Subscription, run
- Azure CLI is targeting the Azure DevOps organization and project you want to deploy the pipelines to.
- To set target Azure DevOps project, run
az devops configure --defaults organization=https://dev.azure.com/<MY_ORG>/ project=<MY_PROJECT>
- To set target Azure DevOps project, run
- You are logged in to the Azure CLI. To login, run
-
Import this repository into a new Github repo. See here on how to import a github repo. Importing is necessary for setting up git integration with Azure Data Factory.
-
Set the following required environment variables:
- GITHUB_REPO - Name of your imported github repo in this form
<my_github_handle>/<repo>
. (ei. "devlace/mdw-dataops-import") - GITHUB_PAT_TOKEN - a Github PAT token. Generate them here. This requires "repo" scope.
Optionally, set the following environment variables:
- RESOURCE_GROUP_LOCATION - Azure location to deploy resources. Default:
westus
. - AZURE_SUBSCRIPTION_ID - Azure subscription id to use to deploy resources. Default: default azure subscription. To see your default, run
az account list
. - RESOURCE_GROUP_NAME_PREFIX - name of the resource group. This will automatically be appended with the environment name. For example:
RESOURCE_GROUP_NAME_PREFIX-dev-rg
. Default: mdwdo-park-${DEPLOYMENT_ID}. - DEPLOYMENT_ID - string appended to all resource names. This is to ensure uniqueness of azure resource names. Default: random five character string.
- AZDO_PIPELINES_BRANCH_NAME - git branch where Azure DevOps pipelines definitions are retrieved from. Default: master.
- AZURESQL_SERVER_PASSWORD - Password of the SQL Server instance. Default: semi-random string.
To further customize the solution, set parameters in
arm.parameters
files located in theinfrastructure
folder. - GITHUB_REPO - Name of your imported github repo in this form
-
-
Deploy Azure resources
- Clone locally the imported Github Repo, then
cd
into thee2e_samples/parking_sensors
folder of the repo - Run
./deploy.sh
.- This may take around ~30mins or more to run end to end. So grab yourself a cup of coffee... ☕
- After a successful deployment, you will find
.env.{environment_name}
files containing essential configuration information per environment. See here for list of deployed resources.
- As part of the deployment script, this updated the Azure DevOps Release Pipeline YAML definition to point to your Github repository. Commit and push up these changes.
- This will trigger a Build and Release which will fail due to a lacking
adf_publish
branch -- this is expected. This branch will be created once you've setup git integration with your DEV Data Factory and publish a change.
- This will trigger a Build and Release which will fail due to a lacking
- Clone locally the imported Github Repo, then
-
Setup ADF git integration in DEV Data Factory
- In the Azure Portal, navigate to the Data Factory in the DEV environment.
- Click "Author & Monitor" to launch the Data Factory portal.
- On the landing page, select "Set up code repository". For more information, see here.
- Fill in the repository settings with the following:
- Repository type: Github
- Github Account: your_Github_account
- Git repository name: imported Github repository
- Collaboration branch: master
- Root folder: /e2e_samples/parking_sensors/adf
- Import Existing Data Factory resource to repository: Selected
- Branch to import resource into: Use Collaboration
- When prompted to select a working branch, select master
IMPORTANT NOTE: Only the DEV Data Factory should be setup with Git integration. Do NOT setup git integration in the STG and PROD Data Factories.
-
Trigger an initial Release
- In the DEV Data Factory portal, navigate to "Manage > Triggers". Select the
T_Sched
trigger and activate it by clicking on the "Play" icon next to it. ClickPublish
to publish changes.- Publishing a change is required to generate the
adf_publish
branch which is required in the Release pipelines.
- Publishing a change is required to generate the
- In Azure DevOps, notice a new run of the Build Pipeline (mdw-park-ci-artifacts) off
master
. This will build the Python package and SQL DACPAC, then publish these as Pipeline Artifacts. - After completion, this should automatically trigger the Release Pipeline (mdw-park-cd-release). This will deploy the artifacts across environments.
- Optional. Trigger the Data Factory Pipelines per environment.
- In the Data Factory portal of each environment, navigate to "Author", then select the
P_Ingest_MelbParkingData
. - Select "Trigger > Trigger Now".
- To monitor the run, go to "Monitor > Pipeline runs".
- Currently, the data pipeline is configured to use "on-demand" databricks clusters so it takes a few minutes to spin up. That said, it is not uncommon to change these to point to "existing" running clusters in Development for faster data pipeline runs.
- In the Data Factory portal of each environment, navigate to "Author", then select the
- In the DEV Data Factory portal, navigate to "Manage > Triggers". Select the
Congratulations!! 🥳 You have successfully deployed the solution and accompanying Build and Release Pipelines. For next steps, we recommend watching this presentation for a detailed walk-through of the running solution. If you've encountered any issues, please file a Github issue with the relevant error message and replication steps.
After a successful deployment, you should have the following resources:
- In Azure, three (3) Resource Groups (one per environment) each with the following Azure resources.
- Data Factory - with pipelines, datasets, linked services, triggers deployed and configured correctly per environment.
- Data Lake Store Gen2 and a Service Principal (SP) with Storage Contributor rights assigned.
- Databricks workspace
- notebooks uploaded at
/notebooks
folder in the workspace - SparkSQL tables created
- ADLS Gen2 mounted at
dbfs:/mnt/datalake
using the Storage Service Principal. - Databricks secrets created*
- notebooks uploaded at
- Azure Synapse (formerly SQLDW) - currently, empty. The Release Pipeline will deploy the SQL Database objects.
- Application Insights
- KeyVault with all relevant secrets stored.
- In Azure DevOps
- Four (4) Azure Pipelines
- mdwdo-park-cd-release - Release Pipeline
- mdwdo-park-ci-artifacts - Build Pipeline
- mdwdo-park-ci-qa-python - "QA" pipeline runs on PR to master
- mdwdo-park-ci-qa-sql - "QA" pipeline runs on PR to master
- Three (6) Variables Groups - two per environment
- mdwdo-park-release-dev
- mdwdo-park-release-secrets-dev**
- mdwdo-park-release-stg
- mdwdo-park-release-secrets-stg**
- mdwdo-park-release-prod
- mdwdo-park-release-secrets-prod**
- Four (4) Service Connections
- Three Azure Service Connections (one per environment) each with a Service Principal with Contributor rights to the corresponding Resource Group.
- mdwdo-park-serviceconnection-dev
- mdwdo-park-serviceconnection-stg
- mdwdo-park-serviceconnection-prod
- Github Service Connection for retrieving code from Github
- mdwdo-park-github
- Three Azure Service Connections (one per environment) each with a Service Principal with Contributor rights to the corresponding Resource Group.
- Four (4) Azure Pipelines
Notes:
-
*This secret-scope is currently not deployed as a KeyVault-backed secret scope due to limitations of creating it programmatically.
-
**These variable groups are currently not linked to KeyVault due to limitations of creating these programmatically. See Known Issues, Limitations and Workarounds
ADLS Gen2 is structured as the following:
datalake <- filesystem
/sys/databricks/libs <- contains all libs, jars, wheels needed for processing
/data
/lnd <- Bronze - landing folder where all data files are ingested into.
/interim <- Silver - interim (cleansed) tables
/dw <- Gold - final tables
The following lists some limitations of the solution and associated deployment script:
- Databricks KeyVault-backed secrets scopes can only be create via the UI, cannot be created programmatically and was not incorporated in the automated deployment of the solution.
- Workaround: Deployment uses normal Databricks secrets with the downside of duplicated information. If you wish, you many manually convert these to KeyVault-back secret scopes. See here for more information.
- Azure DevOps Variable Groups linked to KeyVault can only be created via the UI, cannot be created programmatically and was not incorporated in the automated deployment of the solution.
- Workaround: Deployment add sensitive configuration as "secrets" in Variable Groups with the downside of duplicated information. If you wish, you may manually link a second Variable Group to KeyVault to pull out the secrets. KeyVault secret names should line up with required variables in the Azure DevOps pipelines. See here for more information.
- Azure DevOps Environment and Approval Gates can only be managed via the UI, cannot be managed programmatically and was not incorporated in the automated deployment of the solution.
- Workaround: Approval Gates can be easily configured manually. See here for more information.