Skip to content

Latest commit

 

History

History
 
 

parking_sensors

DataOps - Parking Sensor Demo

The sample demonstrate how DevOps principles can be applied end to end Data Pipeline Solution built according to the Modern Data Warehouse (MDW) pattern.

Contents


Solution Overview

The solution pulls near realtime Melbourne Parking Sensor data from a publicly available REST api endpoint and saves this to Azure Data Lake Gen2. It then validates, cleanses, and transforms the data to a known schema using Azure Databricks. A second Azure Databricks job then transforms these into a Star Schema which are then loaded into Azure Synapse Analytics (formerly SQLDW) using Polybase. The entire pipeline is orchestrated with Azure Data Factory.

Architecture

The following shows the overall architecture of the solution.

Architecture

Sample PowerBI report

PowerBI report

Continuous Integration and Continuous Delivery (CI/CD)

The following shows the overall CI/CD process end to end.

CI/CD

See here for details.

Technologies used

It makes use of the following azure services:

For a detailed walk-through of the solution and key concepts, watch the following video recording:

IMAGE ALT TEXT HERE

Key Learnings

The following summarizes key learnings and best practices demonstrated by this sample solution:

1. Use Data Tiering in your Data Lake

  • Generally, you want to divide your data lake into three major areas which contain your Bronze, Silver and Gold datasets.
    1. Bronze - This is a landing area for your raw datasets with no to minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store.
    2. Silver - These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by Data Scientists.
    3. Gold - These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard Fact and Dimension tables.

2. Validate data early in your pipeline

  • Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all succeeding datasets conform to a specific schema and known data invariants. This also can potentially prevent data pipeline failures in cases of unexpected changes to the input data.
  • Data that does not pass this validation stage can be rerouted to a Malformed Record store for diagnostic purpose.
  • It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can used to replay the data pipeline for both testing (ei. testing data validation logic) and data recovery purposes (ei. data corruption is introduced due to a bug in the data transformation code and thus pipeline needs to be replayed).

3. Make your data pipelines replayable and idempotent

  • Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines replayable and idempotent, you can recover from this state through deployment of code fix and replaying the data pipelines.
  • Idempotency also ensures data-duplication is mitigated when replaying your data pipelines.

4. Ensure data transformation code is testable

  • Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages.
  • While it is possible to run tests against notebooks, by shifting tests left you increase developer productivity by increasing the speed of the feedback cycle.

5. Have a CI/CD pipeline

  • This means including all artifacts needed to build the data pipeline from scratch in source control. This includes infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures, etc), reference/application data, data pipeline definitions, and data validation and transformation logic.
  • There should also be a safe, repeatable process to move changes through dev, test and finally production.

6. Secure and centralize configuration

  • Maintain a central, secure location for sensitive configuration such as database connection strings that can be access by the appropriate services within the specific environment.
  • Any example of this is securing secrets in KeyVault per environment, then having the relevant services query KeyVault for the configuration.

7. Monitor infrastructure, pipelines and data

  • A proper monitoring solution should be in-place to ensure failures are identified, diagnosed and addressed in a timely manner. Aside from the base infrastructure and pipeline runs, data should also be monitored. A common area that should have data monitoring is the malformed record store.

Key Concepts

Build and Release Pipeline

The Build and Release Pipelines definitions can be found here.

Environments

  1. Sandbox and Dev- the DEV resource group is used by developers to build and test their solutions. It contains two logical environments - (1) a Sandbox environment per developer so each developer can make and test their changes in isolation prior committing to master, and (2) a shared Dev environment for integrating changes from the entire development team. "Isolated" sandbox environment are accomplish through a number of practices depending on the Azure Service.
    • Databricks - developers use their dedicated Workspace folder to author and save notebooks. Developers can choose to spin up their own dedicated clusters or share a High-concurrency cluster.
    • DataLake Gen2 - a "sandbox" file system is created. Each developer creates their own folder within this Sandbox filesystem.
    • AzureSQL or SQLDW - A transient database (restored from DEV) is spun up per developer on demand.
    • Data Factory - git integration allows them to make changes to their own branches and debug runs independently.
  2. Stage - the STG resource group is used to test deployments prior to going to production in a production-like environment. Integration tests are run in this environment.
  3. Production - the PROD resource group is the final Production environment.

Build and Release Sequence

There are eight numbered orange boxes describing the sequence from sandbox development to target environments:

CI/CD

  1. Developers develop in their own Sandbox environments within the DEV resource group and commit changes into their own short-lived git branches. (i.e. <developer_name>/<branch_name>)
  2. When changes are complete, developers raise a PR to master for review. This automatically kicks-off the PR validation pipeline which runs the unit tests, linting and DACPAC builds.
  3. On PR completion, the commit to master will trigger a Build pipeline -- publishing all necessary Build Artifacts.
  4. The completion of a successful Build pipeline will trigger the first stage of the Release pipeline. This deploys the publish build artifacts into the DEV environment, with the exception of Azure Data Factory*.
  5. Developers perform a Manual Publish to the DEV ADF from the collaboration branch (master). This updates the ARM templates in in the adf_publish branch.
  6. On the successful completion of the first stage, this triggers an Manual Approval Gate**. On Approval, the release pipeline continues with the second stage -- deploying changes to the Staging environment.
  7. Integration tests are run to test changes in the Staging environment.
  8. ***On the successful completion of the second stage, this triggers a second Manual Approval Gate. On Approval, the release pipeline continues with the third stage -- deploying changes to the Production environment.

Notes:

  • This is a simplified Build and Release process for demo purposes based on Trunk-based development practices.
  • *A manual publish is required -- currently, this cannot be automated.
  • **The solution deployment script does not configure Approval Gates at the moment. See Known Issues, Limitations and Workarounds
  • ***Many organization use dedicated Release Branches (including Microsoft) instead of deploying from master. See Release Flow.

More resources:

Testing

  • Unit Testing - These test small pieces of functionality within your code. Data transformation code should have unit tests and can be accomplished by abstracting Data Transformation logic into packages. Unit tests along with linting are automatically run when a PR is raised to master.

    • See here for unit tests within the solution and the corresponding QA Pipeline that executes the unit tests on every PR.
  • Integration Testing - These are run to ensure integration points of the solution function as expected. In this demo solution, an actual Data Factory Pipeline run is automatically triggered and its output verified as part of the Release to the Staging Environment.

More resources:

Observability / Monitoring

Databricks

Data Factory

How to use the sample

Prerequisites

  1. Github account
  2. Azure Account
  3. Azure DevOps Project

Software pre-requisites if you don't use dev container

Software pre-requisites if you use dev container

  1. Rename .envtemplate to devcontainer.env and update the values as explained in Initial Setup section.
  2. Open the project inside the vscode dev container (see details here)
  3. cd into the e2e_samples/parking_sensors folder and run ./deploy.sh inside the dev container terminal.

Setup and Deployment

IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for list of deployed resources. NOTE: This deployment was tested using WSL 2 (Ubuntu 18.04) and Debian GNU/Linux 9.9 (stretch)

  1. Initial Setup

    1. Ensure that:

      • You are logged in to the Azure CLI. To login, run az login.
      • Azure CLI is targeting the Azure Subscription you want to deploy the resources to.
        • To set target Azure Subscription, run az account set -s <AZURE_SUBSCRIPTION_ID>
      • Azure CLI is targeting the Azure DevOps organization and project you want to deploy the pipelines to.
        • To set target Azure DevOps project, run az devops configure --defaults organization=https://dev.azure.com/<MY_ORG>/ project=<MY_PROJECT>
    2. Import this repository into a new Github repo. See here on how to import a github repo. Importing is necessary for setting up git integration with Azure Data Factory.

    3. Set the following required environment variables:

      • GITHUB_REPO - Name of your imported github repo in this form <my_github_handle>/<repo>. (ei. "devlace/mdw-dataops-import")
      • GITHUB_PAT_TOKEN - a Github PAT token. Generate them here. This requires "repo" scope.

      Optionally, set the following environment variables:

      • RESOURCE_GROUP_LOCATION - Azure location to deploy resources. Default: westus.
      • AZURE_SUBSCRIPTION_ID - Azure subscription id to use to deploy resources. Default: default azure subscription. To see your default, run az account list.
      • RESOURCE_GROUP_NAME_PREFIX - name of the resource group. This will automatically be appended with the environment name. For example: RESOURCE_GROUP_NAME_PREFIX-dev-rg. Default: mdwdo-park-${DEPLOYMENT_ID}.
      • DEPLOYMENT_ID - string appended to all resource names. This is to ensure uniqueness of azure resource names. Default: random five character string.
      • AZDO_PIPELINES_BRANCH_NAME - git branch where Azure DevOps pipelines definitions are retrieved from. Default: master.
      • AZURESQL_SERVER_PASSWORD - Password of the SQL Server instance. Default: semi-random string.

      To further customize the solution, set parameters in arm.parameters files located in the infrastructure folder.

  2. Deploy Azure resources

    1. Clone locally the imported Github Repo, then cd into the e2e_samples/parking_sensors folder of the repo
    2. Run ./deploy.sh.
      • This may take around ~30mins or more to run end to end. So grab yourself a cup of coffee... ☕
      • After a successful deployment, you will find .env.{environment_name} files containing essential configuration information per environment. See here for list of deployed resources.
    3. As part of the deployment script, this updated the Azure DevOps Release Pipeline YAML definition to point to your Github repository. Commit and push up these changes.
      • This will trigger a Build and Release which will fail due to a lacking adf_publish branch -- this is expected. This branch will be created once you've setup git integration with your DEV Data Factory and publish a change.
  3. Setup ADF git integration in DEV Data Factory

    1. In the Azure Portal, navigate to the Data Factory in the DEV environment.
    2. Click "Author & Monitor" to launch the Data Factory portal.
    3. On the landing page, select "Set up code repository". For more information, see here.
    4. Fill in the repository settings with the following:
      • Repository type: Github
      • Github Account: your_Github_account
      • Git repository name: imported Github repository
      • Collaboration branch: master
      • Root folder: /e2e_samples/parking_sensors/adf
      • Import Existing Data Factory resource to repository: Selected
      • Branch to import resource into: Use Collaboration
    5. When prompted to select a working branch, select master

    IMPORTANT NOTE: Only the DEV Data Factory should be setup with Git integration. Do NOT setup git integration in the STG and PROD Data Factories.

  4. Trigger an initial Release

    1. In the DEV Data Factory portal, navigate to "Manage > Triggers". Select the T_Sched trigger and activate it by clicking on the "Play" icon next to it. Click Publish to publish changes.
      • Publishing a change is required to generate the adf_publish branch which is required in the Release pipelines.
    2. In Azure DevOps, notice a new run of the Build Pipeline (mdw-park-ci-artifacts) off master. This will build the Python package and SQL DACPAC, then publish these as Pipeline Artifacts.
    3. After completion, this should automatically trigger the Release Pipeline (mdw-park-cd-release). This will deploy the artifacts across environments.
      • You may need to authorize the Pipelines initially to use the Service Connection for the first time. Release Pipeline
    4. Optional. Trigger the Data Factory Pipelines per environment.
      1. In the Data Factory portal of each environment, navigate to "Author", then select the P_Ingest_MelbParkingData.
      2. Select "Trigger > Trigger Now".
      3. To monitor the run, go to "Monitor > Pipeline runs". Data Factory Run
      • Currently, the data pipeline is configured to use "on-demand" databricks clusters so it takes a few minutes to spin up. That said, it is not uncommon to change these to point to "existing" running clusters in Development for faster data pipeline runs.

Congratulations!! 🥳 You have successfully deployed the solution and accompanying Build and Release Pipelines. For next steps, we recommend watching this presentation for a detailed walk-through of the running solution. If you've encountered any issues, please file a Github issue with the relevant error message and replication steps.

Deployed Resources

After a successful deployment, you should have the following resources:

  • In Azure, three (3) Resource Groups (one per environment) each with the following Azure resources.
    • Data Factory - with pipelines, datasets, linked services, triggers deployed and configured correctly per environment.
    • Data Lake Store Gen2 and a Service Principal (SP) with Storage Contributor rights assigned.
    • Databricks workspace
      • notebooks uploaded at /notebooks folder in the workspace
      • SparkSQL tables created
      • ADLS Gen2 mounted at dbfs:/mnt/datalake using the Storage Service Principal.
      • Databricks secrets created*
    • Azure Synapse (formerly SQLDW) - currently, empty. The Release Pipeline will deploy the SQL Database objects.
    • Application Insights
    • KeyVault with all relevant secrets stored.
  • In Azure DevOps
    • Four (4) Azure Pipelines
      • mdwdo-park-cd-release - Release Pipeline
      • mdwdo-park-ci-artifacts - Build Pipeline
      • mdwdo-park-ci-qa-python - "QA" pipeline runs on PR to master
      • mdwdo-park-ci-qa-sql - "QA" pipeline runs on PR to master
    • Three (6) Variables Groups - two per environment
      • mdwdo-park-release-dev
      • mdwdo-park-release-secrets-dev**
      • mdwdo-park-release-stg
      • mdwdo-park-release-secrets-stg**
      • mdwdo-park-release-prod
      • mdwdo-park-release-secrets-prod**
    • Four (4) Service Connections
      • Three Azure Service Connections (one per environment) each with a Service Principal with Contributor rights to the corresponding Resource Group.
        • mdwdo-park-serviceconnection-dev
        • mdwdo-park-serviceconnection-stg
        • mdwdo-park-serviceconnection-prod
      • Github Service Connection for retrieving code from Github
        • mdwdo-park-github

Notes:

  • *This secret-scope is currently not deployed as a KeyVault-backed secret scope due to limitations of creating it programmatically.

  • **These variable groups are currently not linked to KeyVault due to limitations of creating these programmatically. See Known Issues, Limitations and Workarounds

Data Lake Physical layout

ADLS Gen2 is structured as the following:


datalake                    <- filesystem
    /sys/databricks/libs    <- contains all libs, jars, wheels needed for processing
    /data
        /lnd                <- Bronze - landing folder where all data files are ingested into.
        /interim            <- Silver - interim (cleansed) tables
        /dw                 <- Gold - final tables 

Known Issues, Limitations and Workarounds

The following lists some limitations of the solution and associated deployment script:

  • Databricks KeyVault-backed secrets scopes can only be create via the UI, cannot be created programmatically and was not incorporated in the automated deployment of the solution.
    • Workaround: Deployment uses normal Databricks secrets with the downside of duplicated information. If you wish, you many manually convert these to KeyVault-back secret scopes. See here for more information.
  • Azure DevOps Variable Groups linked to KeyVault can only be created via the UI, cannot be created programmatically and was not incorporated in the automated deployment of the solution.
    • Workaround: Deployment add sensitive configuration as "secrets" in Variable Groups with the downside of duplicated information. If you wish, you may manually link a second Variable Group to KeyVault to pull out the secrets. KeyVault secret names should line up with required variables in the Azure DevOps pipelines. See here for more information.
  • Azure DevOps Environment and Approval Gates can only be managed via the UI, cannot be managed programmatically and was not incorporated in the automated deployment of the solution.
    • Workaround: Approval Gates can be easily configured manually. See here for more information.