Skip to content

Project to build and deploy Azure Databricks and Data Factory using Terraform and Azure DevOps pipeline

Notifications You must be signed in to change notification settings

Taynor/azure-databricks-datafactory-terraform

Repository files navigation

Azure Data Factory and Databricks Platform

This project delivers data engineering technologies from Azure, using Terraform and an Azure DevOps CI/CD pipeline to provision, configure and maintain the services. The following technologies are used within this solution:

  • Azure data lake
  • Azure data factory
  • Azure Databricks It also includes the IaaS components to enhance the architecture, for security, governance and best practice:
  • Azure IaaS Network
  • Key Vault

At the time of writing, the project Terraform was written and tested using the following providers locked in with their respective versions:

  • hashicorp/azurerm 3.10.0
  • databrickslabs/databricks 0.6.0
  • hashicorp/azuread 2.23.0

Project Structure

The solution is designed to initially be deployed as a single unit, however once deployed the configuration and permissions modules can be reused for general maintenance and housekeeping. The deployment is split up into three sections:

  • Infrastructure set up:
    • Resource Group

image

  • Storage and data lake
  • Network
  • Key Vault

image

  • Data engineering service set up for Data Factory and Databricks:
    • Azure Data Factory
    • Azure Databricks workspace

image

Once the base infrastructure and data engineering services have been deployed, the solution moves onto provisioning the resources in Databricks. The resources are separated into two categories:

  • Independent (resources that can be deployed to the Databricks workspace without the existence of other resources to associate itself to). This provisions the Databricks cluster, workspace folders and workspace groups.
  • Dependent (resources that required the independent Databricks resources to be deployed to associate themselves with). This provisions the AD group users, Notebooks and local workspace users.

image

Resources in the adb-provision directory are required for the adb-maintenance and adb-permissions modules to work.

The next phase of the deployment is deploying the Databricks maintenance resources. This module configures the Databricks cluster, and adds users to the groups created in the Databricks provision phase:

image

The next phase of the deployment is adding permissions to resources created in the Databricks provision phase. This module adds permissions to the Databricks cluster, jobs and workspace folders:

image

The final phase of the deployment is adding the Azure Data Factory LinkedServices. The LinkedServices are set up for the Data Lake, Databricks workspace and Key Vault.

image

Known issues

At the time of testing with the provider versions listed above, adding users from Azure AD to the Databricks workspace and Creating the private network for the Databricks workspace is failing. Testing is ongoing to confirm when these features/functionality can be enabled in the project. Provisioning a linked service connection for Databricks using the managed service identity, is not supported in this version of the solution. A guide on how to configure this is included, until this feature has been added. The Terraform resources have been commented out, in the main.tf parent configuration files in the following modules:

  • adb-provision
  • adb-maintenance
  • adb-permissions
  • terraform-infra/network
  • linkedservice

Prepare to use

To use the solution the following resources would need to be available in an Azure subscription to deploy via an Azure DevOps pipeline:

  • Service Principal
  • Azure DevOps project to upload the code and create an Azure DevOps pipeline from
  • Resource group to store the Terraform remote state The service principal will need the following API permissions set in Azure AD to deploy via the Azure DevOps pipeline:

image

The resource group will need to host a storage account, set up with containers to store the Terraform state. The Terraform is designed to execute as individual modules, so a container would be needed for each module:

image

The XXX-build.yml and XXX-release.yml files need the resource group, storage account and container name set up to reflect what will be in the resource group to store the remote state. The service principal will need the following permissions applied in advance, to ensure it can read and write to the terraform.tfvars state file:

image

Ensure in the following YAML XXX-build.yml and XXX-release.yml files that the resource group, storage account and container have been set up with the same matching values:

For the XXX-build.yml file from lines 13 - 15

image

For the XXX-release.yml file from lines 30 - 32

image

The service principal name will need to be changed in those respective files on lines 12 for the XXX-build.yml and 29 for the XXX-release.yml. To set up the secret of the service principal in the Azure DevOps pipeline. The values have been stored in an Azure DevOps variable group (located under pipelines --> library), with the following names:

image

  • TF_VAR_CLIENT_ID (service principal ID)
  • TF_VAR_SECRET (service principal secret)
  • TF_VAR_SUB (Azure subscription ID where the solution will be deployed to)
  • TF_VAR_TENANT_ID (Azure tenant ID where the solution will be deployed to)

The capitalisation of the service principal details are required in order for the Azure DevOps pipeline to pass these values to the Terraform configuration files. Do not amend the capitalisation, only set the necessary values.

The values from the Azure DevOps variable group, are pulled into the Azure-pipelines.yml file using the variables --> group YAML property. The name of the variable group can be changed, but needs to be reflected in the Azure-pipelines.yml file and the variables group --> library name as well.

image

In the Azure-Pipeline.yml file checkov is used to test and review the configuration. A number of checks have been skipped as part of the solution, as they have been reviewed as not being relevant for the solution. Add and remove the checks you feel are necessary, understanding doing so is at your own risk and must pass the checkov assessment.

image

Set the email address that needs to be notified when the checkov assessment completes to receive the notification to manually review the report. Set this at line 32 adding how many other email addresses as you require on each separate line with the same indentation.

To set up the Terraform file for configuration, set the desired values within the terraform.tfvars file. The following example is what is used for resource group:

image

How to use

The following example uses Azure DevOps as the repo to store the code, and execute the CI/CD pipeline. The pipeline will use the Azure-pipeline.yml file to execute the deployment. The following steps will be used to set up the pipeline:

  • Click on Pipelines on the left hand side

image

  • Click on New pipeline on the right hand side

image

  • Choose Azure Repos Git YML

image

  • Choose the repo that you have uploaded the code to

image

  • Choose Existing Azure Pipeline YAML file

image

  • Choose the respective branch (if different from the main branch) and select the Azure-Pipeline.yml file, and click on Continue

image

  • Click on Run

image

  • The pipeline will begin to run

image

  • The checkov task will require the engineer to approve the checkov report, click in Stage to review ther report output

image

  • Click on Bash to expose the report to the verbose screen on the right hand side

image

  • Review the report to ensure you're happy with the results before proceeding

image

  • Once the review is complete, click on ManualValidation under the check_checkov_results nesting

image

  • Click on Review on the right hand side

image

  • Enter a comment to reflect the checkov report assessment, and click on Resume

image

  • The pipleine will continue to run, it will take approximately 30 minutes to complete. Afterward go into the Azure Portal to start using the solution

image

Once the deployment is complete, additional configuration in setting up Databricks mounts to the Data lake and linked service is required. The following steps will go through what is required to get this completed. Firstly create the secret scope for the Databricks workspace, to ensure the Databricks mount script works. For the secret scope name, use the name of the keyvault that you created. The mount script in the notebooks module, dynamically sets this value. So using a different value for the secret will fail the mount script execution. The following guide from Microsoft details how to do so using a key vault backed scope. Which is what this solution has been configured to use:

https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes

Once this is done, to configure the Databricks linked service in Azure Data Factory complete the following steps:

  • Select Azure Data Factory in the Azure Portal from the resource group blade

image

image

  • Click on Managed Identites on the left hand side

image

  • Ensure that System Assigned is selected, and change the status to On and Save

image

  • Confirme enabling the MSI feature

image

  • Click on Azure Role Assignments

image

  • Ensure the correct subscription is selected and click on Add Role Assignment plus sign

image

  • Select the scope of resource group, choosing the desired resource group. For Role choose Contributor and Save:

image

image

image

  • Open up the Azure Data Factory Studio instance, and go to Manage to create a new linked service for Databricks

image

  • Click on plus sign next to New, select compute on the right hand side blade, choose Azure Databricks and then click Continue

image

  • Name the new linked service, and choose the Azure subscription for the Azure selection method

image

  • The Databrick workspace should auto fill, but choose the appropriate one if multiple already exist. select Existing Interactive Cluster

image

  • Select Managed Service Indentity for the Authentication type

image

  • Select the existing cluster created as part of the solution, and test the connection and click on Create

image

  • Click on Publish All to save the changes

image

image

About

Project to build and deploy Azure Databricks and Data Factory using Terraform and Azure DevOps pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages