This project delivers data engineering technologies from Azure, using Terraform and an Azure DevOps CI/CD pipeline to provision, configure and maintain the services. The following technologies are used within this solution:
- Azure data lake
- Azure data factory
- Azure Databricks It also includes the IaaS components to enhance the architecture, for security, governance and best practice:
- Azure IaaS Network
- Key Vault
At the time of writing, the project Terraform was written and tested using the following providers locked in with their respective versions:
- hashicorp/azurerm 3.10.0
- databrickslabs/databricks 0.6.0
- hashicorp/azuread 2.23.0
The solution is designed to initially be deployed as a single unit, however once deployed the configuration and permissions modules can be reused for general maintenance and housekeeping. The deployment is split up into three sections:
- Infrastructure set up:
- Resource Group
- Storage and data lake
- Network
- Key Vault
- Data engineering service set up for Data Factory and Databricks:
- Azure Data Factory
- Azure Databricks workspace
Once the base infrastructure and data engineering services have been deployed, the solution moves onto provisioning the resources in Databricks. The resources are separated into two categories:
- Independent (resources that can be deployed to the Databricks workspace without the existence of other resources to associate itself to). This provisions the Databricks cluster, workspace folders and workspace groups.
- Dependent (resources that required the independent Databricks resources to be deployed to associate themselves with). This provisions the AD group users, Notebooks and local workspace users.
Resources in the adb-provision directory are required for the adb-maintenance and adb-permissions modules to work.
The next phase of the deployment is deploying the Databricks maintenance resources. This module configures the Databricks cluster, and adds users to the groups created in the Databricks provision phase:
The next phase of the deployment is adding permissions to resources created in the Databricks provision phase. This module adds permissions to the Databricks cluster, jobs and workspace folders:
The final phase of the deployment is adding the Azure Data Factory LinkedServices. The LinkedServices are set up for the Data Lake, Databricks workspace and Key Vault.
At the time of testing with the provider versions listed above, adding users from Azure AD to the Databricks workspace and Creating the private network for the Databricks workspace is failing. Testing is ongoing to confirm when these features/functionality can be enabled in the project. Provisioning a linked service connection for Databricks using the managed service identity, is not supported in this version of the solution. A guide on how to configure this is included, until this feature has been added. The Terraform resources have been commented out, in the main.tf parent configuration files in the following modules:
- adb-provision
- adb-maintenance
- adb-permissions
- terraform-infra/network
- linkedservice
To use the solution the following resources would need to be available in an Azure subscription to deploy via an Azure DevOps pipeline:
- Service Principal
- Azure DevOps project to upload the code and create an Azure DevOps pipeline from
- Resource group to store the Terraform remote state The service principal will need the following API permissions set in Azure AD to deploy via the Azure DevOps pipeline:
The resource group will need to host a storage account, set up with containers to store the Terraform state. The Terraform is designed to execute as individual modules, so a container would be needed for each module:
The XXX-build.yml and XXX-release.yml files need the resource group, storage account and container name set up to reflect what will be in the resource group to store the remote state. The service principal will need the following permissions applied in advance, to ensure it can read and write to the terraform.tfvars state file:
Ensure in the following YAML XXX-build.yml and XXX-release.yml files that the resource group, storage account and container have been set up with the same matching values:
For the XXX-build.yml file from lines 13 - 15
For the XXX-release.yml file from lines 30 - 32
The service principal name will need to be changed in those respective files on lines 12 for the XXX-build.yml and 29 for the XXX-release.yml. To set up the secret of the service principal in the Azure DevOps pipeline. The values have been stored in an Azure DevOps variable group (located under pipelines --> library), with the following names:
- TF_VAR_CLIENT_ID (service principal ID)
- TF_VAR_SECRET (service principal secret)
- TF_VAR_SUB (Azure subscription ID where the solution will be deployed to)
- TF_VAR_TENANT_ID (Azure tenant ID where the solution will be deployed to)
The capitalisation of the service principal details are required in order for the Azure DevOps pipeline to pass these values to the Terraform configuration files. Do not amend the capitalisation, only set the necessary values.
The values from the Azure DevOps variable group, are pulled into the Azure-pipelines.yml file using the variables --> group YAML property. The name of the variable group can be changed, but needs to be reflected in the Azure-pipelines.yml file and the variables group --> library name as well.
In the Azure-Pipeline.yml file checkov is used to test and review the configuration. A number of checks have been skipped as part of the solution, as they have been reviewed as not being relevant for the solution. Add and remove the checks you feel are necessary, understanding doing so is at your own risk and must pass the checkov assessment.
Set the email address that needs to be notified when the checkov assessment completes to receive the notification to manually review the report. Set this at line 32 adding how many other email addresses as you require on each separate line with the same indentation.
To set up the Terraform file for configuration, set the desired values within the terraform.tfvars file. The following example is what is used for resource group:
The following example uses Azure DevOps as the repo to store the code, and execute the CI/CD pipeline. The pipeline will use the Azure-pipeline.yml file to execute the deployment. The following steps will be used to set up the pipeline:
- Click on Pipelines on the left hand side
- Click on New pipeline on the right hand side
- Choose Azure Repos Git YML
- Choose the repo that you have uploaded the code to
- Choose Existing Azure Pipeline YAML file
- Choose the respective branch (if different from the main branch) and select the Azure-Pipeline.yml file, and click on Continue
- Click on Run
- The pipeline will begin to run
- The checkov task will require the engineer to approve the checkov report, click in Stage to review ther report output
- Click on Bash to expose the report to the verbose screen on the right hand side
- Review the report to ensure you're happy with the results before proceeding
- Once the review is complete, click on ManualValidation under the check_checkov_results nesting
- Click on Review on the right hand side
- Enter a comment to reflect the checkov report assessment, and click on Resume
- The pipleine will continue to run, it will take approximately 30 minutes to complete. Afterward go into the Azure Portal to start using the solution
Once the deployment is complete, additional configuration in setting up Databricks mounts to the Data lake and linked service is required. The following steps will go through what is required to get this completed. Firstly create the secret scope for the Databricks workspace, to ensure the Databricks mount script works. For the secret scope name, use the name of the keyvault that you created. The mount script in the notebooks module, dynamically sets this value. So using a different value for the secret will fail the mount script execution. The following guide from Microsoft details how to do so using a key vault backed scope. Which is what this solution has been configured to use:
https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes
Once this is done, to configure the Databricks linked service in Azure Data Factory complete the following steps:
- Select Azure Data Factory in the Azure Portal from the resource group blade
- Click on Managed Identites on the left hand side
- Ensure that System Assigned is selected, and change the status to On and Save
- Confirme enabling the MSI feature
- Click on Azure Role Assignments
- Ensure the correct subscription is selected and click on Add Role Assignment plus sign
- Select the scope of resource group, choosing the desired resource group. For Role choose Contributor and Save:
- Open up the Azure Data Factory Studio instance, and go to Manage to create a new linked service for Databricks
- Click on plus sign next to New, select compute on the right hand side blade, choose Azure Databricks and then click Continue
- Name the new linked service, and choose the Azure subscription for the Azure selection method
- The Databrick workspace should auto fill, but choose the appropriate one if multiple already exist. select Existing Interactive Cluster
- Select Managed Service Indentity for the Authentication type
- Select the existing cluster created as part of the solution, and test the connection and click on Create
- Click on Publish All to save the changes