No additional prerequisites are necessary as the demo environment will be setup for you, including Azure Databricks, Purview, ADLS, and example data sources and notebooks.
From the Azure Portal
-
At the top of the page, click the Cloud Shell icon
-
Make sure “Bash” is selected from the dropdown menu located at the left corner of the terminal.
a. Click “Confirm” if the “Switch to Bash in Cloud Shell” pop up appears.
-
Use
az account set --subscription "<SubscriptionID>"
to select the azure subscription you want to use.Note: If your Cloud Shell disconnects, you will need to rerun this command again to ensure the correct subscription.
-
Create a resource group for the demo deployment by using
az group create --location <ResourceGroupLocation> --resource-group <ResourceGroupName>
Note: Save the name of this resource group for use later
-
Change directory to the cloud storage directory (clouddrive)
cd clouddrive
-
Clone this repository into the clouddrive directory using the latest release tag (i.e.
2.x.x
)git clone -b <release_tag> https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator.git
Note:
We highly recommend cloning from the release tags listed here.Clone the main branch only when using nightly builds. By using a nightly build (i.e. the latest commit on main), you gain access to newer / experimental features, however those features may change before the next official release. If you are testing a deployment for production, please clone using release tags.
-
After the clone, click the "Upload/download" icon and select “manage file share”
-
Navigate to
Purview-ADB-Lineage-Solution-Accelerator/deployment/infra/settings.sh
click “…” and select "edit" -
Input values for:
- Resource group
- Prefix (this is added to service names)
- Client ID & Secret (from the App ID required as a prerequisite)
- Tenant id
- Purview location
- Resource Tags (optional, in the following format:
{"Name":"Value","Name2":"Value2"}
)- NOTE: Resource Tags are optional. If you are not using any Resource Tags, input empty set of double quotes("").
-
Push the Save icon to save your changes
Note: Running this script will create all the services noted above, including Azure Databricks and an Azure Databricks cluster which will start after deployment. This cluster is configured to auto terminate after 15 minutes but some Azure charges will accrue.
-
Navigate to
cd clouddrive/Purview-ADB-Lineage-Solution-Accelerator/deployment/infra
Note:
If your organization requires private endpoints for Azure Storage and Azure Event Hubs, you may need to follow the private endpoint guidance and modify the provided arm template. -
Run
./openlineage-deployment.sh
-
(Manual Configuration) After the initial deployment the script will stop and will ask you to add the service principal to the data curator role in the Purview resource. Follow this documentation to Set up Authentication using Service Principal using the Application Identity you created as a prerequisite to installation.
-
Once your service principal is added, go back to the Bash terminal and hit "Enter"
-
The Purview types will be deployed and the deployment will finish
Note:
At this point, you should confirm resources deployed successfully. In particular, check the Azure Function and inside its Functions tab, you should see an OpenLineageIn and PurviewOut function. If you have an error likeMicrosoft.Azure.WebJobs.Extensions.FunctionMetadataLoader: The file 'C:\home\site\wwwroot\worker.config.json' was not found.
please restart or start and stop the function to resolve the issue. Lastly check the Azure Function Configuration tab and check if all the Key Vault Referenced app settings have a green checkmark. If not, wait an additional 2-5 minutes and refresh the screen. If Key Vault references are not all green, check that the Key Vault has an access policy referencing the Azure Function.
-
Finally, run the Databricks notebook provided in your new workspace and observe lineage in Microsoft Purview once the Databricks notebook has finished running all cells.
-
If you do not see any lineage please follow the steps in the troubleshooting guide.
-
If you are interested in demonstrating lineage from Databricks jobs, please follow the steps in the connector only deployment.
Note: If your original bash shell gets closed or goes away while you are completing the manual installation steps above, you can manually run the final part of the installation by running the following from a cloud bash shell in the same subscription context:
purview_endpoint="https://<enter_purview_account_name>.purview.azure.com"
TENANT_ID="<TENANT_ID>"
CLIENT_ID="<CLIENT_ID>"
CLIENT_SECRET="<CLIENT_SECRET>"
acc_purview_token=$(curl https://login.microsoftonline.com/$TENANT_ID/oauth2/token --data "resource=https://purview.azure.net&client_id=$CLIENT_ID&client_secret=$CLIENT_SECRET&grant_type=client_credentials" -H Metadata:true -s | jq -r '.access_token')
purview_type_resp_custom_type=$(curl -s -X POST $purview_endpoint/catalog/api/atlas/v2/types/typedefs \
-H "Authorization: Bearer $acc_purview_token" \
-H "Content-Type: application/json" \
-d @Custom_Types.json )
echo $purview_type_resp_custom_type
If you need a Powershell alternative, see the docs.
You should now be able to run your demo notebook and receive lineage.