This Power Skill uses the DBSCAN unsupervised clustering algorithm alongside VGG16 to extract visual features and cluster images.
This skill is ideal for:
- Exploring your data to identify clusters based on visual features during your data exploration phase
- Using in conjunction with Custom Vision Classification to further cluster your images, for example if you need a hierarchical classification structure.
- Auto-labelling your images based on the clusters identified and the labels you associated with the clusters. Note Azure Machine Learning has an auto-labelling feature already, this Power Skill should be used if this feature is not suitable
See the data folder for sample images used for in the skill
In addition to the common requirements described in the root README.md file, this Power Skill requires access to a Custom Vision resource. This process will use object detection and augment it with cluster labels.
To run this PowerSkill you will need:
- docker
- An Azure Blob storage container
- A provisioned Azure Cognitive Search (ACS) instance
- A provisioned Azure Container Registry
- A Cognitive Services key in the region you deploy ACS to
Below is a full working example that you can get working end to end on sample data.
- The first step in the process is to extract VGG16 embeddings from the images and train the DBSCAN model on the extracted features.
- To better understand the algorithm itself, please use explanatory notebook, it contains a local example of the process.
- Training: For simple local training use the local training cell in Local Training
- As with any (especially, unsupervised) machine learning solution, inspecting the clusters generated and playing with the algorithm hyperparameters will be required.
- To explore generated clusters and generate labels dictionary required for the custom skill, you can use labeling notebook. These labels are what will be indexed to retrieve the images.
- Clusters report is also available under the registered model on the Azure Machine Learning Portal.
- Deploy the skill and add the endpoint to your skillset file using the deploy notebook
- Run your indexer [deployment/azuresearch/create_indexer.json]
- Investigate your indexed data and compare the effect of using Image Clustering Power Skill and Computer Vision Service using Azure Search notebook.
This section describes how to get this working on sample data and how it can be amended for your data.
The first step is to extract the sample data files here train data and the test data into the existing data folder.
Open the notebook Detect Similar Images notebook
This notebooks demonstrates the idea behind the ImageClusteringSkill using a small dataset of open and closed books and bookshelves.
Basically, the PowerSkill consists of the following two steps:
- Extract VGG16 embeddings
- Cluster embeddings using DBSCAN
Run all the cells on the sample dataset to get an idea of how data is clustered. The notebook will load sample book data from the train folder When using on your own data, experiment with the epsilon (eps) parameter as this will influence the number of clusters detected in the data. Visually inspect it until it makes sense.
A pre-trained VGG16 model (vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5) will be used to extract the features from the images.
The last cells display the data that have been clustered as similar.
Open the notebook Training the model notebook
This notebook shows how the model can be trained on the sample data for inference.
Run the cell, "Local Training". Here the parameters for the DBSCAN algorithm can be experimented with if running on your own data. If running on the book sample data, leave them as is. Go here for more info on DBSCAN
Once complete this will save a model to the models directory. Note, we will be deploying this model later to our API for inference.
Now that we have identified the clusters in our data, we want to go and label them with our search terms that will help users easily find them. In our sample data, we have books that are open and closed and we also have bookshelves.
Now open the notebook label and deploy notebook, here you will see we labelled the books with a dictionary that allows multiple labels per cluster:
This cell will train a model on the data and show the clusters. All data with a cluster with value -1 could not be clustered, all other numbers represent the cluster id.
dict = {0 : ['book cover', 'closed book'], 1 : ['open book', 'double spread'], 2: ['book shelf', 'library']}
Here the key of the dictionary relates to the cluster id discovered. Double check the labels to ensure they match the cluster images, in case they have changed.
We will deploy our generated label file with our docker image.
For this step you will need docker running so that we can build and test our inference API locally. You will also need a container registry for the build.
Run the following command to build the inference API container image:
docker build -t [container_registry_name.azurecr.io/clusterextractor:[your_tag] .
The container will require the following variables set at runtime, namely:
KEY=[YourSecretKeyCanBeAnything] # This is a secret key - only requests with this key will be allowed
DEBUG=True # This enables verbose logging
DBSCAN_MODEL=books.pkl # This is the name of the cluster model created from training
CLUSTER_LABELS= # This is the labels file we created to label our clusters
See the file sample_env for the .env format
Now we can test the container by running it locally with our variables:
docker run -it --rm -p 5000:5000 -e KEY=[YourSecretKeyCanBeAnything] -e DEBUG=True
-e DBSCAN_MODEL=books.pkl -e CLUSTER_LABELS=labels.pkl
[container_registry_name.azurecr.io/clusterextractor:[your_tag]
Upon starting you will see a few tensorflow warnings and the download of the vgg model will initiate. See below:
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
58892288/58889256 [==============================] - 16s 0us/step
You should also see the following:
INFO:uvicorn.error:Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
We are now ready to send a request.
The deploy notebook contains a cell
Test the deployed inference API Web App that will enable you to test the Web App.
Alternatively you can also use Postman, see below:
Use Postman to issue a test request to your local inference API. As we are emulating what Azure Cognitive Search will send to a PowerSkill, we need to base64 encode an image as a string.
Issue the request with the following include the contents of the file postman_request.json as the body:
URI: http://0.0.0.0:5000/api/extraction
Headers:
Ocp-Apim-Subscription-Key: [YourSecretKeyCanBeAnything]
Content-Type: application/json
Body: Copy the contents of the file ../data/postman_request.json
After issuing the above request you should get the following response:
{
"values": [
{
"recordId": "0",
"errors": "",
"data": {
"label": [
"open book",
"double spread"
]
},
"warnings": ""
}
]
}
We are now ready to deploy our inference API. We will deploy this as an Azure App Service Web App. running a container.
First we need to push our newly built image to our container registry.
Run the following command:
docker push [container_registry_name.azurecr.io/clusterextractor:[your_tag]
In the deployment folder are two terraform files to deploy the inference API to an App Service Web App for linux.
The simplest is to open a cloud cloud shell and upload the main and variables to your cloud shell storage as this avoids the need for any installation.
Set the following values in the main file:
backend "azurerm" {
storage_account_name = "[your storage account name"
container_name = "[your storage container name]"
key = "[your storage account key"
resource_group_name = "[your storage account resource group name]"
}
Set the following values in thevariables file:
variable "app_service_sku" {
description = "The SKU (size - cpu/mem) of the app plan hosting the container. See: https://azure.microsoft.com/en-us/pricing/details/app-service/linux/"
default = "P2V2"
}
variable "docker_registry_url" {
description = "[your container registry].azurecr.io"
default = ""
}
variable "docker_registry_username" {
description = "[your container registry username]"
default = ""
}
variable "docker_registry_password" {
description = "[your container registry password]"
default = ""
}
variable "docker_image" {
description = "[your docker image name]:[your tag]"
default = ""
}
variable "dbscan_model" {
description = "Set this to books.pkl (if using demo value)"
default = "books.pkl"
}
variable "resource_group" {
description = "This is the name of an existing resource group to deploy to"
default = ""
}
variable "location" {
description = "This is the region of an existing resource group you want to deploy to"
default = "eastus2"
}
variable "debug" {
description = "API logging - set to True for verbose logging"
default = false
}
variable "cluster_labels" {
description = "Set this to labels.pkl (if using demo value)"
default = "labels.pkl"
}
Navigate to the directory containing the files and enter:
terraform init
Then enter:
terraform apply
You will be prompted with:
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Type bash yes
Once deployed, copy the Azure Web App URL which may be found in the overview section of the portal as we will need it to plug into Azure Search.
We are now ready to plug the Clustering PowerSkill into our ACS pipeline and test it.
Note, you need an already deployed ACS instance in the same region as your cognitive services instance as we want to compare what our clustering provides in addition to the custom vision services. Obviously we want to augment what we can extract using custom vision with our clustering model.
You will need your ACS API Key and the URL for your ACS instance.
Navigate to and execute the deploy PowerSkill to ACS cell to deploy our PowerSkill. Alternatively, populate the the values within the deployment json files files and use Postman.
The first step is to upload the data files to a container in Azure blob storage and get the connection values to create the ACS data source.
- Next create the index by running the create index cell
- Next create the skillset by running the create the skillset
- Next create the indexer by running the create indexer cell
The indexer will automatically run and you should see requests coming in if you look at the Web App logs.
Now we are in a position to search on our cluster labelled data, navigate to the test search cell to search on our clustered images.