Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Real-time Item-to-item Recommendation with BigQuery ML Matrix Factorization and ScaNN

This directory contains code samples that demonstrate how to implement a low latency item-to-item recommendation solution, by training and serving embeddings that you can use to enable real-time similarity matching. The foundations of the solution are BigQuery and ScaNN, which is an open source library for efficient vector similarity search at scale.

The series is for data scientists and ML engineers who want to build an embedding training system and serve for item-item recommendation use cases. It assumes that you have experience with Google Cloud, BigQuery, AI Platform, Dataflow, Datastore, and with Tensorflow and TFX Pipelines.

Solution variants

There are two variants of the solution:

  • The first variant utilizes generally available releases of BigQuery and AI Platform together with open source components including ScaNN and Kubeflow Pipelines. To use this variant, follow the instructions in the Production variant section.
  • The second variant is a fully-managed solution that leverages the experimental releases of AI Platform Pipelines and ANN service. To use this variant, follow the instructions in the Experimental variant section.

Dataset

We use the public bigquery-samples.playlists BigQuery dataset to demonstrate the solutions. We use the playlist data to learn embeddings for songs based on their co-occurrences in different playlists. The learned embeddings can be used to match and recommend relevant songs to a given song or playlist.

Production variant

At a high level, the solution works as follows:

  1. Computes pointwise mutual information (PMI) between items based on their co-occurrences.
  2. Trains item embeddings using BigQuery ML Matrix Factorization, with item PMI as implicit feedback.
  3. Using Cloud Dataflow, post-processes the embeddings into CSV files and exports them from the BigQuery ML model to Cloud Storage.
  4. Implements an embedding lookup model using TensorFlow Keras, and then deploys it to AI Platform Prediction.
  5. Serves the embeddings as an approximate nearest neighbor index on AI Platform Prediction for real-time similar items matching.

Diagram showing the architecture of the item embedding solution.

For a detailed description of the solution architecture, see Architecture of a machine learning system for item matching.

Cost

The solution uses the following billable components of Google Cloud:

  • AI Platform Notebooks
  • AI Platform Pipelines
  • AI Platform Prediction
  • AI Platform Training
  • Artifact Registry
  • BigQuery
  • Cloud Build
  • Cloud Storage
  • Dataflow
  • Datastore

To learn about Google Cloud pricing, use the Pricing Calculator to generate a cost estimate based on your projected usage.

Running the solution

You can run the solution step-by-step, or you can run it by using a TFX pipeline.

Run the solution step-by-step

  1. Complete the steps in Set up the GCP environment.
  2. Complete the steps in Set up the AI Platform Notebooks environment.
  3. In the Jupyterlab environment of the embeddings-notebooks instance, open the file browser pane and navigate to the analytics-componentized-patterns/retail/recommendation-system/bqml-scann directory.
  4. Run the 00_prep_bq_and_datastore.ipynb notebook to import the playlist dataset, create the vw_item_groups view with song and playlist data, and export song title and artist information to Datastore.
  5. Run the 00_prep_bq_procedures notebook to create stored procedures needed by the solution.
  6. Run the 01_train_bqml_mf_pmi.ipynb notebook. This covers computing item co-occurrences using PMI, and then training a BigQuery ML matrix factorization mode to generate item embeddings.
  7. Run the 02_export_bqml_mf_embeddings.ipynb notebook. This covers using Dataflow to request the embeddings from the matrix factorization model, format them as CSV files, and export them to Cloud Storage.
  8. Run the 03_create_embedding_lookup_model.ipynb notebook. This covers creating a TensorFlow Keras model to wrap the item embeddings, exporting that model as a SavedModel, and deploying that SavedModel to act as an item-embedding lookup.
  9. Run the 04_build_embeddings_scann.ipynb notebook. This covers building an approximate nearest neighbor index for the embeddings using ScaNN and AI Platform Training, then exporting the ScaNN index to Cloud Storage.
  10. Run the 05_deploy_lookup_and_scann_caip.ipynb notebook. This covers deploying the embedding lookup model and ScaNN index (wrapped in a Flask app to add functionality) created by the solution.
  11. If you don't want to keep the resources you created for this solution, complete the steps in Delete the GCP resources.

Run the solution by using a TFX pipeline

In addition to manual steps outlined above, we provide a TFX pipeline that automates the process of building and deploying the solution. To run the solution by using the TFX pipeline, follow these steps:

  1. Complete the steps in Set up the GCP environment.
  2. Complete the steps in Set up the AI Platform Notebooks environment.
  3. In the Jupyterlab environment of the embeddings-notebooks instance, open the file browser pane and navigate to the analytics-componentized-patterns/retail/recommendation-system/bqml-scann directory.
  4. Run the 00_prep_bq_and_datastore.ipynb notebook to import the playlist dataset, create the vw_item_groups view with song and playlist data, and export song title and artist information to Datastore.
  5. Run the 00_prep_bq_procedures notebook to create stored procedures needed by the solution.
  6. Run the tfx01_interactive.ipynb notebook. This covers creating and running a TFX pipeline that runs the solution, which includes all of the tasks mentioned in the step-by-step notebooks above.
  7. Run the tfx02_deploy_run.ipynb notebook. This covers deploying the TFX pipeline, including building a Docker container image, compiling the pipeline, and deploying the pipeline to AI Platform Pipelines.
  8. Run the 05_deploy_lookup_and_scann_caip.ipynb notebook. This covers deploying the embedding lookup model and ScaNN index (wrapped in a Flask app to add functionality) created by the solution.
  9. If you don't want to keep the resources you created for this solution, complete the steps in Delete the GCP resources.

Set up the GCP environment

Before running the solution, you must complete the following steps to prepare an appropriate environment:

  1. Create and configure a GCP project.

  2. Create the GCP resources you need.

    Before creating the resources, consider what regions you want to use. Creating resources in the same region or multi-region (like US or EU) can reduce latency and improve performance.

  3. Clone this repo to the AI Platform notebook environment.

  4. Install the solution requirements on the notebook environment.

  5. Add the sample dataset and some stored procedures to BigQuery.

Set up the GCP project

  1. In the Cloud Console, on the project selector page, select or create a Cloud project.
  2. Make sure that billing is enabled for your Cloud project.
  3. Enable the Compute Engine, Dataflow, Datastore, AI Platform, AI Platform Notebooks, Artifact Registry, Identity and Access Management, Cloud Build, BigQuery, and BigQuery Reservations APIs.

Create a BigQuery reservation

If you use on-demand pricing for BigQuery, you must purchase flex slots and then create reservations and assignments for them in order to train a matrix factorization model. You can skip this section if you use flat-rate pricing with BigQuery.

You must have the bigquery.reservations.create permission in order to purchase flex slots. This permission is granted to the project owner, and also to the bigquery.admin and bigquery.resourceAdmin predefined Identity and Access Management roles.

  1. In the BigQuery console, click Reservations.

  2. On the Reservations page, click Buy Slots.

  3. On the Buy Slots page, set the options as follows:

    1. In Commitment duration, choose Flex.

    2. In Location, choose the region you want to use for BigQuery. Depending on the region you choose, you may have to request additional slot quota.

    3. In Number of slots, choose 500.

    4. Click Next.

    5. In Purchase confirmation, type CONFIRM.

      Note: The console displays an estimated monthly cost of $14,600.00. You will delete the unused slots at the end of this tutorial, so you will only pay for the slots you use to train the model. Training the model takes approximately 2 hours.

  4. Click Purchase.

  5. Click View Slot Commitments.

  6. Allow up to 20 minutes for the capacity to be provisioned. After the capacity is provisioned, the slot commitment status turns green and shows a checkmark.

  7. Click Create Reservation.

  8. On the Create Reservation page, set the options as follows:

    1. In Reservation name, type model.
    2. In Location, choose whatever region you purchased the flex slots in.
    3. In Number of slots, type 500.
    4. Click Save. This returns you to the Reservations page.
  9. Select the Assignments tab.

  10. In Select an organization, folder, or project, click Browse.

  11. Type the name of the project you are using.

  12. Click Select.

  13. In Reservation, choose the model reservation you created.

  14. Click Create.

Create a Firestore in Datastore Mode database instance

Create a Firestore in Datastore Mode database instance to store song title and artist information for lookup.

  1. Open the Datastore console.
  2. Click Select Datastore Mode.
  3. For Select a location, choose the region you want to use for Datastore.
  4. Click Create Database.

Create a Cloud Storage bucket

Create a Cloud Storage bucket to store the following objects:

  • The SavedModel files for the models created in the solution.
  • The temp files created by the Dataflow pipeline that processes the song embeddings.
  • The CSV files for the processed embeddings.
  1. Open the Cloud Storage console.
  2. Click Create Bucket.
  3. For Name your bucket, type a bucket name. The name must be globally unique.
  4. For Choose where to store your data, select Region and then choose the region you want to use for Cloud Storage.
  5. Click Create.

Create an AI Platform Notebooks instance

Create an AI Platform Notebooks instance to run the notebooks that walk you through using the solution.

  1. Open the AI Platform Notebooks console.
  2. Click New Instance.
  3. Choose TensorFlow Enterprise 2.3, Without GPUs.
  4. For Instance name, type embeddings-notebooks.
  5. For Region, choose the region you want to use for the AI Platform Notebooks instance.
  6. Click Create. It takes a few minutes for the notebook instance to be created.

Give the Cloud Build service account permissions to interact with Compute Engine

  1. Open the Cloud Build settings page.
  2. In the service account list, find the row for Compute Engine and change the Status column value to Enabled.

Update the Compute Engine service account permissions

Add the Compute Engine service account to the IAM Security Admin role. This is required so that later this account can set up other service accounts needed by the solution.

  1. Open the IAM permissions page.
  2. In the members list, find the row for <projectNumber>[email protected] and click Edit.
  3. Click Add another role.
  4. In Select a role, choose IAM and then choose Security Admin.
  5. Click Save.

Create an AI Platform pipeline

Create an AI Platform Pipelines instance to run the TensorFlow Extended (TFX) pipeline that automates the solution workflow. You can skip this step if you are running the solution using the step-by-step notebooks.

Create a Cloud SQL instance

Create a Cloud SQL instance to provide managed storage for the pipeline.

  1. Open the Cloud SQL console.
  2. Click Create Instance.
  3. On the MySQL card, click Choose MySQL.
  4. For Instance ID, type pipeline-db.
  5. For Root Password, type in the password you want to use for the root user.
  6. For Region, type in the region you want to use for the database instance.
  7. Click Create.

Create the pipeline

  1. Open the AI Platform Pipelines console.

  2. In the AI Platform Pipelines toolbar, click New instance. Kubeflow Pipelines opens in Google Cloud Marketplace.

  3. Click Configure. The Deploy Kubeflow Pipelines form opens.

  4. For Cluster zone, choose a zone in the region you want to use for AI Platform Pipelines.

  5. Check Allow access to the following Cloud APIs to grant applications that run on your GKE cluster access to Google Cloud resources. By checking this box, you are granting your cluster access to the https://www.googleapis.com/auth/cloud-platform access scope. This access scope provides full access to the Google Cloud resources that you have enabled in your project. Granting your cluster access to Google Cloud resources in this manner saves you the effort of creating and managing a service account or creating a Kubernetes secret.

  6. Click Create cluster. This step may take several minutes.

  7. Select Create a namespace in the Namespace drop-down list. Type kubeflow-pipelines in New namespace name.

    To learn more about namespaces, read a blog post about organizing Kubernetes with namespaces.

  8. In the App instance name box, type kubeflow-pipelines.

  9. Select Use managed storage and supply the following information:

    • Artifact storage Cloud Storage bucket: Specify the name of the bucket you created in the "Create a Cloud Storage bucket" procedure.
    • Cloud SQL instance connection name: Specify the connection name for the Cloud SQL instance you created in the "Create a Cloud SQL instance" procedure. The instance connection name can be found on the instance detail page in the Cloud SQL console.
    • Database username: Leave this field empty to default to root.
    • Database password: Specify the root user password for the Cloud SQL instance you created in the "Create a Cloud SQL instance" procedure.
    • Database name prefix: Type embeddings.
  10. Click Deploy. This step may take several minutes.

Set up the AI Platform Notebooks environment

You use notebooks to complete the prerequisites and then run the solution. To use the notebooks, you must clone the solution's GitHub repo to your AI Platform Notebooks JupyterLab instance.

  1. Open the AI Platform Notebooks console.

  2. Click Open JupyterLab for the embeddings-notebooks instance.

  3. In the Other section of the JupyterLab Launcher, click Terminal.

  4. In the terminal, run the following command to clone the analytics-componentized-patterns Github repository:

    git clone https://github.com/GoogleCloudPlatform/analytics-componentized-patterns.git
    
  5. In the terminal, run the following command to install packages required by the solution:

    pip install -r analytics-componentized-patterns/retail/recommendation-system/bqml-scann/requirements.txt
    

Delete the GCP resources

Unless you plan to continue using the resources you created in this solution, you should delete them to avoid incurring charges to your GCP account. You can either delete the project containing the resources, or keep the project but delete just those resources.

Either way, you should remove the resources so you won't be billed for them in the future. The following sections describe how to delete these resources.

Delete the project

The easiest way to eliminate billing is to delete the project you created for the solution.

  1. In the Cloud Console, go to the Manage resources page.
  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the components

If you don't want to delete the project, delete the billable components of the solution. These can include:

  1. A Bigquery assignment, reservation, and remaining flex slots (if you chose to use flex slots to train the matrix factorization model)
  2. A BigQuery dataset
  3. Several Cloud Storage buckets
  4. Datastore entities
  5. An AI Platform Notebooks instance
  6. AI Platform models
  7. A Kubernetes Engine cluster (if you used a pipeline for automation)
  8. An AI Platform pipeline (if you used a pipeline for automation)
  9. A Cloud SQL instance (if you used a pipeline for automation)
  10. A Container Registry image (if you used a pipeline for automation)

Experimental variant

The experimental variant of the solution utilizes the new AI Platform and AI Platform (Unified) Pipelines services. Note that both services are currently in the Experimental stage and that the provided examples may have to be updated when the services move to the Preview and eventually to the General Availability. Setting up the managed ANN service is described in the ann_setup.md file.
Note: To use the Experimental releases of AI Platform Pipelines and ANN services you need to allow-list you project and user account. Please contact your Google representative for more information and support.

Experimental variant workflow

  1. Compute pointwise mutual information (PMI) between items based on their co-occurrences.
  2. Train item embeddings using BigQuery ML Matrix Factorization, with item PMI as implicit feedback.
  3. Post-process and export the embeddings from BigQuery ML Matrix Factorization Model to Cloud Storage JSONL formatted files.
  4. Create an approximate nearest search index using the ANN service and the exported embedding files.
  5. Deployed to the index as an ANN service endpoint.

Note that the first two steps are the same as the ScaNN library based solution.

Workflow Ann

We provide an example TFX pipeline that automates the process of training the embeddings and deploying the index.
The pipeline is designed to run on AI Platform (Unified) Pipelines and relies on features introduced in v0.25 of TFX. Each step of the pipeline is implemented as a TFX Custom Python function component. All steps and their inputs and outputs are tracked in the AI Platform (Unified) ML Metadata service.

TFX Ann

Run the experimental variant with notebooks

  1. ann01_create_index.ipynb
    • This notebook walks you through creating an ANN index, creating an ANN endpoint, and deploying the index to the endpoint. It also shows how to call the interfaces exposed by the deployed index.
  2. ann02_run_pipeline.ipynb
    • This notebook demonstrates how to create and test the TFX pipeline and how to submit pipeline runs to AI Platform (Unified) Pipelines.

Before experimenting with the notebooks, make sure that you have prepared the BigQuery environment and trained and extracted item embeddings using the procedures described in the ScaNN library based solution.

Questions? Feedback?

If you have any questions or feedback, please open up a new issue.

License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and limitations under the License.

This is not an official Google product but sample code provided for an educational purpose