Skip to content

GoogleCloudPlatform/ai-ml-recipes

Repository files navigation

AI/ML Recipes for Vertex AI, BigQuery, and Spark on Dataproc

AI/ML Recipes for Vertex AI, BigQuery, and Spark on Dataproc open-source project is an effort to jumpstart your development of data processing and machine learning notebooks using VertexAI, BigQuery and Dataproc's distributed processing capabilities.

We are release a set of machine learning focused notebooks, for you to adapt, extend, and use to solve your use cases using your own data.
You can easily clone the repo and start executing the notebooks right way using your Dataproc cluster or Dataproc Serverless Runtime for the PySpark notebooks, and any environment for the BigQuery Dataframes (Bigframes) notebooks.

Open in Cloud Shell

Notebooks

Please refer to each notebooks folder documentation for more information:

BigFrames (BigQuery Dataframes)

PySpark

Public Datasets

The notebooks read datasets from our public GCS bucket containing several publicly available datasets.

In this doc you can see the list of available datasets, which are located in gs://dataproc-metastore-public-binaries.
The documentation above has details about the datasets, and links to their original pages, containing their LICENSES, etc.

Usage in Vertex AI Workbench notebooks

These notebooks are available from within the Vertex AI Workbench notebooks environment.
Navigate to JupyterLab home screen and click on Notebooks to see the list of notebooks and a button for you to download/copy them into your environment.

Vertex Notebooks Templates Vertex Notebooks Templates List

Usage in your local environment

  1. Install gcloud cli
  2. Run gclout init to setup your default GCP configuration
  3. Clone this repository by running
    git clone https://github.com/GoogleCloudPlatform/dataproc-ml-quickstart-notebooks.git
  4. Install requirements by running pip install -r requirements.txt
  5. For the PySpark notebooks, use one of the approaches using the Dataproc Jupyter Plugin:
    • 5.1) [Recommended] Create Dataproc Serverless Notebooks, after creating a Runtime Template with your desired Dataproc config, and use it as a Jupyter kernel when executing the notebooks
      • Do not forget to ensure the correct network configuration (for example, you need a Cloud NAT to be able to install packages from the public PyPI)
    • 5.2) Create a Dataproc Cluster with your desired Dataproc config, and use it as a Jupyter kernel when executing the notebooks
  6. For the Bigframes notebooks, you do not need PySpark, just any kernel/environment, and the processing will leverage BigQuery in your GCP project

BigQuery Jupyter Plugin

We recommend leveraging the BigQuery Jupyter Plugin, which will be available in your local environment just by installing the dependency running pip install -r requirements.txt. This will enable you to:

  • Connect your Jupyterlab notebooks from anywhere to Dataproc
  • Develop in Python, SQL, Java/Scala, and R
  • Manage Dataproc clusters and jobs
  • Run notebooks in your favorite IDE that supports Jupyter using Dataproc as kernel
  • Deploy a notebook as a recurring job
  • View cloud and spark logs inside Jupyterlab
  • View your BigQuery datasets schema inside Jupyterlab
  • Manage your files on Google Cloud Storage (GCS)

Contributing

See the contributing instructions to get started contributing.

Acknowledgments: Nilo Resende, Dana Soltani, Oscar Pulido James Fu, Neha Sharma, Tanya Warrier, Anish Sarangi, Diogo Kato, André Sousa, Shashank Agarwal, Samuel Schmidt, Eduardo Hruschka, Hitesh Hasija

License

All solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.

Disclaimer

This repository and its contents are not an official Google Product.

Contact

Questions, issues, and comments can be raised via Github issues.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •