diff --git a/site/en/gemini-api/docs/vision.ipynb b/site/en/gemini-api/docs/vision.ipynb
index 48074c34a..8a45b86ff 100644
--- a/site/en/gemini-api/docs/vision.ipynb
+++ b/site/en/gemini-api/docs/vision.ipynb
@@ -64,14 +64,20 @@
"id": "3c5e92a74e64"
},
"source": [
- "The Gemini API can run inference on images and videos passed to it. When passed an image, a series of images, or a video, Gemini can:\n",
+ "The Gemini API is able to process images and videos, enabling a multitude of\n",
+ " exciting developer use cases. Some of Gemini's vision capabilities include\n",
+ " the ability to:\n",
"\n",
- "* Describe or answer questions about the content\n",
- "* Summarize the content\n",
- "* Extrapolate from the content\n",
+ "* Caption and answer questions about images\n",
+ "* Transcribe and reason over PDFs, including long documents up to 2 million token context window\n",
+ "* Describe, segment, and extract information from videos,\n",
+ "including both visual frames and audio, up to 90 minutes long\n",
+ "* Detect objects in an image and return bounding box coordinates for them\n",
"\n",
"This tutorial demonstrates some possible ways to prompt the Gemini API with\n",
- "images and video input. All output is text-only."
+ "images and video input, provides code examples,\n",
+ "and outlines prompting best practices with multimodal vision capabilities.\n",
+ "All output is text-only."
]
},
{
@@ -199,191 +205,304 @@
{
"cell_type": "markdown",
"metadata": {
- "id": "rsdNkDszLBmQ"
+ "id": "2fa34d5c0db8"
},
"source": [
- "### Upload an image file using the File API\n",
- "\n",
- "Use the File API to upload an image of any size. (Always use the File API when the combination of files and system instructions that you intend to send is larger than 20 MB.)\n",
+ "## Image input\n",
"\n",
- "**NOTE**: The File API lets you store up to 20 GB of files per project, with a per-file maximum size of 2 GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but cannot be downloaded from the API. It is available at no cost in all regions where the Gemini API is available.\n",
+ "For total image payload size less than 20MB, it's recommended to either upload\n",
+ "base64 encoded images or directly upload locally stored image files."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8336e412da3e"
+ },
+ "source": [
+ "### Base64 encoded images\n",
"\n",
- "Start by downloading this [sketch of a jetpack](https://storage.googleapis.com/generativeai-downloads/images/jetpack.jpg)."
+ "You can upload public image URLs by encoding them as Base64 payloads.\n",
+ "You can use the httpx library to fetch the image URLs.\n",
+ "The following code example shows how to do this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "id": "lC6sS6DnmGmi"
+ "id": "aa9a0e452544"
},
"outputs": [],
"source": [
- "!curl -o jetpack.jpg https://storage.googleapis.com/generativeai-downloads/images/jetpack.jpg"
+ "import httpx\n",
+ "import base64\n",
+ "\n",
+ "# Retrieve an image\n",
+ "image_path = \"https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Palace_of_Westminster_from_the_dome_on_Methodist_Central_Hall.jpg/2560px-Palace_of_Westminster_from_the_dome_on_Methodist_Central_Hall.jpg\"\n",
+ "image = httpx.get(image_path)\n",
+ "\n",
+ "# Choose a Gemini model\n",
+ "model = genai.GenerativeModel(model_name=\"gemini-1.5-pro\")\n",
+ "\n",
+ "# Create a prompt\n",
+ "prompt = \"Caption this image.\"\n",
+ "response = model.generate_content(\n",
+ " [\n",
+ " {\n",
+ " \"mime_type\": \"image/jpeg\",\n",
+ " \"data\": base64.b64encode(image.content).decode(\"utf-8\"),\n",
+ " },\n",
+ " prompt,\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "Markdown(\">\" + response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
- "id": "qfa2VSqEsulq"
+ "id": "f47333dabe62"
},
"source": [
- "Upload the image using [`media.upload`](https://ai.google.dev/api/rest/v1beta/media/upload) and print the URI, which is used as a reference in Gemini API calls."
+ "### Multiple images\n",
+ "\n",
+ "To prompt with multiple images in Base64 encoded format, you can do the\n",
+ "following:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "id": "N9NxXGZKKusG"
+ "id": "2816ea3d2d91"
},
"outputs": [],
"source": [
- "# Upload the file and print a confirmation.\n",
- "sample_file = genai.upload_file(path=\"jetpack.jpg\",\n",
- " display_name=\"Jetpack drawing\")\n",
+ "import httpx\n",
+ "import base64\n",
"\n",
- "print(f\"Uploaded file '{sample_file.display_name}' as: {sample_file.uri}\")"
+ "# Retrieve two images\n",
+ "image_path_1 = \"https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Palace_of_Westminster_from_the_dome_on_Methodist_Central_Hall.jpg/2560px-Palace_of_Westminster_from_the_dome_on_Methodist_Central_Hall.jpg\"\n",
+ "image_path_2 = \"https://storage.googleapis.com/generativeai-downloads/images/jetpack.jpg\"\n",
+ "\n",
+ "image_1 = httpx.get(image_path_1)\n",
+ "image_2 = httpx.get(image_path_2)\n",
+ "\n",
+ "# Create a prompt\n",
+ "prompt = \"Generate a list of all the objects contained in both images.\"\n",
+ "\n",
+ "response = model.generate_content([\n",
+ "{'mime_type':'image/jpeg', 'data': base64.b64encode(image_1.content).decode('utf-8')},\n",
+ "{'mime_type':'image/jpeg', 'data': base64.b64encode(image_2.content).decode('utf-8')}, prompt])\n",
+ "\n",
+ "Markdown(response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
- "id": "cto22vhKOvGQ"
+ "id": "Lm862F3zthiI"
},
"source": [
- "The `response` shows that the File API stored the specified `display_name` for the uploaded file and a `uri` to reference the file in Gemini API calls. Use `response` to track how uploaded files are mapped to URIs.\n",
+ "### Upload one or more locally stored image files\n",
"\n",
- "Depending on your use case, you can also store the URIs in structures such as a `dict` or a database."
+ "Alternatively, you can upload one or more locally stored image files.. \n",
+ "\n",
+ "You can download and use our drawings of [piranha-infested waters](https://storage.googleapis.com/generativeai-downloads/images/piranha.jpg) and a [firefighter with a cat](https://storage.googleapis.com/generativeai-downloads/images/firefighter.jpg). First, save these files to your local directory.\n",
+ "\n",
+ "Then click **Files** on the left sidebar. For each file, click the **Upload** button, then navigate to that file's location and upload it:\n",
+ "\n",
+ "\n",
+ "\n",
+ "When the combination of files and system instructions that you intend to send is larger than 20 MB in size, use the File API to upload those files. Smaller files can instead be called locally from the Gemini API:\n"
]
},
{
- "cell_type": "markdown",
+ "cell_type": "code",
+ "execution_count": null,
"metadata": {
- "id": "ds5iJlaembWe"
+ "id": "XzMhQ8MWub5_"
},
+ "outputs": [],
"source": [
- "### Verify image file upload and get metadata\n",
+ "import PIL.Image\n",
"\n",
- "You can verify the API successfully stored the uploaded file and get its metadata by calling [`files.get`](https://ai.google.dev/api/rest/v1beta/files/get) through the SDK. Only the `name` (and by extension, the `uri`) are unique. Use `display_name` to identify files only if you manage uniqueness yourself."
+ "sample_file_2 = PIL.Image.open('piranha.jpg')\n",
+ "sample_file_3 = PIL.Image.open('firefighter.jpg')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "id": "kLFsVLFHOWSV"
+ "id": "da11223550a9"
},
"outputs": [],
"source": [
- "file = genai.get_file(name=sample_file.name)\n",
- "print(f\"Retrieved file '{file.display_name}' as: {sample_file.uri}\")"
+ "import google.generativeai as genai\n",
+ "\n",
+ "# Choose a Gemini model.\n",
+ "model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
+ "\n",
+ "# Create a prompt.\n",
+ "prompt = \"Write an advertising jingle based on the items in both images.\"\n",
+ "\n",
+ "response = model.generate_content([sample_file_2, sample_file_3, prompt])\n",
+ "\n",
+ "Markdown(response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
- "id": "BqzIGKBmnFoJ"
+ "id": "736c83de95a1"
},
"source": [
- "Depending on your use case, you can store the URIs in structures, such as a `dict` or a database."
+ "Note that these inline data calls don't include many of the features available\n",
+ "through the File API, such as getting file metadata,\n",
+ "[listing](https://ai.google.dev/gemini-api/docs/vision?lang=python#list-files),\n",
+ "or [deleting files](https://ai.google.dev/gemini-api/docs/vision?lang=python#delete-files)."
]
},
{
"cell_type": "markdown",
"metadata": {
- "id": "EPPOECHzsIGJ"
+ "id": "0d6f7af7c2ff"
+ },
+ "source": [
+ "### Large image payloads"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rsdNkDszLBmQ"
},
"source": [
- "### Prompt with the uploaded image and text\n",
+ "#### Upload an image file using the File API\n",
"\n",
- "After uploading the file, you can make GenerateContent requests that reference the File API URI. Select the generative model and provide it with a text prompt and the uploaded image."
+ "When the combination of files and system instructions that you intend to send is larger than 20 MB in size, use the File API to upload those files.\n",
+ "\n",
+ "**NOTE**: The File API lets you store up to 20 GB of files per project, with a per-file maximum size of 2 GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but cannot be downloaded from the API. It is available at no cost in all regions where the Gemini API is available."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qfa2VSqEsulq"
+ },
+ "source": [
+ "Upload the image using [`media.upload`](https://ai.google.dev/api/rest/v1beta/media/upload) and print the URI, which is used as a reference in Gemini API calls."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "id": "ZYVFqmLkl5nE"
+ "id": "2e9f9469b337"
},
"outputs": [],
"source": [
- "# Choose a Gemini API model.\n",
- "model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
- "\n",
- "# Prompt the model with text and the previously uploaded image.\n",
- "response = model.generate_content([sample_file, \"Describe how this product might be manufactured.\"])\n",
+ "!curl -o jetpack.jpg https://storage.googleapis.com/generativeai-downloads/images/jetpack.jpg"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "N9NxXGZKKusG"
+ },
+ "outputs": [],
+ "source": [
+ "# Upload the file and print a confirmation.\n",
+ "sample_file = genai.upload_file(path=\"jetpack.jpg\",\n",
+ " display_name=\"Jetpack drawing\")\n",
"\n",
- "Markdown(\">\" + response.text)"
+ "print(f\"Uploaded file '{sample_file.display_name}' as: {sample_file.uri}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
- "id": "Lm862F3zthiI"
+ "id": "cto22vhKOvGQ"
},
"source": [
- "### Upload one or more locally stored image files\n",
- "\n",
- "Alternatively, you can upload your own files. You can download and use our drawings of [piranha-infested waters](https://storage.googleapis.com/generativeai-downloads/images/piranha.jpg) and a [firefighter with a cat](https://storage.googleapis.com/generativeai-downloads/images/firefighter.jpg). First, save these files to your local directory.\n",
- "\n",
- "Then click **Files** on the left sidebar. For each file, click the **Upload** button, then navigate to that file's location and upload it:\n",
+ "The `response` shows that the File API stored the specified `display_name` for the uploaded file and a `uri` to reference the file in Gemini API calls. Use `response` to track how uploaded files are mapped to URIs.\n",
"\n",
- "\n",
+ "Depending on your use case, you can also store the URIs in structures such as a `dict` or a database."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ds5iJlaembWe"
+ },
+ "source": [
+ "#### Verify image file upload and get metadata\n",
"\n",
- "When the combination of files and system instructions that you intend to send is larger than 20 MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API:\n"
+ "You can verify the API successfully stored the uploaded file and get its metadata by calling [`files.get`](https://ai.google.dev/api/rest/v1beta/files/get) through the SDK. Only the `name` (and by extension, the `uri`) are unique. Use `display_name` to identify files only if you manage uniqueness yourself."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "id": "XzMhQ8MWub5_"
+ "id": "kLFsVLFHOWSV"
},
"outputs": [],
"source": [
- "import PIL.Image\n",
- "\n",
- "sample_file_2 = PIL.Image.open('piranha.jpg')\n",
- "sample_file_3 = PIL.Image.open('firefighter.jpg')"
+ "file = genai.get_file(name=sample_file.name)\n",
+ "print(f\"Retrieved file '{file.display_name}' as: {sample_file.uri}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
- "id": "F2N5bLR7wlqL"
+ "id": "BqzIGKBmnFoJ"
},
"source": [
- "Note that these inline data calls don't include many of the features available via the File API, such as getting file metadata, [listing](https://colab.research.google.com/drive/19xeyIMZJIk7Zn9KW5_50iZYv8OfjApL5?resourcekey=0-3JZ6U8oAFX7hqeV7gAXshw#scrollTo=VosrkvAyrx-v&line=3&uniqifier=1), or [deleting](https://colab.research.google.com/drive/19xeyIMZJIk7Zn9KW5_50iZYv8OfjApL5?resourcekey=0-3JZ6U8oAFX7hqeV7gAXshw#scrollTo=diCy9BgjLqeS&line=1&uniqifier=1) files."
+ "Depending on your use case, you can store the URIs in structures, such as a `dict` or a database."
]
},
{
"cell_type": "markdown",
"metadata": {
- "id": "X3pl7mWgwt6Q"
+ "id": "EPPOECHzsIGJ"
},
"source": [
- "### Prompt with multiple images\n",
+ "#### Prompt with the uploaded image and text\n",
"\n",
- "You can provide the Gemini API with any combination of images and text that fit within the model's context window. This example provides one short text prompt and the three images previously uploaded."
+ "After uploading the file, you can make GenerateContent requests that reference the File API URI. Select the generative model and provide it with a text prompt and the uploaded image."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "id": "Ou5IVsybcOys"
+ "id": "ZYVFqmLkl5nE"
},
"outputs": [],
"source": [
"# Choose a Gemini model.\n",
"model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
"\n",
- "prompt = \"Write an advertising jingle showing how the product in the first image could solve the problems shown in the second two images.\"\n",
+ "# Prompt the model with text and the previously uploaded image.\n",
+ "response = model.generate_content([sample_file, \"Describe how this product might be manufactured.\"])\n",
"\n",
- "response = model.generate_content([prompt, sample_file, sample_file_2, sample_file_3])\n",
+ "Markdown(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "c63c1a7a8e32"
+ },
+ "source": [
+ "## Capabilties\n",
"\n",
- "Markdown(\">\" + response.text)"
+ "This section outlines specific vision capabilities of the Gemini model, including object detection and bounding box coordinates."
]
},
{
@@ -394,9 +513,7 @@
"source": [
"### Get bounding boxes\n",
"\n",
- "You can ask the model for the coordinates of bounding boxes for objects in images. For object detection, the Gemini model has been trained to provide\n",
- "these coordinates as relative widths or heights in range `[0,1]`, scaled by 1000 and converted to an integer. Effectively, the coordinates given are for a\n",
- "1000x1000 version of the original image, and need to be converted back to the dimensions of the original image."
+ "Gemini models are trained to return bounding box coordinates as relative widths or heights in the range of [0, 1]. These values are then scaled by 1000 and converted to integers. Effectively, the coordinates represent the bounding box on a 1000x1000 pixel version of the image. Therefore, you'll need to convert these coordinates back to the dimensions of your original image to accurately map the bounding boxes."
]
},
{
@@ -410,10 +527,11 @@
"# Choose a Gemini model.\n",
"model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
"\n",
- "prompt = \"Return a bounding box for the piranha. \\n [ymin, xmin, ymax, xmax]\"\n",
+ "# Create a prompt to detect bounding boxes.\n",
+ "prompt = \"Return a bounding box for each of the objects in this image in [ymin, xmin, ymax, xmax] format.\"\n",
"response = model.generate_content([sample_file_2, prompt])\n",
"\n",
- "print(response.text)"
+ "Markdown(response.text)"
]
},
{
@@ -422,11 +540,16 @@
"id": "b8e422c55df2"
},
"source": [
- "To convert these coordinates to the dimensions of the original image:\n",
+ "The model returns bounding box coordinates in the format\n",
+ "`[ymin, xmin, ymax, xmax]`. To convert these normalized coordinates\n",
+ "to the pixel coordinates of your original image, follow these steps:\n",
"\n",
"1. Divide each output coordinate by 1000.\n",
"1. Multiply the x-coordinates by the original image width.\n",
- "1. Multiply the y-coordinates by the original image height."
+ "1. Multiply the y-coordinates by the original image height.\n",
+ "\n",
+ "To explore more detailed examples of generating bounding box coordinates and\n",
+ "visualizing them on images, review our [Object Detection cookbook example](https://github.com/google-gemini/cookbook/blob/main/examples/Object_detection.ipynb)."
]
},
{
@@ -446,7 +569,7 @@
"id": "nDN32NDPxXGX"
},
"source": [
- "## Technical details (video)\n",
+ "### Technical details (video)\n",
"\n",
"Gemini 1.5 Pro and Flash support up to approximately an hour of video data.\n",
"\n",
@@ -623,7 +746,7 @@
"print(\"Making LLM inference request...\")\n",
"response = model.generate_content([prompt, video_file],\n",
" request_options={\"timeout\": 600})\n",
- "print(response.text)"
+ "Markdown(response.text)"
]
},
{
@@ -634,7 +757,11 @@
"source": [
"### Transcribe video and provide visual descriptions\n",
"\n",
- "If the video is not fast-paced (given that frames are sampled at 1 per second), it's possible to transcribe the video with visual descriptions for each shot."
+ "The Gemini models can transcribe and provide visual descriptions of video content\n",
+ "by processing both the audio track and visual frames.\n",
+ "For visual descriptions, the model samples the video at a rate of **1 frame\n",
+ "per second**. This sampling rate may affect the level of detail in the\n",
+ "descriptions, particularly for videos with rapidly changing visuals."
]
},
{
@@ -646,16 +773,16 @@
"outputs": [],
"source": [
"# Create the prompt.\n",
- "prompt = \"Transcribe the audio, giving timestamps. Also provide visual descriptions.\"\n",
+ "prompt = \"Transcribe the audio from this video, giving timestamps for salient events in the video. Also provide visual descriptions.\"\n",
"\n",
"# Choose a Gemini model.\n",
"model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
"\n",
"# Make the LLM request.\n",
"print(\"Making LLM inference request...\")\n",
- "response = model.generate_content([prompt, video_file],\n",
+ "response = model.generate_content([video_file, prompt],\n",
" request_options={\"timeout\": 600})\n",
- "print(response.text)"
+ "Markdown(esponse.text)"
]
},
{