Hacktoberfest 2024 | Llama 3.2 Vision 🤝 Workflows #694

PawelPeczek-Roboflow · 2024-09-30T12:59:38Z

Llama 3.2 Vision in Workflows

Are you ready to make a difference this Hacktoberfest? We’re excited to invite you to contribute by integrating LLama 3.2 Vision into our Workflows ecosystem! This new block for image generation will be a fantastic addition, broadening the horizons of what our platform can achieve.

Join us in enhancing our capabilities and empowering users to harness the power of vision technology. Whether you're a seasoned developer or just starting your journey in open source, your contributions will play a vital role in shaping the future of our ecosystem. Let’s collaborate and bring this innovative functionality to life!

Task description

The task is to integrate the new Llama 3.2 Vision into workflows
We haven't discover the model capabilities yet - that is also the part of the task 🥳
We prefer light integration to REST API through requests library - we've found that OpenRouter provides REST API access (see this) - but if you find a better option - feel free to discuss
We imagine the model can be implemented in similar way as other VLMs:
- OpenAI GPT
- Gemini
- Claude
- Florence 2
please raise any issues with the task in the discussion below

Cheatsheet

The text was updated successfully, but these errors were encountered:

AHB102 · 2024-11-14T06:13:57Z

@PawelPeczek-Roboflow Can I have a go at it ? And can you tell me what is expected, are we talking about complete integration end to end or breaking down this issue into sub issues which can be tackled.

PawelPeczek-Roboflow · 2024-11-14T10:48:45Z

Hi @AHB102, thanks for engaging into the issue.

Sure, you can pick up the task - so the point is we would like to:
a) find a suitable hosted version of Llama vision model such that we can integrate via making HTTP requests
b) once this is agreed - we need to create Workflow blocks similar to https://github.com/roboflow/inference/blob/main/inference/core/workflows/core_steps/models/foundation/openai/v2.py wrapping up the model prompting for various tasks - that would require a little bit of exploration of model capabilities

PawelPeczek-Roboflow · 2024-11-14T10:51:29Z

First step would definitely be agreeing on API that host llama
Options I see:

but was not investigating all of the options, which would be good to do.

I would try to find cheap and reliable third party

AHB102 · 2024-11-14T13:38:03Z

@PawelPeczek-Roboflow I looked into hosted Llama 3.2 Vision APIs and found a few options: Together.ai (https://api.together.xyz/models) , Google Vertex AI (https://cloud.google.com/blog/products/ai-machine-learning/llama-3-2-metas-new-generation-models-vertex-ai) , Azure (https://techcommunity.microsoft.com/blog/machinelearningblog/meta%E2%80%99s-new-llama-3-2-slms-and-image-reasoning-models-now-available-on-azure-ai-m/4255167) and AWS Bedrock(https://aws.amazon.com/blogs/machine-learning/vision-use-cases-with-llama-3-2-11b-and-90b-models-from-meta/).Except Together.ai all of the other options have massive scale , it would be reliable and cheap. Hugging face also has a offering for inferencing. I checked out OpenRouter's API limits. The Llama 3.2 11B model is currently free, and the usage rates are pretty good. I think 20 requests per minute should be plenty for most things Any thoughts ?

PawelPeczek-Roboflow · 2024-11-14T13:48:32Z

I do not have particular bias towards any of the vendor - I even see that the decision which is most handy for people to use is strictly related to individual preferences of the consumer.
I believe AWS / Google / MS would be the "stable" choice, whereas I expect other third parties to be more attractive cost-wise.
One thing to keep in mind is also how easy it is to integrate - I remember that at least part of google API clients are bulky and require setting API key at process-level, not for individual invocation (which is 🔴 flag for multi-tennant deployments which we do with workflows).

I see the construction of the block in the following way:

we support parameters required to deal with model
and on the "orthogonal" axis - we do support 2 parameters - api_key and provider - which will choose backend to use. This approach do also have cons, but at least we do not need multiple blocks to handle the same model from different providers
we could start easy with one provider, ensuring extensibility for the future
wdyt?

AHB102 · 2024-11-14T14:35:30Z

That sounds great! This approach provides a solid foundation for future scalability and flexibility. By not committing to a single vendor upfront, we can adapt to evolving needs and avoid potential vendor lock-in.

To start, I suggest we explore OpenRouter. It offers free API usage for Llama 2.3 11B, making it ideal for initial testing and development. Additionally, its compatibility with familiar libraries like requests and openai can streamline the integration process and minimize security risks.

Once we have a robust core structure in place, we can easily pivot to other providers. wdyt ?

PawelPeczek-Roboflow · 2024-11-15T09:22:29Z

yeah, that sounds right

AHB102 · 2024-11-15T13:09:01Z

I've been diving into the Workflow Block (https://github.com/roboflow/inference/blob/main/inference/core/workflows/core_steps/models/foundation/openai/v2.py) and feel comfortable with the workings of the OpenAIBlockV2 class. I'm about to start writing code. Any advice for getting the most out of it? When modifying the inference core, I understand that I need to include test cases, right?

PawelPeczek-Roboflow · 2024-11-15T13:59:00Z

yeah, tests are recommended.

Here is our block creation guide: https://inference.roboflow.com/workflows/create_workflow_block/

PawelPeczek-Roboflow · 2024-11-15T14:00:09Z

you can find information how to run development smoothly

to test remote apis - we usually create unit-tests agains mocks - and place some integration test skipped if API key not provided

PawelPeczek-Roboflow · 2024-11-15T14:01:34Z

example integration test: https://github.com/roboflow/inference/blob/main/tests/workflows/integration_tests/execution/test_workflow_with_open_ai_models.py

and unit test for the same block: https://github.com/roboflow/inference/blob/main/tests/workflows/unit_tests/core_steps/models/foundation/test_openai.py

AHB102 · 2024-11-15T14:28:21Z

@PawelPeczek-Roboflow Let's get v1.py for Llama working, I'll be focusing on getting it functional before tackling test cases. I'll definitely ask for your input and help along the way, and I'll keep you updated on how it's going. Thanks for the docs 😁

PawelPeczek-Roboflow · 2024-11-21T17:34:04Z

hi there :) anything I can help you with?

AHB102 · 2024-11-22T06:10:12Z

@PawelPeczek-Roboflow Hi, sorry for the late reply.
So, I’ve been diving into the workflow block documentation. It was a lot to take in, but I’ve managed to work through it. I’ve also been experimenting with the OpenRouter Llama 3.2 vision model locally.
I’ve started building the first version of the block. I defined a BlockManifest class for Llama 3.2 vision, referencing the workflow block docs to understand the mechanics and looking at the OpenAI and Claude VLM implementations to see how it’s done in practice.
I had a couple of questions based on my observations:

Claude seems to have a specific resolution it scales input images to, but I couldn’t find anything similar in the Llama documentation.
OpenAI has image resolution settings (low, auto, high), but again, Llama doesn’t seem to support this.
Both implementations have a response limit of around 450 tokens. However, Llama’s token limits vary significantly based on the task:
Object detection: 10-100 objects with 10-50 attributes each (100-5000 tokens)
Image classification: 1-10 class labels with 10-50 attributes each (10-500 tokens)

AHB102 · 2024-11-22T06:14:29Z

Image detail (resolution) ref : https://github.com/roboflow/inference/blob/main/inference/core/workflows/core_steps/models/foundation/openai/v2.py#L162

Max image size ref :
https://github.com/roboflow/inference/blob/main/inference/core/workflows/core_steps/models/foundation/anthropic_claude/v1.py#L178

Max token Number ref :
https://github.com/roboflow/inference/blob/main/inference/core/workflows/core_steps/models/foundation/openai/v2.py#L169

PawelPeczek-Roboflow · 2024-11-22T11:40:43Z

Do not worry to much if the API for all VLM blocks cannot be identical, we strive for similar experience regarding blocks integration, not 100% the same config parameters.

I do not see the list of all params that open-router APIs support, they name it recommended, and use openai client
in Python - so I expect that majority of params work as in openai client, it's just not reported - worth verifyng the ones that can limit the clients spendings on API calls (if paid version in use) - mainly the token limits

AHB102 · 2024-11-22T16:41:37Z

For now I'll dive into the details of max_tokens and top_p to keep our responses concise and cost-effective. We can also explore other tricks like choosing the right model and batching requests. I will update you once I have something, and concurrently keep working on the manifest block. 😁

PawelPeczek-Roboflow · 2024-11-22T16:42:50Z

👍

AHB102 · 2024-11-28T17:51:43Z

@PawelPeczek-Roboflow I experimented with the tiktoken library to determine the maximum token count and top-p values used by OpenRouter Llama 3.2 Vision. For tasks like object detection, image captioning, and OCR, I found that the maximum number of tokens rarely exceeds 200. I also tested different top-p values, ranging from 0.7 to the default of 1.0, and observed a decrease in the number of tokens required as the top-p value increased.
Llama 3.2 Vision is able to perform all the tasks that OpenAI models can, I don’t think there will be change in terms of prompt functions like prepare_caption_prompt() , right ?

PawelPeczek-Roboflow · 2024-11-29T10:16:04Z

I guess so - in this case (contrary to popular approach) I suggest just copy-pasting the functions into your block module. We do not follow DRY rule for blocks, in practice it's easier to manage changes for each blocks separately

AHB102 · 2024-12-06T05:43:24Z

@PawelPeczek-Roboflow Hi 🖐️, I’ve nearly completed the workflow block, which now consists of approximately 600 lines of code. To ensure thorough understanding, I’ve been analyzing the function of each individual component, which has slowed progress. I’ve constructed the block by referencing Anthropic and OpenAI workflow implementations. However, I haven’t tested anything it yet. What should be the next steps toward testing and integration?

PS: I haven’t worked on a codebase of this size before, so I’ve learned a lot in the process. 😅

PawelPeczek-Roboflow · 2024-12-06T12:49:11Z

cool - submit the pr even if not 100% ready, we will figure out the way forward
I do really appreciate the effort

AHB102 · 2024-12-07T18:01:14Z

@PawelPeczek-Roboflow Nice !, I will make a PR 😁

PawelPeczek-Roboflow · 2025-01-07T14:27:29Z

contribution accepted, thanks a lot
we will highlight in next release notes

PawelPeczek-Roboflow added the Hacktoberfest 2024 label Sep 30, 2024

PawelPeczek-Roboflow closed this as completed Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hacktoberfest 2024 | Llama 3.2 Vision 🤝 Workflows #694

Hacktoberfest 2024 | Llama 3.2 Vision 🤝 Workflows #694

PawelPeczek-Roboflow commented Sep 30, 2024

AHB102 commented Nov 14, 2024 •

edited

Loading

PawelPeczek-Roboflow commented Nov 14, 2024

PawelPeczek-Roboflow commented Nov 14, 2024

AHB102 commented Nov 14, 2024

PawelPeczek-Roboflow commented Nov 14, 2024 •

edited

Loading

AHB102 commented Nov 14, 2024

PawelPeczek-Roboflow commented Nov 15, 2024

AHB102 commented Nov 15, 2024 •

edited

Loading

PawelPeczek-Roboflow commented Nov 15, 2024

PawelPeczek-Roboflow commented Nov 15, 2024

PawelPeczek-Roboflow commented Nov 15, 2024

AHB102 commented Nov 15, 2024

PawelPeczek-Roboflow commented Nov 21, 2024

AHB102 commented Nov 22, 2024 •

edited

Loading

AHB102 commented Nov 22, 2024 •

edited

Loading

PawelPeczek-Roboflow commented Nov 22, 2024

AHB102 commented Nov 22, 2024

PawelPeczek-Roboflow commented Nov 22, 2024

AHB102 commented Nov 28, 2024 •

edited

Loading

PawelPeczek-Roboflow commented Nov 29, 2024 •

edited

Loading

AHB102 commented Dec 6, 2024 •

edited

Loading

PawelPeczek-Roboflow commented Dec 6, 2024

AHB102 commented Dec 7, 2024

PawelPeczek-Roboflow commented Jan 7, 2025

Hacktoberfest 2024 | Llama 3.2 Vision 🤝 Workflows #694

Hacktoberfest 2024 | Llama 3.2 Vision 🤝 Workflows #694

Comments

PawelPeczek-Roboflow commented Sep 30, 2024

Llama 3.2 Vision in Workflows

Task description

Cheatsheet

AHB102 commented Nov 14, 2024 • edited Loading

PawelPeczek-Roboflow commented Nov 14, 2024

PawelPeczek-Roboflow commented Nov 14, 2024

AHB102 commented Nov 14, 2024

PawelPeczek-Roboflow commented Nov 14, 2024 • edited Loading

AHB102 commented Nov 14, 2024

PawelPeczek-Roboflow commented Nov 15, 2024

AHB102 commented Nov 15, 2024 • edited Loading

PawelPeczek-Roboflow commented Nov 15, 2024

PawelPeczek-Roboflow commented Nov 15, 2024

PawelPeczek-Roboflow commented Nov 15, 2024

AHB102 commented Nov 15, 2024

PawelPeczek-Roboflow commented Nov 21, 2024

AHB102 commented Nov 22, 2024 • edited Loading

AHB102 commented Nov 22, 2024 • edited Loading

PawelPeczek-Roboflow commented Nov 22, 2024

AHB102 commented Nov 22, 2024

PawelPeczek-Roboflow commented Nov 22, 2024

AHB102 commented Nov 28, 2024 • edited Loading

PawelPeczek-Roboflow commented Nov 29, 2024 • edited Loading

AHB102 commented Dec 6, 2024 • edited Loading

PawelPeczek-Roboflow commented Dec 6, 2024

AHB102 commented Dec 7, 2024

PawelPeczek-Roboflow commented Jan 7, 2025

AHB102 commented Nov 14, 2024 •

edited

Loading

PawelPeczek-Roboflow commented Nov 14, 2024 •

edited

Loading

AHB102 commented Nov 15, 2024 •

edited

Loading

AHB102 commented Nov 22, 2024 •

edited

Loading

AHB102 commented Nov 22, 2024 •

edited

Loading

AHB102 commented Nov 28, 2024 •

edited

Loading

PawelPeczek-Roboflow commented Nov 29, 2024 •

edited

Loading

AHB102 commented Dec 6, 2024 •

edited

Loading