Skip to content

supervisely-ecosystem/serve-florence-2

Repository files navigation

Serve Florence-2

OverviewHow to Run

GitHub release (latest SemVer) views runs

Overview

Application key points:

  • Deploy on GPU
  • Select one of 4 pretrained models
  • Automatic deployment of the Florence-2-large model
  • As initial version only uses the Caption to Phrase Grounding task prompt to detect objects.

Florence-2 is a foundation model designed for multimodal vision tasks, enabling unified handling of image analysis and text interaction. It employs a seq2seq transformer architecture to handle diverse tasks such as object detection, segmentation, image captioning and visual grounding. The model introduces a unified approach to vision-language tasks, where textual prompts guide the model to produce task-specific output.

Florence-2 processes visual data using a vision encoder that converts images into token embeddings. These embeddings are combined with textual prompts and passed through a multimodal encoder-decoder to generate outputs.

operating principle 1

The model is trained on FLD-900M, a large dataset of over 900 million image-text pairs with detailed annotations for global, regional and pixel-level tasks.

Florence-2 serves as a versatile tool capable of performing tasks such as image captioning, object detection, and segmentation through a single, unified architecture.

florence-tasks

How To Run

Step 1 Select pretrained model architecture and press the Serve button.

serve

Step 2. Wait for the model to deploy.

deployed