Application key points:
- Deploy on GPU
- Select one of 4 pretrained models
- Automatic deployment of the
Florence-2-large
model - As initial version only uses the Caption to Phrase Grounding task prompt to detect objects.
Florence-2 is a foundation model designed for multimodal vision tasks, enabling unified handling of image analysis and text interaction. It employs a seq2seq transformer architecture to handle diverse tasks such as object detection, segmentation, image captioning and visual grounding. The model introduces a unified approach to vision-language tasks, where textual prompts guide the model to produce task-specific output.
Florence-2 processes visual data using a vision encoder that converts images into token embeddings. These embeddings are combined with textual prompts and passed through a multimodal encoder-decoder to generate outputs.
The model is trained on FLD-900M, a large dataset of over 900 million image-text pairs with detailed annotations for global, regional and pixel-level tasks.
Florence-2 serves as a versatile tool capable of performing tasks such as image captioning, object detection, and segmentation through a single, unified architecture.
Step 1 Select pretrained model architecture and press the Serve
button.
Step 2. Wait for the model to deploy.