From 766dd04396c7b77921f5e1929188f8f54d4d79a6 Mon Sep 17 00:00:00 2001 From: Mikhail Kim Date: Tue, 22 Aug 2023 14:41:01 +0400 Subject: [PATCH 1/5] Docs: update citation information --- CITATION.cff | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/CITATION.cff b/CITATION.cff index 69cdb0f..36dc871 100644 --- a/CITATION.cff +++ b/CITATION.cff @@ -1,17 +1,21 @@ cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: -- family-names: "Vardanian" - given-names: "Ash" - orcid: "https://orcid.org/0000-0002-4882-1815" -- family-names: "Orshulevich" - given-names: "Vladimir" - orcid: "https://orcid.org/0009-0007-8961-6969" - family-names: "Kim" given-names: "Mikhail" orcid: "https://orcid.org/0009-0003-8413-3221" +- family-names: "Orshulevich" + given-names: "Vladimir" + orcid: "https://orcid.org/0009-0007-8961-6969" +- family-names: "Vardanian" + given-names: "Ash" + orcid: "https://orcid.org/0000-0002-4882-1815" title: "UForm by Unum Cloud" -version: 0.2.0 +version: 0.4.2 +keywords: +- "text-to-image retrieval" +- "multimodal" +- "visual-language pre-training" doi: 10.5281/zenodo.7951497 date-released: 2023-01-03 url: "https://github.com/unum-cloud/uform" From a32cfb3fffffd26425ac2ae91561a45fb5c801ac Mon Sep 17 00:00:00 2001 From: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> Date: Fri, 1 Sep 2023 09:59:00 +0400 Subject: [PATCH 2/5] Docs: New intro --- README.md | 170 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 100 insertions(+), 70 deletions(-) diff --git a/README.md b/README.md index e4062eb..7fb3fd0 100755 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@

UForm

-Multi-Modal Transformers Library
-For Semantic Search Applications
+Pocket-Sized Multi-Modal Transformers Library
+For Semantic Search & Recommendation Systems


@@ -19,55 +19,82 @@ For Semantic Search Applications
--- -UForm is a Multi-Modal Modal inference library designed to encode Multi-Lingual Texts, Images, and, soon, *Audio, Video, and Documents*, into a shared vector space! -It comes with a family of homonymous pre-trained networks, so tiny and efficient you can run them anywhere from large servers to mobile phones... -[All available on HuggingFace](https://huggingface.co/unum-cloud) 🤗 +![UForm + USearch + UCall Demo](https://github.com/ashvardanian/usearch-images/raw/main/assets/usearch-images-slow.gif) -## Three Kinds of Multi-Modal Encoding +Welcome to UForm, a multi-modal AI library that's as versatile as it is efficient. +Imagine encoding text, images, and soon, audio, video, and documents into a shared digital playground. +All of this, with models so compact, they can run anywhere—from your server farm down to your smartphone. 📱💻 +[Check them out on HuggingFace!](https://huggingface.co/unum-cloud) 🤗 -![Early, Mid and Late Fusion Transformer Models](https://raw.githubusercontent.com/unum-cloud/uform/main/assets/model_types_bg.png) +## 🌟 Key Features -__Late-fusion models__ encode each modality independently but into one shared vector space. -Due to independent encoding, late-fusion models are good at capturing coarse-grained features but often neglect fine-grained ones. -This type of model is well-suited for retrieval in extensive collections. -The most famous example of such models is CLIP by OpenAI. +### 🚀 Speed & Efficiency -__Early-fusion models__ encode both modalities jointly so they can take into account fine-grained features. -Usually, these models are used for re-ranking relatively small retrieval results. +- __Tiny Embeddings__: With just 256 dimensions, our embeddings are lean and fast, making your search operations 1.5-3x quicker compared to other CLIP-like models with 512-1024 dimensions. + +- __Quantization Magic__: Our models are trained to be quantization-aware, letting you downcast embeddings from `f32` to `i8` without losing much accuracy. The result? Smaller search indexes and blazing-fast performance, especially on IoT devices. -__Mid-fusion models__ are the golden midpoint between the previous two types. -Mid-fusion models consist of two parts – unimodal and multimodal. -The unimodal part allows encoding each modality separately as late-fusion models do. -The multimodal part takes unimodal features from the unimodal part as input and enhances them with a cross-attention mechanism. +### 🌍 Global Reach -This tiny package will help you deal with the last! +- __Balanced Training__: Our models are cosmopolitan, trained on a balanced diet of English and other languages. This gives us an edge in languages often overlooked by other models, from Hebrew and Armenian to Hindi and Arabic. -## Performance +### 🛠 Versatility + +- __Mid-Fusion Tech__: Our models use mid-fusion to align multiple transformer towers, enabling database-like operations on multi-modal data. + +- __Mixed Features__: Thanks to mid-fusion, our models can produce mixed vision+language features, perfect for recommendation systems. + +- __Hardware Friendly__: Whether it's CoreML, ONNX, or specialized AI hardware like Graphcore IPUs, we've got you covered. + +## 🎓 Architectural Innovation + +Inspired by the ALBEF paper by Salesforce, we've pushed the boundaries of pre-training objectives to squeeze more language-vision understanding into smaller models. +Some UForm models were trained on just 4 million samples across 10 GPUs — a __100x reduction in both dataset size and compute budget compared to OpenAI's CLIP__. +While they may not be suited for zero-shot classification tasks, they are your __go-to choice for processing large image datasets or even petabytes of video frame-by-frame__. + +### 🤝 Mid-Fusion: The Best of Both Worlds + +![Fusion Models](https://raw.githubusercontent.com/unum-cloud/uform/main/assets/model_types_bg.png) + +- __Late-Fusion Models__: Great for capturing the big picture but might miss the details. Ideal for large-scale retrieval. OpenAI CLIP is one of those. +- __Early-Fusion Models__: These are your detail-oriented models, capturing fine-grained features. They're usually employed for re-ranking smaller retrieval results. +- __Mid-Fusion Models__: The balanced diet of models. They offer both a unimodal and a multimodal part, capturing both the forest and the trees. The multimodal part enhances the unimodal features with a cross-attention mechanism. + +So, if you're looking to navigate the complex world of multi-modal data, UForm is the tiny but mighty companion you've been searching for! 🌈🔍 + +### New Training Objectives + +_Coming soon_ ## Installation +Install UForm via pip: + ```bash pip install uform ``` -UForm v0.3.0 and below depend on `transformers` and `timm` libraries. -All newer versions depend only on PyTorch and utility libraries. -For the best performance, PyTorch v2.0.0 and above is recommended. +> Note: For versions below 0.3.0, dependencies include transformers and timm. +> Newer versions only require PyTorch and utility libraries. +> For optimal performance, use PyTorch v2.0.0 or above. -## Usage +## Quick Start -To load the model: +### Loading a Model ```python import uform -model = uform.get_model('unum-cloud/uform-vl-english') -model = uform.get_model('unum-cloud/uform-vl-multilingual-v2') +model = uform.get_model('unum-cloud/uform-vl-english') # Just English +model = uform.get_model('unum-cloud/uform-vl-multilingual-v2') # 21 Languages ``` -You can also load your own Mid-fusion model. Just upload it on HuggingFace and pass the model name to `get_model`. +The Multi-Lingual model is much heavier due to 10x larger vocabulary. +So if you only expect English data, take the former for efficiency. +You can also load your own Mid-fusion model. +Just upload it on HuggingFace and pass the model name to `get_model`. -To encode data: +### Encoding Data ```python from PIL import Image @@ -83,14 +110,15 @@ text_embedding = model.encode_text(text_data) joint_embedding = model.encode_multimodal(image=image_data, text=text_data) ``` -Retrieving features is also trivial: +### Retrieving Features ```python image_features, image_embedding = model.encode_image(image_data, return_features=True) text_features, text_embedding = model.encode_text(text_data, return_features=True) ``` -These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped: +These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped. +Those might be useful for both re-ranking of search results, and recommendation systems. ```python joint_embedding = model.encode_multimodal( @@ -100,21 +128,6 @@ joint_embedding = model.encode_multimodal( ) ``` -### Remote Procedure Calls for Cloud Deployments - -You can also use our larger, faster, better proprietary models deployed in optimized cloud environments. -For that, please, choose the cloud of liking, search the marketplace for "Unum UForm" and reinstall UForm with optional dependencies: - -```bash -pip install uform[remote] -``` - -The only thing that changes after that is calling `get_client` with the IP address of your instance instead of using `get_model` for local usage. - -```python -model = uform.get_client('0.0.0.0:7000') -``` - ### Graphcore IPU Inference First, you will need to setup PopTorch for Graphcore IPUs. @@ -142,6 +155,23 @@ text_data = {k: v.repeat(4, 1) for k,v in text_data.items()} image_features, text_features = model(image_data, text_data) ``` +### Remote Procedure Calls for Cloud Deployments + +You can also use our larger, faster, better proprietary models deployed in optimized cloud environments. +For that, please, choose the cloud of liking, search the marketplace for "Unum UForm" and reinstall UForm with optional dependencies: + +```bash +pip install uform[remote] +``` + +The only thing that changes after that is calling `get_client` with the IP address of your instance instead of using `get_model` for local usage. + +```python +model = uform.get_client('0.0.0.0:7000') +``` + +__[Please, join our Discord for early access!](https://discord.gg/jsMURnSFM2)__ + ## Models ### Architecture @@ -166,32 +196,32 @@ Check out the [`unum-cloud/coco-sm`](https://github.com/unum-cloud/coco-sm) for | Language | OpenCLIP @ 1 | UForm @ 1 | OpenCLIP @ 5 | UForm @ 5 | OpenCLIP @ 10 | UForm @ 10 | Speakers | | :------------------- | -----------: | -----------: | -----------: | -----------: | ------------: | -----------: | -------: | -| Arabic 🇸🇦 | 22.7 | **31.7** | 44.9 | **57.8** | 55.8 | **69.2** | 274 M | -| Armenian 🇦🇲 | 5.6 | **22.0** | 14.3 | **44.7** | 20.2 | **56.0** | 4 M | -| Chinese 🇨🇳 | 27.3 | **32.2** | 51.3 | **59.0** | 62.1 | **70.5** | 1'118 M | -| English 🇺🇸 | **37.8** | 37.7 | 63.5 | **65.0** | 73.5 | **75.9** | 1'452 M | -| French 🇫🇷 | 31.3 | **35.4** | 56.5 | **62.6** | 67.4 | **73.3** | 274 M | -| German 🇩🇪 | 31.7 | **35.1** | 56.9 | **62.2** | 67.4 | **73.3** | 134 M | -| Hebrew 🇮🇱 | 23.7 | **26.7** | 46.3 | **51.8** | 57.0 | **63.5** | 9 M | -| Hindi 🇮🇳 | 20.7 | **31.3** | 42.5 | **57.9** | 53.7 | **69.6** | 602 M | -| Indonesian 🇮🇩 | 26.9 | **30.7** | 51.4 | **57.0** | 62.7 | **68.6** | 199 M | -| Italian 🇮🇹 | 31.3 | **34.9** | 56.7 | **62.1** | 67.1 | **73.1** | 67 M | -| Japanese 🇯🇵 | 27.4 | **32.6** | 51.5 | **59.2** | 62.6 | **70.6** | 125 M | -| Korean 🇰🇷 | 24.4 | **31.5** | 48.1 | **57.8** | 59.2 | **69.2** | 81 M | -| Persian 🇮🇷 | 24.0 | **28.8** | 47.0 | **54.6** | 57.8 | **66.2** | 77 M | -| Polish 🇵🇱 | 29.2 | **33.6** | 53.9 | **60.1** | 64.7 | **71.3** | 41 M | -| Portuguese 🇵🇹 | 31.6 | **32.7** | 57.1 | **59.6** | 67.9 | **71.0** | 257 M | -| Russian 🇷🇺 | 29.9 | **33.9** | 54.8 | **60.9** | 65.8 | **72.0** | 258 M | -| Spanish 🇪🇸 | 32.6 | **35.6** | 58.0 | **62.8** | 68.8 | **73.7** | 548 M | -| Thai 🇹🇭 | 21.5 | **28.7** | 43.0 | **54.6** | 53.7 | **66.0** | 61 M | -| Turkish 🇹🇷 | 25.5 | **33.0** | 49.1 | **59.6** | 60.3 | **70.8** | 88 M | -| Ukranian 🇺🇦 | 26.0 | **30.6** | 49.9 | **56.7** | 60.9 | **68.1** | 41 M | -| Vietnamese 🇻🇳 | 25.4 | **28.3** | 49.2 | **53.9** | 60.3 | **65.5** | 85 M | +| Arabic 🇸🇦 | 22.7 | __31.7__ | 44.9 | __57.8__ | 55.8 | __69.2__ | 274 M | +| Armenian 🇦🇲 | 5.6 | __22.0__ | 14.3 | __44.7__ | 20.2 | __56.0__ | 4 M | +| Chinese 🇨🇳 | 27.3 | __32.2__ | 51.3 | __59.0__ | 62.1 | __70.5__ | 1'118 M | +| English 🇺🇸 | __37.8__ | 37.7 | 63.5 | __65.0__ | 73.5 | __75.9__ | 1'452 M | +| French 🇫🇷 | 31.3 | __35.4__ | 56.5 | __62.6__ | 67.4 | __73.3__ | 274 M | +| German 🇩🇪 | 31.7 | __35.1__ | 56.9 | __62.2__ | 67.4 | __73.3__ | 134 M | +| Hebrew 🇮🇱 | 23.7 | __26.7__ | 46.3 | __51.8__ | 57.0 | __63.5__ | 9 M | +| Hindi 🇮🇳 | 20.7 | __31.3__ | 42.5 | __57.9__ | 53.7 | __69.6__ | 602 M | +| Indonesian 🇮🇩 | 26.9 | __30.7__ | 51.4 | __57.0__ | 62.7 | __68.6__ | 199 M | +| Italian 🇮🇹 | 31.3 | __34.9__ | 56.7 | __62.1__ | 67.1 | __73.1__ | 67 M | +| Japanese 🇯🇵 | 27.4 | __32.6__ | 51.5 | __59.2__ | 62.6 | __70.6__ | 125 M | +| Korean 🇰🇷 | 24.4 | __31.5__ | 48.1 | __57.8__ | 59.2 | __69.2__ | 81 M | +| Persian 🇮🇷 | 24.0 | __28.8__ | 47.0 | __54.6__ | 57.8 | __66.2__ | 77 M | +| Polish 🇵🇱 | 29.2 | __33.6__ | 53.9 | __60.1__ | 64.7 | __71.3__ | 41 M | +| Portuguese 🇵🇹 | 31.6 | __32.7__ | 57.1 | __59.6__ | 67.9 | __71.0__ | 257 M | +| Russian 🇷🇺 | 29.9 | __33.9__ | 54.8 | __60.9__ | 65.8 | __72.0__ | 258 M | +| Spanish 🇪🇸 | 32.6 | __35.6__ | 58.0 | __62.8__ | 68.8 | __73.7__ | 548 M | +| Thai 🇹🇭 | 21.5 | __28.7__ | 43.0 | __54.6__ | 53.7 | __66.0__ | 61 M | +| Turkish 🇹🇷 | 25.5 | __33.0__ | 49.1 | __59.6__ | 60.3 | __70.8__ | 88 M | +| Ukranian 🇺🇦 | 26.0 | __30.6__ | 49.9 | __56.7__ | 60.9 | __68.1__ | 41 M | +| Vietnamese 🇻🇳 | 25.4 | __28.3__ | 49.2 | __53.9__ | 60.3 | __65.5__ | 85 M | | | | | | | | | | -| Mean | 26.5±6.4 | **31.8±3.5** | 49.8±9.8 | **58.1±4.5** | 60.4±10.6 | **69.4±4.3** | - | -| Google Translate | 27.4±6.3 | **31.5±3.5** | 51.1±9.5 | **57.8±4.4** | 61.7±10.3 | **69.1±4.3** | - | -| Microsoft Translator | 27.2±6.4 | **31.4±3.6** | 50.8±9.8 | **57.7±4.7** | 61.4±10.6 | **68.9±4.6** | - | -| Meta NLLB | 24.9±6.7 | **32.4±3.5** | 47.5±10.3 | **58.9±4.5** | 58.2±11.2 | **70.2±4.3** | - | +| Mean | 26.5±6.4 | __31.8±3.5__ | 49.8±9.8 | __58.1±4.5__ | 60.4±10.6 | __69.4±4.3__ | - | +| Google Translate | 27.4±6.3 | __31.5±3.5__ | 51.1±9.5 | __57.8±4.4__ | 61.7±10.3 | __69.1±4.3__ | - | +| Microsoft Translator | 27.2±6.4 | __31.4±3.6__ | 50.8±9.8 | __57.7±4.7__ | 61.4±10.6 | __68.9±4.6__ | - | +| Meta NLLB | 24.9±6.7 | __32.4±3.5__ | 47.5±10.3 | __58.9±4.5__ | 58.2±11.2 | __70.2±4.3__ | - | ### Performance From d52225c2d4ad38ba8580d8dc19a5580298e9f5fd Mon Sep 17 00:00:00 2001 From: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> Date: Fri, 1 Sep 2023 10:22:32 +0400 Subject: [PATCH 3/5] Docs: Improved intro --- README.md | 60 ++++++++++++++++++++++++++++--------------------------- 1 file changed, 31 insertions(+), 29 deletions(-) diff --git a/README.md b/README.md index 7fb3fd0..fc9fe89 100755 --- a/README.md +++ b/README.md @@ -22,42 +22,46 @@ For Semantic Search & Recommendation Systems
![UForm + USearch + UCall Demo](https://github.com/ashvardanian/usearch-images/raw/main/assets/usearch-images-slow.gif) Welcome to UForm, a multi-modal AI library that's as versatile as it is efficient. -Imagine encoding text, images, and soon, audio, video, and documents into a shared digital playground. +Imagine encoding text, images, and soon, audio, video, and documents into a shared Semantic Vector Space. All of this, with models so compact, they can run anywhere—from your server farm down to your smartphone. 📱💻 [Check them out on HuggingFace!](https://huggingface.co/unum-cloud) 🤗 ## 🌟 Key Features -### 🚀 Speed & Efficiency +### ⚡ Speed & Efficiency -- __Tiny Embeddings__: With just 256 dimensions, our embeddings are lean and fast, making your search operations 1.5-3x quicker compared to other CLIP-like models with 512-1024 dimensions. - -- __Quantization Magic__: Our models are trained to be quantization-aware, letting you downcast embeddings from `f32` to `i8` without losing much accuracy. The result? Smaller search indexes and blazing-fast performance, especially on IoT devices. +- __Tiny Embeddings__: With just 256 dimensions, our embeddings are lean and fast to work with, making your search operations 1.5-3x quicker compared to other CLIP-like models with 512-1024 dimensions. + +- __Quantization Magic__: Our models are trained to be quantization-aware, letting you downcast embeddings from `f32` to `i8` without losing much accuracy. Supported by __[USearch](https://github.com/unum-cloud/usearch)__, this leads to further 3x reduction in index size and up to 5x higher performance, especially on IoT devices with low floating-point performance. ### 🌍 Global Reach -- __Balanced Training__: Our models are cosmopolitan, trained on a balanced diet of English and other languages. This gives us an edge in languages often overlooked by other models, from Hebrew and Armenian to Hindi and Arabic. +- __Balanced Training__: Our models are cosmopolitan, trained on a balanced diet of English and other languages. This gives us [an edge in languages often overlooked by other models, from Hebrew and Armenian to Hindi and Arabic](#accuracy). -### 🛠 Versatility +### 🎛 Versatility - __Mid-Fusion Tech__: Our models use mid-fusion to align multiple transformer towers, enabling database-like operations on multi-modal data. -- __Mixed Features__: Thanks to mid-fusion, our models can produce mixed vision+language features, perfect for recommendation systems. +- __Mixed-Modality Features__: Thanks to mid-fusion, our models can produce mixed vision+language features, perfect for recommendation systems. + +- __Cheap Inference__: All of our models have under 1 Billion parameters, meaning substantially [higher throughput and lower inference costs](#speed) than even tiny models, like the famous `distilbert`. -- __Hardware Friendly__: Whether it's CoreML, ONNX, or specialized AI hardware like Graphcore IPUs, we've got you covered. +- __Hardware Friendly__: Whether it's [CoreML, ONNX](https://huggingface.co/unum-cloud/uform-coreml-onnx), or specialized AI hardware like [Graphcore IPUs](#graphcore-ipus), we've got you covered. -## 🎓 Architectural Innovation +## 🎓 Architectural Improvements Inspired by the ALBEF paper by Salesforce, we've pushed the boundaries of pre-training objectives to squeeze more language-vision understanding into smaller models. Some UForm models were trained on just 4 million samples across 10 GPUs — a __100x reduction in both dataset size and compute budget compared to OpenAI's CLIP__. While they may not be suited for zero-shot classification tasks, they are your __go-to choice for processing large image datasets or even petabytes of video frame-by-frame__. -### 🤝 Mid-Fusion: The Best of Both Worlds +### Mid-Fusion ![Fusion Models](https://raw.githubusercontent.com/unum-cloud/uform/main/assets/model_types_bg.png) - __Late-Fusion Models__: Great for capturing the big picture but might miss the details. Ideal for large-scale retrieval. OpenAI CLIP is one of those. + - __Early-Fusion Models__: These are your detail-oriented models, capturing fine-grained features. They're usually employed for re-ranking smaller retrieval results. + - __Mid-Fusion Models__: The balanced diet of models. They offer both a unimodal and a multimodal part, capturing both the forest and the trees. The multimodal part enhances the unimodal features with a cross-attention mechanism. So, if you're looking to navigate the complex world of multi-modal data, UForm is the tiny but mighty companion you've been searching for! 🌈🔍 @@ -66,7 +70,7 @@ So, if you're looking to navigate the complex world of multi-modal data, UForm i _Coming soon_ -## Installation +## 🛠 Installation Install UForm via pip: @@ -78,7 +82,7 @@ pip install uform > Newer versions only require PyTorch and utility libraries. > For optimal performance, use PyTorch v2.0.0 or above. -## Quick Start +## 🚀 Quick Start ### Loading a Model @@ -89,7 +93,7 @@ model = uform.get_model('unum-cloud/uform-vl-english') # Just English model = uform.get_model('unum-cloud/uform-vl-multilingual-v2') # 21 Languages ``` -The Multi-Lingual model is much heavier due to 10x larger vocabulary. +The multi-lingual model is much heavier due to 10x larger vocabulary. So if you only expect English data, take the former for efficiency. You can also load your own Mid-fusion model. Just upload it on HuggingFace and pass the model name to `get_model`. @@ -128,10 +132,11 @@ joint_embedding = model.encode_multimodal( ) ``` -### Graphcore IPU Inference +### Graphcore IPUs -First, you will need to setup PopTorch for Graphcore IPUs. -Follow the user [guide](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/intro.html). +To run on Graphcore IPUs, you will need to setup PopTorch first. +Follow the [user guide](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/intro.html) on their website. +Once complete, our example would need a couple of adjustments to best leverage available data- and model-parallelism of the Graphcore platform. ```python import poptorch @@ -155,24 +160,22 @@ text_data = {k: v.repeat(4, 1) for k,v in text_data.items()} image_features, text_features = model(image_data, text_data) ``` -### Remote Procedure Calls for Cloud Deployments +### Cloud API You can also use our larger, faster, better proprietary models deployed in optimized cloud environments. For that, please, choose the cloud of liking, search the marketplace for "Unum UForm" and reinstall UForm with optional dependencies: -```bash -pip install uform[remote] -``` - -The only thing that changes after that is calling `get_client` with the IP address of your instance instead of using `get_model` for local usage. - ```python +$ pip install uform[remote] + model = uform.get_client('0.0.0.0:7000') ``` +The only thing that changes after that is calling `get_client` with the IP address of your instance instead of using `get_model` for local usage. + __[Please, join our Discord for early access!](https://discord.gg/jsMURnSFM2)__ -## Models +## 📊 Models ### Architecture @@ -189,7 +192,7 @@ For pre-training, we translated captions with [NLLB](https://github.com/facebook [weights-m]: https://huggingface.co/unum-cloud/uform-vl-multilingual/resolve/main/torch_weight.pt [weights-m-v2]: https://huggingface.co/unum-cloud/uform-vl-multilingual-v2/resolve/main/torch_weight.pt -### Evaluation +### Accuracy Evaluating the `unum-cloud/uform-vl-multilingual-v2` model, one can expect the following metrics for text-to-image search, compared against `xlm-roberta-base-ViT-B-32` [OpenCLIP](https://github.com/mlfoundations/open_clip) model. Check out the [`unum-cloud/coco-sm`](https://github.com/unum-cloud/coco-sm) for details. @@ -223,7 +226,7 @@ Check out the [`unum-cloud/coco-sm`](https://github.com/unum-cloud/coco-sm) for | Microsoft Translator | 27.2±6.4 | __31.4±3.6__ | 50.8±9.8 | __57.7±4.7__ | 61.4±10.6 | __68.9±4.6__ | - | | Meta NLLB | 24.9±6.7 | __32.4±3.5__ | 47.5±10.3 | __58.9±4.5__ | 58.2±11.2 | __70.2±4.3__ | - | -### Performance +### Speed On RTX 3090, the following performance is expected from `uform` on text encoding. @@ -236,7 +239,7 @@ On RTX 3090, the following performance is expected from `uform` on text encoding | | | | | | `unum-cloud/uform-vl-multilingual` | Yes | 6'809 | __x 4.22__ | -## Additional Tooling +## 🧰 Additional Tooling There are two options to calculate semantic compatibility between an image and a text: [Cosine Similarity](#cosine-similarity) and [Matching Score](#matching-score). @@ -278,4 +281,3 @@ __Cons__: - Resource-intensive. - Not suitable for retrieval in large collections. - From 708ab38ee3dc5839eedc3c7ade2151ef215b9406 Mon Sep 17 00:00:00 2001 From: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> Date: Fri, 1 Sep 2023 07:38:46 +0100 Subject: [PATCH 4/5] Docs: Intro Accents --- README.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index fc9fe89..da75019 100755 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@

UForm

-Pocket-Sized Multi-Modal Transformers Library
+Pocket-Sized Multi-Modal AI
For Semantic Search & Recommendation Systems


@@ -23,7 +23,7 @@ For Semantic Search & Recommendation Systems
Welcome to UForm, a multi-modal AI library that's as versatile as it is efficient. Imagine encoding text, images, and soon, audio, video, and documents into a shared Semantic Vector Space. -All of this, with models so compact, they can run anywhere—from your server farm down to your smartphone. 📱💻 +With compact __custom pre-trained transformer models__, all of this can run anywhere—from your server farm down to your smartphone. 📱💻 [Check them out on HuggingFace!](https://huggingface.co/unum-cloud) 🤗 ## 🌟 Key Features @@ -32,7 +32,7 @@ All of this, with models so compact, they can run anywhere—from your server fa - __Tiny Embeddings__: With just 256 dimensions, our embeddings are lean and fast to work with, making your search operations 1.5-3x quicker compared to other CLIP-like models with 512-1024 dimensions. -- __Quantization Magic__: Our models are trained to be quantization-aware, letting you downcast embeddings from `f32` to `i8` without losing much accuracy. Supported by __[USearch](https://github.com/unum-cloud/usearch)__, this leads to further 3x reduction in index size and up to 5x higher performance, especially on IoT devices with low floating-point performance. +- __Quantization Magic__: Our models are trained to be quantization-aware, letting you downcast embeddings from `f32` to `i8` without losing much accuracy. Supported by __[USearch](https://github.com/unum-cloud/usearch)__, this leads to a further 3x reduction in index size and up to a 5x higher performance, especially on IoT devices with low floating-point performance. ### 🌍 Global Reach @@ -60,11 +60,11 @@ While they may not be suited for zero-shot classification tasks, they are your _ - __Late-Fusion Models__: Great for capturing the big picture but might miss the details. Ideal for large-scale retrieval. OpenAI CLIP is one of those. -- __Early-Fusion Models__: These are your detail-oriented models, capturing fine-grained features. They're usually employed for re-ranking smaller retrieval results. +- __Early-Fusion Models__: These are detail-oriented models that capture fine-grained features. They're usually employed for re-ranking smaller retrieval results. -- __Mid-Fusion Models__: The balanced diet of models. They offer both a unimodal and a multimodal part, capturing both the forest and the trees. The multimodal part enhances the unimodal features with a cross-attention mechanism. +- __Mid-Fusion Models__: The balanced diet of models. They offer an unimodal and a multimodal part, capturing both the forest and the trees. The multimodal part enhances the unimodal features with a cross-attention mechanism. -So, if you're looking to navigate the complex world of multi-modal data, UForm is the tiny but mighty companion you've been searching for! 🌈🔍 +So, if you're looking to navigate the complex world of multi-modal data, UForm is the tiny but mighty companion you've been searching for! ### New Training Objectives @@ -93,9 +93,9 @@ model = uform.get_model('unum-cloud/uform-vl-english') # Just English model = uform.get_model('unum-cloud/uform-vl-multilingual-v2') # 21 Languages ``` -The multi-lingual model is much heavier due to 10x larger vocabulary. -So if you only expect English data, take the former for efficiency. -You can also load your own Mid-fusion model. +The multi-lingual model is much heavier due to a 10x more extensive vocabulary. +So, if you only expect English data, take the former for efficiency. +You can also load your Mid-fusion model. Just upload it on HuggingFace and pass the model name to `get_model`. ### Encoding Data @@ -122,7 +122,7 @@ text_features, text_embedding = model.encode_text(text_data, return_features=Tru ``` These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped. -Those might be useful for both re-ranking of search results, and recommendation systems. +Those might be useful for re-ranking search results, and recommendation systems. ```python joint_embedding = model.encode_multimodal( @@ -134,9 +134,9 @@ joint_embedding = model.encode_multimodal( ### Graphcore IPUs -To run on Graphcore IPUs, you will need to setup PopTorch first. +To run on Graphcore IPUs, you must set up PopTorch first. Follow the [user guide](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/intro.html) on their website. -Once complete, our example would need a couple of adjustments to best leverage available data- and model-parallelism of the Graphcore platform. +Once complete, our example would need a couple of adjustments to best leverage the Graphcore platform's available data and model-parallelism. ```python import poptorch @@ -163,7 +163,7 @@ image_features, text_features = model(image_data, text_data) ### Cloud API You can also use our larger, faster, better proprietary models deployed in optimized cloud environments. -For that, please, choose the cloud of liking, search the marketplace for "Unum UForm" and reinstall UForm with optional dependencies: +For that, please choose the cloud of liking, search the marketplace for "Unum UForm", and reinstall UForm with optional dependencies: ```python $ pip install uform[remote] @@ -275,7 +275,7 @@ score = model.get_matching_scores(joint_embedding) __Pros__: - Joint embedding captures fine-grained features. -- Suitable for re-ranking - sorting retrieval result. +- Suitable for re-ranking - sorting retrieval results. __Cons__: From 45c29765fdd9f71eb65e7e0b4c5ecbc9f9ec4dca Mon Sep 17 00:00:00 2001 From: Gurgen Yegoryan <21982202+gurgenyegoryan@users.noreply.github.com> Date: Fri, 1 Sep 2023 15:32:12 +0400 Subject: [PATCH 5/5] Make: Add rebase action --- .github/workflows/release.yml | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 1e11ce9..c535899 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -34,6 +34,31 @@ jobs: - run: cp .github/workflows/package.json . && npm install && npx semantic-release + rebase: + name: Rebase Dev. Branch + needs: versioning + runs-on: ubuntu-latest + steps: + - name: Checkout the latest code + uses: actions/checkout@v3 + with: + fetch-depth: 0 + + - name: Perform rebase + run: | + git fetch origin main + git checkout main-dev + git rebase origin/main + + - name: Push changes + uses: CasperWA/push-protected@v2 + with: + token: ${{ secrets.SEMANTIC_RELEASE_TOKEN }} + branch: main-dev + unprotect_reviews: True + force: True + + pypi_publish: name: Publish to PyPi runs-on: ubuntu-latest