Hello, thanks for your interest in our work. We're the EMMA team, the Heriot-Watt University team that was one of the finalists in the Amazon Simbot Challenge, a university competition organised by Amazon to advance the state-of-the-art in Conversational Embodied AI. We created this Github organisation to organise all the code and artefacts that we are releasing to the public as part of our effort to create the first V+L model for Embodied AI tasks.
- [4/12] 📣 We have released the checkpoints on Hugging Face! 🤗
- [1/12] 📣 We're going to be at EMNLP 2023 to present the paper Multitask Multimodal Prompted Training for Interactive Embodied Task Completion. We'll release everything soon! Follow the organisation for updates as we release things!
There are multiple repos which can make it confusing up front, so here’s how all the pieces connect on a high-level.
- Datasets efficiently merges all annotations from all the datasets without duplicating images.
- Perception extracts visual features from every single image.
- Policy takes each instance and the visual features, creates the pretraining and fine-tuning datasets, and trains the EMMA models. It also evaluates checkpoints on the image-based tasks.
- Experience Hub brings all the models together for inference; taking a request and an observation and predicting the next action for the environment.
- Arena Evaluation sends observations and instructions from the Alexa Arena to the Experience Hub, and returns the predicted actions for execution in the environment.
We have multiple image-based datasets where every example from the raw data can be turned into a single example consisting of an image and its annotations in its metadata. As we mix every dataset together for training, all the data must be in a single structure for easy training. Additionally, as each dataset contributes different tasks, many have introduced new tasks and annotations on the same images as other datasets. For example, while COCO provides image captions, VisualGenome provides scene graphs and also possible region descriptions.
The purpose of datasets is to convert every dataset from its raw form (as it was released) into a single structure, to merge multiple datasets together and remove duplicates, to store all this information in a lightweight format that allows quick querying, and to do it as fast as possible. We easily merge all annotations from multiple datasets without duplicating images, validate the fields using Pydantic, we store these in an incredibly simple SQLite table called a “DatasetDb” or “Db”. Importantly, datasets only handles the metadata for a given image. So, everything except for the actual image, but it keeps a reference to the original image.
The other part of providing input to the model involves extracting visual features for every single image. As mentioned in the paper, we use a feature extractor that is trained separately. To train the model efficiently, we want to extract the visual features for every single image up front. We use Perception to extract all visual features using VinVL for every single image for every single dataset.
Policy holds the main model of EMMA. Using the Dataset Dbs and the extracted features, we train the models to perform the various tasks. Policy also contains the evaluation for each of the image-based tasks.
The Experience Hub contains all logic for integrating the models together for inference: taking in a request from a user, processing the observation with Perception, predicting the next action with Policy, and returning the action to the environment.
The Arena Evaluation is how we send requests and observations from the Alexa Arena to the Experience Hub, and return actions to perform in the environment back to the Alexa Arena.
We provide pretrained/finetuned checkpoints, dataset features, and dbs required to reproduce our setup. All the material is available on Hugging Face! 🤗
The checkpoints can be accessed from this link: https://huggingface.co/emma-heriot-watt/emma_models
The features for pretraining and downstream evaluation are availble here: https://huggingface.co/emma-heriot-watt/emma_features
The features for pretraining and downstream evaluation are availble here: https://huggingface.co/emma-heriot-watt/emma_datasets
Model | COCO Captioning | VQAv2 | RefCOCOg | NLVR2 |
---|---|---|---|---|
VL-T5 | 116.5 | 70.3 | 71.3 | 73.6 |
VL-BART | 116.6 | 71.3 | 22.4 | 70.3 |
UniTAB | 119.8 | 71.0 | 84.5 | — |
OFA-base | 138.2 | 78.1 | 82.3 | — |
EMMA | 122.3 | 73.2 | 80.3 | 70.3 |
Model | MSR (↑) | NRA (↓) | QA |
---|---|---|---|
LeaderBoard | |||
GauchoAI | 36.47 | — | — |
SEAGULL | 30.98 | — | — |
Kingfisher | 22.37 | — | — |
Baseline | |||
NS | 19.32 | 11.73 | ❌ |
NS | 22.80 | 12.73 | ✅ |
VL | 18.19 | 11.82 | ❌ |
VL | 34.20 | 18.82 | ✅ |
EMMA: | |||
EMMA-modular | 33.76 | 18.91 | ❌ |
EMMA-modular | 33.95 | 19.05 | CR |
EMMA-modular | 35.16 | 18.92 | ✅ |
EMMA-unified | 33.26 | 18.79 | ❌ |
EMMA-unified | 33.59 | 18.89 | CR |
EMMA-unified | 36.81 | 18.69 | ✅ |