Skip to content

Latest commit

 

History

History
54 lines (54 loc) · 1.82 KB

2023-04-24-alazraki23a.md

File metadata and controls

54 lines (54 loc) · 1.82 KB
title abstract layout series publisher issn id month tex_title firstpage lastpage page order cycles bibtex_author author date address container-title volume genre issued pdf extras
How (not) to ensemble LVLMs for VQA
This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to mod- els including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment (Fig. 1) shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?
inproceedings
Proceedings of Machine Learning Research
PMLR
2640-3498
alazraki23a
0
How (not) to ensemble LVLMs for VQA
1
20
1-20
1
false
Alazraki, Lisa and Castrejon, Lluis and Dehghani, Mostafa and Huot, Fantine and Uijlings, Jasper and Mensink, Thomas
given family
Lisa
Alazraki
given family
Lluis
Castrejon
given family
Mostafa
Dehghani
given family
Fantine
Huot
given family
Jasper
Uijlings
given family
Thomas
Mensink
2023-04-24
Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops
239
inproceedings
date-parts
2023
4
24