title

abstract

layout

series

publisher

issn

id

month

tex_title

firstpage

lastpage

page

order

cycles

bibtex_author

author

date

address

container-title

volume

genre

issued

pdf

extras

How (not) to ensemble LVLMs for VQA

This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to mod- els including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment (Fig. 1) shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?

inproceedings

Proceedings of Machine Learning Research

PMLR

2640-3498

alazraki23a

0

How (not) to ensemble LVLMs for VQA

1

20

1-20

1

false

Alazraki, Lisa and Castrejon, Lluis and Dehghani, Mostafa and Huot, Fantine and Uijlings, Jasper and Mensink, Thomas

given	family
Lisa	Alazraki

given	family
Lluis	Castrejon

given	family
Mostafa	Dehghani

given	family
Fantine	Huot

given	family
Jasper	Uijlings

given	family
Thomas	Mensink

2023-04-24

Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops

239

inproceedings

date-parts

2023

4

24

https://proceedings.mlr.press/v239/alazraki23a/alazraki23a.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2023-04-24-alazraki23a.md

2023-04-24-alazraki23a.md

Files

2023-04-24-alazraki23a.md

Latest commit

History

2023-04-24-alazraki23a.md

File metadata and controls