VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

1 The Hebrew University of Jerusalem,   2 Google Research,   3 University of California Los Angeles,   4 Allen Institute for AI,   5 University of Washington,   6 University of California, Santa Barbara,   7 Stanford University   8 LAION
*Equal Contribution
arXiv Code

🤗

Dataset

🤗

Leaderboard
LAION Blog


VisIT-Bench is a new vision-language instruction following benchmark inspired by real-world use cases. Testing 70 diverse “wish-list” skills with an automated ranking system, it advances the ongoing assessment of multimodal chatbot performance.


Why VisIT-Bench 🤔?

Though recent VLMs have shown promise in following instructions, their evaluation for real-world human-chatbot instructions is often limited. Typically, VLMs are evaluated through qualitative comparison of outputs, which makes it challenging to quantify progress and potential shortcomings. VisIT-Bench helps address this problem by offering a comprehensive testbed for measuring model performance across a diverse set of instruction-following tasks, inspired by real world scenarios.🌍

Dataset Viewer

Leaderboard

To submit your results to the leaderboard, please add a "predictions" column to this csv, and send to this mail.

BibTeX

@misc{bitton2023visitbench,
      title={VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use}, 
      author={Yonatan Bitton and Hritik Bansal and Jack Hessel and Rulin Shao and Wanrong Zhu and Anas Awadalla and Josh Gardner and Rohan Taori and Ludwig Schimdt},
      year={2023},
      eprint={2308.06595},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}