VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

1 The Hebrew University of Jerusalem,   2 Google Research,   3 University of California Los Angeles,   4 Allen Institute for AI,   5 University of Washington,   6 University of California, Santa Barbara,   7 Stanford University   8 LAION
*Equal Contribution
arXiv Code





VisIT-Bench is a new vision-language instruction following benchmark inspired by real-world use cases. Testing 70 diverse “wish-list” skills with an automated ranking system, it advances the ongoing assessment of multimodal chatbot performance.

Why VisIT-Bench 🤔?

Though recent VLMs have shown promise in following instructions, their evaluation for real-world human-chatbot instructions is often limited. Typically, VLMs are evaluated through qualitative comparison of outputs, which makes it challenging to quantify progress and potential shortcomings. VisIT-Bench helps address this problem by offering a comprehensive testbed for measuring model performance across a diverse set of instruction-following tasks, inspired by real world scenarios.🌍

Dataset Viewer


To submit your results to the leaderboard, please add a "predictions" column to this csv, and send to this mail.


      title={VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use}, 
      author={Yonatan Bitton and Hritik Bansal and Jack Hessel and Rulin Shao and Wanrong Zhu and Anas Awadalla and Josh Gardner and Rohan Taori and Ludwig Schimdt},