VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

1 The Hebrew University of Jerusalem,   2 Google Research,   3 University of California Los Angeles,   4 Allen Institute for AI,   5 University of Washington,   6 University of California, Santa Barbara,   7 Stanford University   8 LAION
*Equal Contribution

NeurIPS 2023, Datasets and Benchmarks
arXiv Code Auto-Evaluation Sandbox





VisIT-Bench is a new vision-language instruction following benchmark inspired by real-world use cases. Testing 70 diverse “wish-list” skills with an automated ranking system, it advances the ongoing assessment of multimodal chatbot performance.

Why VisIT-Bench 🤔?

Though recent VLMs have shown promise in following instructions, their evaluation for real-world human-chatbot instructions is often limited. Typically, VLMs are evaluated through qualitative comparison of outputs, which makes it challenging to quantify progress and potential shortcomings. VisIT-Bench helps address this problem by offering a comprehensive testbed for measuring model performance across a diverse set of instruction-following tasks, inspired by real world scenarios.🌍

Dataset Viewer


To submit your results to the leaderboard, you can run our auto-evaluation code, following the instructions here. Once you are happy with the results, you can send to this mail. Please include in your email 1) a name for your model, 2) your team name (including your affiliation), and optionally, 3) a GitHub repo or paper link. Please also attach your predictions: you can add a "predictions" column to this csv.


      title={VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use}, 
      author={Yonatan Bitton and Hritik Bansal and Jack Hessel and Rulin Shao and Wanrong Zhu and Anas Awadalla and Josh Gardner and Rohan Taori and Ludwig Schmidt},