A view of multimodal evaluation benchmarks

Introduction

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

With major advances in the field of large language models (LLMs), models that can process multimodal inputs have recently come to the fore. These models can take both text and images as input, and sometimes other modes, such as video or speech.

Multimodal models present unique challenges in assessment. In this blog post, we'll take a look at a few multimodal datasets that can be used to evaluate the performance of such models, mostly focused on Visual Question Answering (VQA), where information from an image is Need to answer the question using .

The landscape of multimodal datasets is huge and ever-growing, with benchmarks focused on different cognitive and reasoning capabilities, data sources, and applications. The list of datasets here is by no means exhaustive. We will briefly describe the key features of ten multimodal datasets and benchmarks and outline some key trends in the space.

Multimodal datasets

Text VQA

There are a variety of vision language tasks that can be evaluated under the Generalist Multimodal Language Model. One such task is optical character recognition (OCR) and answering questions based on text in an image. There is a dataset evaluating these types of capabilities. Text VQAData set released in 2019 Singh et al.

AD 4nXfDds9TvHLT7Gs5KqS8Y g4RFosFeNDYtxTd64wGjZAe2CdPIFx sZAqmdxgltuvzhs4AnCLm74xQE7Wf 8jTeLYV1LI2xi1F7CUJqBgCrlv6ZjZ61KVLtOKmVBJpxG nBECUMM8 OFqZjfNhy

Two examples from TextVQA (Singh et al., 2019)

Because the dataset focuses on text in images, many of the images are of things like billboards, whiteboards, or traffic signs. In total, there are 28,408 images from Open Images The dataset and its associated 45,336 questions, which require reading and reasoning about text in images. For each question, there are 10 ground-truth answers provided by the annotators.

DocVQA

Similarly, TextVQA, DocVQA Image is related to text-based reasoning, but is more specific: in DocVQA, images are documents, including tables, forms, and lists, and come from chemical or fossil fuel industry sources. . Of the 6,071 documents, 12,767 are images and 50,000 questions are associated with these images. The authors also provide a random distribution of the data. The train (80%) Validation (10%), and Test (10%) set.

AD 4nXcxFNFNfnwW1apy7N17x8U2sjYtvip4EnF0GeM91Gx

An example question-answer pair from DocVQA (Matthew et al., 2020)

OCR Bench

The above two datasets are far from being the only ones available for OCR-related tasks. If one wants to comprehensively evaluate a model, performing the evaluation on all available testing data can be expensive and time-consuming. Because of this, samples from several related datasets are sometimes combined into a single benchmark that is smaller than the sum of all the individual datasets, and more diverse than any single source dataset.

For OCR-related tasks, there is one such dataset. OCR Bench by the Liu et al. It consists of 1,000 manually verified query-answer pairs from 18 datasets (including TextVQA and DocVQA described above). The benchmark covers five main tasks: text recognition, scene text-based VQA, document-based VQA, key information extraction, and handwritten mathematical expression recognition.

AD 4nXclLjuEwrNSFB2Xwut7y07 6DyQArzHaSVNMD6omW53EunQP6RdzAwS8Y9YNkQ1ibpryklWSBCd7YdbbZZex3FP3BQTPG6cr7Az F3qqzh1fwTzJB llvlyzrL5um

Examples of text recognition (a), handwritten mathematical expression recognition (b), and scene text-based VQA (c) tasks in OCRBench.Liu et al., 2023)

MathVista

There are also compilations of a number of datasets for other specific sets of tasks. For example, MathVista by the Lu et al. The focus is on mathematical reasoning. It contains 6,141 examples coming from 31 multimodal datasets including mathematical functions (28 pre-existing datasets and 3 newly created).

AD 4nXdLudrTpmhqjF6d2mQVKKss60XkLn7yK95MFALuPXS0ciURVNq3Ape2L9HH U9PXpby0p8GV2czh5Wa8y6cSBOewXYsr43f28EdjdZn93RV DtU

Examples from datasets annotated for MathVista (Lowe et al., 2023)

The dataset is divided into two parts: testmini (1,000 examples) for evaluation with limited resources, and Test (5,141 examples remaining). To combat model overfitting, responses Test Distributions are not released publicly.

LogicVista

Another relatively specialized skill that can be assessed in a multimodal LLM is logical reasoning. A dataset that aims to do this has recently been released. LogicVista by the Xiao et al. It consists of 448 multiple-choice questions including 5 logical reasoning tasks and 9 aptitudes. These examples are collected and interpreted from licensed intelligence test sources. Two examples from the data set are shown in the figure below.

AD 4nXd3LckFX2fkvTQa1Zhim7qc55ZMKX5cPb0GugIVWkZj3 pqGNQQ3BfKN MSHR1AzWauvXDqqdyBd83c7IwkSfOt8IyB8U tmmk2EVF2Hp8r2wAmfPRbfqBpX

Examples from the LogicVista dataset (Xiao et al., 2024)

Real World QA

As opposed to narrowly defined tasks such as OCR or tasks involving arithmetic, some datasets cover broader and less restricted purposes and domains. For example, Real World QA is a dataset of over 700 real-world images, with one query for each image. Although most images come from vehicles and depict driving situations, some show more general scenes with multiple objects in them. Questions vary: some have multiple choice options, while others are open-ended, including instructions such as “Please answer directly with a word or number”.

AD 4nXfd7i8L7ohRMyAdj xv2LiXifpidW7AA6ew7mIus18bKIVNOPSxrZH1HnixkQ31XoKhTzWDIwi2ZXJ3aE5caVCIWA7XWl8wVpWMEIFjoezc7E pI

For example by combining a picture, question and answer Real World QA

MM Bench

In a situation where different models are competing to get the best scores on fixed benchmarks, Overfitting The number of standards models becomes a concern. When a model overfits, it means that it will show very good results on a particular dataset, even though this strong performance does not generalize well enough to other data. To combat this, there is a recent trend to publicly release only the benchmark questions, but not the answers. For example, the MM Bench The data set is divided into The giant And Test subset of , and while The giant is issued with the answers, Test This dataset consists of 3,217 multiple-choice image-based questions covering 20 distinct abilities, defined by the authors as broad groups of perception (e.g., object localization, image quality) and reasoning (e.g., future prediction). Goi, social connection).

AD 4nXdT68qxPE0uS5lsPglajZ76M6 tOFkoDphn53whZa7MEq4S MNbhUIPmoU0zob2eP8YjYe3yxzrzYCr0Fk4d fA7xTiC6GymtNYuuFYmOvO2uXp7tSn6qkCRzA 1o2pVpSvSVmI4jR

Results of eight vision language models on 20 capabilities defined in MMBench.TestAs experienced. Liu et al. (2023)

An interesting feature of the dataset is that, unlike most other datasets where all questions are in English, MMBench is bilingual, with English questions also translated into Chinese (the translations were automatically done using GPT-4). and then confirmed).

To verify the consistency of the models' performance and reduce the chance of the model giving the correct answer by accident, MMBench authors ask the models the same question multiple times with multiple-choice option sequences.

MME

Another standard for comprehensive assessment of multimodal capabilities is MMEs. Fu et al. This dataset includes 14 subtasks related to cognitive and perceptual abilities. Some of the images in MME come from existing datasets, and some are novel and manually taken by the authors. MME differs from most data sets described here in the way its questions are asked. All questions require a “yes” or “no” answer. To better test the models, two questions are generated for each image, such that one of them is answered “yes” and the other is “no”, and the model is scored to get “points”. It is important to answer both correctly. This dataset is for academic research purposes only.

AD 4nXcNg6a8oFlpyKf WNbj4YEt1hUBvuVZfeNcZSM65L PdfLpPvjOWGfmMGGF3lBSfjVEcxfud52dSYk9ZIP6nfeGr0dFxdxsVr8FaN

Examples from the MME benchmark (Fu et al., 2023)

MMMU

While most of the datasets described above evaluate multimodal models on tasks that most humans can perform, some datasets focus instead on expert knowledge. There is one such standard. MMMU by the Yu et al.

Questions in the MMMU require college-level subject knowledge and cover 6 core subjects: Art and Design, Business, Science, Health and Medicine, Humanities and Social Science, and Tech and Engineering. In total, there are more than 11,000 questions from college textbooks, quizzes, and exams. Image types include diagrams, maps, chemical structures, etc.

AD 4nXdJKNsh5Ce6WluQ0OYs hRaw0yNX34c P4bkLbJvmFsRhMPLz6p0rFM 75R6LiMeDoDMGapwKP

MMMU examples from two subjects (Yu et al., 2023)

TVQA

The standards mentioned so far cover two modes of data: text and images. Although this combination is the most widespread, it should be noted that more modalities such as video or speech are being incorporated into larger multimodal models. For an example of a multimodal dataset that includes video, we can look at the TVQA dataset by Lei et al.Which was built in 2018. In this data set, a few questions are asked about 60-90 second long video clips from six popular TV shows. For some queries, using only subtitles or only video is sufficient, while others need to use both methods.

Examples of TVQA (Lei et al., 2018)

Multimodal input on ClariFi

With the Clarifai platform, you can easily process multimodal input. In this For example a notebookyou can see how Gemini Pro Vision The model can be used to answer an image-based query from the RealWorldQA benchmark.

Key trends in multimodal evaluation benchmarks

We've noticed a few trends related to multimodal benchmarks:

  • In the era of small models when mastering a particular task, the dataset typically includes both training and test data (such as TextVQA), the growing popularity of generalist models pre-trained on large-scale data. With , we look at more and more data sets. For model evaluation only.
  • As the number of available datasets increases, and models become increasingly large and there are more resources to evaluate, the trend is to create curated collections of samples from several datasets for smaller-scale but more comprehensive evaluations.
  • For some datasets, the answers, or in some cases even the questions, are not released publicly. This is intended to combat overfitting of models to specific benchmarks, where good scores on a benchmark do not necessarily indicate strong performance in general.

The result

In this blog post, we briefly describe a few datasets that can be used to test the multimodal capabilities of vision language models. It should be noted that many other existing standards are not mentioned here. The variety of benchmarks is usually very broad: some datasets focus on narrow tasks, such as OCR or math, while others aim to be more comprehensive and reflect the real world. Some require general and some highly specialized knowledge. Questions may require yes/no, multiple choice, or open-ended responses.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment