In the last couple of years, we have seen amazing advances in AI systems when it comes to recognizing and analyzing the content of complex images. But a new paper highlights how state-of-the-art “vision learning models” (VLMs) often fail at simple, low-level visual analysis tasks that are trivially easy for humans.
In a preprint paper with the provocative title “Visions Are Language Models. blind” (which has a PDF version that includes a black sun emoji. In the title), researchers at Auburn University and the University of Alberta create eight simple visual acuity tests with objectively valid answers. This ranges from identifying how many times two colored lines intersect to identifying which letter is circled in a long word to counting how many nested in a picture. Shapes exist (representative examples and results can be found on the research team's webpage).
Importantly, these tests are generated by custom code and do not rely on pre-existing images or tests that can be found on the public Internet, thus “minimal[ing] According to the researchers, the opportunity to solve VLMs through memorization. The tests “require minimal to zero global knowledge” beyond basic 2D shapes, making it difficult to assess the answer from a “text-only question and choice” (which for some other visual AI benchmarks has been identified as a problem).
Are you smarter than a fifth grader?
After running multiple tests on four different visual models—GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5—the researchers found that all four fell short of the 100 percent accuracy you'd expect from such simple visual analysis. Can expect for tasks. (And which the most observant humans will have little trouble getting). But the size of AI's underperformance varies greatly depending on the specific task. When asked to count the number of rows and columns in an empty grid, for example, the best-performing model answered correctly less than 60 percent of the time. Gemini-1.5 Pro, on the other hand, achieved about 93% accuracy in identifying circled letters, close to human-level performance.
Even small changes in tasks can lead to big changes in results. While all four models tested were able to correctly identify five overlapping hollow circles, the accuracy of all models fell below 50 percent when six to nine circles were included. “This suggests that VLMs are biased toward the well-known Olympic logo, which has 5 circles,” the researchers speculate. In other cases, the models occasionally misjudged nonsensical responses, such as guessing “9,” “n,” or “©” as the circled letter in the word “Subdermatoglyphic.”
Overall, the results highlight that AI models that can perform well at high-level visual reasoning have some significant “blind spots” (sorry) when it comes to low-level abstract imagery. are All of this is reminiscent of a similar capacity gap that we often see in modern large language models, which can produce very serious summaries of long texts while failing at the most basic questions of math and spelling. go
This gap in VLM capabilities may come from the inability of these systems to generalize beyond the types of content on which they are expressly trained. Yet when the researchers tried to fine-tune the model using specific images drawn from one of their tasks (the “Are the two circles touching?” test), the model went from 17 percent to 37 percent accurate. showed only modest improvement. “The loss values for all these experiments were very close to zero, indicating that the model outperforms the training set but fails to generalize,” the researchers write.
The researchers suggest that the difference in VLM ability may be related to the so-called “late fusion” of vision encoders in previously trained large language models. An “early fusion” training approach that integrates language training as well as visual encoding may lead to better results on these lower-level tasks, the researchers suggest (providing any analysis of this question without doing).