Seeing vs. Understanding: The Blind Spots of Multimodal AI Models

Sep 10, 2024 | Trends

UTF-8utf-8E28098VisualE2809920AI20models20might20not20see20anything20at20all

As the world enters a remarkable era of technological innovation, artificial intelligence (AI) stands at the forefront, particularly with the emergence of multimodal models like GPT-4o and Gemini 1.5 Pro. Touted for their ability to “understand” images and audio in addition to text, these models are being marketed as the next step in AI evolution. However, a recent study reveals an unsettling truth about their visual processing capabilities: they may not actually “see” in the human sense. Below, we delve into what this means for the future of AI and what we should take away from these findings.

The Human Perspective: Can Computers Really See?

To set the stage, it’s crucial to clarify that AI creators have rarely, if ever, claimed that their models “see” like humans do. Yet their marketing, laden with terms like “vision capabilities” and “visual understanding,” implies that the technology operates on a level comparable to human vision. But studies suggest otherwise.

The Study: Simple Tests, Unexpected Results

Researchers from Auburn University and the University of Alberta recently conducted a systematic study that put these AI models to the test with basic visual tasks. The goals were straightforward: determine whether shapes overlap or count simple figures in an image. While humans would breeze through these tasks, AI models struggled significantly, raising questions about their visual capabilities.

In one test, two circles were presented either touching, overlapping, or spaced apart. While GPT-4o succeeded more than 95% of the time with distant circles, it dropped to just 18% accuracy with overlapping ones.
Counting tasks proved even more bewildering—Gemini struggled with six interlocking circles to the point where it couldn’t get a single answer right.

As co-author Anh Nguyen astutely noted, “Our message is, ‘Look, these best models are STILL failing.’” The implications were profound; these models, despite being sophisticated, lacked the consistency and reliability expected when interpreting visual data.

Model Limitations: Abstraction Without Insight

The study implies that the foundational issue extends beyond mere awareness. The models fail to “see” in an empirical sense. Their “understanding” seems to be tethered closely to their training data, which lacks diverse visual scenarios. For example, the famous Olympic Rings likely skewed their responses due to repeated exposure during training, while less common configurations—like six interlocking rings—left them at a loss.

Currently, there’s no technology that allows us to visualize exactly what an AI model “sees,” leading to a stark revelation: these advanced systems may be receiving visual input but lack the discerning power to interpret it meaningfully.

Interpreting the Findings: The Future of AI Vision

So, what does this mean for the future of such multimodal models? Are they rendered useless due to these shortcomings? Not at all. While they struggle with fundamental reasoning tasks, these AI models excel in certain environments, particularly those involving recognizable human actions or familiar everyday objects. Their unique strengths could find applications in fields like autonomous vehicles or customer service, where understanding patterns in human behavior or common situations matters more than intricate visual processing.

Conclusion: Rethinking AI Capabilities

In a world that increasingly relies on artificial intelligence, this study highlights the need for transparency regarding the capabilities of these models. While they may participate in visual tasks, it’s crucial to remember they do so without the inherent understanding that humans possess. As technology progresses, it is imperative we remain mindful of these limitations and continue researching to expand the potential of AI.

At **[fxis.ai](https://fxis.ai)**, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

For more insights, updates, or to collaborate on AI development projects, stay connected with **[fxis.ai](https://fxis.ai)**.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox