Exclusive: Gemini’s data analysis capabilities aren’t as good as Google claims


One of the selling points of Google’s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the amount of data they are supposed to process and analyze. In press briefings and demonstrations, Google has repeatedly claimed that the models can accomplish previously impossible tasks thanks to their “long context,” such as summarizing documents several hundred pages long or searching within movie scenes. .

But new research suggests that the models are not, in fact, very effective in these areas.

Two separate studies looked at how well Google’s Gemini and other models do at understanding massive amounts of data (think War and Peace-length works). Both studies found that Gemini 1.5 Pro and 1.5 Flash struggle to correctly answer questions about large data sets; in a series of document-based tests, the models gave the correct answer only 40 to 50 percent of the time.

“While models like Gemini 1.5 Pro can technically handle long contexts, we’ve seen many cases where the models don’t actually ‘understand’ the content,” Marzena Karpinska, a postdoctoral fellow at UMass Amherst and co-author of one of the studies, told TechCrunch.

Gemini pop-up is missing

A model’s context, or context window, refers to the input data (e.g., text) that the model considers before generating an output (e.g., additional text). A simple question—“Who won the 2020 US presidential election?”—can serve as context, as can a movie script, TV show, or audio clip. And as the context windows get larger, so do the documents inserted into them.

The latest versions of Gemini can accept more than 2 million tokens as context. (“Tokens” are subdivided chunks of raw data, such as the syllables “fan,” “tas,” and “tic” in the word “fantastic.”) That’s about 1.4 million words, two hours of video, or 22 hours of audio — the largest context of any commercially available model.

At a press conference earlier this year, Google showed off several pre-recorded demos meant to illustrate the potential of Gemini’s long-form context capabilities. One had Gemini 1.5 Pro search the Apollo 11 moon landing television transcript (about 402 pages) for quotes that contained jokes, and then found a scene in the broadcast that looked like a pencil sketch.

Google DeepMind VP of Research Oriol Vinyals, who led the briefing, called the model “magical.”

“(1.5 Pro) does these kinds of reasoning tasks on every page, every word,” he said.

Perhaps that was an exaggeration.

In one of the aforementioned studies assessing these abilities, Karpinska, working with researchers at the Allen Institute for AI and Princeton, asked models to evaluate true/false statements about fiction books written in English. The researchers chose recent works so that the models could not “cheat” by relying on prior knowledge, and they peppered the statements with references to specific details and plot points that would be impossible to understand without reading the books in their entirety.

Given a statement like “Using his Apoth skills, Nusis is able to reverse engineer the type of portal opened by the Reagents Key found in Rona’s wooden chest”, Gemini 1.5 Pro and 1.5 Flash – after ingesting the corresponding book – had to say whether the statement was true or false and explain their reasoning.

Image credits: University of Massachusetts Amherst

Tested on a book of approximately 260,000 words (~520 pages), researchers found that 1.5 Pro answered true/false statements correctly 46.7% of the time, while Flash only answered correctly 20% of the time. This means that a coin answers questions about the book much better than Google’s latest machine learning model. Averaging all benchmark results, neither model managed to achieve better than chance in terms of question answer accuracy.

“We noticed that the models have more difficulty verifying claims that require considering larger portions of the book, or even the entire book, compared to claims that can be resolved by retrieving sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the models have difficulty verifying claims about implicit information that is clear to a human reader but not explicitly stated in the text.”

The second of the two studies, co-authored by researchers at UC Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reason” about videos—that is, to search for and answer questions about their content.

The co-authors created a dataset of images (for example, a photo of a birthday cake) with questions for the model to answer about the objects depicted in the images (for example, “What cartoon character is on this cake?”). To evaluate the models, they chose one of the images at random and inserted “entertaining” images before and after it to create slideshow-like sequences.

Flash didn’t perform very well. In a test that asked the model to transcribe six handwritten digits from a 25-image “slideshow,” Flash got about 50 percent of the transcriptions correct. Accuracy dropped to about 30 percent with eight digits.

“In real-world image question-and-answer tasks, this seems particularly difficult for all the models we tested,” Michael Saxon, a doctoral student at UC Santa Barbara and one of the co-authors of the paper, told TechCrunch. the study. “This small amount of reasoning – recognizing that a number is in a frame and reading it – could be what breaks the pattern. »

Google promises too much with Gemini

None of these studies have been peer-reviewed and examined Gemini 1.5 Pro and 1.5 Flash versions with 2 million token contexts. (Both tested context versions of 1 million tokens.) And Flash isn’t supposed to be as good as Pro in terms of performance; Google presents it as a low-cost alternative.

Still, both fuel the fire that Google overpromised – and underdelivered – with Gemini from the start. None of the models the researchers tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. But Google is the only template provider that prioritizes the pop-up in its ads.

“There’s nothing wrong with simply stating that ‘our model can accommodate X number of tokens’ based on objective technical details,” Saxon said. “But the question is, what useful thing can be done with it?” »

Generative AI, as a whole, is coming under increasing scrutiny as companies (and investors) become increasingly frustrated with the technology’s limitations.

In two recent Boston Consulting Group surveys, about half of respondents, all senior executives, said they did not expect generative AI to drive substantial productivity gains and were concerned about the risk of errors and data compromises from generative AI-powered tools. PitchBook recently reported that for two consecutive quarters, early-stage generative AI deal closing has declined, falling 76% from its peak in Q3 2023.

In the face of chatbots that summarize meetings and conjure up fictional details about people, and AI search platforms that are essentially plagiarism generators, customers are looking for promising differentiators. Google, which has raced, sometimes awkwardly, to catch up with its generative AI rivals, was desperate to make Gemini’s context one of those differentiators.

But the bet was premature, it seems.

“We haven’t yet found a way to demonstrate that ‘reasoning’ or ‘understanding’ over long documents actually happens, and each group that publishes these models is developing its own ad hoc evaluations to support these claims,” Karpinska said. “Without knowing how long the context processing is happening (and companies don’t share these details), it’s hard to say how realistic these claims are.”

Google did not respond to a request for comment.

Saxon and Karpinska believe that antidotes to exaggerated claims around generative AI provide better reference points and, along the same lines, place more emphasis on third-party critiques. Saxon notes that one of the most common tests for long context (usually cited by Google in its marketing materials), “needle in the haystack”, only measures a model’s ability to retrieve particular information , like names and numbers, from data sets – not to answer. complex questions about this information.

“All the scientists and most engineers who use these models agree that our current benchmarking culture is broken,” Saxon said. “It is therefore important that the public understands that these giant reports containing figures such as ‘general intelligence on benchmarks’ must be taken with a huge grain of salt. »



Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top