Зарегистрируйтесь сейчас для лучшей персонализированной цитаты!

Новости по теме

The AI model race has suddenly gotten a lot closer, say Stanford scholars

Apr, 09, 2025 Hi-network.com
stanford-hai-ai-index-2025-fig-2-1-39
Stanford University

The competition to create the world's top artificial intelligence models has become something of a scrimmage, a pile of worthy contenders all on top of one another, with less and less of a clear victory by anyone.

According to scholars at Stanford University's Institute for Human-Centered Artificial Intelligence, the number of contenders in "frontier" or "foundation" models has expanded substantially in recent years, but the difference between the best and the weakest has also narrowed substantially.

In 2024, "the Elo score difference between the top and 10th-ranked model on the Chatbot Arena Leaderboard was 11.9%. By early 2025, this gap had narrowed to just 5.4%," write Rishi Bommasani and team in "The AI Index 2025 Annual Report" 

Also:Is OpenAI doomed? Open-source models may crush it, warns expert

In the chapter on technical performance, Bommasani and colleagues relate that in 2022, when ChatGPT first emerged, the top large language models were dominated by OpenAI and Google. That field now includes China's DeepSeek AI, Elon Musk's xAI, Anthropic, Meta Platforms's Meta AI, and Mistral AI. 

"The AI landscape is becoming increasingly competitive, with high-quality models now available from a growing number of developers," they write. 

The gap between OpenAI and Google has narrowed even more, with the GPT family and Gemini having a performance difference of just 0.7%, down from 4.9% in 2023. 

A concurrent trend, according to Bommasani, is the rise of "open-weight" AI models, such as Meta Platforms's Llama, which can, in some cases, equal the top "closed" models, such as GPT. 

stanford-hai-ai-index-2025-fig-2-1-34
Stanford University

Open-weight models are those where the trained weights of the neural nets, the heart of their ability to transform input into output, are made available for download. They can be used to inspect and replicate the AI model without having access to the actual source code instructions of the model. Closed models do not provide public access to weights, and so the models remain something of a black box, as is the case with GPT and Gemini.

"In early January 2024, the leading closed-weight model outperformed the top open-weight model by 8.0%. By February 2025, this gap had narrowed to 1.7%," write Bommasani and team.

Also: Gemini Pro 2.5 is a stunningly capable coding assistant - and a big threat to ChatGPT

Since 2023, when "closed-weight models consistently outperformed open-weight counterparts on nearly every major benchmark," they relate, the gap between closed and open has narrowed from 15.9 points to "just 0.1 percentage point" at the end of 2024, largely a result of Meta's 3.1 version of Llama. 

Another thread taking place alongside open-weight models are the surprising achievements of smaller large language models. AI models are typically classified based on the number of weights they use, with the biggest at the mo ment publicly disclosed, Meta's Llama 4, using two trillion weights. 

stanford-hai-ai-index-2025-fig-2-1-38
Stanford University

"2024 was a breakthrough year for smaller AI models," write Bommasani and team. "Nearly every major AI developer released compact, high-performing models, including GPT-4o mini, o1-mini, Gemini 2.0 Flash, Llama 3.1 8B, and Mistral Small 3.5."

Bommasani and team don't make any predictions about what happens next in the crowded field, but they do see a very pressing concern for the benchmark tests used to evaluate large language models. 

Those tests are becoming saturated -- even some of the most demanding, such as the HumanEval benchmark created in 2021 by OpenAI to test models' coding skills. That affirms a feeling seen throughout the industry these days: It's becoming harder to accurately and rigorously compare new AI models.

Also:With AI models clobbering every benchmark, it's time for human evaluation

In response, note the authors, the field has developed new ways to construct benchmark tests, such as Humanity's Last Exam, which has human-curated questions formulated by subject-matter experts; and Arena-Hard-Auto, a test created by the non-profit Large Model Systems Corp., using crowd-sourced prompts that are automatically curated for difficulty. 

The authors note that one of the more challenging tests is the ARC-AGI test for finding visual patterns. It's still a hard test, even though OpenAI's o3 mini did well on it in December. 

The hardness of the benchmark is affecting AI models for the better, they write. "This year's improvements [by o3 mini] suggest a shift in focus toward more meaningful advancements in generalization and search capabilities" among AI models, they write.

The authors note that creating benchmarks is not simple. For one, there is the model of "contamination," where neural networks are trained on data that ends up being used as test questions, like a student who has access to the answers ahead of an exam.

Also: 'Humanity's Last Exam' benchmark is stumping top AI models - can you do any better?

And many benchmarks are just badly constructed, they write. "Despite widespread use, benchmarks like MMLU demonstrated poor adherence to quality standards, while others, such as GPQA, performed significantly better," according to a broad research study at Stanford called BetterBench.

Bommasani and team conclude that standardizing across benchmarks is essential going forward. "These findings underscore the need for standardized benchmarking to ensure reliable AI evaluation and to prevent misleading conclusions about model performance," they write. "Benchmarks have the potential to shape policy decisions and influence procurement decisions within organizations, highlighting the importance of consistency and rigor in evaluation."

Want more stories about AI?Sign up for Innovation, our weekly newsletter.

tag-icon Горячие метки: 3. Инновации

Copyright © 2014-2024 Hi-Network.com | HAILIAN TECHNOLOGY CO., LIMITED | All Rights Reserved.
Our company's operations and information are independent of the manufacturers' positions, nor a part of any listed trademarks company.