Benchmarks & Milestones
37 trackedTrack the latest AI achievements, model releases, and industry milestones
February 2026
Claude Opus 4.6 — 1M context, agent teams
Features a 1-million-token context window and introduces agent teams — groups of agents that can split larger tasks into segmented jobs. Outperformed GPT-5.2 on several benchmarks.
September 2025
Claude Sonnet 4.5 — autonomous agent benchmark leader
State-of-the-art on SWE-bench Verified and OSWorld (61.4%). Can work autonomously for 30+ hours (up from 7 hours with Opus 4). Launched alongside the Claude Agent SDK.
May 2025
Claude Sonnet 4 and Opus 4 launch
Claude Sonnet 4 delivers superior coding and reasoning as a hybrid model. Claude Opus 4 is declared the world's best coding model, leading SWE-bench (72.5%) and Terminal-bench (43.2%).
April 2025
o3 — OpenAI's smartest reasoning model with tool use
OpenAI releases o3, its most capable reasoning model, making 20% fewer major errors than o1. The first reasoning model that can agentically use all ChatGPT tools (web search, Python, image generation) simultaneously.
Llama 4 Scout and Maverick — MoE + 10M context
First Llama models with native multimodality and mixture-of-experts architecture. Scout supports a 10M-token context window on a single H100. Maverick (128 experts) beats GPT-4o and Gemini 2.0 Flash.
March 2025
Gemini 2.5 Pro — Google's first thinking model
Google's first 'thinking model' that reasons through steps before responding. Debuted at #1 on LMArena by a significant margin. Described as Google's most intelligent AI model.
February 2025
Claude 3.7 Sonnet — first hybrid reasoning model
The first hybrid reasoning model on the market, able to produce near-instant responses or engage in visible extended step-by-step thinking. Launched alongside Claude Code, an agentic coding CLI tool.
Grok 3 — trained on 200K-GPU Colossus supercluster
Trained with 10x the compute of Grok-2 on the 200,000-GPU Colossus supercluster. Features a 1M-token context window and large-scale reinforcement learning for reasoning.
January 2025
o3-mini — fast, affordable reasoning
OpenAI releases o3-mini, a cost-efficient reasoning model optimized for STEM. Excels in science, math, and coding, showing that reasoning capabilities can be delivered in lightweight form factors.
DeepSeek R1 — open-source reasoning model shakes markets
An open-source reasoning model rivaling OpenAI o1. Its mobile app briefly surpassed ChatGPT as #1 free app on the US App Store. The release caused an 18% drop in Nvidia's share price.
December 2024
DeepSeek V3 — trained for $6M, rivals GPT-4o
A 671B MoE model (37B active per token) trained for only ~$6 million. Competitive with GPT-4o and Claude 3.5 Sonnet on most benchmarks, demonstrating radical training cost efficiency.
Gemini 2.0 Flash — built for the agentic era
Outperforms Gemini 1.5 Pro on key benchmarks at twice the speed. First Gemini with multimodal output (native image generation + text-to-speech). Positioned for the 'agentic era.'
Llama 3.3 70B — 405B performance at 70B cost
A 70B text-only model delivering performance comparable to the much larger Llama 3.1 405B at a fraction of the serving cost, demonstrating significant efficiency gains through improved training.
o1 full release with ChatGPT Pro ($200/mo)
Full release of o1 with 34% fewer major mistakes and 50% faster than the preview, now multimodal. Launched alongside the $200/month ChatGPT Pro subscription tier featuring o1-pro mode.
November 2024
Claude 3.5 Haiku — Opus performance at Haiku speed
Claude 3.5 Haiku matches the performance of the prior Claude 3 Opus at the speed and cost of the smaller Haiku tier. Scored 40.6% on SWE-bench Verified, outperforming many agents using larger models.
October 2024
Upgraded Claude 3.5 Sonnet + computer use beta
Upgraded Claude 3.5 Sonnet with across-the-board improvements, especially in coding. Introduces computer use in public beta — the first frontier AI to control a computer (cursor, clicks, typing) via API.
September 2024
Llama 3.2 — first multimodal + edge models
Meta's first multimodal open models (11B and 90B vision LLMs) plus lightweight models (1B and 3B) for edge and mobile devices. Announced at Meta Connect 2024.
o1-preview — OpenAI's first reasoning model
OpenAI launches o1-preview, its first model trained with reinforcement learning to think step-by-step before answering. Ranked 89th percentile on Codeforces and placed among top 500 US students on AIME.
August 2024
Grok 2 — xAI enters the frontier race
xAI's first competitive frontier model, achieving performance on par with leading models on graduate-level science (GPQA), general knowledge (MMLU), and math. Includes image generation via Flux.
July 2024
Llama 3.1 405B — first open frontier model
The first openly available frontier-level model at 405 billion parameters, rivaling top proprietary models in general knowledge, math, tool use, and multilingual translation. 128K context.
GPT-4o mini — cost-efficient intelligence
OpenAI releases GPT-4o mini, a smaller and significantly cheaper variant that outperforms GPT-4 Turbo on many benchmarks while costing a fraction to run.
June 2024
Claude 3.5 Sonnet — faster than Opus, cheaper too
Claude 3.5 Sonnet outperforms Claude 3 Opus on key benchmarks at 2x the speed and 1/5 the cost. Introduces Artifacts, an interactive workspace for AI-generated content alongside conversations.
May 2024
Gemini 1.5 Flash — speed-optimized with 1M context
A speed-optimized Gemini model for high-frequency tasks where latency matters most, while still supporting 1 million tokens of multimodal context (text, images, audio, video).
GPT-4o launches — 2x faster, half the price
OpenAI releases GPT-4o ('omni'), which natively processes and generates text, audio, and images. 2x faster and 50% cheaper than GPT-4 Turbo. Made free to all ChatGPT users.
DeepSeek V2 — MoE with radical cost efficiency
A MoE model with 236B total parameters but only 21B active, pioneering efficient inference with Multi-head Latent Attention. Dramatically undercut competitor pricing.
April 2024
Llama 3 — most capable open LLM at launch
The most capable openly available LLM at launch, with significantly improved reasoning, coding, and instruction-following over Llama 2. Available in 8B and 70B parameter sizes.
March 2024
Claude 3 family — Opus, Sonnet, Haiku
Anthropic launches three models: Opus (most intelligent), Sonnet (balanced), and Haiku (fastest). Claude 3 Opus outperforms GPT-4 on most benchmarks. Anthropic's first model with vision capabilities.
February 2024
Mistral Large — European frontier alternative
Mistral's first true flagship model competing with GPT-4. Features 32K context, specialized RAG support, and best-in-class multilingual performance across French, German, Spanish, and Italian.
Gemini 1.5 Pro — 1 million token context window
Gemini 1.5 Pro achieves a breakthrough 1-million-token context window, capable of processing 1 hour of video, 11 hours of audio, or 700K+ words in a single prompt.
December 2023
Mixtral 8x7B — open MoE beats Llama 2 70B
A pioneering open-weight sparse mixture-of-experts model that outperforms Llama 2 70B on most benchmarks with 6x faster inference. Released under the permissive Apache 2.0 license.
Google launches Gemini 1.0
Google DeepMind releases Gemini 1.0 in Ultra, Pro, and Nano sizes. The Ultra model is the first to claim human-expert performance on MMLU (90.0%). Natively multimodal from training.
November 2023
GPT-4 Turbo announced at DevDay with 128K context
OpenAI announces GPT-4 Turbo at its first DevDay conference, featuring a 128K context window, JSON mode, and significantly lower pricing — 3x cheaper input, 2x cheaper output vs GPT-4.
July 2023
Meta open-sources Llama 2
Meta releases Llama 2 in partnership with Microsoft — one of the first major open-weight foundation models for commercial use. Available in 7B, 13B, and 70B parameter sizes.
Claude 2 launches with 100K context to the public
Anthropic releases Claude 2 with a 100K token context window, allowing users to process entire books in a single conversation. The first Anthropic model available to the general public via claude.ai.
March 2023
GPT-4 released with multimodal capabilities
OpenAI launches GPT-4, a large multimodal model that accepts image and text inputs. It scores in the 90th percentile on the bar exam and demonstrates significant reasoning improvements over GPT-3.5.
Anthropic introduces Claude
Anthropic launches its first public AI assistant, positioned as helpful, honest, and harmless. Available in Claude and Claude Instant variants, initially via API only.
November 2022
ChatGPT launches and reaches 100M users
OpenAI releases ChatGPT, a conversational AI assistant based on GPT-3.5. It becomes the fastest-growing consumer app in history, reaching 100 million users in two months.