AI Model Picker
An opinionated capability matrix of the frontier LLMs I actually deploy with. Filter by use case, compare context windows and per-token pricing side by side. Editorial, not exhaustive.
| Model | Context | $ Input /1M | $ Output /1M | Latency | Capabilities | Cutoff | Best for |
|---|---|---|---|---|---|---|---|
Llama 4 Scout Meta | 10M | $0.20 | $0.60 | Fast | VisionTools | 2025-03 | Cheap bulkLong context |
DeepSeek V3 DeepSeek | 128K | $0.27 | $1.10 | Fast | VisionTools | 2024-07 | Cheap bulkCoding |
Gemini 2.5 Flash Google | 1M | $0.30 | $1.20 | Fast | VisionTools | 2025-04 | Cheap bulkLong contextVision |
GPT-5 mini OpenAI | 200K | $0.50 | $2 | Fast | VisionTools | 2025-10 | Cheap bulkVision |
Llama 4 Maverick Meta | 1M | $0.50 | $1.50 | Medium | VisionTools | 2025-03 | Cheap bulkLong contextVision |
DeepSeek R1 DeepSeek | 128K | $0.55 | $2.19 | Medium | VisionTools | 2024-12 | ReasoningCodingCheap bulk |
Claude Haiku 4.5 Anthropic | 200K | $1 | $5 | Fast | VisionTools | 2025-08 | Cheap bulkVision |
Mistral Large 2 Mistral | 128K | $2 | $6 | Medium | VisionTools | 2024-07 | Coding |
Gemini 2.5 Pro Google | 2M | $2.50 | $15 | Medium | VisionTools | 2025-04 | Long contextVisionReasoning |
Claude Sonnet 4.6 Anthropic | 1M | $3 | $15 | Medium | VisionTools | 2025-08 | CodingLong contextVision |
Grok 4 xAI | 256K | $5 | $15 | Medium | VisionTools | 2025-06 | ReasoningCodingVision |
GPT-5 OpenAI | 400K | $10 | $30 | Medium | VisionTools | 2025-10 | CodingReasoningVision |
Claude Opus 4.7 Anthropic | 1M | $15 | $75 | Slow | VisionTools | 2026-01 | CodingLong contextReasoningVision |
o3 OpenAI | 200K | $20 | $80 | Slow | VisionTools | 2024-10 | ReasoningCoding |
10M context window — niche but unmatched for chat-over-huge-corpus prototypes.
General-purpose chat. Excellent value, weaker than R1 on hard reasoning.
Cheapest long-context option on the market. Good for RAG over large corpora.
Budget tier for OpenAI. Good cost/quality trade-off for classification and extraction.
Open weights, multimodal, MoE. Self-host or use via Together / Groq / Bedrock for low $/M.
Open-weight reasoning model. Strong math/code benchmarks at very low API cost. No vision.
Fast, cheap, surprisingly strong. Right pick for high-throughput pipelines.
European hosting options matter for some regulated workloads.
Largest production context window (2M). Strong multimodal grounding.
Default workhorse for most production loads. Excellent coding scores at a fraction of Opus cost.
X integration is the differentiator. Reasoning competitive with mid-tier frontier.
OpenAI flagship. Reliable generalist with strong reasoning mode.
Anthropic flagship. Strong on agentic coding, long-context reasoning, and tool use. 1M context tier.
Reasoning model. Best when you can pay for many hidden thinking tokens.
How I'd pick
After deploying these in production across several teams, my heuristic is simple: pick two models, not one. A workhorse for the default 80 % of work, and a cheap-fast model for high-volume, low-stakes tasks. Reserve the flagship tier (Opus 4.7, GPT-5, o3, Gemini 2.5 Pro) for the hard 5 % where reasoning and reliability matter more than cost.
What I actually look at
- Input price dominates real-world bills. RAG and agent loops read 5–20× more tokens than they write.
- Context window matters when it matters — but most workloads stay under 32 K. Don't overpay for headroom you won't use.
- Tool use quality ≠ tool-use checkmark. Test the actual behaviour under your prompts before committing.
- Latency tier is a UX decision. Chat needs fast first-token; batch jobs can wait.
- Knowledge cutoff only matters if you don't ground with search or RAG.
Caveats
Pricing and model availability change often. This page is hand-curated and refreshed quarterly — verify on the vendor's page before committing to anything contractual. Open-weight models (Llama, DeepSeek) are priced at common hosting providers; self-hosting changes the cost equation entirely.
