Developer Tools

AI Model Picker

An opinionated capability matrix of the frontier LLMs I actually deploy with. Filter by use case, compare context windows and per-token pricing side by side. Editorial, not exhaustive.

EditorialLast updated 2026-05-31
Use case
Vendors
14 of 14 models
Llama 4 Scout
Meta · Llama 4
Fast
Context
10M
In /1M
$0.20
Out /1M
$0.60
VisionTools 2025-03
Cheap bulkLong context

10M context window — niche but unmatched for chat-over-huge-corpus prototypes.

DeepSeek V3
DeepSeek · DeepSeek
Fast
Context
128K
In /1M
$0.27
Out /1M
$1.10
VisionTools 2024-07
Cheap bulkCoding

General-purpose chat. Excellent value, weaker than R1 on hard reasoning.

Gemini 2.5 Flash
Google · Gemini 2.5
Fast
Context
1M
In /1M
$0.30
Out /1M
$1.20
VisionTools 2025-04
Cheap bulkLong contextVision

Cheapest long-context option on the market. Good for RAG over large corpora.

GPT-5 mini
OpenAI · GPT-5
Fast
Context
200K
In /1M
$0.50
Out /1M
$2
VisionTools 2025-10
Cheap bulkVision

Budget tier for OpenAI. Good cost/quality trade-off for classification and extraction.

Llama 4 Maverick
Meta · Llama 4
Medium
Context
1M
In /1M
$0.50
Out /1M
$1.50
VisionTools 2025-03
Cheap bulkLong contextVision

Open weights, multimodal, MoE. Self-host or use via Together / Groq / Bedrock for low $/M.

DeepSeek R1
DeepSeek · DeepSeek
Medium
Context
128K
In /1M
$0.55
Out /1M
$2.19
VisionTools 2024-12
ReasoningCodingCheap bulk

Open-weight reasoning model. Strong math/code benchmarks at very low API cost. No vision.

Claude Haiku 4.5
Anthropic · Claude 4
Fast
Context
200K
In /1M
$1
Out /1M
$5
VisionTools 2025-08
Cheap bulkVision

Fast, cheap, surprisingly strong. Right pick for high-throughput pipelines.

Mistral Large 2
Mistral · Mistral
Medium
Context
128K
In /1M
$2
Out /1M
$6
VisionTools 2024-07
Coding

European hosting options matter for some regulated workloads.

Gemini 2.5 Pro
Google · Gemini 2.5
Medium
Context
2M
In /1M
$2.50
Out /1M
$15
VisionTools 2025-04
Long contextVisionReasoning

Largest production context window (2M). Strong multimodal grounding.

Claude Sonnet 4.6
Anthropic · Claude 4
Medium
Context
1M
In /1M
$3
Out /1M
$15
VisionTools 2025-08
CodingLong contextVision

Default workhorse for most production loads. Excellent coding scores at a fraction of Opus cost.

Grok 4
xAI · Grok
Medium
Context
256K
In /1M
$5
Out /1M
$15
VisionTools 2025-06
ReasoningCodingVision

X integration is the differentiator. Reasoning competitive with mid-tier frontier.

GPT-5
OpenAI · GPT-5
Medium
Context
400K
In /1M
$10
Out /1M
$30
VisionTools 2025-10
CodingReasoningVision

OpenAI flagship. Reliable generalist with strong reasoning mode.

Claude Opus 4.7
Anthropic · Claude 4
Slow
Context
1M
In /1M
$15
Out /1M
$75
VisionTools 2026-01
CodingLong contextReasoningVision

Anthropic flagship. Strong on agentic coding, long-context reasoning, and tool use. 1M context tier.

o3
OpenAI · o-series
Slow
Context
200K
In /1M
$20
Out /1M
$80
VisionTools 2024-10
ReasoningCoding

Reasoning model. Best when you can pay for many hidden thinking tokens.

How I'd pick

After deploying these in production across several teams, my heuristic is simple: pick two models, not one. A workhorse for the default 80 % of work, and a cheap-fast model for high-volume, low-stakes tasks. Reserve the flagship tier (Opus 4.7, GPT-5, o3, Gemini 2.5 Pro) for the hard 5 % where reasoning and reliability matter more than cost.

What I actually look at

  • Input price dominates real-world bills. RAG and agent loops read 5–20× more tokens than they write.
  • Context window matters when it matters — but most workloads stay under 32 K. Don't overpay for headroom you won't use.
  • Tool use quality ≠ tool-use checkmark. Test the actual behaviour under your prompts before committing.
  • Latency tier is a UX decision. Chat needs fast first-token; batch jobs can wait.
  • Knowledge cutoff only matters if you don't ground with search or RAG.

Caveats

Pricing and model availability change often. This page is hand-curated and refreshed quarterly — verify on the vendor's page before committing to anything contractual. Open-weight models (Llama, DeepSeek) are priced at common hosting providers; self-hosting changes the cost equation entirely.

Frequently Asked Questions

Which LLM should I pick for production?
Pick two: a workhorse (Sonnet 4.6 / GPT-5 / Gemini 2.5 Pro) and a cheap fast model (Haiku 4.5 / GPT-5 mini / Gemini Flash). Reserve flagship Opus / o3 / Gemini Pro for the hard 5 %.
How accurate is the pricing here?
Hand-curated, refreshed quarterly. Always cross-check with the vendor pricing page before signing anything.
Why is input price the most important number?
Most production workloads (RAG, agents, code assistants) read 5–20× more tokens than they write. Input price usually dominates the bill.
Are open-weight models included?
Yes — Llama 4 and DeepSeek, priced at common API hosts (Together, Fireworks, Bedrock). Self-hosting changes the maths; use a dedicated GPU-cost comparator for that.