As of 2026-05-15
As of 2026-05-15
Most LLM comparisons you read online either repeat published benchmarks (which leak) or vibe-rank models with no method. Neither is what a working reader needs. This page is the public methodology for every dated head-to-head on this site.
What we test
Each comparison is task-shaped: a small, named scenario with a fixed input distribution. Examples:
- Code-review battery (12 prompts). Real diffs from open-source repos with a planted bug each. We score how many bugs each model finds and how many spurious comments it adds.
- Summarization battery (10 prompts). News articles + technical write-ups + meeting transcripts. Rubric: faithfulness, key-point coverage, length adherence, hallucination count.
- Long-context retrieval (8 prompts). Needle-in-a-haystack at 16k, 64k, 128k, 200k tokens. Measure success vs. fabricated answers.
- Hard-reasoning battery (15 prompts). Problems with traceable answers (logic puzzles, arithmetic chains, planning). Measure correctness, not chain-of-thought style.
Each battery lives in source control; the prompts are reproducible.
Fixed parameters
For every model in every test:
- Temperature: 0 unless the task explicitly tests sampling behavior.
- Top-p: 1.
- Random seed: fixed where the API exposes it (most do for local models, some for hosted).
- Max tokens: identical for the task; disclosed.
- System prompt: identical for the task; disclosed.
- Few-shot examples: identical or none; disclosed.
- Model version: pinned to a specific snapshot or build date. No "auto-route" endpoints.
If any of those vary, the test is invalid.
Rubric for open-ended tasks
For writing and analysis tasks where there is no right answer, we score against a four-point rubric:
- Faithfulness. Does the output actually match the source / prompt? Count hallucinations, not vibes.
- Structure. Does the output follow the requested shape? (Length, format, label discipline.)
- Voice. Does it read like what was asked for, or does it read like "AI"?
- Hard errors. Count of objectively wrong factual claims.
Each output gets numbered. Raw outputs are linked. You are free to disagree with our scoring — the data is public.
What we report
Every comparison post includes:
- Models tested, with version pins.
- Date of the run.
- Task batteries and prompts, linked.
- Per-task scores, with the rubric where applicable.
- A short editorial verdict, marked as opinion.
- Caveats — what we did not test, what surprised us, where we doubt our own scoring.
If the verdict and the numbers disagree, we say so.
What we do not do
- No leaderboards across all tasks. Aggregate scores hide what matters. Each task has its own table.
- No "best model" headlines. "Best for what" is the question; the headline reflects the task.
- No undated snapshots. Every benchmark article carries an "as of" line. Old snapshots stay live; they are not silently updated.
- No paid placement. Nothing in our methodology gives any model a structural advantage. If a vendor sent us hardware, we would say so.
Why this approach beats vibes
Vibe-ranking models is fun and worthless. Standardized benchmarks are rigorous and misleading. The middle path — small disclosed task batteries with fixed parameters and a visible rubric — is what actual engineers do when they are picking a model for a project. We just write it down.