Why not just use MMLU / HumanEval / BFCL like everyone else?

Big standardized benchmarks are useful for ranking research models on aggregate skills, but they leak into training data and they do not tell a working reader what a model actually feels like on their tasks. We use small, task-shaped batteries: 10–30 representative prompts per task, fixed and disclosed. You can rerun any of them yourself.

Is "subjective rating" honest?

Yes, if it is calibrated. We rate writing and reasoning outputs against a simple rubric (faithfulness, structure, voice, hard-error count), disclose the rubric, and show the raw outputs so you can disagree. Hidden subjective scores are dishonest; visible rubrics with visible outputs are how editorial benchmarking actually works.

How often do you re-run?

Whenever a model in the comparison gets a meaningful update, or roughly every two months on the snapshot articles. Old snapshots stay live with an "as of" banner; they are archives, not lies.

How We Bench LLMs — Methodology

As of 2026-05-15

Most LLM comparisons you read online either repeat published benchmarks (which leak) or vibe-rank models with no method. Neither is what a working reader needs. This page is the public methodology for every dated head-to-head on this site.

What we test

Each comparison is task-shaped: a small, named scenario with a fixed input distribution. Examples:

Code-review battery (12 prompts). Real diffs from open-source repos with a planted bug each. We score how many bugs each model finds and how many spurious comments it adds.
Summarization battery (10 prompts). News articles + technical write-ups + meeting transcripts. Rubric: faithfulness, key-point coverage, length adherence, hallucination count.
Long-context retrieval (8 prompts). Needle-in-a-haystack at 16k, 64k, 128k, 200k tokens. Measure success vs. fabricated answers.
Hard-reasoning battery (15 prompts). Problems with traceable answers (logic puzzles, arithmetic chains, planning). Measure correctness, not chain-of-thought style.

Each battery lives in source control; the prompts are reproducible.

Fixed parameters

For every model in every test:

Temperature: 0 unless the task explicitly tests sampling behavior.
Top-p: 1.
Random seed: fixed where the API exposes it (most do for local models, some for hosted).
Max tokens: identical for the task; disclosed.
System prompt: identical for the task; disclosed.
Few-shot examples: identical or none; disclosed.
Model version: pinned to a specific snapshot or build date. No "auto-route" endpoints.

If any of those vary, the test is invalid.

Rubric for open-ended tasks

For writing and analysis tasks where there is no right answer, we score against a four-point rubric:

Faithfulness. Does the output actually match the source / prompt? Count hallucinations, not vibes.
Structure. Does the output follow the requested shape? (Length, format, label discipline.)
Voice. Does it read like what was asked for, or does it read like "AI"?
Hard errors. Count of objectively wrong factual claims.

Each output gets numbered. Raw outputs are linked. You are free to disagree with our scoring — the data is public.

What we report

Every comparison post includes:

Models tested, with version pins.
Date of the run.
Task batteries and prompts, linked.
Per-task scores, with the rubric where applicable.
A short editorial verdict, marked as opinion.
Caveats — what we did not test, what surprised us, where we doubt our own scoring.

If the verdict and the numbers disagree, we say so.

What we do not do

No leaderboards across all tasks. Aggregate scores hide what matters. Each task has its own table.
No "best model" headlines. "Best for what" is the question; the headline reflects the task.
No undated snapshots. Every benchmark article carries an "as of" line. Old snapshots stay live; they are not silently updated.
No paid placement. Nothing in our methodology gives any model a structural advantage. If a vendor sent us hardware, we would say so.

Why this approach beats vibes

Vibe-ranking models is fun and worthless. Standardized benchmarks are rigorous and misleading. The middle path — small disclosed task batteries with fixed parameters and a visible rubric — is what actual engineers do when they are picking a model for a project. We just write it down.

How We Bench LLMs — Methodology

Article summary

What we test

Fixed parameters

Rubric for open-ended tasks

What we report

What we do not do

Why this approach beats vibes

Frequently asked questions

See also

Where to go next