What this is
ipbr-rank is a public-LLM coding-role scoreboard. It pulls model performance from public benchmarks, normalizes them onto a common 0-100 scale, and produces four role scores: Idea, Plan, Build, Review.
All inputs come from public, verifiable sources. Weights and aggregation rules are explicit and versioned. A small number of vendor-published metrics that haven't yet appeared on public leaderboards are recorded as overrides. There is no manual reranking.
The four roles
- Idea — open-ended creativity, general intelligence, breadth. Driven by LM Arena Text, AI Stupid Level idea-shaped axes, and reasoning blends.
- Plan — structured reasoning, function-calling, multi-step task decomposition. Driven by Terminal-Bench, tau2-bench, IFBench, MCP-Atlas, and AISL plan axes.
- Build — actually writing code that runs. Driven by SWE-bench (Verified + Multilingual + Pro), SWE-rebench, LiveCodeBench, Sonar code quality, and AISL build axes.
- Review — judging code quality, correctness, and preference. Driven by LM Arena, Sonar issue density, and AISL review axes. Review has no adjusted variant — it is the source of the penalty applied to the other three.
Raw vs adjusted
The raw score is the benchmark composite, normalized. The adjusted score subtracts a reviewer-reservation penalty: when one vendor's models dominate Review, that lead gets discounted from their other scores.
L_v = max(0, max(R_all) - max(R_outside_v))
I_adj = I_raw - 0.08 × L_v
P_adj = P_raw - 0.18 × L_v
B_adj = B_raw - 0.32 × L_v
Coefficients reflect how easy each role is to game with biased preference evaluations. Build is hardest hit; Idea is barely touched; Plan sits in between.
How scores are built
- Normalize — each metric is percentile-mapped within the active model population (5th/95th boundaries; log-scaled for cost/speed/latency). Operational metrics use a tail-penalty curve — the top 80% maps into 70-100 with mild differentiation; only the bottom 20% drops sharply.
- Aggregate — metrics roll up into groups (CRE, GEN, PLAN, BUILD, LM_ARENA_REVIEW_PROXY, OPS_*, A_I/A_P/A_B/A_R). If at least 70% of a group's weight is present, the score is the present-weight mean. Below that threshold, it shrinks toward 50 proportional to missing weight.
- Combine — each role score is a weighted average of groups. AISL's role-shaped perspective (A_*) carries 0.35 in every formula. Operational metrics carry 0.08 — fast-enough models cluster within a 1-2 point spread, but genuinely slow models lose 4-6 points.
- Canary health — AISL canary drift is a penalty-only signal. Healthy or missing canary data adds nothing; degraded canary data can subtract up to 6 points from raw role scores.
- Synthesize last — when a known sibling pair has a metric on one model but not the other, the missing field is filled from the sibling and softened toward 50 by 15% so it reads as a softer signal.
Sources
| source | status | rows | matched | unmatched |
|---|---|---|---|---|
| aistupidlevel | verified | 24 | 39 | 1 |
| arc_agi | verified | 149 | 50 | 112 |
| artificial_analysis | verified | 500 | 92 | 424 |
| livecodebench | verified | 28 | 4 | 24 |
| lmarena | verified | 387 | 90 | 313 |
| mcp_atlas | verified | 19 | 26 | 7 |
| openrouter | verified | 367 | 64 | 319 |
| overrides | verified | 23 | 37 | 0 |
| sonar | verified | 58 | 37 | 35 |
| swebench | verified | 194 | 33 | 169 |
| swebench_pro | verified | 24 | 22 | 14 |
| swerebench | verified | 91 | 30 | 73 |
| terminal_bench | verified | 124 | 85 | 53 |
Glossary
- Synthesized — a metric value filled from a known sibling model when the source did not directly cover this model.
- Trust threshold — the 70% group-coverage cutoff above which the present-weight mean is trusted directly.
- Composite — a metric that is itself a weighted blend of related sub-metrics (currently SWEComposite).
- A_* perspective — AISL's 17 capability axes weighted four ways (one weighting per role). Canary health is separate and penalty-only.