$ipbr-rank · live llm coding-role score
refreshed · 14 sources · updated frequently — models drift and degrade
[ idea ]
1gemini-3.1-pro-preview93.793.7
2claude-opus-4.692.992.9
3claude-opus-4.789.889.4
[ plan ]
1gemini-3.1-pro-preview87.487.4
2gpt-5.584.084.0
3claude-opus-4.777.977.0
[ build ]
1claude-opus-4.684.784.7
2gpt-5.581.181.1
3claude-opus-4.780.278.6
[ review ]
1gemini-3.1-pro-preview85.985.9
2kimi-k2.684.184.1
3claude-opus-4.782.582.5
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

raw vs adjusted

The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models lead the direct LM Arena search/document review proxy, that proxy lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.

Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted.

missing data

If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.

Full math, role definitions, and source list →

claude-opus-4.6anthropic92.992.976.276.284.784.774.6

group breakdown

A_B84.64 / 24A_I86.65 / 24A_P65.49 / 24A_R83.07 / 24BUILD86.82 / 24CRE100.01 / 24GEN90.04 / 24LM_ARENA_REVIEW_PROXY33.612 / 24OPS_long78.817 / 24OPS_precision77.414 / 24OPS_review79.614 / 24PLAN73.17 / 24

metrics

AI_canary_health84.25 / 7AI_code100.01 / 22AI_complexity88.93 / 22AI_context_awareness8.48 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency71.13 / 22AI_hallucination_resistance0.022 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy100.02 / 24AI_plan_coherence0.023 / 24AI_recovery100.01 / 22AI_refusal100.01 / 22AI_spec100.01 / 22AI_stability100.01 / 22AI_task_completion96.73 / 24AI_tool_selection99.92 / 24ARC_AGI_290.94 / 22ArtificialAnalysisCoding76.75 / 23ArtificialAnalysisIntelligence85.36 / 23ArtificialAnalysisReasoning86.35 / 23BlendedCost61.921 / 24ContextWindow99.37 / 24CopilotArenaOrLMArenaCode99.82 / 24GDPval82.27 / 24GPQA_HLE_Reasoning86.35 / 23GSO75.33 / 16IFBench30.418 / 23LMArenaCreativeOrOpenEnded100.01 / 24LMArenaSearchDocument33.610 / 22LMArenaText100.01 / 24LongContextRecall90.25 / 23OutputSpeed76.918 / 23SWEBenchMultilingual90.98 / 19SWEBenchPro100.01 / 21SWEBenchVerified99.72 / 23SWEComposite95.71 / 24SWERebench91.64 / 23SciCode85.85 / 23SonarBugDensity59.512 / 22SonarComposite70.57 / 24SonarFunctionalSkill92.24 / 22SonarIssueDensity46.89 / 22SonarVulnerabilityDensity66.611 / 22TTFT77.414 / 23Tau2Bench87.67 / 23TerminalBench64.27 / 24
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlas
gpt-5.5openai83.483.484.084.081.181.179.4

group breakdown

A_B62.713 / 24A_I75.116 / 24A_P62.512 / 24A_R84.86 / 24BUILD87.41 / 24CRE82.67 / 24GEN94.43 / 24LM_ARENA_REVIEW_PROXY27.513 / 24OPS_long82.110 / 24OPS_precision79.513 / 24OPS_review81.013 / 24PLAN90.62 / 24

metrics

AI_code22.416 / 22AI_complexity28.319 / 22AI_context_awareness0.020 / 24AI_correctness94.114 / 22AI_edge_cases86.516 / 22AI_efficiency42.217 / 22AI_hallucination_resistance80.011 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy94.16 / 24AI_plan_coherence8.320 / 24AI_recovery98.712 / 22AI_refusal100.013 / 22AI_spec100.013 / 22AI_stability89.912 / 22AI_task_completion92.610 / 24AI_tool_selection100.01 / 24ARC_AGI_296.72 / 22ArtificialAnalysisCoding100.02 / 23ArtificialAnalysisIntelligence98.23 / 23ArtificialAnalysisReasoning100.02 / 23BlendedCost50.623 / 24ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.710 / 24GDPval95.02 / 24GPQA_HLE_Reasoning100.02 / 23GSO94.02 / 16IFBench78.17 / 23LMArenaCreativeOrOpenEnded82.67 / 24LMArenaSearchDocument27.511 / 22LMArenaText82.67 / 24LongContextRecall98.03 / 23MCPAtlas72.87 / 16OutputSpeed81.712 / 23SWEBenchPro95.06 / 21SWEBenchVerified95.08 / 23SWEComposite89.95 / 24SWERebench83.58 / 23SciCode94.54 / 23SonarBugDensity94.52 / 22SonarComposite65.58 / 24SonarFunctionalSkill46.518 / 22SonarIssueDensity52.77 / 22SonarVulnerabilityDensity99.22 / 22TTFT85.48 / 23Tau2Bench86.98 / 23TerminalBench100.01 / 24
sources aistupidlevelarc_agiartificial_analysisgsolmarenaopenrouteroverridessonarterminal_benchmissing SWEComposite/SWEBenchMultilingual
claude-opus-4.7anthropic89.889.477.977.080.278.682.5

group breakdown

A_B69.47 / 24A_I82.28 / 24A_P59.715 / 24A_R76.516 / 24BUILD86.53 / 24CRE93.64 / 24GEN96.52 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long72.219 / 24OPS_precision65.020 / 24OPS_review70.920 / 24PLAN78.84 / 24

metrics

AI_code51.06 / 22AI_complexity50.28 / 22AI_context_awareness8.77 / 24AI_correctness100.02 / 22AI_edge_cases100.02 / 22AI_efficiency68.34 / 22AI_hallucination_resistance0.023 / 24AI_memory_retention23.89 / 24AI_parameter_accuracy37.316 / 24AI_plan_coherence22.210 / 24AI_recovery100.02 / 22AI_refusal100.02 / 22AI_spec100.02 / 22AI_stability84.116 / 22AI_task_completion84.214 / 24AI_tool_selection0.020 / 24ARC_AGI_292.73 / 22ArtificialAnalysisCoding90.63 / 23ArtificialAnalysisIntelligence100.01 / 23ArtificialAnalysisReasoning95.63 / 23BlendedCost61.922 / 24ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.01 / 24GDPval95.01 / 24GPQA_HLE_Reasoning95.63 / 23GSO100.01 / 16IFBench45.112 / 23LMArenaCreativeOrOpenEnded93.64 / 24LMArenaSearchDocument100.01 / 22LMArenaText93.64 / 24LongContextRecall88.27 / 23OutputSpeed78.016 / 23SWEBenchMultilingual95.03 / 19SWEBenchPro95.02 / 21SWEBenchVerified95.03 / 23SWEComposite91.14 / 24SWERebench85.36 / 23SciCode100.01 / 23SonarBugDensity50.119 / 22SonarComposite51.418 / 24SonarFunctionalSkill93.92 / 22SonarIssueDensity0.022 / 22SonarVulnerabilityDensity25.319 / 22TTFT41.120 / 23Tau2Bench79.810 / 23TerminalBench78.24 / 24
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarmissing BUILD/MCPAtlasPLAN/MCPAtlas
kimi-k2.6moonshot75.275.276.476.477.877.884.1

group breakdown

A_B59.415 / 24A_I75.414 / 24A_P60.214 / 24A_R77.815 / 24BUILD84.44 / 24CRE78.28 / 24GEN74.35 / 24LM_ARENA_REVIEW_PROXY94.72 / 24OPS_long60.422 / 24OPS_precision73.915 / 24OPS_review72.419 / 24PLAN87.83 / 24

metrics

AI_code22.414 / 22AI_complexity28.315 / 22AI_context_awareness0.016 / 24AI_correctness94.110 / 22AI_edge_cases86.512 / 22AI_efficiency41.618 / 22AI_hallucination_resistance40.018 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy86.49 / 24AI_plan_coherence13.919 / 24AI_recovery98.78 / 22AI_refusal100.09 / 22AI_spec100.09 / 22AI_stability88.413 / 22AI_task_completion77.217 / 24AI_tool_selection77.611 / 24ARC_AGI_211.914 / 22ArtificialAnalysisCoding73.67 / 23ArtificialAnalysisIntelligence88.44 / 23ArtificialAnalysisReasoning87.64 / 23BlendedCost89.19 / 24ContextWindow78.814 / 24CopilotArenaOrLMArenaCode94.65 / 24GDPval68.511 / 24GPQA_HLE_Reasoning87.64 / 23IFBench91.56 / 23LMArenaCreativeOrOpenEnded78.28 / 24LMArenaSearchDocument94.72 / 22LMArenaText78.28 / 24LongContextRecall85.38 / 23MCPAtlas92.54 / 16OutputSpeed37.822 / 23SWEBenchMultilingual95.05 / 19SWEBenchPro95.04 / 21SWEBenchVerified95.06 / 23SWEComposite86.29 / 24SWERebench73.114 / 23SciCode94.53 / 23SonarBugDensity92.55 / 22SonarComposite80.66 / 24SonarFunctionalSkill66.817 / 22SonarIssueDensity92.54 / 22SonarVulnerabilityDensity81.69 / 22TTFT94.25 / 23Tau2Bench96.03 / 23TerminalBench74.65 / 24
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSO
gemini-3.1-pro-previewgoogle93.793.787.487.477.377.385.9

group breakdown

A_B64.112 / 24A_I80.512 / 24A_P70.75 / 24A_R82.510 / 24BUILD82.15 / 24CRE100.02 / 24GEN100.01 / 24LM_ARENA_REVIEW_PROXY92.34 / 24OPS_long79.816 / 24OPS_precision68.219 / 24OPS_review75.516 / 24PLAN92.61 / 24

metrics

AI_code24.013 / 22AI_complexity37.211 / 22AI_context_awareness28.35 / 24AI_correctness92.517 / 22AI_edge_cases89.610 / 22AI_efficiency58.611 / 22AI_hallucination_resistance92.59 / 24AI_memory_retention29.77 / 24AI_parameter_accuracy26.719 / 24AI_plan_coherence92.58 / 24AI_recovery92.516 / 22AI_refusal70.120 / 22AI_spec72.720 / 22AI_stability92.510 / 22AI_task_completion88.313 / 24AI_tool_selection18.618 / 24ARC_AGI_2100.01 / 22ArtificialAnalysisCoding100.01 / 23ArtificialAnalysisIntelligence100.02 / 23ArtificialAnalysisReasoning100.01 / 23BlendedCost77.313 / 24ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.59 / 24GDPval50.215 / 24GPQA_HLE_Reasoning100.01 / 23GSO51.39 / 16IFBench94.44 / 23LMArenaCreativeOrOpenEnded100.02 / 24LMArenaSearchDocument92.34 / 22LMArenaText100.02 / 24LongContextRecall100.02 / 23MCPAtlas71.19 / 16OutputSpeed93.75 / 23SWEBenchMultilingual36.012 / 19SWEBenchPro89.110 / 21SWEBenchVerified95.05 / 23SWEComposite88.96 / 24SWERebench99.82 / 23SciCode100.02 / 23SonarBugDensity52.717 / 22SonarComposite54.217 / 24SonarFunctionalSkill78.910 / 22SonarIssueDensity13.217 / 22SonarVulnerabilityDensity58.216 / 22TTFT27.722 / 23Tau2Bench95.35 / 23TerminalBench89.43 / 24
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing none
glm-5.1zai78.778.770.970.976.376.379.6

group breakdown

A_B58.018 / 24A_I71.618 / 24A_P58.617 / 24A_R73.718 / 24BUILD81.96 / 24CRE87.05 / 24GEN67.48 / 24LM_ARENA_REVIEW_PROXY88.07 / 24OPS_long83.89 / 24OPS_precision88.16 / 24OPS_review85.86 / 24PLAN76.96 / 24

metrics

AI_code26.510 / 22AI_complexity31.513 / 22AI_context_awareness7.510 / 24AI_correctness87.518 / 22AI_edge_cases81.019 / 22AI_efficiency42.816 / 22AI_hallucination_resistance41.516 / 24AI_memory_retention7.510 / 24AI_parameter_accuracy81.013 / 24AI_plan_coherence19.315 / 24AI_recovery91.417 / 22AI_refusal92.516 / 22AI_spec92.516 / 22AI_stability82.618 / 22AI_task_completion73.118 / 24AI_tool_selection73.513 / 24ARC_AGI_25.216 / 22ArtificialAnalysisCoding61.910 / 23ArtificialAnalysisIntelligence79.68 / 23ArtificialAnalysisReasoning63.311 / 23BlendedCost93.06 / 24ContextWindow74.919 / 24CopilotArenaOrLMArenaCode97.13 / 24GDPval73.410 / 24GPQA_HLE_Reasoning63.311 / 23IFBench92.35 / 23LMArenaCreativeOrOpenEnded87.05 / 24LMArenaSearchDocument88.07 / 22LMArenaText87.05 / 24LongContextRecall49.017 / 23MCPAtlas100.01 / 16OutputSpeed79.115 / 23SWEBenchMultilingual50.910 / 19SWEBenchPro95.07 / 21SWEBenchVerified91.912 / 23SWEComposite92.13 / 24SWERebench100.01 / 23SciCode41.516 / 23SonarBugDensity100.01 / 22SonarComposite86.02 / 24SonarFunctionalSkill69.812 / 22SonarIssueDensity100.01 / 22SonarVulnerabilityDensity87.25 / 22TTFT98.84 / 23Tau2Bench100.02 / 23TerminalBench55.811 / 24
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswerebenchterminal_benchmissing BUILD/GSO
claude-sonnet-4.6anthropic73.573.561.861.871.771.763.6

group breakdown

A_B67.68 / 24A_I82.29 / 24A_P64.411 / 24A_R75.917 / 24BUILD76.98 / 24CRE73.810 / 24GEN66.39 / 24LM_ARENA_REVIEW_PROXY23.214 / 24OPS_long66.221 / 24OPS_precision53.723 / 24OPS_review63.621 / 24PLAN58.812 / 24

metrics

AI_canary_health88.24 / 7AI_code48.57 / 22AI_complexity55.57 / 22AI_context_awareness7.89 / 24AI_correctness100.05 / 22AI_edge_cases89.111 / 22AI_efficiency58.013 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy100.04 / 24AI_plan_coherence16.716 / 24AI_recovery100.05 / 22AI_refusal100.05 / 22AI_spec100.05 / 22AI_stability94.87 / 22AI_task_completion95.94 / 24AI_tool_selection51.615 / 24ARC_AGI_210.615 / 22ArtificialAnalysisCoding85.54 / 23ArtificialAnalysisIntelligence80.77 / 23ArtificialAnalysisReasoning68.79 / 23BlendedCost74.418 / 24ContextWindow99.311 / 24CopilotArenaOrLMArenaCode95.04 / 24GDPval86.16 / 24GPQA_HLE_Reasoning68.79 / 23GSO30.711 / 16IFBench39.716 / 23LMArenaCreativeOrOpenEnded73.810 / 24LMArenaSearchDocument23.212 / 22LMArenaText73.810 / 24LongContextRecall90.26 / 23MCPAtlas69.810 / 16OutputSpeed79.714 / 23SWEBenchMultilingual95.04 / 19SWEBenchPro76.516 / 21SWEBenchVerified90.313 / 23SWEComposite88.18 / 24SWERebench95.73 / 23SciCode57.910 / 23SonarBugDensity65.810 / 22SonarComposite55.812 / 24SonarFunctionalSkill84.55 / 22SonarIssueDensity22.313 / 22SonarVulnerabilityDensity21.820 / 22TTFT0.023 / 23Tau2Bench51.214 / 23TerminalBench47.415 / 24
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswerebenchmissing none
gpt-5.3-codexopenai69.269.251.151.170.870.870.3

group breakdown

A_B62.114 / 24A_I75.315 / 24A_P57.218 / 24A_R81.013 / 24BUILD75.39 / 24CRE72.612 / 24GEN49.715 / 24LM_ARENA_REVIEW_PROXY92.53 / 24OPS_long85.68 / 24OPS_precision82.610 / 24OPS_review83.210 / 24PLAN42.217 / 24

metrics

AI_code22.415 / 22AI_complexity28.317 / 22AI_context_awareness0.018 / 24AI_correctness94.112 / 22AI_edge_cases86.514 / 22AI_efficiency58.412 / 22AI_hallucination_resistance60.014 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy86.410 / 24AI_plan_coherence2.822 / 24AI_recovery98.710 / 22AI_refusal100.011 / 22AI_spec100.011 / 22AI_stability86.215 / 22AI_task_completion61.819 / 24AI_tool_selection79.010 / 24ARC_AGI_271.98 / 22ArtificialAnalysisCoding44.415 / 23ArtificialAnalysisIntelligence34.317 / 23ArtificialAnalysisReasoning35.316 / 23BlendedCost76.614 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode60.117 / 24GDPval68.012 / 24GPQA_HLE_Reasoning35.316 / 23GSO53.48 / 16IFBench59.911 / 23LMArenaCreativeOrOpenEnded72.612 / 24LMArenaSearchDocument92.53 / 22LMArenaText72.612 / 24LongContextRecall45.019 / 23OutputSpeed90.08 / 23SWEBenchPro95.05 / 21SWEBenchVerified92.59 / 23SWEComposite92.12 / 24SWERebench89.55 / 23SciCode44.715 / 23SonarBugDensity80.87 / 22SonarComposite60.99 / 24SonarFunctionalSkill72.311 / 22SonarIssueDensity7.518 / 22SonarVulnerabilityDensity92.53 / 22TTFT78.413 / 23Tau2Bench7.520 / 23TerminalBench74.36 / 24
sources aistupidlevelartificial_analysislmarenaopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
gpt-5.4openai71.271.252.452.469.869.860.7

group breakdown

A_B58.516 / 24A_I75.513 / 24A_P62.013 / 24A_R79.914 / 24BUILD74.210 / 24CRE76.69 / 24GEN47.216 / 24LM_ARENA_REVIEW_PROXY17.119 / 24OPS_long93.03 / 24OPS_precision89.35 / 24OPS_review90.73 / 24PLAN43.116 / 24

metrics

AI_code7.621 / 22AI_complexity28.318 / 22AI_context_awareness0.019 / 24AI_correctness94.113 / 22AI_edge_cases86.515 / 22AI_efficiency50.314 / 22AI_hallucination_resistance60.015 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy87.38 / 24AI_plan_coherence5.621 / 24AI_recovery98.711 / 22AI_refusal100.012 / 22AI_spec100.012 / 22AI_stability89.911 / 22AI_task_completion92.69 / 24AI_tool_selection99.84 / 24ARC_AGI_275.87 / 22ArtificialAnalysisCoding35.517 / 23ArtificialAnalysisIntelligence33.018 / 23ArtificialAnalysisReasoning15.519 / 23BlendedCost75.015 / 24ContextWindow100.01 / 24CopilotArenaOrLMArenaCode68.913 / 24GDPval87.94 / 24GPQA_HLE_Reasoning15.519 / 23GSO54.07 / 16IFBench60.510 / 23LMArenaCreativeOrOpenEnded76.69 / 24LMArenaSearchDocument17.117 / 22LMArenaText76.69 / 24LongContextRecall24.520 / 23MCPAtlas72.86 / 16OutputSpeed96.64 / 23SWEBenchPro92.59 / 21SWEBenchVerified95.07 / 23SWEComposite88.97 / 24SWERebench83.57 / 23SciCode12.020 / 23SonarBugDensity84.76 / 22SonarComposite60.410 / 24SonarFunctionalSkill66.814 / 22SonarIssueDensity6.820 / 22SonarVulnerabilityDensity100.01 / 22TTFT86.87 / 23Tau2Bench0.023 / 23TerminalBench100.02 / 24
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proterminal_benchmissing SWEComposite/SWEBenchMultilingual
claude-opus-4.5anthropic67.967.966.066.068.668.660.6

group breakdown

A_B36.421 / 24A_I49.321 / 24A_P53.019 / 24A_R53.321 / 24BUILD80.67 / 24CRE73.511 / 24GEN73.56 / 24LM_ARENA_REVIEW_PROXY10.821 / 24OPS_long76.518 / 24OPS_precision73.517 / 24OPS_review73.618 / 24PLAN67.19 / 24

metrics

AI_canary_health88.53 / 7AI_code11.118 / 22AI_complexity3.420 / 22AI_context_awareness75.42 / 24AI_correctness63.719 / 22AI_edge_cases82.218 / 22AI_efficiency31.020 / 22AI_hallucination_resistance20.019 / 24AI_memory_retention0.012 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence13.918 / 24AI_recovery87.518 / 22AI_refusal17.921 / 22AI_spec17.921 / 22AI_stability83.317 / 22AI_task_completion98.82 / 24AI_tool_selection74.712 / 24ARC_AGI_284.85 / 22ArtificialAnalysisCoding75.86 / 23ArtificialAnalysisIntelligence73.79 / 23ArtificialAnalysisReasoning63.710 / 23BlendedCost61.920 / 24ContextWindow74.721 / 24CopilotArenaOrLMArenaCode77.78 / 24GDPval80.49 / 24GPQA_HLE_Reasoning63.710 / 23GSO59.35 / 16IFBench43.514 / 23LMArenaCreativeOrOpenEnded73.511 / 24LMArenaSearchDocument10.819 / 22LMArenaText73.511 / 24LongContextRecall100.01 / 23OutputSpeed80.713 / 23SWEBenchMultilingual95.02 / 19SWEBenchPro88.411 / 21SWEBenchVerified92.210 / 23SWEComposite84.910 / 24SWERebench76.59 / 23SciCode72.77 / 23SonarBugDensity73.78 / 22SonarComposite87.11 / 24SonarFunctionalSkill100.01 / 22SonarIssueDensity77.25 / 22SonarVulnerabilityDensity87.24 / 22TTFT73.317 / 23Tau2Bench81.89 / 23TerminalBench54.812 / 24
sources aistupidlevelartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlas
gemini-3-progoogle81.081.060.560.564.364.360.6

group breakdown

A_B66.69 / 24A_I85.96 / 24A_P74.42 / 24A_R88.24 / 24BUILD66.313 / 24CRE94.83 / 24GEN60.012 / 24LM_ARENA_REVIEW_PROXY19.916 / 24OPS_long45.223 / 24OPS_precision48.024 / 24OPS_review43.024 / 24PLAN55.214 / 24

metrics

AI_code19.517 / 22AI_complexity35.012 / 22AI_context_awareness24.56 / 24AI_correctness100.07 / 22AI_edge_cases96.66 / 22AI_efficiency60.28 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention26.28 / 24AI_parameter_accuracy22.620 / 24AI_plan_coherence100.01 / 24AI_recovery100.06 / 22AI_refusal73.717 / 22AI_spec76.817 / 22AI_stability100.05 / 22AI_task_completion95.15 / 24AI_tool_selection13.119 / 24ARC_AGI_241.99 / 22BlendedCost77.312 / 24ContextWindow0.024 / 24CopilotArenaOrLMArenaCode69.212 / 24GDPval37.219 / 24GSO40.710 / 16LMArenaCreativeOrOpenEnded94.83 / 24LMArenaSearchDocument19.914 / 22LMArenaText94.83 / 24MCPAtlas74.95 / 16SWEBenchMultilingual33.513 / 19SWEBenchPro80.314 / 21SWEBenchVerified82.916 / 23SWEComposite72.115 / 24SWERebench70.616 / 23SonarBugDensity53.213 / 22SonarComposite54.913 / 24SonarFunctionalSkill84.16 / 22SonarIssueDensity6.721 / 22SonarVulnerabilityDensity59.712 / 22TerminalBench61.28 / 24
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/LongContextRecallBUILD/SciCodeGEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/OutputSpeedOPS_long/TTFTOPS_precision/OutputSpeedOPS_precision/TTFTOPS_review/OutputSpeedOPS_review/TTFTPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/Tau2Bench
gemini-3-flashgoogle80.580.567.867.864.264.264.2

group breakdown

A_B64.111 / 24A_I80.511 / 24A_P70.74 / 24A_R82.59 / 24BUILD60.714 / 24CRE86.36 / 24GEN63.011 / 24LM_ARENA_REVIEW_PROXY19.217 / 24OPS_long95.11 / 24OPS_precision91.62 / 24OPS_review93.41 / 24PLAN64.510 / 24

metrics

AI_code24.012 / 22AI_complexity37.210 / 22AI_context_awareness28.34 / 24AI_correctness92.516 / 22AI_edge_cases89.69 / 22AI_efficiency58.610 / 22AI_hallucination_resistance92.58 / 24AI_memory_retention29.76 / 24AI_parameter_accuracy26.718 / 24AI_plan_coherence92.57 / 24AI_recovery92.515 / 22AI_refusal70.119 / 22AI_spec72.719 / 22AI_stability92.59 / 22AI_task_completion88.312 / 24AI_tool_selection18.617 / 24ARC_AGI_23.119 / 22ArtificialAnalysisCoding59.411 / 23ArtificialAnalysisIntelligence62.113 / 23ArtificialAnalysisReasoning82.77 / 23BlendedCost91.58 / 24ContextWindow100.05 / 24CopilotArenaOrLMArenaCode68.814 / 24GDPval39.117 / 24GPQA_HLE_Reasoning82.77 / 23GSO14.014 / 16IFBench96.83 / 23LMArenaCreativeOrOpenEnded86.36 / 24LMArenaSearchDocument19.215 / 22LMArenaText86.36 / 24LongContextRecall68.69 / 23MCPAtlas22.412 / 16OutputSpeed99.42 / 23SWEBenchMultilingual100.01 / 19SWEBenchPro53.018 / 21SWEBenchVerified100.01 / 23SWEComposite74.112 / 24SWERebench76.310 / 23SciCode78.76 / 23SonarBugDensity52.716 / 22SonarComposite54.216 / 24SonarFunctionalSkill78.99 / 22SonarIssueDensity13.216 / 22SonarVulnerabilityDensity58.215 / 22TTFT81.49 / 23Tau2Bench61.612 / 23TerminalBench48.313 / 24
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
deepseek-v4-flashdeepseek52.652.663.663.663.563.566.4

group breakdown

A_B23.923 / 24A_I25.124 / 24A_P36.823 / 24A_R25.024 / 24BUILD73.811 / 24CRE58.816 / 24GEN56.513 / 24LM_ARENA_REVIEW_PROXY88.05 / 24OPS_long88.05 / 24OPS_precision91.53 / 24OPS_review88.65 / 24PLAN78.45 / 24

metrics

AI_canary_health83.46 / 7AI_code7.619 / 22AI_complexity0.022 / 22AI_context_awareness0.014 / 24AI_correctness0.021 / 22AI_edge_cases0.021 / 22AI_efficiency67.75 / 22AI_hallucination_resistance60.012 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy86.311 / 24AI_plan_coherence16.717 / 24AI_recovery0.021 / 22AI_refusal100.06 / 22AI_spec100.06 / 22AI_stability0.022 / 22AI_task_completion77.216 / 24AI_tool_selection98.45 / 24ARC_AGI_211.912 / 22ArtificialAnalysisCoding47.213 / 23ArtificialAnalysisIntelligence62.512 / 23ArtificialAnalysisReasoning76.78 / 23BlendedCost100.01 / 24ContextWindow71.622 / 24CopilotArenaOrLMArenaCode87.96 / 24GDPval67.413 / 24GPQA_HLE_Reasoning76.78 / 23IFBench100.01 / 23LMArenaCreativeOrOpenEnded58.816 / 24LMArenaSearchDocument88.05 / 22LMArenaText58.816 / 24LongContextRecall52.516 / 23OutputSpeed85.911 / 23SWEBenchMultilingual58.69 / 19SWEBenchPro95.03 / 21SWEBenchVerified95.04 / 23SWEComposite82.611 / 24SWERebench73.112 / 23SciCode47.513 / 23SonarBugDensity92.53 / 22SonarComposite80.64 / 24SonarFunctionalSkill66.815 / 22SonarIssueDensity92.52 / 22SonarVulnerabilityDensity81.67 / 22TTFT99.92 / 23Tau2Bench94.06 / 23TerminalBench60.99 / 24
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlas
claude-sonnet-4.5anthropic66.166.151.551.559.459.452.2

group breakdown

A_B73.96 / 24A_I84.97 / 24A_P68.66 / 24A_R81.611 / 24BUILD52.916 / 24CRE64.314 / 24GEN44.117 / 24LM_ARENA_REVIEW_PROXY2.322 / 24OPS_long80.214 / 24OPS_precision80.311 / 24OPS_review82.411 / 24PLAN40.918 / 24

metrics

AI_canary_health89.11 / 7AI_code61.95 / 22AI_complexity60.96 / 22AI_context_awareness0.013 / 24AI_correctness100.04 / 22AI_edge_cases92.57 / 22AI_efficiency64.36 / 22AI_hallucination_resistance20.021 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy100.03 / 24AI_plan_coherence22.211 / 24AI_recovery100.04 / 22AI_refusal100.04 / 22AI_spec100.04 / 22AI_stability100.03 / 22AI_task_completion100.01 / 24AI_tool_selection99.83 / 24ARC_AGI_23.717 / 22ArtificialAnalysisCoding46.914 / 23ArtificialAnalysisIntelligence50.214 / 23ArtificialAnalysisReasoning35.317 / 23BlendedCost74.417 / 24ContextWindow99.310 / 24CopilotArenaOrLMArenaCode54.118 / 24GDPval88.23 / 24GPQA_HLE_Reasoning35.317 / 23GSO27.312 / 16IFBench41.615 / 23LMArenaCreativeOrOpenEnded64.314 / 24LMArenaSearchDocument2.320 / 22LMArenaText64.314 / 24LongContextRecall65.711 / 23MCPAtlas6.615 / 16OutputSpeed76.619 / 23SWEBenchMultilingual3.918 / 19SWEBenchPro81.213 / 21SWEBenchVerified85.715 / 23SWEComposite71.616 / 24SWERebench74.911 / 23SciCode46.414 / 23SonarBugDensity2.821 / 22SonarComposite15.623 / 24SonarFunctionalSkill17.220 / 22SonarIssueDensity30.012 / 22SonarVulnerabilityDensity4.621 / 22TTFT78.811 / 23Tau2Bench56.513 / 23TerminalBench37.417 / 24
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
claude-sonnet-4anthropic32.032.039.239.258.958.959.6

group breakdown

A_B89.11 / 24A_I91.71 / 24A_P67.77 / 24A_R86.35 / 24BUILD47.318 / 24CRE0.023 / 24GEN16.321 / 24LM_ARENA_REVIEW_PROXY86.28 / 24OPS_long80.115 / 24OPS_precision79.812 / 24OPS_review82.012 / 24PLAN29.720 / 24

metrics

AI_code98.92 / 22AI_complexity100.01 / 22AI_context_awareness0.012 / 24AI_correctness100.03 / 22AI_edge_cases100.03 / 22AI_efficiency97.02 / 22AI_hallucination_resistance20.020 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy81.012 / 24AI_plan_coherence19.414 / 24AI_recovery100.03 / 22AI_refusal100.03 / 22AI_spec100.03 / 22AI_stability100.02 / 22AI_task_completion92.66 / 24AI_tool_selection97.06 / 24ARC_AGI_20.221 / 22ArtificialAnalysisCoding32.718 / 23ArtificialAnalysisIntelligence35.116 / 23ArtificialAnalysisReasoning8.621 / 23BlendedCost74.416 / 24ContextWindow99.39 / 24CopilotArenaOrLMArenaCode53.520 / 24GDPval86.15 / 24GPQA_HLE_Reasoning8.621 / 23GSO6.015 / 16IFBench34.717 / 23LMArenaCreativeOrOpenEnded0.023 / 24LMArenaSearchDocument86.28 / 22LMArenaText0.023 / 24LiveCodeBench0.02 / 2LongContextRecall60.812 / 23MCPAtlas13.113 / 16OutputSpeed77.117 / 23SWEBenchMultilingual10.814 / 19SWEBenchPro78.415 / 21SWEBenchVerified69.921 / 23SWEComposite61.017 / 24SWERebench55.117 / 23SciCode20.818 / 23SonarBugDensity0.022 / 22SonarComposite19.522 / 24SonarFunctionalSkill26.419 / 22SonarIssueDensity35.810 / 22SonarVulnerabilityDensity0.022 / 22TTFT76.815 / 23Tau2Bench26.619 / 23TerminalBench47.414 / 24
sources aistupidlevelarc_agiartificial_analysisgsolivecodebenchlmarenaopenroutersonarswebenchswebench_proswerebenchmissing none
claude-opus-4.1anthropic52.152.156.356.358.558.551.7

group breakdown

A_B22.824 / 24A_I32.422 / 24A_P33.124 / 24A_R41.522 / 24BUILD71.912 / 24CRE53.117 / 24GEN66.310 / 24LM_ARENA_REVIEW_PROXY0.123 / 24OPS_long67.020 / 24OPS_precision58.522 / 24OPS_review59.022 / 24PLAN63.011 / 24

metrics

AI_canary_health68.17 / 7AI_code0.022 / 22AI_complexity0.321 / 22AI_context_awareness0.011 / 24AI_correctness28.120 / 22AI_edge_cases67.620 / 22AI_efficiency6.721 / 22AI_hallucination_resistance40.017 / 24AI_memory_retention0.011 / 24AI_parameter_accuracy63.715 / 24AI_plan_coherence19.413 / 24AI_recovery82.819 / 22AI_refusal0.022 / 22AI_spec0.022 / 22AI_stability59.320 / 22AI_task_completion77.215 / 24AI_tool_selection90.18 / 24ARC_AGI_282.86 / 22ArtificialAnalysisCoding71.98 / 23ArtificialAnalysisIntelligence70.110 / 23ArtificialAnalysisReasoning61.712 / 23BlendedCost0.024 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode53.819 / 24GDPval80.48 / 24GPQA_HLE_Reasoning61.712 / 23GSO57.96 / 16IFBench44.413 / 23LMArenaCreativeOrOpenEnded53.117 / 24LMArenaSearchDocument0.121 / 22LMArenaText53.117 / 24LongContextRecall92.54 / 23MCPAtlas92.52 / 16OutputSpeed76.120 / 23SWEBenchMultilingual92.56 / 19SWEBenchPro82.612 / 21SWEBenchVerified92.011 / 23SWEComposite72.914 / 24SWERebench52.318 / 23SciCode69.38 / 23SonarBugDensity70.19 / 22SonarComposite81.53 / 24SonarFunctionalSkill92.53 / 22SonarIssueDensity73.16 / 22SonarVulnerabilityDensity81.66 / 22TTFT69.818 / 23Tau2Bench77.011 / 23TerminalBench29.418 / 24
sources aistupidlevellmarenaopenrouteroverridesswerebenchterminal_benchmissing none
grok-4-latestxai70.470.469.169.158.358.363.5

group breakdown

A_B88.53 / 24A_I89.92 / 24A_P65.010 / 24A_R97.61 / 24BUILD43.820 / 24CRE58.915 / 24GEN69.07 / 24LM_ARENA_REVIEW_PROXY18.418 / 24OPS_long81.412 / 24OPS_precision70.118 / 24OPS_review74.017 / 24PLAN71.28 / 24

metrics

AI_code89.24 / 22AI_complexity99.42 / 22AI_context_awareness0.021 / 24AI_correctness100.08 / 22AI_edge_cases100.05 / 22AI_efficiency0.022 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.02 / 24AI_recovery100.07 / 22AI_refusal100.014 / 22AI_spec100.014 / 22AI_stability87.114 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_220.711 / 22ArtificialAnalysisCoding54.412 / 23ArtificialAnalysisIntelligence86.05 / 23ArtificialAnalysisReasoning83.96 / 23BlendedCost74.419 / 24ContextWindow78.415 / 24CopilotArenaOrLMArenaCode60.616 / 24GDPval15.622 / 24GPQA_HLE_Reasoning83.96 / 23IFBench100.02 / 23LMArenaCreativeOrOpenEnded58.915 / 24LMArenaSearchDocument18.416 / 22LMArenaText58.915 / 24LongContextRecall58.813 / 23OutputSpeed98.83 / 23SWEComposite45.620 / 24SWERebench39.119 / 23SciCode60.79 / 23SonarComposite50.019 / 24TTFT39.421 / 23Tau2Bench100.01 / 23TerminalBench11.821 / 24
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouteroverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gpt-5.2openai67.067.058.058.056.156.158.4

group breakdown

A_B58.417 / 24A_I71.717 / 24A_P58.916 / 24A_R81.412 / 24BUILD51.717 / 24CRE67.813 / 24GEN53.514 / 24LM_ARENA_REVIEW_PROXY20.815 / 24OPS_long86.06 / 24OPS_precision83.29 / 24OPS_review83.99 / 24PLAN55.513 / 24

metrics

AI_code7.620 / 22AI_complexity28.316 / 22AI_context_awareness0.017 / 24AI_correctness94.111 / 22AI_edge_cases86.513 / 22AI_efficiency47.715 / 22AI_hallucination_resistance80.010 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy89.07 / 24AI_plan_coherence0.024 / 24AI_recovery98.79 / 22AI_refusal100.010 / 22AI_spec100.010 / 22AI_stability71.119 / 22AI_task_completion92.68 / 24AI_tool_selection90.17 / 24ARC_AGI_20.022 / 22ArtificialAnalysisCoding64.59 / 23ArtificialAnalysisIntelligence62.811 / 23ArtificialAnalysisReasoning56.413 / 23BlendedCost80.111 / 24ContextWindow85.312 / 24CopilotArenaOrLMArenaCode39.222 / 24GDPval66.314 / 24GPQA_HLE_Reasoning56.413 / 23GSO64.74 / 16IFBench62.79 / 23LMArenaCreativeOrOpenEnded67.813 / 24LMArenaSearchDocument20.813 / 22LMArenaText67.813 / 24LongContextRecall53.915 / 23OutputSpeed90.07 / 23SWEBenchMultilingual0.019 / 19SWEBenchPro38.220 / 21SWEBenchVerified81.318 / 23SWEComposite45.621 / 24SciCode54.611 / 23SonarBugDensity64.211 / 22SonarComposite59.711 / 24SonarFunctionalSkill67.213 / 22SonarIssueDensity35.711 / 22SonarVulnerabilityDensity73.410 / 22TTFT78.412 / 23Tau2Bench48.116 / 23TerminalBench58.210 / 24
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWERebench
glm-4.7zai33.333.351.251.252.252.255.1

group breakdown

A_B56.020 / 24A_I55.020 / 24A_P47.021 / 24A_R58.520 / 24BUILD45.319 / 24CRE10.022 / 24GEN38.018 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long89.44 / 24OPS_precision92.01 / 24OPS_review89.44 / 24PLAN54.315 / 24

metrics

AI_context_awareness0.024 / 24AI_hallucination_resistance100.05 / 24AI_memory_retention100.04 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.05 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding39.616 / 23ArtificialAnalysisIntelligence47.015 / 23ArtificialAnalysisReasoning55.814 / 23BlendedCost96.13 / 24ContextWindow74.918 / 24CopilotArenaOrLMArenaCode69.711 / 24GDPval36.820 / 24GPQA_HLE_Reasoning55.814 / 23IFBench69.98 / 23LMArenaCreativeOrOpenEnded10.022 / 24LMArenaText10.022 / 24LongContextRecall57.414 / 23MCPAtlas0.016 / 16OutputSpeed88.49 / 23SWEBenchMultilingual5.017 / 19SWEBenchVerified90.214 / 23SWEComposite60.718 / 24SWERebench70.915 / 23SciCode48.612 / 23SonarBugDensity51.618 / 22SonarComposite27.321 / 24SonarFunctionalSkill0.022 / 22SonarIssueDensity50.88 / 22SonarVulnerabilityDensity28.718 / 22TTFT100.01 / 23Tau2Bench96.04 / 23TerminalBench27.119 / 24
sources aistupidlevelartificial_analysislmarenamcp_atlasopenrouteroverridessonarswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchPro
kimi-k2-0905moonshot25.025.029.429.451.451.447.1

group breakdown

A_B32.222 / 24A_I28.423 / 24A_P38.122 / 24A_R27.723 / 24BUILD59.915 / 24CRE27.620 / 24GEN11.924 / 24LM_ARENA_REVIEW_PROXY88.06 / 24OPS_long35.624 / 24OPS_precision58.721 / 24OPS_review54.823 / 24PLAN30.219 / 24

metrics

AI_canary_health88.92 / 7AI_code33.59 / 22AI_complexity28.314 / 22AI_context_awareness0.015 / 24AI_correctness0.022 / 22AI_edge_cases0.022 / 22AI_efficiency64.07 / 22AI_hallucination_resistance60.013 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy79.314 / 24AI_plan_coherence22.212 / 24AI_recovery0.022 / 22AI_refusal100.08 / 22AI_spec100.08 / 22AI_stability1.721 / 22AI_task_completion92.67 / 24AI_tool_selection83.29 / 24ARC_AGI_211.913 / 22ArtificialAnalysisCoding6.921 / 23ArtificialAnalysisIntelligence7.721 / 23ArtificialAnalysisReasoning0.022 / 23BlendedCost92.77 / 24ContextWindow53.423 / 24CopilotArenaOrLMArenaCode87.97 / 24GDPval5.023 / 24GPQA_HLE_Reasoning0.022 / 23IFBench0.022 / 23LMArenaCreativeOrOpenEnded27.620 / 24LMArenaSearchDocument88.06 / 22LMArenaText27.620 / 24LongContextRecall0.022 / 23MCPAtlas92.53 / 16OutputSpeed0.023 / 23SWEBenchMultilingual5.015 / 19SWEBenchPro92.58 / 21SWEBenchVerified78.620 / 23SWEComposite73.913 / 24SWERebench73.113 / 23SciCode0.022 / 23SonarBugDensity92.54 / 22SonarComposite80.65 / 24SonarFunctionalSkill66.816 / 22SonarIssueDensity92.53 / 22SonarVulnerabilityDensity81.68 / 22TTFT91.86 / 23Tau2Bench46.117 / 23TerminalBench44.616 / 24
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSO
gemini-2.5-flashgoogle53.353.334.934.947.147.152.0

group breakdown

A_B88.72 / 24A_I89.44 / 24A_P75.41 / 24A_R94.52 / 24BUILD28.723 / 24CRE46.019 / 24GEN14.223 / 24LM_ARENA_REVIEW_PROXY78.89 / 24OPS_long94.12 / 24OPS_precision89.64 / 24OPS_review92.22 / 24PLAN14.223 / 24

metrics

AI_code90.73 / 22AI_complexity63.55 / 22AI_context_awareness100.01 / 24AI_correctness100.06 / 22AI_edge_cases100.04 / 22AI_efficiency100.01 / 22AI_hallucination_resistance95.56 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy100.05 / 24AI_plan_coherence56.09 / 24AI_recovery78.920 / 22AI_refusal100.07 / 22AI_spec100.07 / 22AI_stability100.04 / 22AI_task_completion25.020 / 24AI_tool_selection67.814 / 24ARC_AGI_20.820 / 22ArtificialAnalysisCoding0.022 / 23ArtificialAnalysisIntelligence0.022 / 23ArtificialAnalysisReasoning14.120 / 23BlendedCost94.45 / 24ContextWindow100.03 / 24CopilotArenaOrLMArenaCode66.015 / 24GDPval39.716 / 24GPQA_HLE_Reasoning14.120 / 23GSO19.413 / 16IFBench22.919 / 23LMArenaCreativeOrOpenEnded46.019 / 24LMArenaSearchDocument78.89 / 22LMArenaText46.019 / 24LiveCodeBench100.01 / 2LongContextRecall46.118 / 23MCPAtlas26.611 / 16OutputSpeed100.01 / 23SWEBenchMultilingual92.57 / 19SWEBenchPro52.519 / 21SWEBenchVerified0.023 / 23SWEComposite27.624 / 24SWERebench0.023 / 23SciCode17.519 / 23SonarBugDensity52.714 / 22SonarComposite54.214 / 24SonarFunctionalSkill78.97 / 22SonarIssueDensity13.214 / 22SonarVulnerabilityDensity58.213 / 22TTFT73.416 / 23Tau2Bench0.022 / 23TerminalBench0.323 / 24
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing none
gemini-2.5-progoogle29.729.739.339.346.346.344.3

group breakdown

A_B64.110 / 24A_I80.510 / 24A_P70.73 / 24A_R82.58 / 24BUILD37.521 / 24CRE0.024 / 24GEN17.319 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long81.911 / 24OPS_precision73.516 / 24OPS_review79.215 / 24PLAN28.821 / 24

metrics

AI_code24.011 / 22AI_complexity37.29 / 22AI_context_awareness28.33 / 24AI_correctness92.515 / 22AI_edge_cases89.68 / 22AI_efficiency58.69 / 22AI_hallucination_resistance92.57 / 24AI_memory_retention29.75 / 24AI_parameter_accuracy26.717 / 24AI_plan_coherence92.56 / 24AI_recovery92.514 / 22AI_refusal70.118 / 22AI_spec72.718 / 22AI_stability92.58 / 22AI_task_completion88.311 / 24AI_tool_selection18.616 / 24ARC_AGI_23.718 / 22ArtificialAnalysisCoding25.819 / 23ArtificialAnalysisIntelligence20.719 / 23ArtificialAnalysisReasoning44.815 / 23BlendedCost80.110 / 24ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.923 / 24GDPval37.918 / 24GPQA_HLE_Reasoning44.815 / 23GSO0.016 / 16IFBench18.720 / 23LMArenaCreativeOrOpenEnded0.024 / 24LMArenaSearchDocument0.022 / 22LMArenaText0.024 / 24LongContextRecall67.210 / 23MCPAtlas71.18 / 16OutputSpeed91.36 / 23SWEBenchMultilingual36.011 / 19SWEBenchPro75.717 / 21SWEBenchVerified38.222 / 23SWEComposite36.522 / 24SWERebench1.822 / 23SciCode36.117 / 23SonarBugDensity52.715 / 22SonarComposite54.215 / 24SonarFunctionalSkill78.98 / 22SonarIssueDensity13.215 / 22SonarVulnerabilityDensity58.214 / 22TTFT43.319 / 23Tau2Bench3.321 / 23TerminalBench1.822 / 24
sources arc_agiartificial_analysisgsolmarenaopenrouterswebenchswerebenchterminal_benchmissing none
grok-code-fast-1xai54.154.132.632.644.844.842.6

group breakdown

A_B78.55 / 24A_I89.93 / 24A_P67.48 / 24A_R92.23 / 24BUILD29.522 / 24CRE48.218 / 24GEN15.822 / 24LM_ARENA_REVIEW_PROXY15.720 / 24OPS_long85.87 / 24OPS_precision85.78 / 24OPS_review85.67 / 24PLAN12.824 / 24

metrics

AI_code47.68 / 22AI_complexity64.64 / 22AI_context_awareness0.022 / 24AI_correctness100.09 / 22AI_edge_cases83.317 / 22AI_efficiency37.619 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention100.02 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.03 / 24AI_recovery97.813 / 22AI_refusal100.015 / 22AI_spec100.015 / 22AI_stability100.06 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.110 / 22ArtificialAnalysisCoding0.023 / 23ArtificialAnalysisIntelligence0.023 / 23ArtificialAnalysisReasoning0.023 / 23BlendedCost99.32 / 24ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.024 / 24GDPval5.024 / 24GPQA_HLE_Reasoning0.023 / 23IFBench0.023 / 23LMArenaCreativeOrOpenEnded48.218 / 24LMArenaSearchDocument15.718 / 22LMArenaText48.218 / 24LongContextRecall0.023 / 23OutputSpeed87.810 / 23SWEBenchVerified82.717 / 23SWEComposite46.119 / 24SWERebench27.921 / 23SciCode0.023 / 23SonarComposite50.020 / 24TTFT79.410 / 23Tau2Bench51.215 / 23TerminalBench0.024 / 24
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
glm-4.6zai34.734.730.130.134.334.337.8

group breakdown

A_B56.019 / 24A_I55.019 / 24A_P47.020 / 24A_R58.519 / 24BUILD20.824 / 24CRE24.521 / 24GEN17.320 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long80.213 / 24OPS_precision86.77 / 24OPS_review84.38 / 24PLAN17.722 / 24

metrics

AI_context_awareness0.023 / 24AI_hallucination_resistance100.04 / 24AI_memory_retention100.03 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.04 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding18.220 / 23ArtificialAnalysisIntelligence13.320 / 23ArtificialAnalysisReasoning16.518 / 23BlendedCost95.44 / 24ContextWindow75.017 / 24CopilotArenaOrLMArenaCode45.021 / 24GDPval20.521 / 24GPQA_HLE_Reasoning16.518 / 23IFBench4.521 / 23LMArenaCreativeOrOpenEnded24.521 / 24LMArenaText24.521 / 24LongContextRecall9.821 / 23MCPAtlas7.514 / 16OutputSpeed71.921 / 23SWEBenchMultilingual5.016 / 19SWEBenchPro0.021 / 21SWEBenchVerified79.019 / 23SWEComposite27.723 / 24SWERebench38.420 / 23SciCode12.021 / 23SonarBugDensity7.520 / 22SonarComposite10.724 / 24SonarFunctionalSkill7.521 / 22SonarIssueDensity7.519 / 22SonarVulnerabilityDensity29.017 / 22TTFT99.43 / 23Tau2Bench39.718 / 23TerminalBench13.920 / 24
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesswebenchswebench_proswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocument