$ipbr-rank · live llm coding-role score
refreshed · 14 sources · updated frequently — models drift and degrade
[ idea ]
1gemini-3.1-pro-preview92.592.5
2claude-opus-4.688.388.3
3claude-opus-4.787.286.8
[ plan ]
1gpt-5.583.583.5
2gemini-3.1-pro-preview82.082.0
3claude-opus-4.777.076.0
[ build ]
1gpt-5.581.581.5
2gemini-3.1-pro-preview80.680.6
3claude-opus-4.676.976.9
[ review ]
1gemini-3.1-pro-preview86.486.4
2kimi-k2.682.482.4
3gpt-5.579.679.6
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

raw vs adjusted

The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models lead the direct LM Arena search/document review proxy, that proxy lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.

Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted.

missing data

If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.

Full math, role definitions, and source list →

gpt-5.5openai81.181.183.583.581.581.579.6

group breakdown

A_B59.610 / 24A_I68.316 / 24A_P56.411 / 24A_R80.510 / 24BUILD89.11 / 24CRE81.47 / 24GEN94.13 / 24LM_ARENA_REVIEW_PROXY28.214 / 24OPS_long82.29 / 24OPS_precision79.811 / 24OPS_review81.211 / 24PLAN93.01 / 24

metrics

AI_code15.319 / 22AI_complexity37.320 / 22AI_context_awareness0.020 / 24AI_correctness94.116 / 22AI_edge_cases86.519 / 22AI_efficiency61.915 / 22AI_hallucination_resistance100.09 / 24AI_memory_retention7.95 / 24AI_parameter_accuracy99.13 / 24AI_plan_coherence5.122 / 24AI_recovery98.714 / 22AI_refusal50.019 / 22AI_spec50.019 / 22AI_stability89.813 / 22AI_task_completion100.08 / 24AI_tool_selection88.46 / 24ARC_AGI_296.72 / 17ArtificialAnalysisCoding100.02 / 21ArtificialAnalysisIntelligence98.13 / 21ArtificialAnalysisReasoning100.02 / 21BlendedCost50.623 / 24ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.98 / 22GDPval95.01 / 16GPQA_HLE_Reasoning100.02 / 21GSO94.02 / 15IFBench78.17 / 21LMArenaCreativeOrOpenEnded81.47 / 24LMArenaSearchDocument28.29 / 19LMArenaText81.47 / 24LongContextRecall98.03 / 21OutputSpeed81.810 / 20SWEBenchPro95.03 / 15SWEBenchVerified95.07 / 18SWEComposite89.94 / 24SWERebench83.58 / 21SciCode94.54 / 21SonarBugDensity94.52 / 17SonarComposite65.55 / 24SonarFunctionalSkill46.513 / 17SonarIssueDensity52.74 / 17SonarVulnerabilityDensity99.22 / 17TTFT86.17 / 20Tau2Bench86.98 / 21TerminalBench100.01 / 22
sources aistupidlevelarc_agiartificial_analysisgsolmarenaopenrouteroverridessonarterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
gemini-3.1-pro-previewgoogle92.592.582.082.080.680.686.4

group breakdown

A_B74.96 / 24A_I75.97 / 24A_P48.919 / 24A_R83.55 / 24BUILD83.54 / 24CRE100.02 / 24GEN100.01 / 24LM_ARENA_REVIEW_PROXY92.34 / 24OPS_long78.215 / 24OPS_precision66.017 / 24OPS_review73.816 / 24PLAN92.62 / 24

metrics

AI_code62.26 / 22AI_complexity72.36 / 22AI_context_awareness7.55 / 24AI_correctness92.519 / 22AI_edge_cases92.57 / 22AI_efficiency87.95 / 22AI_hallucination_resistance92.516 / 24AI_memory_retention7.58 / 24AI_parameter_accuracy78.118 / 24AI_plan_coherence26.511 / 24AI_recovery92.517 / 22AI_refusal50.013 / 22AI_spec50.013 / 22AI_stability92.56 / 22AI_task_completion27.419 / 24AI_tool_selection11.619 / 24ARC_AGI_2100.01 / 17ArtificialAnalysisCoding100.01 / 21ArtificialAnalysisIntelligence100.02 / 21ArtificialAnalysisReasoning100.01 / 21BlendedCost77.313 / 24ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.47 / 22GDPval24.712 / 16GPQA_HLE_Reasoning100.01 / 21GSO51.38 / 15IFBench94.44 / 21LMArenaCreativeOrOpenEnded100.02 / 24LMArenaSearchDocument92.34 / 19LMArenaText100.02 / 24LongContextRecall100.02 / 21MCPAtlas71.16 / 13OutputSpeed92.95 / 20SWEBenchPro89.15 / 15SWEBenchVerified95.04 / 18SWEComposite94.82 / 24SWERebench99.82 / 21SciCode100.02 / 21SonarBugDensity52.712 / 17SonarComposite54.212 / 24SonarFunctionalSkill78.98 / 17SonarIssueDensity13.213 / 17SonarVulnerabilityDensity58.211 / 17TTFT21.819 / 20Tau2Bench95.35 / 21TerminalBench89.43 / 22
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing SWEComposite/SWEBenchMultilingual
claude-opus-4.6anthropic88.388.373.173.176.976.969.9

group breakdown

A_B53.818 / 24A_I68.017 / 24A_P53.613 / 24A_R64.716 / 24BUILD86.22 / 24CRE100.01 / 24GEN89.54 / 24LM_ARENA_REVIEW_PROXY33.113 / 24OPS_long79.014 / 24OPS_precision76.213 / 24OPS_review78.912 / 24PLAN73.16 / 24

metrics

AI_canary_health83.35 / 7AI_code24.98 / 22AI_complexity37.311 / 22AI_context_awareness0.09 / 24AI_correctness94.17 / 22AI_edge_cases86.510 / 22AI_efficiency65.413 / 22AI_hallucination_resistance1.819 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy85.112 / 24AI_plan_coherence0.023 / 24AI_recovery98.75 / 22AI_refusal50.03 / 22AI_spec50.03 / 22AI_stability89.88 / 22AI_task_completion83.310 / 24AI_tool_selection99.92 / 24ARC_AGI_290.94 / 17ArtificialAnalysisCoding76.15 / 21ArtificialAnalysisIntelligence84.06 / 21ArtificialAnalysisReasoning86.35 / 21BlendedCost61.921 / 24ContextWindow99.37 / 24CopilotArenaOrLMArenaCode99.82 / 22GDPval72.87 / 16GPQA_HLE_Reasoning86.35 / 21GSO75.33 / 15IFBench30.416 / 21LMArenaCreativeOrOpenEnded100.01 / 24LMArenaSearchDocument33.18 / 19LMArenaText100.01 / 24LongContextRecall90.24 / 21OutputSpeed79.014 / 20SWEBenchMultilingual90.92 / 6SWEBenchPro100.01 / 15SWEBenchVerified99.72 / 18SWEComposite95.71 / 24SWERebench91.64 / 21SciCode85.85 / 21SonarBugDensity59.58 / 17SonarComposite70.54 / 24SonarFunctionalSkill92.23 / 17SonarIssueDensity46.86 / 17SonarVulnerabilityDensity66.67 / 17TTFT72.215 / 20Tau2Bench87.67 / 21TerminalBench64.27 / 22
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlas
claude-opus-4.7anthropic87.286.877.076.075.673.978.9

group breakdown

A_B52.420 / 24A_I70.512 / 24A_P57.410 / 24A_R63.419 / 24BUILD86.23 / 24CRE94.44 / 24GEN96.72 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long69.518 / 24OPS_precision59.720 / 24OPS_review67.218 / 24PLAN78.84 / 24

metrics

AI_code15.315 / 22AI_complexity37.312 / 22AI_context_awareness0.010 / 24AI_correctness94.18 / 22AI_edge_cases86.511 / 22AI_efficiency78.26 / 22AI_hallucination_resistance1.820 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy88.510 / 24AI_plan_coherence20.515 / 24AI_recovery98.76 / 22AI_refusal50.04 / 22AI_spec50.04 / 22AI_stability86.115 / 22AI_task_completion100.02 / 24AI_tool_selection67.913 / 24ARC_AGI_292.73 / 17ArtificialAnalysisCoding90.33 / 21ArtificialAnalysisIntelligence100.01 / 21ArtificialAnalysisReasoning95.63 / 21BlendedCost61.922 / 24ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.01 / 22GDPval93.92 / 16GPQA_HLE_Reasoning95.63 / 21GSO100.01 / 15IFBench45.111 / 21LMArenaCreativeOrOpenEnded94.44 / 24LMArenaSearchDocument100.01 / 19LMArenaText94.44 / 24LongContextRecall88.26 / 21OutputSpeed79.013 / 20SWEBenchPro95.02 / 15SWEBenchVerified95.03 / 18SWEComposite90.73 / 24SWERebench85.36 / 21SciCode100.01 / 21SonarBugDensity50.114 / 17SonarComposite51.413 / 24SonarFunctionalSkill93.92 / 17SonarIssueDensity0.017 / 17SonarVulnerabilityDensity25.314 / 17TTFT25.018 / 20Tau2Bench79.810 / 21TerminalBench78.24 / 22
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
claude-opus-4.5anthropic72.872.867.767.774.074.067.1

group breakdown

A_B61.88 / 24A_I72.58 / 24A_P63.72 / 24A_R81.16 / 24BUILD79.55 / 24CRE73.412 / 24GEN70.47 / 24LM_ARENA_REVIEW_PROXY11.221 / 24OPS_long76.716 / 24OPS_precision73.415 / 24OPS_review73.617 / 24PLAN67.19 / 24

metrics

AI_canary_health88.52 / 7AI_code24.97 / 22AI_complexity37.310 / 22AI_context_awareness0.08 / 24AI_correctness94.16 / 22AI_edge_cases86.59 / 22AI_efficiency71.39 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy99.82 / 24AI_plan_coherence46.26 / 24AI_recovery98.74 / 22AI_refusal50.02 / 22AI_spec50.02 / 22AI_stability86.114 / 22AI_task_completion100.01 / 24AI_tool_selection92.25 / 24ArtificialAnalysisCoding75.16 / 21ArtificialAnalysisIntelligence71.58 / 21ArtificialAnalysisReasoning63.710 / 21BlendedCost61.920 / 24ContextWindow74.721 / 24CopilotArenaOrLMArenaCode76.86 / 22GDPval71.58 / 16GPQA_HLE_Reasoning63.710 / 21GSO59.35 / 15IFBench43.512 / 21LMArenaCreativeOrOpenEnded73.412 / 24LMArenaSearchDocument11.216 / 19LMArenaText73.412 / 24LongContextRecall100.01 / 21OutputSpeed81.312 / 20SWEBenchPro88.46 / 15SWEBenchVerified92.29 / 18SWEComposite83.87 / 24SWERebench76.59 / 21SciCode72.77 / 21SonarBugDensity73.75 / 17SonarComposite87.11 / 24SonarFunctionalSkill100.01 / 17SonarIssueDensity77.23 / 17SonarVulnerabilityDensity87.23 / 17TTFT72.714 / 20Tau2Bench81.89 / 21TerminalBench54.811 / 22
sources aistupidlevelartificial_analysisgsolmarenamcp_atlasopenroutersonarswebench_proswerebenchterminal_benchmissing BUILD/MCPAtlasGEN/ARC_AGI_2PLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
kimi-k2.6moonshot74.474.474.774.771.871.882.4

group breakdown

A_B59.113 / 24A_I66.619 / 24A_P51.414 / 24A_R80.011 / 24BUILD74.27 / 24CRE79.38 / 24GEN74.15 / 24LM_ARENA_REVIEW_PROXY94.82 / 24OPS_long70.717 / 24OPS_precision79.212 / 24OPS_review77.813 / 24PLAN87.83 / 24

metrics

AI_code20.113 / 22AI_complexity37.316 / 22AI_context_awareness0.016 / 24AI_correctness94.112 / 22AI_edge_cases86.515 / 22AI_efficiency56.517 / 22AI_hallucination_resistance100.05 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy78.915 / 24AI_plan_coherence7.720 / 24AI_recovery98.710 / 22AI_refusal50.015 / 22AI_spec50.015 / 22AI_stability80.417 / 22AI_task_completion83.313 / 24AI_tool_selection61.514 / 24ARC_AGI_211.99 / 17ArtificialAnalysisCoding72.87 / 21ArtificialAnalysisIntelligence87.54 / 21ArtificialAnalysisReasoning87.64 / 21BlendedCost89.19 / 24ContextWindow78.814 / 24CopilotArenaOrLMArenaCode94.44 / 22GDPval52.110 / 16GPQA_HLE_Reasoning87.64 / 21IFBench91.55 / 21LMArenaCreativeOrOpenEnded79.38 / 24LMArenaSearchDocument94.82 / 19LMArenaText79.38 / 24LongContextRecall85.37 / 21MCPAtlas92.52 / 13OutputSpeed57.119 / 20SWEBenchVerified95.05 / 18SWEComposite66.014 / 24SWERebench73.112 / 21SciCode94.53 / 21SonarBugDensity92.53 / 17SonarComposite80.63 / 24SonarFunctionalSkill66.812 / 17SonarIssueDensity92.52 / 17SonarVulnerabilityDensity81.65 / 17TTFT92.85 / 20Tau2Bench96.03 / 21TerminalBench74.65 / 22
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
glm-5.1zai74.774.764.864.870.870.876.7

group breakdown

A_B57.814 / 24A_I64.120 / 24A_P51.115 / 24A_R75.514 / 24BUILD73.39 / 24CRE86.85 / 24GEN57.612 / 24LM_ARENA_REVIEW_PROXY88.05 / 24OPS_long83.88 / 24OPS_precision88.55 / 24OPS_review85.96 / 24PLAN72.97 / 24

metrics

AI_code24.610 / 22AI_complexity39.28 / 22AI_context_awareness7.56 / 24AI_correctness87.520 / 22AI_edge_cases81.020 / 22AI_efficiency55.619 / 22AI_hallucination_resistance92.517 / 24AI_memory_retention7.59 / 24AI_parameter_accuracy74.519 / 24AI_plan_coherence14.019 / 24AI_recovery91.418 / 22AI_refusal50.022 / 22AI_spec50.022 / 22AI_stability75.819 / 22AI_task_completion78.314 / 24AI_tool_selection59.715 / 24ARC_AGI_25.211 / 17ArtificialAnalysisCoding39.513 / 21ArtificialAnalysisIntelligence60.59 / 21ArtificialAnalysisReasoning54.013 / 21BlendedCost93.06 / 24ContextWindow74.919 / 24CopilotArenaOrLMArenaCode95.93 / 22GDPval59.59 / 16GPQA_HLE_Reasoning54.013 / 21IFBench84.06 / 21LMArenaCreativeOrOpenEnded86.85 / 24LMArenaSearchDocument88.05 / 19LMArenaText86.85 / 24LongContextRecall41.217 / 21MCPAtlas100.01 / 13OutputSpeed78.715 / 20SWEBenchMultilingual50.93 / 6SWEBenchVerified91.910 / 18SWEComposite78.68 / 24SWERebench100.01 / 21SciCode40.414 / 21SonarBugDensity100.01 / 17SonarComposite86.02 / 24SonarFunctionalSkill69.89 / 17SonarIssueDensity100.01 / 17SonarVulnerabilityDensity87.24 / 17TTFT100.01 / 20Tau2Bench100.02 / 21TerminalBench55.810 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswerebenchterminal_benchmissing BUILD/GSOSWEComposite/SWEBenchPro
gpt-5.4openai68.968.950.750.769.669.660.5

group breakdown

A_B59.312 / 24A_I68.814 / 24A_P57.87 / 24A_R79.512 / 24BUILD73.78 / 24CRE76.59 / 24GEN44.816 / 24LM_ARENA_REVIEW_PROXY17.620 / 24OPS_long92.13 / 24OPS_precision88.16 / 24OPS_review89.84 / 24PLAN43.117 / 24

metrics

AI_code15.318 / 22AI_complexity37.319 / 22AI_context_awareness0.019 / 24AI_correctness94.115 / 22AI_edge_cases86.518 / 22AI_efficiency70.910 / 22AI_hallucination_resistance100.08 / 24AI_memory_retention2.110 / 24AI_parameter_accuracy96.65 / 24AI_plan_coherence17.918 / 24AI_recovery98.713 / 22AI_refusal50.018 / 22AI_spec50.018 / 22AI_stability80.418 / 22AI_task_completion100.07 / 24AI_tool_selection87.18 / 24ARC_AGI_275.85 / 17ArtificialAnalysisCoding33.715 / 21ArtificialAnalysisIntelligence27.416 / 21ArtificialAnalysisReasoning15.518 / 21BlendedCost75.015 / 24ContextWindow100.01 / 24CopilotArenaOrLMArenaCode68.011 / 22GDPval81.44 / 16GPQA_HLE_Reasoning15.518 / 21GSO54.06 / 15IFBench60.510 / 21LMArenaCreativeOrOpenEnded76.59 / 24LMArenaSearchDocument17.615 / 19LMArenaText76.59 / 24LongContextRecall24.518 / 21MCPAtlas72.84 / 13OutputSpeed96.14 / 20SWEBenchPro92.54 / 15SWEBenchVerified95.06 / 18SWEComposite88.95 / 24SWERebench83.57 / 21SciCode12.018 / 21SonarBugDensity84.74 / 17SonarComposite60.46 / 24SonarFunctionalSkill66.811 / 17SonarIssueDensity6.815 / 17SonarVulnerabilityDensity100.01 / 17TTFT83.88 / 20Tau2Bench0.021 / 21TerminalBench100.02 / 22
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proterminal_benchmissing SWEComposite/SWEBenchMultilingual
claude-sonnet-4.6anthropic70.870.860.460.467.967.960.6

group breakdown

A_B53.519 / 24A_I70.811 / 24A_P59.34 / 24A_R64.317 / 24BUILD76.26 / 24CRE73.911 / 24GEN65.78 / 24LM_ARENA_REVIEW_PROXY23.315 / 24OPS_long67.119 / 24OPS_precision54.222 / 24OPS_review64.121 / 24PLAN58.811 / 24

metrics

AI_canary_health88.23 / 7AI_code20.112 / 22AI_complexity37.315 / 22AI_context_awareness0.013 / 24AI_correctness94.111 / 22AI_edge_cases86.514 / 22AI_efficiency74.58 / 22AI_hallucination_resistance1.822 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy90.78 / 24AI_plan_coherence20.516 / 24AI_recovery98.79 / 22AI_refusal50.07 / 22AI_spec50.07 / 22AI_stability89.811 / 22AI_task_completion100.05 / 24AI_tool_selection96.03 / 24ARC_AGI_210.610 / 17ArtificialAnalysisCoding85.14 / 21ArtificialAnalysisIntelligence79.17 / 21ArtificialAnalysisReasoning68.79 / 21BlendedCost74.418 / 24ContextWindow99.311 / 24CopilotArenaOrLMArenaCode93.25 / 22GDPval80.16 / 16GPQA_HLE_Reasoning68.79 / 21GSO30.710 / 15IFBench39.714 / 21LMArenaCreativeOrOpenEnded73.911 / 24LMArenaSearchDocument23.310 / 19LMArenaText73.911 / 24LongContextRecall90.25 / 21MCPAtlas69.87 / 13OutputSpeed81.511 / 20SWEBenchPro76.510 / 15SWEBenchVerified90.311 / 18SWEComposite87.46 / 24SWERebench95.73 / 21SciCode57.99 / 21SonarBugDensity65.86 / 17SonarComposite55.88 / 24SonarFunctionalSkill84.54 / 17SonarIssueDensity22.310 / 17SonarVulnerabilityDensity21.815 / 17TTFT0.020 / 20Tau2Bench51.213 / 21TerminalBench47.414 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenroutersonarswerebenchmissing SWEComposite/SWEBenchMultilingual
gemini-3-progoogle79.779.754.454.466.166.160.3

group breakdown

A_B79.33 / 24A_I80.42 / 24A_P48.820 / 24A_R89.41 / 24BUILD64.411 / 24CRE94.73 / 24GEN59.911 / 24LM_ARENA_REVIEW_PROXY19.918 / 24OPS_long45.223 / 24OPS_precision48.023 / 24OPS_review43.024 / 24PLAN55.213 / 24

metrics

AI_code64.33 / 22AI_complexity76.33 / 22AI_context_awareness0.014 / 24AI_correctness100.02 / 22AI_edge_cases100.02 / 22AI_efficiency94.52 / 22AI_hallucination_resistance100.04 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy83.013 / 24AI_plan_coherence22.314 / 24AI_recovery100.01 / 22AI_refusal50.012 / 22AI_spec50.012 / 22AI_stability100.02 / 22AI_task_completion23.420 / 24AI_tool_selection4.820 / 24ARC_AGI_241.96 / 17BlendedCost77.312 / 24ContextWindow0.024 / 24CopilotArenaOrLMArenaCode68.410 / 22GDPval5.016 / 16GSO40.79 / 15LMArenaCreativeOrOpenEnded94.73 / 24LMArenaSearchDocument19.913 / 19LMArenaText94.73 / 24MCPAtlas74.93 / 13SWEBenchMultilingual33.54 / 6SWEBenchPro80.38 / 15SWEBenchVerified82.913 / 18SWEComposite72.111 / 24SWERebench70.614 / 21SonarBugDensity53.29 / 17SonarComposite54.99 / 24SonarFunctionalSkill84.15 / 17SonarIssueDensity6.716 / 17SonarVulnerabilityDensity59.78 / 17TerminalBench61.28 / 22
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/LongContextRecallBUILD/SciCodeGEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/OutputSpeedOPS_long/TTFTOPS_precision/OutputSpeedOPS_precision/TTFTOPS_review/OutputSpeedOPS_review/TTFTPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/Tau2Bench
gemini-3-flashgoogle78.978.962.162.165.765.764.1

group breakdown

A_B74.95 / 24A_I75.96 / 24A_P48.918 / 24A_R83.54 / 24BUILD59.012 / 24CRE86.16 / 24GEN61.610 / 24LM_ARENA_REVIEW_PROXY20.017 / 24OPS_long94.02 / 24OPS_precision90.53 / 24OPS_review92.61 / 24PLAN64.510 / 24

metrics

AI_code62.25 / 22AI_complexity72.35 / 22AI_context_awareness7.54 / 24AI_correctness92.518 / 22AI_edge_cases92.56 / 22AI_efficiency87.94 / 22AI_hallucination_resistance92.515 / 24AI_memory_retention7.57 / 24AI_parameter_accuracy78.117 / 24AI_plan_coherence26.510 / 24AI_recovery92.516 / 22AI_refusal50.011 / 22AI_spec50.011 / 22AI_stability92.55 / 22AI_task_completion27.418 / 24AI_tool_selection11.618 / 24ARC_AGI_23.114 / 17ArtificialAnalysisCoding58.39 / 21ArtificialAnalysisIntelligence58.912 / 21ArtificialAnalysisReasoning82.77 / 21BlendedCost91.58 / 24ContextWindow100.05 / 24CopilotArenaOrLMArenaCode68.012 / 22GDPval8.014 / 16GPQA_HLE_Reasoning82.77 / 21GSO14.013 / 15IFBench96.83 / 21LMArenaCreativeOrOpenEnded86.16 / 24LMArenaSearchDocument20.012 / 19LMArenaText86.16 / 24LongContextRecall68.68 / 21MCPAtlas22.49 / 13OutputSpeed98.13 / 20SWEBenchMultilingual100.01 / 6SWEBenchPro53.012 / 15SWEBenchVerified100.01 / 18SWEComposite74.19 / 24SWERebench76.310 / 21SciCode78.76 / 21SonarBugDensity52.711 / 17SonarComposite54.211 / 24SonarFunctionalSkill78.97 / 17SonarIssueDensity13.212 / 17SonarVulnerabilityDensity58.210 / 17TTFT79.59 / 20Tau2Bench61.611 / 21TerminalBench48.312 / 22
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
gpt-5.3-codexopenai66.566.554.454.463.763.769.7

group breakdown

A_B60.69 / 24A_I67.718 / 24A_P50.216 / 24A_R80.68 / 24BUILD66.210 / 24CRE72.513 / 24GEN55.613 / 24LM_ARENA_REVIEW_PROXY92.53 / 24OPS_long58.021 / 24OPS_precision60.619 / 24OPS_review64.120 / 24PLAN54.914 / 24

metrics

AI_code20.114 / 22AI_complexity37.318 / 22AI_context_awareness0.018 / 24AI_correctness94.114 / 22AI_edge_cases86.517 / 22AI_efficiency68.712 / 22AI_hallucination_resistance100.07 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy91.46 / 24AI_plan_coherence0.024 / 24AI_recovery98.712 / 22AI_refusal50.017 / 22AI_spec50.017 / 22AI_stability86.116 / 22AI_task_completion66.715 / 24AI_tool_selection69.112 / 24BlendedCost76.614 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode59.314 / 22GDPval51.511 / 16GSO53.47 / 15LMArenaCreativeOrOpenEnded72.513 / 24LMArenaSearchDocument92.53 / 19LMArenaText72.513 / 24SWEBenchVerified92.58 / 18SWEComposite72.210 / 24SWERebench89.55 / 21SonarComposite50.018 / 24TerminalBench74.36 / 22
sources aistupidlevelartificial_analysislmarenaopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/LongContextRecallBUILD/MCPAtlasBUILD/SciCodeGEN/ARC_AGI_2GEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/OutputSpeedOPS_long/TTFTOPS_precision/OutputSpeedOPS_precision/TTFTOPS_review/OutputSpeedOPS_review/TTFTPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/MCPAtlasPLAN/Tau2BenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
grok-4-latestxai76.976.968.668.658.458.462.0

group breakdown

A_B83.51 / 24A_I79.93 / 24A_P57.49 / 24A_R87.92 / 24BUILD45.218 / 24CRE76.010 / 24GEN72.86 / 24LM_ARENA_REVIEW_PROXY19.219 / 24OPS_long84.67 / 24OPS_precision74.814 / 24OPS_review77.414 / 24PLAN71.28 / 24

metrics

AI_code95.92 / 22AI_complexity100.01 / 22AI_context_awareness0.021 / 24AI_correctness100.03 / 22AI_edge_cases100.03 / 22AI_efficiency0.022 / 22AI_hallucination_resistance100.010 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.01 / 24AI_recovery74.120 / 22AI_refusal50.020 / 22AI_spec50.020 / 22AI_stability100.03 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_220.78 / 17ArtificialAnalysisCoding53.110 / 21ArtificialAnalysisIntelligence84.85 / 21ArtificialAnalysisReasoning83.96 / 21BlendedCost74.419 / 24ContextWindow78.415 / 24CopilotArenaOrLMArenaCode58.015 / 22GPQA_HLE_Reasoning83.96 / 21IFBench100.02 / 21LMArenaCreativeOrOpenEnded76.010 / 24LMArenaSearchDocument19.214 / 19LMArenaText76.010 / 24LongContextRecall58.813 / 21OutputSpeed100.01 / 20SWEComposite45.619 / 24SWERebench39.117 / 21SciCode60.78 / 21SonarComposite50.019 / 24TTFT51.816 / 20Tau2Bench100.01 / 21TerminalBench11.819 / 22
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouterswerebenchterminal_benchmissing BUILD/GDPvalBUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
claude-sonnet-4.5anthropic62.662.649.449.453.953.947.7

group breakdown

A_B52.021 / 24A_I72.29 / 24A_P61.73 / 24A_R63.818 / 24BUILD52.513 / 24CRE64.115 / 24GEN42.217 / 24LM_ARENA_REVIEW_PROXY1.822 / 24OPS_long80.711 / 24OPS_precision80.09 / 24OPS_review82.39 / 24PLAN40.918 / 24

metrics

AI_canary_health80.36 / 7AI_code15.317 / 22AI_complexity37.314 / 22AI_context_awareness0.012 / 24AI_correctness94.110 / 22AI_edge_cases86.513 / 22AI_efficiency69.911 / 22AI_hallucination_resistance1.821 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy85.711 / 24AI_plan_coherence38.57 / 24AI_recovery98.78 / 22AI_refusal50.06 / 22AI_spec50.06 / 22AI_stability89.810 / 22AI_task_completion100.04 / 24AI_tool_selection87.17 / 24ARC_AGI_23.712 / 17ArtificialAnalysisCoding45.312 / 21ArtificialAnalysisIntelligence46.013 / 21ArtificialAnalysisReasoning35.315 / 21BlendedCost74.417 / 24ContextWindow99.310 / 24CopilotArenaOrLMArenaCode53.416 / 22GDPval81.93 / 16GPQA_HLE_Reasoning35.315 / 21GSO27.311 / 15IFBench41.613 / 21LMArenaCreativeOrOpenEnded64.115 / 24LMArenaSearchDocument1.817 / 19LMArenaText64.115 / 24LongContextRecall65.710 / 21MCPAtlas6.612 / 13OutputSpeed78.317 / 20SWEBenchMultilingual3.95 / 6SWEBenchPro81.27 / 15SWEBenchVerified85.712 / 18SWEComposite71.612 / 24SWERebench74.911 / 21SciCode46.413 / 21SonarBugDensity2.816 / 17SonarComposite15.623 / 24SonarFunctionalSkill17.215 / 17SonarIssueDensity30.09 / 17SonarVulnerabilityDensity4.616 / 17TTFT76.311 / 20Tau2Bench56.512 / 21TerminalBench37.415 / 22
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
gpt-5.2openai63.963.955.355.353.253.256.1

group breakdown

A_B56.115 / 24A_I68.515 / 24A_P56.412 / 24A_R79.013 / 24BUILD50.814 / 24CRE67.914 / 24GEN52.214 / 24LM_ARENA_REVIEW_PROXY21.216 / 24OPS_long58.320 / 24OPS_precision61.318 / 24OPS_review64.819 / 24PLAN55.512 / 24

metrics

AI_code0.721 / 22AI_complexity37.317 / 22AI_context_awareness0.017 / 24AI_correctness94.113 / 22AI_edge_cases86.516 / 22AI_efficiency61.916 / 22AI_hallucination_resistance100.06 / 24AI_memory_retention1.511 / 24AI_parameter_accuracy91.17 / 24AI_plan_coherence7.721 / 24AI_recovery98.711 / 22AI_refusal50.016 / 22AI_spec50.016 / 22AI_stability89.812 / 22AI_task_completion100.06 / 24AI_tool_selection92.24 / 24ARC_AGI_20.017 / 17ArtificialAnalysisCoding63.48 / 21ArtificialAnalysisIntelligence59.710 / 21ArtificialAnalysisReasoning56.411 / 21BlendedCost80.111 / 24ContextWindow85.312 / 24CopilotArenaOrLMArenaCode38.720 / 22GPQA_HLE_Reasoning56.411 / 21GSO64.74 / 15IFBench62.79 / 21LMArenaCreativeOrOpenEnded67.914 / 24LMArenaSearchDocument21.211 / 19LMArenaText67.914 / 24LongContextRecall53.915 / 21SWEBenchMultilingual0.06 / 6SWEBenchPro38.214 / 15SWEBenchVerified81.314 / 18SWEComposite45.620 / 24SciCode54.610 / 21SonarBugDensity64.27 / 17SonarComposite59.77 / 24SonarFunctionalSkill67.210 / 17SonarIssueDensity35.78 / 17SonarVulnerabilityDensity73.46 / 17Tau2Bench48.115 / 21TerminalBench58.29 / 22
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenroutersonarswebenchswebench_proterminal_benchmissing BUILD/GDPvalBUILD/MCPAtlasOPS_long/OutputSpeedOPS_long/TTFTOPS_precision/OutputSpeedOPS_precision/TTFTOPS_review/OutputSpeedOPS_review/TTFTPLAN/MCPAtlasSWEComposite/SWERebench
claude-sonnet-4anthropic26.526.536.636.653.053.058.7

group breakdown

A_B59.411 / 24A_I70.513 / 24A_P59.25 / 24A_R80.59 / 24BUILD49.416 / 24CRE0.023 / 24GEN14.022 / 24LM_ARENA_REVIEW_PROXY86.26 / 24OPS_long80.710 / 24OPS_precision79.810 / 24OPS_review82.210 / 24PLAN29.719 / 24

metrics

AI_code15.316 / 22AI_complexity37.313 / 22AI_context_awareness0.011 / 24AI_correctness94.19 / 22AI_edge_cases86.512 / 22AI_efficiency63.914 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy89.79 / 24AI_plan_coherence25.612 / 24AI_recovery98.77 / 22AI_refusal50.05 / 22AI_spec50.05 / 22AI_stability89.89 / 22AI_task_completion100.03 / 24AI_tool_selection85.89 / 24ARC_AGI_20.216 / 17ArtificialAnalysisCoding30.716 / 21ArtificialAnalysisIntelligence29.715 / 21ArtificialAnalysisReasoning8.619 / 21BlendedCost74.416 / 24ContextWindow99.39 / 24CopilotArenaOrLMArenaCode52.918 / 22GDPval80.15 / 16GPQA_HLE_Reasoning8.619 / 21GSO6.014 / 15IFBench34.715 / 21LMArenaCreativeOrOpenEnded0.023 / 24LMArenaSearchDocument86.26 / 19LMArenaText0.023 / 24LiveCodeBench0.02 / 2LongContextRecall60.811 / 21MCPAtlas13.110 / 13OutputSpeed78.616 / 20SWEBenchPro78.49 / 15SWEBenchVerified69.916 / 18SWEComposite66.613 / 24SWERebench55.115 / 21SciCode20.817 / 21SonarBugDensity0.017 / 17SonarComposite19.522 / 24SonarFunctionalSkill26.414 / 17SonarIssueDensity35.87 / 17SonarVulnerabilityDensity0.017 / 17TTFT75.613 / 20Tau2Bench26.618 / 21TerminalBench47.413 / 22
sources aistupidlevelarc_agiartificial_analysisgsolivecodebenchlmarenaopenroutersonarswebenchswebench_proswerebenchmissing SWEComposite/SWEBenchMultilingual
glm-4.7zai33.033.050.650.651.751.755.0

group breakdown

A_B56.017 / 24A_I55.022 / 24A_P47.022 / 24A_R58.522 / 24BUILD44.519 / 24CRE10.222 / 24GEN35.818 / 24LM_ARENA_REVIEW_PROXY50.012 / 24OPS_long90.44 / 24OPS_precision92.21 / 24OPS_review89.83 / 24PLAN54.315 / 24

metrics

AI_context_awareness0.024 / 24AI_hallucination_resistance100.013 / 24AI_memory_retention100.04 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.04 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding37.914 / 21ArtificialAnalysisIntelligence42.614 / 21ArtificialAnalysisReasoning55.812 / 21BlendedCost96.13 / 24ContextWindow74.918 / 24CopilotArenaOrLMArenaCode68.89 / 22GPQA_HLE_Reasoning55.812 / 21IFBench69.98 / 21LMArenaCreativeOrOpenEnded10.222 / 24LMArenaText10.222 / 24LongContextRecall57.414 / 21MCPAtlas0.013 / 13OutputSpeed90.57 / 20SWEComposite58.415 / 24SWERebench70.913 / 21SciCode48.611 / 21SonarBugDensity51.613 / 17SonarComposite27.321 / 24SonarFunctionalSkill0.017 / 17SonarIssueDensity50.85 / 17SonarVulnerabilityDensity28.713 / 17TTFT99.03 / 20Tau2Bench96.04 / 21TerminalBench27.117 / 22
sources aistupidlevelartificial_analysislmarenamcp_atlasopenroutersonarswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GDPvalBUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
claude-opus-4.1anthropic56.556.550.050.048.348.344.8

group breakdown

A_B49.922 / 24A_I70.910 / 24A_P58.26 / 24A_R63.320 / 24BUILD48.517 / 24CRE53.017 / 24GEN50.815 / 24LM_ARENA_REVIEW_PROXY0.023 / 24OPS_long48.722 / 24OPS_precision43.724 / 24OPS_review46.223 / 24PLAN45.916 / 24

metrics

AI_canary_health68.17 / 7AI_code10.420 / 22AI_complexity37.39 / 22AI_context_awareness0.07 / 24AI_correctness94.15 / 22AI_edge_cases86.58 / 22AI_efficiency56.018 / 22AI_hallucination_resistance1.818 / 24AI_memory_retention0.012 / 24AI_parameter_accuracy71.020 / 24AI_plan_coherence35.98 / 24AI_recovery98.73 / 22AI_refusal50.01 / 22AI_spec50.01 / 22AI_stability89.87 / 22AI_task_completion83.39 / 24AI_tool_selection83.211 / 24BlendedCost0.024 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode53.217 / 22LMArenaCreativeOrOpenEnded53.017 / 24LMArenaSearchDocument0.018 / 19LMArenaText53.017 / 24SWEComposite50.916 / 24SWERebench52.316 / 21SonarComposite50.014 / 24TerminalBench29.416 / 22
sources aistupidlevellmarenaopenrouterswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/GSOBUILD/LongContextRecallBUILD/MCPAtlasBUILD/SciCodeGEN/ARC_AGI_2GEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/OutputSpeedOPS_long/TTFTOPS_precision/OutputSpeedOPS_precision/TTFTOPS_review/OutputSpeedOPS_review/TTFTPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/MCPAtlasPLAN/Tau2BenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
deepseek-v4-flashdeepseek52.052.064.764.747.647.654.2

group breakdown

A_B20.223 / 24A_I17.823 / 24A_P37.723 / 24A_R24.023 / 24BUILD49.715 / 24CRE58.516 / 24GEN62.89 / 24LM_ARENA_REVIEW_PROXY50.08 / 24OPS_long87.85 / 24OPS_precision91.32 / 24OPS_review88.55 / 24PLAN76.65 / 24

metrics

AI_canary_health83.44 / 7AI_code0.022 / 22AI_complexity0.021 / 22AI_context_awareness47.92 / 24AI_correctness0.021 / 22AI_edge_cases0.021 / 22AI_efficiency77.77 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy97.14 / 24AI_plan_coherence25.613 / 24AI_recovery0.021 / 22AI_refusal50.08 / 22AI_spec50.08 / 22AI_stability0.022 / 22AI_task_completion83.311 / 24AI_tool_selection100.01 / 24ArtificialAnalysisCoding45.611 / 21ArtificialAnalysisIntelligence59.311 / 21ArtificialAnalysisReasoning76.78 / 21BlendedCost100.01 / 24ContextWindow71.622 / 24GPQA_HLE_Reasoning76.78 / 21IFBench100.01 / 21LMArenaCreativeOrOpenEnded58.516 / 24LMArenaText58.516 / 24LongContextRecall52.516 / 21OutputSpeed85.89 / 20SWEComposite50.017 / 24SciCode47.512 / 21SonarComposite50.015 / 24TTFT99.52 / 20Tau2Bench94.06 / 21
sources aistupidlevelartificial_analysislmarenaopenroutermissing BUILD/CopilotArenaOrLMArenaCodeBUILD/GDPvalBUILD/GSOBUILD/MCPAtlasBUILD/TerminalBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasPLAN/TerminalBenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSWEComposite/SWERebenchSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-2.5-progoogle27.727.733.033.047.547.543.9

group breakdown

A_B74.94 / 24A_I75.95 / 24A_P48.917 / 24A_R83.53 / 24BUILD35.921 / 24CRE0.024 / 24GEN14.521 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long79.213 / 24OPS_precision68.916 / 24OPS_review75.915 / 24PLAN28.820 / 24

metrics

AI_code62.24 / 22AI_complexity72.34 / 22AI_context_awareness7.53 / 24AI_correctness92.517 / 22AI_edge_cases92.55 / 22AI_efficiency87.93 / 22AI_hallucination_resistance92.514 / 24AI_memory_retention7.56 / 24AI_parameter_accuracy78.116 / 24AI_plan_coherence26.59 / 24AI_recovery92.515 / 22AI_refusal50.010 / 22AI_spec50.010 / 22AI_stability92.54 / 22AI_task_completion27.417 / 24AI_tool_selection11.617 / 24ARC_AGI_23.713 / 17ArtificialAnalysisCoding23.617 / 21ArtificialAnalysisIntelligence14.117 / 21ArtificialAnalysisReasoning44.814 / 21BlendedCost80.110 / 24ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.921 / 22GDPval7.515 / 16GPQA_HLE_Reasoning44.814 / 21GSO0.015 / 15IFBench18.718 / 21LMArenaCreativeOrOpenEnded0.024 / 24LMArenaSearchDocument0.019 / 19LMArenaText0.024 / 24LongContextRecall67.29 / 21MCPAtlas71.15 / 13OutputSpeed91.16 / 20SWEBenchPro75.711 / 15SWEBenchVerified38.217 / 18SWEComposite36.622 / 24SWERebench1.820 / 21SciCode36.115 / 21SonarBugDensity52.710 / 17SonarComposite54.210 / 24SonarFunctionalSkill78.96 / 17SonarIssueDensity13.211 / 17SonarVulnerabilityDensity58.29 / 17TTFT30.217 / 20Tau2Bench3.319 / 21TerminalBench1.820 / 22
sources arc_agiartificial_analysisgsolmarenaopenrouterswebenchswerebenchterminal_benchmissing SWEComposite/SWEBenchMultilingual
gemini-2.5-flashgoogle52.352.334.634.642.942.946.9

group breakdown

A_B81.22 / 24A_I84.71 / 24A_P68.01 / 24A_R74.315 / 24BUILD24.523 / 24CRE46.019 / 24GEN15.220 / 24LM_ARENA_REVIEW_PROXY78.87 / 24OPS_long94.21 / 24OPS_precision90.14 / 24OPS_review92.62 / 24PLAN17.023 / 24

metrics

AI_code100.01 / 22AI_complexity97.72 / 22AI_context_awareness100.01 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency100.01 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence50.25 / 24AI_recovery90.719 / 22AI_refusal50.09 / 22AI_spec50.09 / 22AI_stability100.01 / 22AI_task_completion29.716 / 24AI_tool_selection59.616 / 24ARC_AGI_20.815 / 17ArtificialAnalysisCoding0.020 / 21ArtificialAnalysisIntelligence0.819 / 21ArtificialAnalysisReasoning17.916 / 21BlendedCost94.45 / 24ContextWindow100.03 / 24CopilotArenaOrLMArenaCode65.313 / 22GDPval10.313 / 16GPQA_HLE_Reasoning17.916 / 21GSO19.412 / 15IFBench28.317 / 21LMArenaCreativeOrOpenEnded46.019 / 24LMArenaSearchDocument78.87 / 19LMArenaText46.019 / 24LiveCodeBench100.01 / 2LongContextRecall58.812 / 21MCPAtlas26.68 / 13OutputSpeed99.32 / 20SWEBenchPro52.513 / 15SWEBenchVerified0.018 / 18SWEComposite20.424 / 24SWERebench0.021 / 21SciCode23.516 / 21SonarComposite50.016 / 24TTFT75.612 / 20Tau2Bench0.020 / 21TerminalBench0.321 / 22
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing SWEComposite/SWEBenchMultilingualSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
grok-code-fast-1xai50.750.730.230.240.740.743.9

group breakdown

A_B64.97 / 24A_I76.24 / 24A_P57.88 / 24A_R80.77 / 24BUILD28.322 / 24CRE48.218 / 24GEN15.819 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long85.26 / 24OPS_precision85.08 / 24OPS_review85.17 / 24PLAN12.824 / 24

metrics

AI_code21.111 / 22AI_complexity60.07 / 22AI_context_awareness0.022 / 24AI_correctness100.04 / 22AI_edge_cases95.84 / 22AI_efficiency46.020 / 22AI_hallucination_resistance100.011 / 24AI_memory_retention100.02 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.02 / 24AI_recovery100.02 / 22AI_refusal50.021 / 22AI_spec50.021 / 22AI_stability61.620 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.17 / 17ArtificialAnalysisCoding0.021 / 21ArtificialAnalysisIntelligence0.021 / 21ArtificialAnalysisReasoning0.021 / 21BlendedCost99.32 / 24ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.022 / 22GPQA_HLE_Reasoning0.021 / 21IFBench0.021 / 21LMArenaCreativeOrOpenEnded48.218 / 24LMArenaText48.218 / 24LongContextRecall0.021 / 21OutputSpeed87.18 / 20SWEComposite41.221 / 24SWERebench27.919 / 21SciCode0.021 / 21SonarComposite50.020 / 24TTFT77.910 / 20Tau2Bench51.214 / 21TerminalBench0.022 / 22
sources aistupidlevelartificial_analysislmarenaopenrouterswerebenchterminal_benchmissing BUILD/GDPvalBUILD/GSOBUILD/MCPAtlasLM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
glm-4.6zai33.833.829.129.134.934.938.1

group breakdown

A_B56.016 / 24A_I55.021 / 24A_P47.021 / 24A_R58.521 / 24BUILD21.924 / 24CRE24.521 / 24GEN13.723 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long79.312 / 24OPS_precision85.87 / 24OPS_review83.58 / 24PLAN17.722 / 24

metrics

AI_context_awareness0.023 / 24AI_hallucination_resistance100.012 / 24AI_memory_retention100.03 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.03 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding15.918 / 21ArtificialAnalysisIntelligence6.118 / 21ArtificialAnalysisReasoning16.517 / 21BlendedCost95.44 / 24ContextWindow75.017 / 24CopilotArenaOrLMArenaCode44.419 / 22GPQA_HLE_Reasoning16.517 / 21IFBench4.519 / 21LMArenaCreativeOrOpenEnded24.521 / 24LMArenaText24.521 / 24LongContextRecall9.819 / 21MCPAtlas7.511 / 13OutputSpeed70.818 / 20SWEBenchPro0.015 / 15SWEBenchVerified79.015 / 18SWEComposite30.223 / 24SWERebench38.418 / 21SciCode12.019 / 21SonarBugDensity7.515 / 17SonarComposite10.724 / 24SonarFunctionalSkill7.516 / 17SonarIssueDensity7.514 / 17SonarVulnerabilityDensity29.012 / 17TTFT97.74 / 20Tau2Bench39.717 / 21TerminalBench13.918 / 22
sources aistupidlevelartificial_analysislmarenaopenrouterswebenchswebench_proswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GDPvalBUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchMultilingual
kimi-k2-0905moonshot19.919.922.422.434.434.430.6

group breakdown

A_B11.924 / 24A_I11.224 / 24A_P27.024 / 24A_R9.924 / 24BUILD41.120 / 24CRE27.520 / 24GEN8.124 / 24LM_ARENA_REVIEW_PROXY50.09 / 24OPS_long35.324 / 24OPS_precision58.121 / 24OPS_review54.422 / 24PLAN22.221 / 24

metrics

AI_canary_health88.91 / 7AI_code24.99 / 22AI_complexity0.022 / 22AI_context_awareness0.015 / 24AI_correctness0.022 / 22AI_edge_cases0.022 / 22AI_efficiency3.021 / 22AI_hallucination_resistance1.823 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy82.514 / 24AI_plan_coherence17.917 / 24AI_recovery0.022 / 22AI_refusal50.014 / 22AI_spec50.014 / 22AI_stability0.921 / 22AI_task_completion83.312 / 24AI_tool_selection83.210 / 24ArtificialAnalysisCoding4.219 / 21ArtificialAnalysisIntelligence0.020 / 21ArtificialAnalysisReasoning0.020 / 21BlendedCost92.77 / 24ContextWindow53.423 / 24GPQA_HLE_Reasoning0.020 / 21IFBench0.020 / 21LMArenaCreativeOrOpenEnded27.520 / 24LMArenaText27.520 / 24LongContextRecall0.020 / 21OutputSpeed0.020 / 20SWEComposite50.018 / 24SciCode0.020 / 21SonarComposite50.017 / 24TTFT90.16 / 20Tau2Bench46.116 / 21
sources aistupidlevelartificial_analysislmarenaopenroutermissing BUILD/CopilotArenaOrLMArenaCodeBUILD/GDPvalBUILD/GSOBUILD/MCPAtlasBUILD/TerminalBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasPLAN/TerminalBenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSWEComposite/SWERebenchSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity