how scoring works
Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.
raw vs adjusted
The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models dominate Review, that lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.
Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted — it is the source of the penalty.
missing data
If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.
| claude-opus-4.6 | anthropic | 92.4 | 92.4 | 74.9 | 74.9 | 84.1 | 84.1 | 74.5 | ▸ |
group breakdownA_B83.45 / 24A_I86.94 / 24A_P64.69 / 24A_R82.512 / 24BUILD86.32 / 24CRE100.01 / 24GEN89.54 / 24LM_ARENA_REVIEW_PROXY33.113 / 24OPS_long79.414 / 24OPS_precision78.015 / 24OPS_review76.015 / 24PLAN72.75 / 24 metricsAI_canary_health83.34 / 7AI_code95.04 / 22AI_complexity85.63 / 22AI_context_awareness9.09 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency74.87 / 22AI_hallucination_resistance0.022 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy71.815 / 24AI_plan_coherence3.220 / 24AI_recovery100.01 / 22AI_refusal100.01 / 22AI_spec100.01 / 22AI_stability100.02 / 22AI_task_completion92.94 / 24AI_tool_selection100.01 / 24ARC_AGI_290.94 / 17ArtificialAnalysisCoding76.15 / 21ArtificialAnalysisIntelligence84.05 / 21ArtificialAnalysisReasoning86.35 / 21BlendedCost61.921 / 24ContextWindow99.37 / 24CopilotArenaOrLMArenaCode99.82 / 22GDPval72.87 / 16GPQA_HLE_Reasoning86.35 / 21IFBench31.416 / 21LMArenaCreativeOrOpenEnded100.01 / 24LMArenaSearchDocument33.18 / 19LMArenaText100.01 / 24LongContextRecall90.24 / 21OutputSpeed78.815 / 19SWEBenchMultilingual90.92 / 6SWEBenchPro100.01 / 15SWEBenchVerified99.72 / 18SWEComposite96.51 / 24SWERebench91.64 / 21SciCode85.85 / 21SonarBugDensity59.58 / 17SonarFunctionalSkill92.23 / 17SonarIssueDensity46.86 / 17SonarVulnerabilityDensity66.67 / 17TTFT75.115 / 19Tau2Bench91.26 / 21TerminalBench64.27 / 22 | |||||||||
| gemini-3.1-pro-preview | 93.9 | 93.6 | 88.0 | 87.3 | 82.9 | 81.7 | 87.0 | ▸ | |
group breakdownA_B81.68 / 24A_I83.39 / 24A_P72.55 / 24A_R84.710 / 24BUILD83.04 / 24CRE100.02 / 24GEN100.01 / 24LM_ARENA_REVIEW_PROXY92.34 / 24OPS_long85.98 / 24OPS_precision80.013 / 24OPS_review77.114 / 24PLAN94.22 / 24 metricsAI_code73.48 / 22AI_complexity68.98 / 22AI_context_awareness14.18 / 24AI_correctness92.515 / 22AI_edge_cases64.719 / 22AI_efficiency80.86 / 22AI_hallucination_resistance92.58 / 24AI_memory_retention92.58 / 24AI_parameter_accuracy92.57 / 24AI_plan_coherence79.18 / 24AI_recovery84.517 / 22AI_refusal92.517 / 22AI_spec92.517 / 22AI_stability83.814 / 22AI_task_completion61.218 / 24AI_tool_selection10.319 / 24ARC_AGI_2100.01 / 17ArtificialAnalysisCoding100.01 / 21ArtificialAnalysisIntelligence100.02 / 21ArtificialAnalysisReasoning100.01 / 21BlendedCost77.313 / 24ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.47 / 22GDPval24.712 / 16GPQA_HLE_Reasoning100.01 / 21IFBench97.53 / 21LMArenaCreativeOrOpenEnded100.02 / 24LMArenaSearchDocument92.34 / 19LMArenaText100.02 / 24LongContextRecall100.02 / 21MCPAtlas71.16 / 13OutputSpeed93.45 / 19SWEBenchPro89.15 / 15SWEBenchVerified95.04 / 18SWEComposite94.32 / 24SWERebench99.82 / 21SciCode100.02 / 21SonarBugDensity52.712 / 17SonarFunctionalSkill78.98 / 17SonarIssueDensity13.213 / 17SonarVulnerabilityDensity58.211 / 17TTFT59.118 / 19Tau2Bench99.34 / 21TerminalBench89.43 / 22 | |||||||||
| gpt-5.5 | openai | 82.4 | 82.4 | 83.7 | 83.7 | 81.9 | 81.9 | 80.7 | ▸ |
group breakdownA_B65.413 / 24A_I76.011 / 24A_P61.511 / 24A_R82.911 / 24BUILD89.51 / 24CRE81.77 / 24GEN94.13 / 24LM_ARENA_REVIEW_PROXY28.214 / 24OPS_long82.710 / 24OPS_precision80.99 / 24OPS_review78.313 / 24PLAN95.41 / 24 metricsAI_code37.718 / 22AI_complexity29.213 / 22AI_context_awareness0.020 / 24AI_correctness94.112 / 22AI_edge_cases86.513 / 22AI_efficiency52.414 / 22AI_hallucination_resistance60.014 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy95.83 / 24AI_plan_coherence8.719 / 24AI_recovery98.711 / 22AI_refusal100.013 / 22AI_spec100.013 / 22AI_stability89.59 / 22AI_task_completion86.79 / 24AI_tool_selection83.86 / 24ARC_AGI_296.72 / 17ArtificialAnalysisCoding100.02 / 21ArtificialAnalysisIntelligence98.13 / 21ArtificialAnalysisReasoning100.02 / 21BlendedCost50.623 / 24ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.98 / 22GDPval95.01 / 16GPQA_HLE_Reasoning100.02 / 21IFBench80.76 / 21LMArenaCreativeOrOpenEnded81.77 / 24LMArenaSearchDocument28.29 / 19LMArenaText81.77 / 24LongContextRecall98.03 / 21OutputSpeed84.09 / 19SWEBenchPro95.03 / 15SWEBenchVerified95.07 / 18SWEComposite91.24 / 24SWERebench83.58 / 21SciCode94.54 / 21SonarBugDensity94.52 / 17SonarFunctionalSkill46.513 / 17SonarIssueDensity52.74 / 17SonarVulnerabilityDensity99.22 / 17TTFT82.68 / 19Tau2Bench90.57 / 21TerminalBench100.01 / 22 | |||||||||
| claude-opus-4.7 | anthropic | 90.5 | 90.5 | 79.9 | 79.9 | 81.1 | 81.1 | 83.1 | ▸ |
group breakdownA_B73.110 / 24A_I84.06 / 24A_P67.27 / 24A_R79.314 / 24BUILD86.23 / 24CRE94.54 / 24GEN96.72 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long78.815 / 24OPS_precision75.816 / 24OPS_review73.217 / 24PLAN79.84 / 24 metricsAI_canary_health88.23 / 7AI_code63.310 / 22AI_complexity41.710 / 22AI_context_awareness14.15 / 24AI_correctness100.02 / 22AI_edge_cases100.02 / 22AI_efficiency70.110 / 22AI_hallucination_resistance0.023 / 24AI_memory_retention35.410 / 24AI_parameter_accuracy35.220 / 24AI_plan_coherence22.512 / 24AI_recovery100.02 / 22AI_refusal100.02 / 22AI_spec100.02 / 22AI_stability100.03 / 22AI_task_completion100.01 / 24AI_tool_selection56.913 / 24ARC_AGI_292.73 / 17ArtificialAnalysisCoding90.33 / 21ArtificialAnalysisIntelligence100.01 / 21ArtificialAnalysisReasoning95.63 / 21BlendedCost61.922 / 24ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.01 / 22GDPval93.92 / 16GPQA_HLE_Reasoning95.63 / 21IFBench46.610 / 21LMArenaCreativeOrOpenEnded94.54 / 24LMArenaSearchDocument100.01 / 19LMArenaText94.54 / 24LongContextRecall88.26 / 21OutputSpeed80.712 / 19SWEBenchPro95.02 / 15SWEBenchVerified95.03 / 18SWEComposite91.83 / 24SWERebench85.36 / 21SciCode100.01 / 21SonarBugDensity50.114 / 17SonarFunctionalSkill93.92 / 17SonarIssueDensity0.017 / 17SonarVulnerabilityDensity25.314 / 17TTFT66.716 / 19Tau2Bench83.19 / 21TerminalBench78.24 / 22 | |||||||||
| gpt-5.4 | openai | 71.7 | 71.7 | 55.4 | 55.4 | 71.5 | 71.5 | 65.2 | ▸ |
group breakdownA_B65.214 / 24A_I75.013 / 24A_P59.912 / 24A_R82.513 / 24BUILD73.88 / 24CRE77.39 / 24GEN45.016 / 24LM_ARENA_REVIEW_PROXY17.620 / 24OPS_long93.03 / 24OPS_precision91.32 / 24OPS_review89.85 / 24PLAN50.915 / 24 metricsAI_code37.717 / 22AI_complexity29.218 / 22AI_context_awareness0.019 / 24AI_correctness94.111 / 22AI_edge_cases86.512 / 22AI_efficiency54.013 / 22AI_hallucination_resistance60.013 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy91.010 / 24AI_plan_coherence3.222 / 24AI_recovery98.710 / 22AI_refusal100.012 / 22AI_spec100.012 / 22AI_stability85.710 / 22AI_task_completion86.78 / 24AI_tool_selection82.67 / 24ARC_AGI_275.85 / 17ArtificialAnalysisCoding33.715 / 21ArtificialAnalysisIntelligence27.416 / 21ArtificialAnalysisReasoning15.518 / 21BlendedCost75.015 / 24ContextWindow100.01 / 24CopilotArenaOrLMArenaCode68.011 / 22GDPval81.44 / 16GPQA_HLE_Reasoning15.518 / 21IFBench62.59 / 21LMArenaCreativeOrOpenEnded77.39 / 24LMArenaSearchDocument17.615 / 19LMArenaText77.39 / 24LongContextRecall24.518 / 21MCPAtlas72.84 / 13OutputSpeed95.23 / 19SWEBenchPro92.54 / 15SWEBenchVerified95.06 / 18SWEComposite90.25 / 24SWERebench83.57 / 21SciCode12.018 / 21SonarBugDensity84.74 / 17SonarFunctionalSkill66.811 / 17SonarIssueDensity6.815 / 17SonarVulnerabilityDensity100.01 / 17TTFT90.66 / 19Tau2Bench0.021 / 21TerminalBench100.02 / 22 | |||||||||
| kimi-k2.6 | moonshot | 74.2 | 74.2 | 72.1 | 72.1 | 71.0 | 71.0 | 79.4 | ▸ |
group breakdownA_B61.815 / 24A_I74.214 / 24A_P56.015 / 24A_R75.415 / 24BUILD76.07 / 24CRE77.48 / 24GEN73.75 / 24LM_ARENA_REVIEW_PROXY94.82 / 24OPS_long58.020 / 24OPS_precision59.818 / 24OPS_review60.318 / 24PLAN86.63 / 24 metricsAI_code40.614 / 22AI_complexity29.215 / 22AI_context_awareness0.016 / 24AI_correctness94.18 / 22AI_edge_cases86.59 / 22AI_efficiency49.416 / 22AI_hallucination_resistance20.021 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy66.717 / 24AI_plan_coherence8.718 / 24AI_recovery98.77 / 22AI_refusal100.09 / 22AI_spec100.09 / 22AI_stability79.716 / 22AI_task_completion72.311 / 24AI_tool_selection52.214 / 24ARC_AGI_211.99 / 17ArtificialAnalysisCoding72.87 / 21ArtificialAnalysisIntelligence87.54 / 21ArtificialAnalysisReasoning87.64 / 21BlendedCost87.19 / 24ContextWindow78.414 / 24CopilotArenaOrLMArenaCode94.44 / 22GDPval52.110 / 16GPQA_HLE_Reasoning87.64 / 21IFBench94.54 / 21LMArenaCreativeOrOpenEnded77.48 / 24LMArenaSearchDocument94.82 / 19LMArenaText77.48 / 24LongContextRecall85.37 / 21MCPAtlas92.52 / 13SWEBenchVerified95.05 / 18SWEComposite68.214 / 24SWERebench73.112 / 21SciCode94.53 / 21SonarBugDensity92.53 / 17SonarFunctionalSkill66.812 / 17SonarIssueDensity92.52 / 17SonarVulnerabilityDensity81.65 / 17Tau2Bench100.01 / 21TerminalBench74.65 / 22 | |||||||||
| glm-5.1 | zai | 75.9 | 75.9 | 63.6 | 63.6 | 70.2 | 70.2 | 74.9 | ▸ |
group breakdownA_B60.016 / 24A_I70.616 / 24A_P55.116 / 24A_R71.616 / 24BUILD73.29 / 24CRE86.35 / 24GEN57.511 / 24LM_ARENA_REVIEW_PROXY88.05 / 24OPS_long83.29 / 24OPS_precision87.37 / 24OPS_review89.37 / 24PLAN69.57 / 24 metricsAI_code42.012 / 22AI_complexity32.311 / 22AI_context_awareness7.511 / 24AI_correctness87.516 / 22AI_edge_cases81.014 / 22AI_efficiency49.515 / 22AI_hallucination_resistance24.518 / 24AI_memory_retention7.512 / 24AI_parameter_accuracy64.218 / 24AI_plan_coherence14.916 / 24AI_recovery91.412 / 22AI_refusal92.518 / 22AI_spec92.518 / 22AI_stability75.318 / 22AI_task_completion68.912 / 24AI_tool_selection51.915 / 24ARC_AGI_25.211 / 17ArtificialAnalysisCoding39.513 / 21ArtificialAnalysisIntelligence60.58 / 21ArtificialAnalysisReasoning54.013 / 21BlendedCost93.06 / 24ContextWindow74.919 / 24CopilotArenaOrLMArenaCode95.93 / 22GDPval59.59 / 16GPQA_HLE_Reasoning54.013 / 21IFBench86.85 / 21LMArenaCreativeOrOpenEnded86.35 / 24LMArenaSearchDocument88.05 / 19LMArenaText86.35 / 24LongContextRecall41.217 / 21MCPAtlas100.01 / 13OutputSpeed77.517 / 19SWEBenchMultilingual50.93 / 6SWEBenchVerified91.910 / 18SWEComposite77.78 / 24SWERebench100.01 / 21SciCode40.414 / 21SonarBugDensity100.01 / 17SonarFunctionalSkill69.89 / 17SonarIssueDensity100.01 / 17SonarVulnerabilityDensity87.24 / 17TTFT100.01 / 19Tau2Bench100.03 / 21TerminalBench55.810 / 22 | |||||||||
| gemini-3-pro | 82.4 | 82.4 | 62.1 | 62.1 | 69.9 | 69.9 | 63.8 | ▸ | |
group breakdownA_B87.24 / 24A_I89.23 / 24A_P76.52 / 24A_R90.82 / 24BUILD65.410 / 24CRE94.53 / 24GEN59.910 / 24LM_ARENA_REVIEW_PROXY19.918 / 24OPS_long45.223 / 24OPS_precision46.623 / 24OPS_review50.522 / 24PLAN55.313 / 24 metricsAI_code77.55 / 22AI_complexity72.25 / 22AI_context_awareness7.710 / 24AI_correctness100.06 / 22AI_edge_cases67.316 / 22AI_efficiency86.23 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy100.02 / 24AI_plan_coherence84.25 / 24AI_recovery90.513 / 22AI_refusal100.07 / 22AI_spec100.07 / 22AI_stability89.87 / 22AI_task_completion63.215 / 24AI_tool_selection3.220 / 24ARC_AGI_241.96 / 17BlendedCost77.312 / 24ContextWindow0.024 / 24CopilotArenaOrLMArenaCode68.410 / 22GDPval5.016 / 16LMArenaCreativeOrOpenEnded94.53 / 24LMArenaSearchDocument19.913 / 19LMArenaText94.53 / 24MCPAtlas74.93 / 13SWEBenchMultilingual33.54 / 6SWEBenchPro80.38 / 15SWEBenchVerified82.913 / 18SWEComposite73.310 / 24SWERebench70.614 / 21SonarBugDensity53.29 / 17SonarFunctionalSkill84.15 / 17SonarIssueDensity6.716 / 17SonarVulnerabilityDensity59.78 / 17TerminalBench61.28 / 22 | |||||||||
| gemini-3-flash | 81.2 | 81.2 | 68.4 | 68.4 | 68.9 | 68.9 | 66.1 | ▸ | |
group breakdownA_B81.67 / 24A_I83.38 / 24A_P72.54 / 24A_R84.79 / 24BUILD59.412 / 24CRE85.86 / 24GEN61.69 / 24LM_ARENA_REVIEW_PROXY20.017 / 24OPS_long94.91 / 24OPS_precision91.81 / 24OPS_review90.53 / 24PLAN64.69 / 24 metricsAI_code73.47 / 22AI_complexity68.97 / 22AI_context_awareness14.17 / 24AI_correctness92.514 / 22AI_edge_cases64.718 / 22AI_efficiency80.85 / 22AI_hallucination_resistance92.57 / 24AI_memory_retention92.57 / 24AI_parameter_accuracy92.56 / 24AI_plan_coherence79.17 / 24AI_recovery84.516 / 22AI_refusal92.516 / 22AI_spec92.516 / 22AI_stability83.813 / 22AI_task_completion61.217 / 24AI_tool_selection10.318 / 24ARC_AGI_23.114 / 17ArtificialAnalysisCoding58.39 / 21ArtificialAnalysisIntelligence58.911 / 21ArtificialAnalysisReasoning82.76 / 21BlendedCost91.58 / 24ContextWindow100.05 / 24CopilotArenaOrLMArenaCode68.012 / 22GDPval8.014 / 16GPQA_HLE_Reasoning82.76 / 21IFBench100.02 / 21LMArenaCreativeOrOpenEnded85.86 / 24LMArenaSearchDocument20.012 / 19LMArenaText85.86 / 24LongContextRecall68.69 / 21MCPAtlas22.49 / 13OutputSpeed99.22 / 19SWEBenchMultilingual100.01 / 6SWEBenchPro53.012 / 15SWEBenchVerified100.01 / 18SWEComposite76.49 / 24SWERebench76.310 / 21SciCode78.76 / 21SonarBugDensity52.711 / 17SonarFunctionalSkill78.97 / 17SonarIssueDensity13.212 / 17SonarVulnerabilityDensity58.210 / 17TTFT81.19 / 19Tau2Bench64.210 / 21TerminalBench48.312 / 22 | |||||||||
| gpt-5.3-codex | openai | 69.5 | 69.5 | 57.1 | 57.1 | 63.0 | 63.0 | 71.7 | ▸ |
group breakdownA_B67.511 / 24A_I75.912 / 24A_P56.314 / 24A_R86.35 / 24BUILD61.611 / 24CRE73.212 / 24GEN55.812 / 24LM_ARENA_REVIEW_PROXY92.53 / 24OPS_long58.021 / 24OPS_precision59.320 / 24OPS_review58.821 / 24PLAN58.310 / 24 metricsAI_code37.716 / 22AI_complexity29.217 / 22AI_context_awareness0.018 / 24AI_correctness94.110 / 22AI_edge_cases86.511 / 22AI_efficiency58.312 / 22AI_hallucination_resistance80.010 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy85.911 / 24AI_plan_coherence3.221 / 24AI_recovery98.79 / 22AI_refusal100.011 / 22AI_spec100.011 / 22AI_stability89.58 / 22AI_task_completion57.819 / 24AI_tool_selection65.512 / 24BlendedCost76.614 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode59.314 / 22GDPval51.511 / 16LMArenaCreativeOrOpenEnded73.212 / 24LMArenaSearchDocument92.53 / 19LMArenaText73.212 / 24SWEBenchVerified92.58 / 18SWEComposite72.512 / 24SWERebench89.55 / 21TerminalBench74.36 / 22 | |||||||||
| claude-opus-4.5 | anthropic | 61.5 | 61.5 | 59.7 | 59.7 | 62.3 | 62.3 | 55.0 | ▸ |
group breakdownA_B23.524 / 24A_I34.322 / 24A_P39.221 / 24A_R37.022 / 24BUILD80.65 / 24CRE73.411 / 24GEN70.46 / 24LM_ARENA_REVIEW_PROXY11.221 / 24OPS_long77.717 / 24OPS_precision75.717 / 24OPS_review74.816 / 24PLAN65.78 / 24 metricsAI_canary_health88.52 / 7AI_code0.022 / 22AI_complexity0.022 / 22AI_context_awareness50.83 / 24AI_correctness20.020 / 22AI_edge_cases68.715 / 22AI_efficiency27.520 / 22AI_hallucination_resistance20.019 / 24AI_memory_retention11.811 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence11.517 / 24AI_recovery70.819 / 22AI_refusal0.022 / 22AI_spec0.022 / 22AI_stability83.415 / 22AI_task_completion65.114 / 24AI_tool_selection99.62 / 24ArtificialAnalysisCoding75.16 / 21ArtificialAnalysisIntelligence71.57 / 21ArtificialAnalysisReasoning63.79 / 21BlendedCost61.920 / 24ContextWindow74.721 / 24CopilotArenaOrLMArenaCode76.86 / 22GDPval71.58 / 16GPQA_HLE_Reasoning63.79 / 21IFBench44.911 / 21LMArenaCreativeOrOpenEnded73.411 / 24LMArenaSearchDocument11.216 / 19LMArenaText73.411 / 24LongContextRecall100.01 / 21OutputSpeed82.011 / 19SWEBenchPro88.46 / 15SWEBenchVerified92.29 / 18SWEComposite85.57 / 24SWERebench76.59 / 21SciCode72.77 / 21SonarBugDensity73.75 / 17SonarFunctionalSkill100.01 / 17SonarIssueDensity77.23 / 17SonarVulnerabilityDensity87.23 / 17TTFT75.914 / 19Tau2Bench85.28 / 21TerminalBench54.811 / 22 | |||||||||
| grok-4-latest | xai | 76.0 | 76.0 | 52.5 | 52.5 | 62.1 | 62.1 | 59.7 | ▸ |
group breakdownA_B91.21 / 24A_I92.01 / 24A_P65.98 / 24A_R99.71 / 24BUILD46.618 / 24CRE76.110 / 24GEN49.315 / 24LM_ARENA_REVIEW_PROXY19.219 / 24OPS_long78.816 / 24OPS_precision78.614 / 24OPS_review78.412 / 24PLAN38.118 / 24 metricsAI_code96.63 / 22AI_complexity100.01 / 22AI_context_awareness0.021 / 24AI_correctness100.07 / 22AI_edge_cases100.06 / 22AI_efficiency0.022 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention99.22 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.01 / 24AI_recovery100.06 / 22AI_refusal100.014 / 22AI_spec100.014 / 22AI_stability100.05 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_220.78 / 17ArtificialAnalysisCoding51.510 / 21ArtificialAnalysisIntelligence40.314 / 21ArtificialAnalysisReasoning57.010 / 21BlendedCost74.419 / 24ContextWindow78.415 / 24CopilotArenaOrLMArenaCode58.015 / 22GPQA_HLE_Reasoning57.010 / 21IFBench33.115 / 21LMArenaCreativeOrOpenEnded76.110 / 24LMArenaSearchDocument19.214 / 19LMArenaText76.110 / 24LongContextRecall77.08 / 21OutputSpeed79.414 / 19SWEComposite46.720 / 24SWERebench39.117 / 21SciCode51.910 / 21TTFT79.510 / 19Tau2Bench51.514 / 21TerminalBench11.819 / 22 | |||||||||
| claude-sonnet-4 | anthropic | 36.7 | 36.7 | 42.8 | 42.8 | 61.6 | 61.6 | 62.6 | ▸ |
group breakdownA_B89.13 / 24A_I91.72 / 24A_P68.46 / 24A_R86.34 / 24BUILD46.817 / 24CRE0.023 / 24GEN14.022 / 24LM_ARENA_REVIEW_PROXY86.26 / 24OPS_long81.411 / 24OPS_precision80.711 / 24OPS_review79.310 / 24PLAN33.319 / 24 metricsAI_code99.02 / 22AI_complexity99.72 / 22AI_context_awareness0.013 / 24AI_correctness100.03 / 22AI_edge_cases100.03 / 22AI_efficiency97.32 / 22AI_hallucination_resistance20.020 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy91.98 / 24AI_plan_coherence19.815 / 24AI_recovery100.03 / 22AI_refusal100.03 / 22AI_spec100.03 / 22AI_stability100.04 / 22AI_task_completion99.43 / 24AI_tool_selection88.84 / 24ARC_AGI_20.216 / 17ArtificialAnalysisCoding30.716 / 21ArtificialAnalysisIntelligence29.715 / 21ArtificialAnalysisReasoning8.619 / 21BlendedCost74.416 / 24ContextWindow99.39 / 24CopilotArenaOrLMArenaCode52.918 / 22GDPval80.15 / 16GPQA_HLE_Reasoning8.619 / 21IFBench35.814 / 21LMArenaCreativeOrOpenEnded0.023 / 24LMArenaSearchDocument86.26 / 19LMArenaText0.023 / 24LiveCodeBench0.02 / 2LongContextRecall60.812 / 21MCPAtlas13.110 / 13OutputSpeed79.513 / 19SWEBenchPro78.49 / 15SWEBenchVerified69.916 / 18SWEComposite68.213 / 24SWERebench55.115 / 21SciCode20.817 / 21SonarBugDensity0.017 / 17SonarFunctionalSkill26.414 / 17SonarIssueDensity35.87 / 17SonarVulnerabilityDensity0.017 / 17TTFT76.713 / 19Tau2Bench27.718 / 21TerminalBench47.413 / 22 | |||||||||
| claude-sonnet-4.5 | anthropic | 68.0 | 68.0 | 55.9 | 55.9 | 61.3 | 61.3 | 55.8 | ▸ |
group breakdownA_B78.09 / 24A_I86.65 / 24A_P77.31 / 24A_R85.96 / 24BUILD51.514 / 24CRE64.015 / 24GEN42.217 / 24LM_ARENA_REVIEW_PROXY1.822 / 24OPS_long81.112 / 24OPS_precision80.712 / 24OPS_review79.49 / 24PLAN42.317 / 24 metricsAI_canary_health78.17 / 7AI_code67.09 / 22AI_complexity65.89 / 22AI_context_awareness99.82 / 24AI_correctness100.04 / 22AI_edge_cases100.04 / 22AI_efficiency74.48 / 22AI_hallucination_resistance40.016 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy61.319 / 24AI_plan_coherence30.89 / 24AI_recovery100.04 / 22AI_refusal100.04 / 22AI_spec100.04 / 22AI_stability93.66 / 22AI_task_completion99.82 / 24AI_tool_selection82.08 / 24ARC_AGI_23.712 / 17ArtificialAnalysisCoding45.312 / 21ArtificialAnalysisIntelligence46.012 / 21ArtificialAnalysisReasoning35.315 / 21BlendedCost74.417 / 24ContextWindow99.310 / 24CopilotArenaOrLMArenaCode53.416 / 22GDPval81.93 / 16GPQA_HLE_Reasoning35.315 / 21IFBench43.012 / 21LMArenaCreativeOrOpenEnded64.015 / 24LMArenaSearchDocument1.817 / 19LMArenaText64.015 / 24LongContextRecall65.711 / 21MCPAtlas6.612 / 13OutputSpeed78.616 / 19SWEBenchMultilingual3.95 / 6SWEBenchPro81.27 / 15SWEBenchVerified85.712 / 18SWEComposite72.711 / 24SWERebench74.911 / 21SciCode46.413 / 21SonarBugDensity2.816 / 17SonarFunctionalSkill17.215 / 17SonarIssueDensity30.09 / 17SonarVulnerabilityDensity4.616 / 17TTFT77.511 / 19Tau2Bench58.911 / 21TerminalBench37.415 / 22 | |||||||||
| claude-sonnet-4.6 | anthropic | 63.1 | 63.1 | 54.4 | 54.4 | 60.0 | 60.0 | 53.0 | ▸ |
group breakdownA_B31.322 / 24A_I46.320 / 24A_P41.920 / 24A_R43.421 / 24BUILD76.16 / 24CRE73.013 / 24GEN65.47 / 24LM_ARENA_REVIEW_PROXY23.315 / 24OPS_long67.618 / 24OPS_precision54.921 / 24OPS_review49.523 / 24PLAN56.911 / 24 metricsAI_code7.320 / 22AI_complexity16.220 / 22AI_context_awareness14.54 / 24AI_correctness27.418 / 22AI_edge_cases87.08 / 22AI_efficiency42.318 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy94.44 / 24AI_plan_coherence25.310 / 24AI_recovery87.614 / 22AI_refusal14.920 / 22AI_spec14.920 / 22AI_stability85.311 / 22AI_task_completion67.713 / 24AI_tool_selection93.63 / 24ARC_AGI_210.610 / 17ArtificialAnalysisCoding85.14 / 21ArtificialAnalysisIntelligence79.16 / 21ArtificialAnalysisReasoning68.78 / 21BlendedCost74.418 / 24ContextWindow99.311 / 24CopilotArenaOrLMArenaCode93.25 / 22GDPval80.16 / 16GPQA_HLE_Reasoning68.78 / 21IFBench41.013 / 21LMArenaCreativeOrOpenEnded73.013 / 24LMArenaSearchDocument23.310 / 19LMArenaText73.013 / 24LongContextRecall90.25 / 21MCPAtlas69.87 / 13OutputSpeed82.410 / 19SWEBenchPro76.510 / 15SWEBenchVerified90.311 / 18SWEComposite86.86 / 24SWERebench95.73 / 21SciCode57.98 / 21SonarBugDensity65.86 / 17SonarFunctionalSkill84.54 / 17SonarIssueDensity22.310 / 17SonarVulnerabilityDensity21.815 / 17TTFT0.019 / 19Tau2Bench53.312 / 21TerminalBench47.414 / 22 | |||||||||
| gpt-5.2 | openai | 66.0 | 66.0 | 56.3 | 56.3 | 57.6 | 57.6 | 60.2 | ▸ |
group breakdownA_B67.412 / 24A_I74.215 / 24A_P58.513 / 24A_R85.67 / 24BUILD52.213 / 24CRE67.814 / 24GEN52.213 / 24LM_ARENA_REVIEW_PROXY21.216 / 24OPS_long58.319 / 24OPS_precision59.819 / 24OPS_review59.520 / 24PLAN56.612 / 24 metricsAI_code40.615 / 22AI_complexity29.216 / 22AI_context_awareness0.017 / 24AI_correctness94.19 / 22AI_edge_cases86.510 / 22AI_efficiency59.211 / 22AI_hallucination_resistance80.09 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy85.212 / 24AI_plan_coherence0.423 / 24AI_recovery98.78 / 22AI_refusal100.010 / 22AI_spec100.010 / 22AI_stability79.717 / 22AI_task_completion86.77 / 24AI_tool_selection75.39 / 24ARC_AGI_20.017 / 17ArtificialAnalysisCoding63.48 / 21ArtificialAnalysisIntelligence59.79 / 21ArtificialAnalysisReasoning56.411 / 21BlendedCost80.111 / 24ContextWindow85.312 / 24CopilotArenaOrLMArenaCode38.720 / 22GPQA_HLE_Reasoning56.411 / 21IFBench64.78 / 21LMArenaCreativeOrOpenEnded67.814 / 24LMArenaSearchDocument21.211 / 19LMArenaText67.814 / 24LongContextRecall53.915 / 21SWEBenchMultilingual0.06 / 6SWEBenchPro38.214 / 15SWEBenchVerified81.314 / 18SWEComposite48.419 / 24SciCode54.69 / 21SonarBugDensity64.27 / 17SonarFunctionalSkill67.210 / 17SonarIssueDensity35.78 / 17SonarVulnerabilityDensity73.46 / 17Tau2Bench50.115 / 21TerminalBench58.29 / 22 | |||||||||
| gemini-2.5-pro | 34.6 | 34.6 | 40.1 | 40.1 | 54.3 | 54.3 | 47.0 | ▸ | |
group breakdownA_B81.66 / 24A_I83.37 / 24A_P72.53 / 24A_R84.78 / 24BUILD39.121 / 24CRE0.024 / 24GEN14.521 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long86.07 / 24OPS_precision80.910 / 24OPS_review78.411 / 24PLAN22.221 / 24 metricsAI_code73.46 / 22AI_complexity68.96 / 22AI_context_awareness14.16 / 24AI_correctness92.513 / 22AI_edge_cases64.717 / 22AI_efficiency80.84 / 22AI_hallucination_resistance92.56 / 24AI_memory_retention92.56 / 24AI_parameter_accuracy92.55 / 24AI_plan_coherence79.16 / 24AI_recovery84.515 / 22AI_refusal92.515 / 22AI_spec92.515 / 22AI_stability83.812 / 22AI_task_completion61.216 / 24AI_tool_selection10.317 / 24ARC_AGI_23.713 / 17ArtificialAnalysisCoding23.617 / 21ArtificialAnalysisIntelligence14.117 / 21ArtificialAnalysisReasoning44.814 / 21BlendedCost80.110 / 24ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.921 / 22GDPval7.515 / 16GPQA_HLE_Reasoning44.814 / 21IFBench19.318 / 21LMArenaCreativeOrOpenEnded0.024 / 24LMArenaSearchDocument0.019 / 19LMArenaText0.024 / 24LongContextRecall67.210 / 21MCPAtlas71.15 / 13OutputSpeed92.06 / 19SWEBenchPro75.711 / 15SWEBenchVerified38.217 / 18SWEComposite40.722 / 24SWERebench1.820 / 21SciCode36.115 / 21SonarBugDensity52.710 / 17SonarFunctionalSkill78.96 / 17SonarIssueDensity13.211 / 17SonarVulnerabilityDensity58.29 / 17TTFT61.917 / 19Tau2Bench3.519 / 21TerminalBench1.820 / 22 | |||||||||
| glm-4.7 | zai | 34.7 | 34.7 | 50.1 | 50.1 | 50.5 | 50.5 | 54.8 | ▸ |
group breakdownA_B56.018 / 24A_I55.019 / 24A_P46.919 / 24A_R58.519 / 24BUILD41.820 / 24CRE10.122 / 24GEN35.818 / 24LM_ARENA_REVIEW_PROXY50.012 / 24OPS_long88.15 / 24OPS_precision90.44 / 24OPS_review92.11 / 24PLAN53.614 / 24 metricsAI_context_awareness0.024 / 24AI_hallucination_resistance100.05 / 24AI_memory_retention99.25 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.04 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding37.914 / 21ArtificialAnalysisIntelligence42.613 / 21ArtificialAnalysisReasoning55.812 / 21BlendedCost96.13 / 24ContextWindow74.918 / 24CopilotArenaOrLMArenaCode68.89 / 22GPQA_HLE_Reasoning55.812 / 21IFBench72.27 / 21LMArenaCreativeOrOpenEnded10.122 / 24LMArenaText10.122 / 24LongContextRecall57.414 / 21MCPAtlas0.013 / 13OutputSpeed86.67 / 19SWEComposite56.315 / 24SWERebench70.913 / 21SciCode48.611 / 21SonarBugDensity51.613 / 17SonarFunctionalSkill0.017 / 17SonarIssueDensity50.85 / 17SonarVulnerabilityDensity28.713 / 17TTFT98.52 / 19Tau2Bench100.02 / 21TerminalBench27.117 / 22 | |||||||||
| gemini-2.5-flash | 54.4 | 54.4 | 35.1 | 35.1 | 49.9 | 49.9 | 53.6 | ▸ | |
group breakdownA_B89.72 / 24A_I80.910 / 24A_P63.510 / 24A_R90.23 / 24BUILD26.323 / 24CRE45.819 / 24GEN15.120 / 24LM_ARENA_REVIEW_PROXY78.87 / 24OPS_long94.92 / 24OPS_precision91.13 / 24OPS_review89.76 / 24PLAN13.523 / 24 metricsAI_code100.01 / 22AI_complexity84.14 / 22AI_context_awareness100.01 / 24AI_correctness100.05 / 22AI_edge_cases100.05 / 22AI_efficiency100.01 / 22AI_hallucination_resistance69.911 / 24AI_memory_retention37.09 / 24AI_parameter_accuracy73.314 / 24AI_plan_coherence0.024 / 24AI_recovery100.05 / 22AI_refusal100.06 / 22AI_spec100.06 / 22AI_stability53.220 / 22AI_task_completion44.220 / 24AI_tool_selection27.216 / 24ARC_AGI_20.815 / 17ArtificialAnalysisCoding0.020 / 21ArtificialAnalysisIntelligence0.819 / 21ArtificialAnalysisReasoning17.916 / 21BlendedCost94.45 / 24ContextWindow100.03 / 24CopilotArenaOrLMArenaCode65.313 / 22GDPval10.313 / 16GPQA_HLE_Reasoning17.916 / 21IFBench29.217 / 21LMArenaCreativeOrOpenEnded45.819 / 24LMArenaSearchDocument78.87 / 19LMArenaText45.819 / 24LiveCodeBench100.01 / 2LongContextRecall58.813 / 21MCPAtlas26.68 / 13OutputSpeed100.01 / 19SWEBenchPro52.513 / 15SWEBenchVerified0.018 / 18SWEComposite20.424 / 24SWERebench0.021 / 21SciCode23.516 / 21TTFT77.112 / 19Tau2Bench0.020 / 21TerminalBench0.321 / 22 | |||||||||
| deepseek-v4-flash | deepseek | 52.9 | 52.9 | 60.9 | 60.9 | 49.0 | 49.0 | 51.0 | ▸ |
group breakdownA_B32.721 / 24A_I29.123 / 24A_P38.922 / 24A_R24.924 / 24BUILD49.815 / 24CRE58.816 / 24GEN62.88 / 24LM_ARENA_REVIEW_PROXY50.08 / 24OPS_long87.26 / 24OPS_precision89.95 / 24OPS_review92.02 / 24PLAN71.16 / 24 metricsAI_canary_health78.46 / 7AI_code40.613 / 22AI_complexity29.214 / 22AI_context_awareness0.014 / 24AI_correctness0.021 / 22AI_edge_cases0.021 / 22AI_efficiency71.19 / 22AI_hallucination_resistance40.017 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy91.69 / 24AI_plan_coherence25.311 / 24AI_recovery0.021 / 22AI_refusal100.05 / 22AI_spec100.05 / 22AI_stability0.022 / 22AI_task_completion86.75 / 24AI_tool_selection86.25 / 24ArtificialAnalysisCoding45.611 / 21ArtificialAnalysisIntelligence59.310 / 21ArtificialAnalysisReasoning76.77 / 21BlendedCost100.01 / 24ContextWindow71.622 / 24GPQA_HLE_Reasoning76.77 / 21IFBench100.01 / 21LMArenaCreativeOrOpenEnded58.816 / 24LMArenaText58.816 / 24LongContextRecall52.516 / 21OutputSpeed85.08 / 19SWEComposite50.017 / 24SciCode47.512 / 21TTFT98.53 / 19Tau2Bench97.95 / 21 | |||||||||
| claude-opus-4.1 | anthropic | 54.8 | 54.8 | 47.3 | 47.3 | 48.0 | 48.0 | 45.5 | ▸ |
group breakdownA_B48.319 / 24A_I61.917 / 24A_P49.917 / 24A_R64.017 / 24BUILD48.516 / 24CRE52.917 / 24GEN50.714 / 24LM_ARENA_REVIEW_PROXY0.023 / 24OPS_long48.722 / 24OPS_precision46.224 / 24OPS_review42.524 / 24PLAN43.016 / 24 metricsAI_canary_health79.25 / 7AI_code18.119 / 22AI_complexity25.719 / 22AI_context_awareness0.012 / 24AI_correctness59.517 / 22AI_edge_cases89.07 / 22AI_efficiency39.519 / 22AI_hallucination_resistance40.015 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy71.516 / 24AI_plan_coherence19.814 / 24AI_recovery84.218 / 22AI_refusal56.919 / 22AI_spec56.919 / 22AI_stability100.01 / 22AI_task_completion83.110 / 24AI_tool_selection71.011 / 24BlendedCost0.024 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode53.217 / 22LMArenaCreativeOrOpenEnded52.917 / 24LMArenaSearchDocument0.018 / 19LMArenaText52.917 / 24SWEComposite50.716 / 24SWERebench52.316 / 21TerminalBench29.416 / 22 | |||||||||
| kimi-k2-0905 | moonshot | 24.3 | 24.3 | 28.1 | 28.1 | 40.0 | 40.0 | 37.5 | ▸ |
group breakdownA_B33.020 / 24A_I27.024 / 24A_P36.224 / 24A_R28.723 / 24BUILD42.719 / 24CRE27.620 / 24GEN8.124 / 24LM_ARENA_REVIEW_PROXY50.09 / 24OPS_long35.424 / 24OPS_precision53.722 / 24OPS_review60.219 / 24PLAN29.620 / 24 metricsAI_canary_health88.91 / 7AI_code43.611 / 22AI_complexity29.212 / 22AI_context_awareness0.015 / 24AI_correctness0.022 / 22AI_edge_cases0.022 / 22AI_efficiency44.417 / 22AI_hallucination_resistance60.012 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy81.113 / 24AI_plan_coherence22.513 / 24AI_recovery0.022 / 22AI_refusal100.08 / 22AI_spec100.08 / 22AI_stability1.521 / 22AI_task_completion86.76 / 24AI_tool_selection72.810 / 24ArtificialAnalysisCoding4.219 / 21ArtificialAnalysisIntelligence0.020 / 21ArtificialAnalysisReasoning0.020 / 21BlendedCost92.77 / 24ContextWindow53.423 / 24GPQA_HLE_Reasoning0.020 / 21IFBench0.020 / 21LMArenaCreativeOrOpenEnded27.620 / 24LMArenaText27.620 / 24LongContextRecall0.020 / 21OutputSpeed0.019 / 19SWEComposite50.018 / 24SciCode0.020 / 21TTFT90.95 / 19Tau2Bench48.016 / 21 | |||||||||
| glm-4.6 | zai | 36.0 | 36.0 | 31.0 | 31.0 | 37.7 | 37.7 | 40.9 | ▸ |
group breakdownA_B56.017 / 24A_I55.018 / 24A_P46.918 / 24A_R58.518 / 24BUILD23.124 / 24CRE24.321 / 24GEN13.623 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long80.313 / 24OPS_precision85.38 / 24OPS_review87.68 / 24PLAN18.122 / 24 metricsAI_context_awareness0.023 / 24AI_hallucination_resistance100.04 / 24AI_memory_retention99.24 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.03 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding15.918 / 21ArtificialAnalysisIntelligence6.118 / 21ArtificialAnalysisReasoning16.517 / 21BlendedCost95.44 / 24ContextWindow75.017 / 24CopilotArenaOrLMArenaCode44.419 / 22GPQA_HLE_Reasoning16.517 / 21IFBench4.719 / 21LMArenaCreativeOrOpenEnded24.321 / 24LMArenaText24.321 / 24LongContextRecall9.819 / 21MCPAtlas7.511 / 13OutputSpeed72.518 / 19SWEBenchPro0.015 / 15SWEBenchVerified79.015 / 18SWEComposite34.723 / 24SWERebench38.418 / 21SciCode12.019 / 21SonarBugDensity7.515 / 17SonarFunctionalSkill7.516 / 17SonarIssueDensity7.514 / 17SonarVulnerabilityDensity29.012 / 17TTFT98.34 / 19Tau2Bench41.317 / 21TerminalBench13.918 / 22 | |||||||||
| grok-code-fast-1 | xai | 42.4 | 42.4 | 26.3 | 26.3 | 34.2 | 34.2 | 38.1 | ▸ |
group breakdownA_B30.023 / 24A_I38.321 / 24A_P36.723 / 24A_R47.920 / 24BUILD30.622 / 24CRE48.118 / 24GEN15.819 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long90.54 / 24OPS_precision89.56 / 24OPS_review90.14 / 24PLAN11.424 / 24 metricsAI_code0.721 / 22AI_complexity1.421 / 22AI_context_awareness0.022 / 24AI_correctness21.519 / 22AI_edge_cases62.120 / 22AI_efficiency1.921 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention99.23 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.02 / 24AI_recovery65.620 / 22AI_refusal2.121 / 22AI_spec2.121 / 22AI_stability68.319 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.17 / 17ArtificialAnalysisCoding0.021 / 21ArtificialAnalysisIntelligence0.021 / 21ArtificialAnalysisReasoning0.021 / 21BlendedCost99.32 / 24ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.022 / 22GPQA_HLE_Reasoning0.021 / 21IFBench0.021 / 21LMArenaCreativeOrOpenEnded48.118 / 24LMArenaText48.118 / 24LongContextRecall0.021 / 21OutputSpeed94.14 / 19SWEComposite43.421 / 24SWERebench27.919 / 21SciCode0.021 / 21TTFT85.47 / 19Tau2Bench53.313 / 21TerminalBench0.022 / 22 | |||||||||