how scoring works
Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.
raw vs adjusted
The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models lead the direct LM Arena search/document review proxy, that proxy lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.
Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted.
missing data
If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.
| gemini-3.1-pro-preview | 95.4 | 95.4 | 88.2 | 88.2 | 82.5 | 82.5 | 84.7 | ▸ | |
group breakdownA_B82.16 / 24A_I89.16 / 24A_P71.95 / 24A_R74.218 / 24BUILD83.54 / 24CRE99.42 / 24GEN99.91 / 24LM_ARENA_REVIEW_PROXY92.34 / 24OPS_long79.512 / 24OPS_precision67.818 / 24OPS_review75.217 / 24PLAN93.92 / 24 metricsAI_code90.65 / 22AI_complexity87.66 / 22AI_context_awareness7.54 / 24AI_correctness92.518 / 22AI_edge_cases60.119 / 22AI_efficiency88.15 / 22AI_hallucination_resistance10.922 / 24AI_memory_retention92.58 / 24AI_parameter_accuracy78.118 / 24AI_plan_coherence92.58 / 24AI_recovery92.518 / 22AI_refusal92.520 / 22AI_spec92.520 / 22AI_stability92.55 / 22AI_task_completion27.419 / 24AI_tool_selection11.819 / 24ARC_AGI_2100.01 / 17ArtificialAnalysisCoding100.01 / 21ArtificialAnalysisIntelligence100.02 / 21ArtificialAnalysisReasoning100.01 / 21BlendedCost77.313 / 24ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.47 / 22GDPval24.712 / 16GPQA_HLE_Reasoning100.01 / 21GSO51.38 / 15IFBench97.53 / 21LMArenaCreativeOrOpenEnded99.42 / 23LMArenaSearchDocument92.34 / 19LMArenaText99.42 / 23LongContextRecall100.02 / 21MCPAtlas71.16 / 13OutputSpeed93.64 / 20SWEBenchPro89.15 / 15SWEBenchVerified95.04 / 18SWEComposite94.82 / 24SWERebench99.82 / 21SciCode100.02 / 21SonarBugDensity52.712 / 17SonarComposite54.212 / 24SonarFunctionalSkill78.98 / 17SonarIssueDensity13.213 / 17SonarVulnerabilityDensity58.211 / 17TTFT26.519 / 20Tau2Bench99.34 / 21TerminalBench89.43 / 22 | |||||||||
| gpt-5.5 | openai | 70.8 | 70.8 | 83.7 | 83.7 | 81.6 | 81.6 | 81.1 | ▸ |
group breakdownA_B60.418 / 24A_I72.818 / 24A_P62.010 / 24A_R85.97 / 24BUILD89.11 / 24CRE59.48 / 24GEN88.64 / 24LM_ARENA_REVIEW_PROXY28.214 / 24OPS_long81.38 / 24OPS_precision78.212 / 24OPS_review80.011 / 24PLAN94.31 / 24 metricsAI_code10.021 / 22AI_complexity8.221 / 22AI_context_awareness0.020 / 24AI_correctness94.515 / 22AI_edge_cases80.514 / 22AI_efficiency60.716 / 22AI_hallucination_resistance100.014 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy99.13 / 24AI_plan_coherence0.123 / 24AI_recovery98.215 / 22AI_refusal100.016 / 22AI_spec100.016 / 22AI_stability87.110 / 22AI_task_completion100.09 / 24AI_tool_selection91.85 / 24ARC_AGI_296.72 / 17ArtificialAnalysisCoding100.02 / 21ArtificialAnalysisIntelligence98.13 / 21ArtificialAnalysisReasoning100.02 / 21BlendedCost50.623 / 24ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.98 / 22GDPval95.01 / 16GPQA_HLE_Reasoning100.02 / 21GSO94.02 / 15IFBench80.76 / 21LMArenaCreativeOrOpenEnded59.48 / 23LMArenaSearchDocument28.29 / 19LMArenaText59.48 / 23LongContextRecall98.03 / 21OutputSpeed81.610 / 20SWEBenchPro95.03 / 15SWEBenchVerified95.07 / 18SWEComposite89.94 / 24SWERebench83.58 / 21SciCode94.54 / 21SonarBugDensity94.52 / 17SonarComposite65.55 / 24SonarFunctionalSkill46.513 / 17SonarIssueDensity52.74 / 17SonarVulnerabilityDensity99.22 / 17TTFT81.67 / 20Tau2Bench90.57 / 21TerminalBench100.01 / 22 | |||||||||
| claude-opus-4.6 | anthropic | 90.1 | 90.1 | 75.0 | 75.0 | 79.3 | 79.3 | 75.3 | ▸ |
group breakdownA_B64.411 / 24A_I75.813 / 24A_P60.315 / 24A_R86.34 / 24BUILD86.22 / 24CRE100.01 / 24GEN89.53 / 24LM_ARENA_REVIEW_PROXY33.113 / 24OPS_long78.113 / 24OPS_precision74.614 / 24OPS_review77.813 / 24PLAN74.26 / 24 metricsAI_canary_health83.35 / 7AI_code14.512 / 22AI_complexity36.810 / 22AI_context_awareness0.08 / 24AI_correctness94.56 / 22AI_edge_cases80.55 / 22AI_efficiency62.514 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy85.112 / 24AI_plan_coherence0.024 / 24AI_recovery98.26 / 22AI_refusal100.03 / 22AI_spec100.03 / 22AI_stability87.17 / 22AI_task_completion83.311 / 24AI_tool_selection100.01 / 24ARC_AGI_290.94 / 17ArtificialAnalysisCoding76.15 / 21ArtificialAnalysisIntelligence84.05 / 21ArtificialAnalysisReasoning86.35 / 21BlendedCost61.921 / 24ContextWindow99.37 / 24CopilotArenaOrLMArenaCode99.82 / 22GDPval72.87 / 16GPQA_HLE_Reasoning86.35 / 21GSO75.33 / 15IFBench31.416 / 21LMArenaCreativeOrOpenEnded100.01 / 23LMArenaSearchDocument33.18 / 19LMArenaText100.01 / 23LongContextRecall90.24 / 21OutputSpeed79.115 / 20SWEBenchMultilingual90.92 / 6SWEBenchPro100.01 / 15SWEBenchVerified99.72 / 18SWEComposite95.71 / 24SWERebench91.64 / 21SciCode85.85 / 21SonarBugDensity59.58 / 17SonarComposite70.54 / 24SonarFunctionalSkill92.23 / 17SonarIssueDensity46.86 / 17SonarVulnerabilityDensity66.67 / 17TTFT67.516 / 20Tau2Bench91.26 / 21TerminalBench64.27 / 22 | |||||||||
| claude-opus-4.7 | anthropic | 87.0 | 86.5 | 79.3 | 78.3 | 77.6 | 76.0 | 81.4 | ▸ |
group breakdownA_B56.119 / 24A_I77.17 / 24A_P61.711 / 24A_R68.919 / 24BUILD86.23 / 24CRE89.63 / 24GEN95.52 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long77.915 / 24OPS_precision73.815 / 24OPS_review77.314 / 24PLAN79.94 / 24 metricsAI_code10.016 / 22AI_complexity36.811 / 22AI_context_awareness0.09 / 24AI_correctness94.57 / 22AI_edge_cases80.56 / 22AI_efficiency71.47 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy88.510 / 24AI_plan_coherence5.414 / 24AI_recovery98.27 / 22AI_refusal100.04 / 22AI_spec100.04 / 22AI_stability87.18 / 22AI_task_completion100.02 / 24AI_tool_selection70.513 / 24ARC_AGI_292.73 / 17ArtificialAnalysisCoding90.33 / 21ArtificialAnalysisIntelligence100.01 / 21ArtificialAnalysisReasoning95.63 / 21BlendedCost61.922 / 24ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.01 / 22GDPval93.92 / 16GPQA_HLE_Reasoning95.63 / 21GSO100.01 / 15IFBench46.610 / 21LMArenaCreativeOrOpenEnded89.63 / 23LMArenaSearchDocument100.01 / 19LMArenaText89.63 / 23LongContextRecall88.26 / 21OutputSpeed79.712 / 20SWEBenchPro95.02 / 15SWEBenchVerified95.03 / 18SWEComposite90.73 / 24SWERebench85.36 / 21SciCode100.01 / 21SonarBugDensity50.114 / 17SonarComposite51.413 / 24SonarFunctionalSkill93.92 / 17SonarIssueDensity0.017 / 17SonarVulnerabilityDensity25.314 / 17TTFT64.617 / 20Tau2Bench83.19 / 21TerminalBench78.24 / 22 | |||||||||
| claude-opus-4.5 | anthropic | 58.1 | 58.1 | 65.6 | 65.6 | 75.5 | 75.5 | 69.0 | ▸ |
group breakdownA_B67.67 / 24A_I76.48 / 24A_P62.98 / 24A_R87.73 / 24BUILD79.55 / 24CRE43.615 / 24GEN61.66 / 24LM_ARENA_REVIEW_PROXY11.221 / 24OPS_long76.917 / 24OPS_precision73.716 / 24OPS_review73.818 / 24PLAN68.28 / 24 metricsAI_canary_health88.52 / 7AI_code28.17 / 22AI_complexity36.89 / 22AI_context_awareness0.07 / 24AI_correctness94.55 / 22AI_edge_cases80.54 / 22AI_efficiency65.710 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention0.012 / 24AI_parameter_accuracy99.82 / 24AI_plan_coherence2.817 / 24AI_recovery98.25 / 22AI_refusal100.02 / 22AI_spec100.02 / 22AI_stability87.16 / 22AI_task_completion100.01 / 24AI_tool_selection95.84 / 24ArtificialAnalysisCoding75.16 / 21ArtificialAnalysisIntelligence71.57 / 21ArtificialAnalysisReasoning63.79 / 21BlendedCost61.920 / 24ContextWindow74.721 / 24CopilotArenaOrLMArenaCode76.86 / 22GDPval71.58 / 16GPQA_HLE_Reasoning63.79 / 21GSO59.35 / 15IFBench44.911 / 21LMArenaCreativeOrOpenEnded43.614 / 23LMArenaSearchDocument11.216 / 19LMArenaText43.614 / 23LongContextRecall100.01 / 21OutputSpeed81.411 / 20SWEBenchPro88.46 / 15SWEBenchVerified92.29 / 18SWEComposite83.87 / 24SWERebench76.59 / 21SciCode72.77 / 21SonarBugDensity73.75 / 17SonarComposite87.11 / 24SonarFunctionalSkill100.01 / 17SonarIssueDensity77.23 / 17SonarVulnerabilityDensity87.23 / 17TTFT73.515 / 20Tau2Bench85.28 / 21TerminalBench54.811 / 22 | |||||||||
| kimi-k2.6 | moonshot | 63.8 | 63.8 | 75.7 | 75.7 | 72.8 | 72.8 | 84.1 | ▸ |
group breakdownA_B62.615 / 24A_I76.39 / 24A_P60.216 / 24A_R85.69 / 24BUILD74.27 / 24CRE53.89 / 24GEN67.75 / 24LM_ARENA_REVIEW_PROXY94.82 / 24OPS_long72.718 / 24OPS_precision80.29 / 24OPS_review78.912 / 24PLAN89.13 / 24 metricsAI_code10.019 / 22AI_complexity36.817 / 22AI_context_awareness0.016 / 24AI_correctness94.511 / 22AI_edge_cases80.510 / 22AI_efficiency54.817 / 22AI_hallucination_resistance100.010 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy78.915 / 24AI_plan_coherence15.912 / 24AI_recovery98.211 / 22AI_refusal100.012 / 22AI_spec100.012 / 22AI_stability84.211 / 22AI_task_completion83.313 / 24AI_tool_selection63.914 / 24ARC_AGI_211.99 / 17ArtificialAnalysisCoding72.87 / 21ArtificialAnalysisIntelligence87.54 / 21ArtificialAnalysisReasoning87.64 / 21BlendedCost89.09 / 24ContextWindow78.814 / 24CopilotArenaOrLMArenaCode94.44 / 22GDPval52.110 / 16GPQA_HLE_Reasoning87.64 / 21IFBench94.54 / 21LMArenaCreativeOrOpenEnded53.89 / 23LMArenaSearchDocument94.82 / 19LMArenaText53.89 / 23LongContextRecall85.37 / 21MCPAtlas92.52 / 13OutputSpeed61.019 / 20SWEBenchVerified95.05 / 18SWEComposite66.014 / 24SWERebench73.112 / 21SciCode94.53 / 21SonarBugDensity92.53 / 17SonarComposite80.63 / 24SonarFunctionalSkill66.812 / 17SonarIssueDensity92.52 / 17SonarVulnerabilityDensity81.65 / 17TTFT92.45 / 20Tau2Bench100.01 / 21TerminalBench74.65 / 22 | |||||||||
| glm-5.1 | zai | 67.7 | 67.7 | 65.6 | 65.6 | 71.5 | 71.5 | 77.9 | ▸ |
group breakdownA_B60.717 / 24A_I72.419 / 24A_P58.717 / 24A_R80.214 / 24BUILD73.39 / 24CRE69.46 / 24GEN53.210 / 24LM_ARENA_REVIEW_PROXY88.05 / 24OPS_long83.67 / 24OPS_precision88.35 / 24OPS_review85.87 / 24PLAN73.37 / 24 metricsAI_code16.011 / 22AI_complexity38.87 / 22AI_context_awareness7.55 / 24AI_correctness87.819 / 22AI_edge_cases76.015 / 22AI_efficiency54.118 / 22AI_hallucination_resistance92.519 / 24AI_memory_retention7.510 / 24AI_parameter_accuracy74.519 / 24AI_plan_coherence21.011 / 24AI_recovery91.019 / 22AI_refusal92.521 / 22AI_spec92.521 / 22AI_stability79.016 / 22AI_task_completion78.314 / 24AI_tool_selection61.816 / 24ARC_AGI_25.211 / 17ArtificialAnalysisCoding39.513 / 21ArtificialAnalysisIntelligence60.58 / 21ArtificialAnalysisReasoning54.013 / 21BlendedCost93.06 / 24ContextWindow74.919 / 24CopilotArenaOrLMArenaCode95.93 / 22GDPval59.59 / 16GPQA_HLE_Reasoning54.013 / 21IFBench86.85 / 21LMArenaCreativeOrOpenEnded69.46 / 23LMArenaSearchDocument88.05 / 19LMArenaText69.46 / 23LongContextRecall41.217 / 21MCPAtlas100.01 / 13OutputSpeed78.217 / 20SWEBenchMultilingual50.93 / 6SWEBenchVerified91.910 / 18SWEComposite78.68 / 24SWERebench100.01 / 21SciCode40.414 / 21SonarBugDensity100.01 / 17SonarComposite86.02 / 24SonarFunctionalSkill69.89 / 17SonarIssueDensity100.01 / 17SonarVulnerabilityDensity87.24 / 17TTFT100.01 / 20Tau2Bench100.03 / 21TerminalBench55.810 / 22 | |||||||||
| claude-sonnet-4.6 | anthropic | 56.1 | 56.1 | 59.2 | 59.2 | 70.9 | 70.9 | 66.1 | ▸ |
group breakdownA_B65.69 / 24A_I74.117 / 24A_P61.414 / 24A_R86.06 / 24BUILD76.26 / 24CRE43.614 / 24GEN58.18 / 24LM_ARENA_REVIEW_PROXY23.315 / 24OPS_long68.519 / 24OPS_precision55.022 / 24OPS_review64.919 / 24PLAN59.510 / 24 metricsAI_canary_health88.23 / 7AI_code23.69 / 22AI_complexity36.814 / 22AI_context_awareness0.012 / 24AI_correctness94.59 / 22AI_edge_cases80.58 / 22AI_efficiency64.712 / 22AI_hallucination_resistance100.06 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy90.77 / 24AI_plan_coherence0.121 / 24AI_recovery98.29 / 22AI_refusal100.07 / 22AI_spec100.07 / 22AI_stability75.117 / 22AI_task_completion100.05 / 24AI_tool_selection99.82 / 24ARC_AGI_210.610 / 17ArtificialAnalysisCoding85.14 / 21ArtificialAnalysisIntelligence79.16 / 21ArtificialAnalysisReasoning68.78 / 21BlendedCost74.418 / 24ContextWindow99.311 / 24CopilotArenaOrLMArenaCode93.25 / 22GDPval80.16 / 16GPQA_HLE_Reasoning68.78 / 21GSO30.710 / 15IFBench41.013 / 21LMArenaCreativeOrOpenEnded43.613 / 23LMArenaSearchDocument23.310 / 19LMArenaText43.613 / 23LongContextRecall90.25 / 21MCPAtlas69.87 / 13OutputSpeed84.09 / 20SWEBenchPro76.510 / 15SWEBenchVerified90.311 / 18SWEComposite87.46 / 24SWERebench95.73 / 21SciCode57.98 / 21SonarBugDensity65.86 / 17SonarComposite55.88 / 24SonarFunctionalSkill84.54 / 17SonarIssueDensity22.310 / 17SonarVulnerabilityDensity21.815 / 17TTFT0.020 / 20Tau2Bench53.312 / 21TerminalBench47.414 / 22 | |||||||||
| gpt-5.4 | openai | 57.0 | 57.0 | 50.1 | 50.1 | 70.5 | 70.5 | 62.0 | ▸ |
group breakdownA_B63.214 / 24A_I75.814 / 24A_P62.69 / 24A_R85.412 / 24BUILD73.78 / 24CRE50.110 / 24GEN38.216 / 24LM_ARENA_REVIEW_PROXY17.620 / 24OPS_long92.13 / 24OPS_precision87.36 / 24OPS_review89.34 / 24PLAN43.317 / 24 metricsAI_code10.020 / 22AI_complexity36.820 / 22AI_context_awareness0.019 / 24AI_correctness94.514 / 22AI_edge_cases80.513 / 22AI_efficiency64.911 / 22AI_hallucination_resistance100.013 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy96.64 / 24AI_plan_coherence5.416 / 24AI_recovery98.214 / 22AI_refusal100.015 / 22AI_spec100.015 / 22AI_stability82.414 / 22AI_task_completion100.08 / 24AI_tool_selection90.57 / 24ARC_AGI_275.85 / 17ArtificialAnalysisCoding33.715 / 21ArtificialAnalysisIntelligence27.416 / 21ArtificialAnalysisReasoning15.518 / 21BlendedCost75.015 / 24ContextWindow100.01 / 24CopilotArenaOrLMArenaCode68.011 / 22GDPval81.44 / 16GPQA_HLE_Reasoning15.518 / 21GSO54.06 / 15IFBench62.59 / 21LMArenaCreativeOrOpenEnded50.110 / 23LMArenaSearchDocument17.615 / 19LMArenaText50.110 / 23LongContextRecall24.518 / 21MCPAtlas72.84 / 13OutputSpeed97.33 / 20SWEBenchPro92.54 / 15SWEBenchVerified95.06 / 18SWEComposite88.95 / 24SWERebench83.57 / 21SciCode12.018 / 21SonarBugDensity84.74 / 17SonarComposite60.46 / 24SonarFunctionalSkill66.811 / 17SonarIssueDensity6.815 / 17SonarVulnerabilityDensity100.01 / 17TTFT80.48 / 20Tau2Bench0.021 / 21TerminalBench100.02 / 22 | |||||||||
| gemini-3-pro | 79.8 | 79.8 | 60.4 | 60.4 | 68.1 | 68.1 | 57.7 | ▸ | |
group breakdownA_B87.83 / 24A_I96.01 / 24A_P75.82 / 24A_R78.515 / 24BUILD64.411 / 24CRE87.64 / 24GEN58.27 / 24LM_ARENA_REVIEW_PROXY19.918 / 24OPS_long45.223 / 24OPS_precision48.023 / 24OPS_review43.024 / 24PLAN55.213 / 24 metricsAI_code97.72 / 22AI_complexity94.23 / 22AI_context_awareness0.014 / 24AI_correctness100.02 / 22AI_edge_cases61.816 / 22AI_efficiency94.82 / 22AI_hallucination_resistance4.023 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy83.013 / 24AI_plan_coherence100.01 / 24AI_recovery100.02 / 22AI_refusal100.010 / 22AI_spec100.010 / 22AI_stability100.01 / 22AI_task_completion23.420 / 24AI_tool_selection5.020 / 24ARC_AGI_241.96 / 17BlendedCost77.312 / 24ContextWindow0.024 / 24CopilotArenaOrLMArenaCode68.410 / 22GDPval5.016 / 16GSO40.79 / 15LMArenaCreativeOrOpenEnded87.64 / 23LMArenaSearchDocument19.913 / 19LMArenaText87.64 / 23MCPAtlas74.93 / 13SWEBenchMultilingual33.54 / 6SWEBenchPro80.38 / 15SWEBenchVerified82.913 / 18SWEComposite72.111 / 24SWERebench70.614 / 21SonarBugDensity53.29 / 17SonarComposite54.99 / 24SonarFunctionalSkill84.15 / 17SonarIssueDensity6.716 / 17SonarVulnerabilityDensity59.78 / 17TerminalBench61.28 / 22 | |||||||||
| gemini-3-flash | 73.7 | 73.7 | 66.9 | 66.9 | 67.5 | 67.5 | 62.1 | ▸ | |
group breakdownA_B82.15 / 24A_I89.15 / 24A_P71.94 / 24A_R74.217 / 24BUILD59.012 / 24CRE69.85 / 24GEN57.59 / 24LM_ARENA_REVIEW_PROXY20.017 / 24OPS_long94.51 / 24OPS_precision90.63 / 24OPS_review92.71 / 24PLAN65.59 / 24 metricsAI_code90.64 / 22AI_complexity87.65 / 22AI_context_awareness7.53 / 24AI_correctness92.517 / 22AI_edge_cases60.118 / 22AI_efficiency88.14 / 22AI_hallucination_resistance10.921 / 24AI_memory_retention92.57 / 24AI_parameter_accuracy78.117 / 24AI_plan_coherence92.57 / 24AI_recovery92.517 / 22AI_refusal92.519 / 22AI_spec92.519 / 22AI_stability92.54 / 22AI_task_completion27.418 / 24AI_tool_selection11.818 / 24ARC_AGI_23.114 / 17ArtificialAnalysisCoding58.39 / 21ArtificialAnalysisIntelligence58.911 / 21ArtificialAnalysisReasoning82.76 / 21BlendedCost91.58 / 24ContextWindow100.05 / 24CopilotArenaOrLMArenaCode68.012 / 22GDPval8.014 / 16GPQA_HLE_Reasoning82.76 / 21GSO14.013 / 15IFBench100.02 / 21LMArenaCreativeOrOpenEnded69.85 / 23LMArenaSearchDocument20.012 / 19LMArenaText69.85 / 23LongContextRecall68.69 / 21MCPAtlas22.49 / 13OutputSpeed99.22 / 20SWEBenchMultilingual100.01 / 6SWEBenchPro53.012 / 15SWEBenchVerified100.01 / 18SWEComposite74.19 / 24SWERebench76.310 / 21SciCode78.76 / 21SonarBugDensity52.711 / 17SonarComposite54.211 / 24SonarFunctionalSkill78.97 / 17SonarIssueDensity13.212 / 17SonarVulnerabilityDensity58.210 / 17TTFT78.710 / 20Tau2Bench64.210 / 21TerminalBench48.312 / 22 | |||||||||
| gpt-5.3-codex | openai | 56.9 | 56.9 | 54.6 | 54.6 | 64.7 | 64.7 | 71.1 | ▸ |
group breakdownA_B64.510 / 24A_I76.012 / 24A_P57.319 / 24A_R86.35 / 24BUILD66.210 / 24CRE50.111 / 24GEN50.011 / 24LM_ARENA_REVIEW_PROXY92.53 / 24OPS_long58.021 / 24OPS_precision60.620 / 24OPS_review64.121 / 24PLAN54.914 / 24 metricsAI_code14.514 / 22AI_complexity36.819 / 22AI_context_awareness0.018 / 24AI_correctness94.513 / 22AI_edge_cases80.512 / 22AI_efficiency63.913 / 22AI_hallucination_resistance100.012 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy91.45 / 24AI_plan_coherence0.122 / 24AI_recovery98.213 / 22AI_refusal100.014 / 22AI_spec100.014 / 22AI_stability87.19 / 22AI_task_completion66.715 / 24AI_tool_selection71.812 / 24BlendedCost76.614 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode59.314 / 22GDPval51.511 / 16GSO53.47 / 15LMArenaCreativeOrOpenEnded50.111 / 23LMArenaSearchDocument92.53 / 19LMArenaText50.111 / 23SWEBenchVerified92.58 / 18SWEComposite72.210 / 24SWERebench89.55 / 21SonarComposite50.018 / 24TerminalBench74.36 / 22 | |||||||||
| deepseek-v4-flash | deepseek | 41.6 | 41.6 | 67.2 | 67.2 | 58.8 | 58.8 | 69.3 | ▸ |
group breakdownA_B66.58 / 24A_I74.215 / 24A_P61.413 / 24A_R85.78 / 24BUILD49.715 / 24CRE12.721 / 24GEN49.312 / 24LM_ARENA_REVIEW_PROXY50.08 / 24OPS_long88.15 / 24OPS_precision91.42 / 24OPS_review88.65 / 24PLAN77.65 / 24 metricsAI_canary_health84.44 / 7AI_code28.18 / 22AI_complexity36.815 / 22AI_context_awareness0.013 / 24AI_correctness94.510 / 22AI_edge_cases80.59 / 22AI_efficiency70.88 / 22AI_hallucination_resistance100.07 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy88.99 / 24AI_plan_coherence8.013 / 24AI_recovery98.210 / 22AI_refusal100.08 / 22AI_spec100.08 / 22AI_stability67.919 / 22AI_task_completion100.06 / 24AI_tool_selection83.811 / 24ArtificialAnalysisCoding45.611 / 21ArtificialAnalysisIntelligence59.310 / 21ArtificialAnalysisReasoning76.77 / 21BlendedCost100.01 / 24ContextWindow71.622 / 24GPQA_HLE_Reasoning76.77 / 21IFBench100.01 / 21LMArenaCreativeOrOpenEnded12.720 / 23LMArenaText12.720 / 23LongContextRecall52.516 / 21OutputSpeed86.48 / 20SWEComposite50.017 / 24SciCode47.512 / 21SonarComposite50.015 / 24TTFT99.12 / 20Tau2Bench97.95 / 21 | |||||||||
| claude-sonnet-4.5 | anthropic | 42.8 | 42.8 | 46.9 | 46.9 | 56.8 | 56.8 | 53.1 | ▸ |
group breakdownA_B64.112 / 24A_I76.211 / 24A_P61.612 / 24A_R85.411 / 24BUILD52.513 / 24CRE23.920 / 24GEN32.218 / 24LM_ARENA_REVIEW_PROXY1.822 / 24OPS_long80.510 / 24OPS_precision79.610 / 24OPS_review82.010 / 24PLAN41.618 / 24 metricsAI_canary_health80.36 / 7AI_code10.018 / 22AI_complexity36.813 / 22AI_context_awareness0.011 / 24AI_correctness94.58 / 22AI_edge_cases80.57 / 22AI_efficiency76.16 / 22AI_hallucination_resistance100.05 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy85.711 / 24AI_plan_coherence0.120 / 24AI_recovery98.28 / 22AI_refusal100.06 / 22AI_spec100.06 / 22AI_stability82.413 / 22AI_task_completion100.04 / 24AI_tool_selection90.56 / 24ARC_AGI_23.712 / 17ArtificialAnalysisCoding45.312 / 21ArtificialAnalysisIntelligence46.012 / 21ArtificialAnalysisReasoning35.315 / 21BlendedCost74.417 / 24ContextWindow99.310 / 24CopilotArenaOrLMArenaCode53.416 / 22GDPval81.93 / 16GPQA_HLE_Reasoning35.315 / 21GSO27.311 / 15IFBench43.012 / 21LMArenaCreativeOrOpenEnded23.919 / 23LMArenaSearchDocument1.817 / 19LMArenaText23.919 / 23LongContextRecall65.711 / 21MCPAtlas6.612 / 13OutputSpeed78.416 / 20SWEBenchMultilingual3.95 / 6SWEBenchPro81.27 / 15SWEBenchVerified85.712 / 18SWEComposite71.612 / 24SWERebench74.911 / 21SciCode46.413 / 21SonarBugDensity2.816 / 17SonarComposite15.623 / 24SonarFunctionalSkill17.215 / 17SonarIssueDensity30.09 / 17SonarVulnerabilityDensity4.616 / 17TTFT75.311 / 20Tau2Bench58.911 / 21TerminalBench37.415 / 22 | |||||||||
| gpt-5.2 | openai | 46.9 | 46.9 | 55.3 | 55.3 | 55.0 | 55.0 | 57.8 | ▸ |
group breakdownA_B63.413 / 24A_I76.210 / 24A_P65.37 / 24A_R85.113 / 24BUILD50.814 / 24CRE31.517 / 24GEN43.113 / 24LM_ARENA_REVIEW_PROXY21.216 / 24OPS_long58.320 / 24OPS_precision61.319 / 24OPS_review64.820 / 24PLAN56.311 / 24 metricsAI_code14.513 / 22AI_complexity36.818 / 22AI_context_awareness0.017 / 24AI_correctness94.512 / 22AI_edge_cases80.511 / 22AI_efficiency61.415 / 22AI_hallucination_resistance100.011 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy91.16 / 24AI_plan_coherence23.810 / 24AI_recovery98.212 / 22AI_refusal100.013 / 22AI_spec100.013 / 22AI_stability75.118 / 22AI_task_completion100.07 / 24AI_tool_selection95.83 / 24ARC_AGI_20.017 / 17ArtificialAnalysisCoding63.48 / 21ArtificialAnalysisIntelligence59.79 / 21ArtificialAnalysisReasoning56.411 / 21BlendedCost80.111 / 24ContextWindow85.312 / 24CopilotArenaOrLMArenaCode38.720 / 22GPQA_HLE_Reasoning56.411 / 21GSO64.74 / 15IFBench64.78 / 21LMArenaCreativeOrOpenEnded31.516 / 23LMArenaSearchDocument21.211 / 19LMArenaText31.516 / 23LongContextRecall53.915 / 21SWEBenchMultilingual0.06 / 6SWEBenchPro38.214 / 15SWEBenchVerified81.314 / 18SWEComposite45.620 / 24SciCode54.69 / 21SonarBugDensity64.27 / 17SonarComposite59.77 / 24SonarFunctionalSkill67.210 / 17SonarIssueDensity35.78 / 17SonarVulnerabilityDensity73.46 / 17Tau2Bench50.115 / 21TerminalBench58.29 / 22 | |||||||||
| glm-4.7 | zai | 41.8 | 41.8 | 52.4 | 52.4 | 51.7 | 51.7 | 55.3 | ▸ |
group breakdownA_B56.021 / 24A_I55.021 / 24A_P46.921 / 24A_R58.521 / 24BUILD44.519 / 24CRE27.019 / 24GEN40.815 / 24LM_ARENA_REVIEW_PROXY50.012 / 24OPS_long90.44 / 24OPS_precision92.11 / 24OPS_review89.73 / 24PLAN55.512 / 24 metricsAI_context_awareness0.024 / 24AI_hallucination_resistance100.018 / 24AI_memory_retention98.95 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.05 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding37.914 / 21ArtificialAnalysisIntelligence42.613 / 21ArtificialAnalysisReasoning55.812 / 21BlendedCost96.13 / 24ContextWindow74.918 / 24CopilotArenaOrLMArenaCode68.89 / 22GPQA_HLE_Reasoning55.812 / 21IFBench72.27 / 21LMArenaCreativeOrOpenEnded27.018 / 23LMArenaText27.018 / 23LongContextRecall57.414 / 21MCPAtlas0.013 / 13OutputSpeed90.87 / 20SWEComposite58.415 / 24SWERebench70.913 / 21SciCode48.611 / 21SonarBugDensity51.613 / 17SonarComposite27.321 / 24SonarFunctionalSkill0.017 / 17SonarIssueDensity50.85 / 17SonarVulnerabilityDensity28.713 / 17TTFT98.43 / 20Tau2Bench100.02 / 21TerminalBench27.117 / 22 | |||||||||
| claude-opus-4.1 | anthropic | 30.3 | 30.3 | 46.4 | 46.4 | 51.2 | 51.2 | 50.1 | ▸ |
group breakdownA_B62.116 / 24A_I74.116 / 24A_P58.118 / 24A_R85.410 / 24BUILD48.517 / 24CRE0.723 / 24GEN37.717 / 24LM_ARENA_REVIEW_PROXY0.023 / 24OPS_long48.722 / 24OPS_precision43.724 / 24OPS_review46.223 / 24PLAN45.915 / 24 metricsAI_canary_health68.17 / 7AI_code10.015 / 22AI_complexity36.88 / 22AI_context_awareness0.06 / 24AI_correctness94.54 / 22AI_edge_cases80.53 / 22AI_efficiency50.320 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention0.011 / 24AI_parameter_accuracy71.020 / 24AI_plan_coherence0.119 / 24AI_recovery98.24 / 22AI_refusal100.01 / 22AI_spec100.01 / 22AI_stability82.412 / 22AI_task_completion83.310 / 24AI_tool_selection86.510 / 24BlendedCost0.024 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode53.217 / 22LMArenaCreativeOrOpenEnded0.722 / 23LMArenaSearchDocument0.018 / 19LMArenaText0.722 / 23SWEComposite50.916 / 24SWERebench52.316 / 21SonarComposite50.014 / 24TerminalBench29.416 / 22 | |||||||||
| gemini-2.5-pro | 62.1 | 62.1 | 42.7 | 42.7 | 49.3 | 49.3 | 41.8 | ▸ | |
group breakdownA_B82.14 / 24A_I89.14 / 24A_P71.93 / 24A_R74.216 / 24BUILD35.921 / 24CRE60.57 / 24GEN29.619 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long80.511 / 24OPS_precision70.217 / 24OPS_review77.016 / 24PLAN28.920 / 24 metricsAI_code90.63 / 22AI_complexity87.64 / 22AI_context_awareness7.52 / 24AI_correctness92.516 / 22AI_edge_cases60.117 / 22AI_efficiency88.13 / 22AI_hallucination_resistance10.920 / 24AI_memory_retention92.56 / 24AI_parameter_accuracy78.116 / 24AI_plan_coherence92.56 / 24AI_recovery92.516 / 22AI_refusal92.518 / 22AI_spec92.518 / 22AI_stability92.53 / 22AI_task_completion27.417 / 24AI_tool_selection11.817 / 24ARC_AGI_23.713 / 17ArtificialAnalysisCoding23.617 / 21ArtificialAnalysisIntelligence14.117 / 21ArtificialAnalysisReasoning44.814 / 21BlendedCost80.110 / 24ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.921 / 22GDPval7.515 / 16GPQA_HLE_Reasoning44.814 / 21GSO0.015 / 15IFBench19.318 / 21LMArenaCreativeOrOpenEnded60.57 / 23LMArenaSearchDocument0.019 / 19LMArenaText60.57 / 23LongContextRecall67.210 / 21MCPAtlas71.15 / 13OutputSpeed92.65 / 20SWEBenchPro75.711 / 15SWEBenchVerified38.217 / 18SWEComposite36.622 / 24SWERebench1.820 / 21SciCode36.115 / 21SonarBugDensity52.710 / 17SonarComposite54.210 / 24SonarFunctionalSkill78.96 / 17SonarIssueDensity13.211 / 17SonarVulnerabilityDensity58.29 / 17TTFT32.718 / 20Tau2Bench3.519 / 21TerminalBench1.820 / 22 | |||||||||
| claude-sonnet-4 | anthropic | 33.6 | 33.6 | 35.1 | 35.1 | 48.5 | 48.5 | 50.8 | ▸ |
group breakdownA_B40.422 / 24A_I40.422 / 24A_P44.722 / 24A_R46.822 / 24BUILD49.416 / 24CRE27.818 / 24GEN21.020 / 24LM_ARENA_REVIEW_PROXY86.26 / 24OPS_long80.99 / 24OPS_precision79.611 / 24OPS_review82.19 / 24PLAN30.119 / 24 metricsAI_code10.017 / 22AI_complexity36.812 / 22AI_context_awareness0.010 / 24AI_correctness46.620 / 22AI_edge_cases7.821 / 22AI_efficiency67.09 / 22AI_hallucination_resistance100.04 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy89.78 / 24AI_plan_coherence2.818 / 24AI_recovery27.620 / 22AI_refusal100.05 / 22AI_spec100.05 / 22AI_stability4.521 / 22AI_task_completion100.03 / 24AI_tool_selection89.28 / 24ARC_AGI_20.216 / 17ArtificialAnalysisCoding30.716 / 21ArtificialAnalysisIntelligence29.715 / 21ArtificialAnalysisReasoning8.619 / 21BlendedCost74.416 / 24ContextWindow99.39 / 24CopilotArenaOrLMArenaCode52.918 / 22GDPval80.15 / 16GPQA_HLE_Reasoning8.619 / 21GSO6.014 / 15IFBench35.814 / 21LMArenaCreativeOrOpenEnded27.817 / 23LMArenaSearchDocument86.26 / 19LMArenaText27.817 / 23LiveCodeBench0.02 / 2LongContextRecall60.812 / 21MCPAtlas13.110 / 13OutputSpeed79.513 / 20SWEBenchPro78.49 / 15SWEBenchVerified69.916 / 18SWEComposite66.613 / 24SWERebench55.115 / 21SciCode20.817 / 21SonarBugDensity0.017 / 17SonarComposite19.522 / 24SonarFunctionalSkill26.414 / 17SonarIssueDensity35.87 / 17SonarVulnerabilityDensity0.017 / 17TTFT74.214 / 20Tau2Bench27.718 / 21TerminalBench47.413 / 22 | |||||||||
| grok-code-fast-1 | xai | 33.9 | 33.9 | 29.9 | 29.9 | 47.2 | 47.2 | 48.0 | ▸ |
group breakdownA_B91.32 / 24A_I92.42 / 24A_P66.46 / 24A_R96.71 / 24BUILD28.322 / 24CRE7.522 / 24GEN5.623 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long87.56 / 24OPS_precision86.67 / 24OPS_review86.56 / 24PLAN13.324 / 24 metricsAI_code87.86 / 22AI_complexity94.62 / 22AI_context_awareness0.022 / 24AI_correctness100.03 / 22AI_edge_cases100.02 / 22AI_efficiency52.919 / 22AI_hallucination_resistance100.016 / 24AI_memory_retention98.93 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.03 / 24AI_recovery100.03 / 22AI_refusal100.017 / 22AI_spec100.017 / 22AI_stability79.215 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.17 / 17ArtificialAnalysisCoding0.021 / 21ArtificialAnalysisIntelligence0.021 / 21ArtificialAnalysisReasoning0.021 / 21BlendedCost99.32 / 24ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.022 / 22GPQA_HLE_Reasoning0.021 / 21IFBench0.021 / 21LMArenaCreativeOrOpenEnded7.521 / 23LMArenaText7.521 / 23LongContextRecall0.021 / 21OutputSpeed90.96 / 20SWEComposite41.221 / 24SWERebench27.919 / 21SciCode0.021 / 21SonarComposite50.020 / 24TTFT79.19 / 20Tau2Bench53.313 / 21TerminalBench0.022 / 22 | |||||||||
| gemini-2.5-flash | 29.7 | 29.7 | 33.5 | 33.5 | 46.2 | 46.2 | 52.1 | ▸ | |
group breakdownA_B94.91 / 24A_I89.13 / 24A_P76.31 / 24A_R96.12 / 24BUILD24.523 / 24CRE0.024 / 24GEN3.724 / 24LM_ARENA_REVIEW_PROXY78.87 / 24OPS_long94.42 / 24OPS_precision90.04 / 24OPS_review92.62 / 24PLAN17.223 / 24 metricsAI_code100.01 / 22AI_complexity100.01 / 22AI_context_awareness100.01 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency100.01 / 22AI_hallucination_resistance100.08 / 24AI_memory_retention50.99 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence53.89 / 24AI_recovery100.01 / 22AI_refusal100.09 / 22AI_spec100.09 / 22AI_stability60.720 / 22AI_task_completion29.716 / 24AI_tool_selection62.015 / 24ARC_AGI_20.815 / 17ArtificialAnalysisCoding0.020 / 21ArtificialAnalysisIntelligence0.819 / 21ArtificialAnalysisReasoning17.916 / 21BlendedCost94.45 / 24ContextWindow100.03 / 24CopilotArenaOrLMArenaCode65.313 / 22GDPval10.313 / 16GPQA_HLE_Reasoning17.916 / 21GSO19.412 / 15IFBench29.217 / 21LMArenaCreativeOrOpenEnded0.023 / 23LMArenaSearchDocument78.87 / 19LMArenaText0.023 / 23LiveCodeBench100.01 / 2LongContextRecall58.813 / 21MCPAtlas26.68 / 13OutputSpeed100.01 / 20SWEBenchPro52.513 / 15SWEBenchVerified0.018 / 18SWEComposite20.424 / 24SWERebench0.021 / 21SciCode23.516 / 21SonarComposite50.016 / 24TTFT74.813 / 20Tau2Bench0.020 / 21TerminalBench0.321 / 22 | |||||||||
| grok-4-latest | xai | 45.2 | 45.2 | 43.3 | 43.3 | 43.0 | 43.0 | 41.7 | ▸ |
group breakdownA_B24.524 / 24A_I31.023 / 24A_P32.324 / 24A_R35.023 / 24BUILD45.718 / 24CRE48.213 / 24GEN42.314 / 24LM_ARENA_REVIEW_PROXY19.219 / 24OPS_long77.914 / 24OPS_precision76.713 / 24OPS_review77.015 / 24PLAN43.816 / 24 metricsAI_code0.022 / 22AI_complexity0.022 / 22AI_context_awareness0.021 / 24AI_correctness0.022 / 22AI_edge_cases51.720 / 22AI_efficiency0.022 / 22AI_hallucination_resistance100.015 / 24AI_memory_retention98.92 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.02 / 24AI_recovery6.921 / 22AI_refusal0.022 / 22AI_spec0.022 / 22AI_stability100.02 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_220.78 / 17ArtificialAnalysisCoding51.510 / 21ArtificialAnalysisIntelligence40.314 / 21ArtificialAnalysisReasoning57.010 / 21BlendedCost74.419 / 24ContextWindow78.415 / 24CopilotArenaOrLMArenaCode58.015 / 22GPQA_HLE_Reasoning57.010 / 21IFBench33.115 / 21LMArenaCreativeOrOpenEnded48.212 / 23LMArenaSearchDocument19.214 / 19LMArenaText48.212 / 23LongContextRecall77.08 / 21OutputSpeed79.414 / 20SWEComposite45.619 / 24SWERebench39.117 / 21SciCode51.910 / 21SonarComposite50.019 / 24TTFT75.012 / 20Tau2Bench51.514 / 21TerminalBench11.819 / 22 | |||||||||
| kimi-k2-0905 | moonshot | 36.0 | 36.0 | 27.2 | 27.2 | 38.8 | 38.8 | 36.5 | ▸ |
group breakdownA_B30.523 / 24A_I24.124 / 24A_P33.023 / 24A_R34.424 / 24BUILD41.120 / 24CRE50.012 / 24GEN20.021 / 24LM_ARENA_REVIEW_PROXY50.09 / 24OPS_long35.024 / 24OPS_precision57.621 / 24OPS_review54.022 / 24PLAN22.721 / 24 metricsAI_canary_health88.91 / 7AI_code23.610 / 22AI_complexity36.816 / 22AI_context_awareness0.015 / 24AI_correctness5.721 / 22AI_edge_cases0.022 / 22AI_efficiency10.721 / 22AI_hallucination_resistance100.09 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy82.514 / 24AI_plan_coherence5.415 / 24AI_recovery0.022 / 22AI_refusal100.011 / 22AI_spec100.011 / 22AI_stability0.022 / 22AI_task_completion83.312 / 24AI_tool_selection86.59 / 24ArtificialAnalysisCoding4.219 / 21ArtificialAnalysisIntelligence0.020 / 21ArtificialAnalysisReasoning0.020 / 21BlendedCost92.77 / 24ContextWindow53.423 / 24GPQA_HLE_Reasoning0.020 / 21IFBench0.020 / 21LongContextRecall0.020 / 21OutputSpeed0.020 / 20SWEComposite50.018 / 24SciCode0.020 / 21SonarComposite50.017 / 24TTFT88.66 / 20Tau2Bench48.016 / 21 | |||||||||
| glm-4.6 | zai | 38.2 | 38.2 | 29.8 | 29.8 | 34.8 | 34.8 | 38.1 | ▸ |
group breakdownA_B56.020 / 24A_I55.020 / 24A_P46.920 / 24A_R58.520 / 24BUILD21.924 / 24CRE33.116 / 24GEN16.322 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long77.216 / 24OPS_precision84.58 / 24OPS_review82.38 / 24PLAN18.122 / 24 metricsAI_context_awareness0.023 / 24AI_hallucination_resistance100.017 / 24AI_memory_retention98.94 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.04 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding15.918 / 21ArtificialAnalysisIntelligence6.118 / 21ArtificialAnalysisReasoning16.517 / 21BlendedCost95.44 / 24ContextWindow75.017 / 24CopilotArenaOrLMArenaCode44.419 / 22GPQA_HLE_Reasoning16.517 / 21IFBench4.719 / 21LMArenaCreativeOrOpenEnded33.115 / 23LMArenaText33.115 / 23LongContextRecall9.819 / 21MCPAtlas7.511 / 13OutputSpeed67.318 / 20SWEBenchPro0.015 / 15SWEBenchVerified79.015 / 18SWEComposite30.223 / 24SWERebench38.418 / 21SciCode12.019 / 21SonarBugDensity7.515 / 17SonarComposite10.724 / 24SonarFunctionalSkill7.516 / 17SonarIssueDensity7.514 / 17SonarVulnerabilityDensity29.012 / 17TTFT96.94 / 20Tau2Bench41.317 / 21TerminalBench13.918 / 22 | |||||||||