how scoring works
Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.
raw vs adjusted
The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models lead the direct LM Arena search/document review proxy, that proxy lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.
Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted.
missing data
If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.
| claude-opus-4.7 | anthropic | 92.2 | 91.8 | 78.5 | 77.6 | 84.9 | 83.2 | 84.0 | ▸ |
group breakdownA_B88.72 / 24A_I92.22 / 24A_P62.29 / 24A_R82.99 / 24BUILD86.53 / 24CRE93.64 / 24GEN96.52 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long72.219 / 24OPS_precision65.020 / 24OPS_review70.920 / 24PLAN78.84 / 24 metricsAI_code99.02 / 22AI_complexity100.01 / 22AI_context_awareness8.77 / 24AI_correctness100.04 / 22AI_edge_cases100.04 / 22AI_efficiency99.72 / 22AI_hallucination_resistance0.023 / 24AI_memory_retention23.89 / 24AI_parameter_accuracy37.316 / 24AI_plan_coherence22.210 / 24AI_recovery100.04 / 22AI_refusal100.04 / 22AI_spec100.04 / 22AI_stability100.01 / 22AI_task_completion84.214 / 24AI_tool_selection0.020 / 24ARC_AGI_292.73 / 22ArtificialAnalysisCoding90.63 / 23ArtificialAnalysisIntelligence100.01 / 23ArtificialAnalysisReasoning95.63 / 23BlendedCost61.922 / 24ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.01 / 24GDPval95.01 / 24GPQA_HLE_Reasoning95.63 / 23GSO100.01 / 16IFBench45.112 / 23LMArenaCreativeOrOpenEnded93.64 / 24LMArenaSearchDocument100.01 / 22LMArenaText93.64 / 24LongContextRecall88.27 / 23OutputSpeed78.016 / 23SWEBenchMultilingual95.03 / 19SWEBenchPro95.02 / 21SWEBenchVerified95.03 / 23SWEComposite91.14 / 24SWERebench85.36 / 23SciCode100.01 / 23SonarBugDensity50.119 / 22SonarComposite51.418 / 24SonarFunctionalSkill93.92 / 22SonarIssueDensity0.022 / 22SonarVulnerabilityDensity25.319 / 22TTFT41.120 / 23Tau2Bench79.810 / 23TerminalBench78.24 / 24 | |||||||||
| claude-opus-4.6 | anthropic | 91.7 | 91.7 | 76.1 | 76.1 | 81.4 | 81.4 | 73.5 | ▸ |
group breakdownA_B70.76 / 24A_I81.95 / 24A_P64.94 / 24A_R78.612 / 24BUILD86.82 / 24CRE100.01 / 24GEN90.04 / 24LM_ARENA_REVIEW_PROXY33.612 / 24OPS_long78.817 / 24OPS_precision77.414 / 24OPS_review79.614 / 24PLAN73.17 / 24 metricsAI_canary_health83.36 / 7AI_code57.76 / 22AI_complexity50.16 / 22AI_context_awareness8.48 / 24AI_correctness100.03 / 22AI_edge_cases100.03 / 22AI_efficiency63.67 / 22AI_hallucination_resistance0.022 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy100.02 / 24AI_plan_coherence0.023 / 24AI_recovery100.03 / 22AI_refusal100.03 / 22AI_spec100.03 / 22AI_stability98.55 / 22AI_task_completion96.73 / 24AI_tool_selection99.92 / 24ARC_AGI_290.94 / 22ArtificialAnalysisCoding76.75 / 23ArtificialAnalysisIntelligence85.36 / 23ArtificialAnalysisReasoning86.35 / 23BlendedCost61.921 / 24ContextWindow99.37 / 24CopilotArenaOrLMArenaCode99.82 / 24GDPval82.27 / 24GPQA_HLE_Reasoning86.35 / 23GSO75.33 / 16IFBench30.418 / 23LMArenaCreativeOrOpenEnded100.01 / 24LMArenaSearchDocument33.610 / 22LMArenaText100.01 / 24LongContextRecall90.25 / 23OutputSpeed76.918 / 23SWEBenchMultilingual90.98 / 19SWEBenchPro100.01 / 21SWEBenchVerified99.72 / 23SWEComposite95.71 / 24SWERebench91.64 / 23SciCode85.85 / 23SonarBugDensity59.512 / 22SonarComposite70.57 / 24SonarFunctionalSkill92.24 / 22SonarIssueDensity46.89 / 22SonarVulnerabilityDensity66.611 / 22TTFT77.414 / 23Tau2Bench87.67 / 23TerminalBench64.27 / 24 | |||||||||
| claude-opus-4.5 | anthropic | 77.7 | 77.7 | 71.0 | 71.0 | 81.0 | 81.0 | 68.5 | ▸ |
group breakdownA_B88.03 / 24A_I89.84 / 24A_P74.01 / 24A_R86.03 / 24BUILD80.67 / 24CRE73.511 / 24GEN73.56 / 24LM_ARENA_REVIEW_PROXY10.821 / 24OPS_long76.518 / 24OPS_precision73.517 / 24OPS_review73.618 / 24PLAN67.19 / 24 metricsAI_canary_health88.53 / 7AI_code98.93 / 22AI_complexity91.94 / 22AI_context_awareness75.42 / 24AI_correctness100.02 / 22AI_edge_cases100.02 / 22AI_efficiency96.53 / 22AI_hallucination_resistance20.019 / 24AI_memory_retention0.012 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence13.918 / 24AI_recovery100.02 / 22AI_refusal100.02 / 22AI_spec100.02 / 22AI_stability96.89 / 22AI_task_completion98.82 / 24AI_tool_selection74.712 / 24ARC_AGI_284.85 / 22ArtificialAnalysisCoding75.86 / 23ArtificialAnalysisIntelligence73.79 / 23ArtificialAnalysisReasoning63.710 / 23BlendedCost61.920 / 24ContextWindow74.721 / 24CopilotArenaOrLMArenaCode77.78 / 24GDPval80.49 / 24GPQA_HLE_Reasoning63.710 / 23GSO59.35 / 16IFBench43.514 / 23LMArenaCreativeOrOpenEnded73.511 / 24LMArenaSearchDocument10.819 / 22LMArenaText73.511 / 24LongContextRecall100.01 / 23OutputSpeed80.713 / 23SWEBenchMultilingual95.02 / 19SWEBenchPro88.411 / 21SWEBenchVerified92.210 / 23SWEComposite84.910 / 24SWERebench76.59 / 23SciCode72.77 / 23SonarBugDensity73.78 / 22SonarComposite87.11 / 24SonarFunctionalSkill100.01 / 22SonarIssueDensity77.25 / 22SonarVulnerabilityDensity87.24 / 22TTFT73.317 / 23Tau2Bench81.89 / 23TerminalBench54.812 / 24 | |||||||||
| gpt-5.5 | openai | 82.7 | 82.7 | 83.9 | 83.9 | 80.9 | 80.9 | 78.9 | ▸ |
group breakdownA_B62.313 / 24A_I72.414 / 24A_P62.28 / 24A_R82.811 / 24BUILD87.41 / 24CRE82.67 / 24GEN94.43 / 24LM_ARENA_REVIEW_PROXY27.513 / 24OPS_long82.110 / 24OPS_precision79.513 / 24OPS_review81.013 / 24PLAN90.62 / 24 metricsAI_code30.315 / 22AI_complexity18.915 / 22AI_context_awareness0.020 / 24AI_correctness93.813 / 22AI_edge_cases71.118 / 22AI_efficiency48.414 / 22AI_hallucination_resistance80.011 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy94.16 / 24AI_plan_coherence8.320 / 24AI_recovery97.915 / 22AI_refusal100.013 / 22AI_spec100.013 / 22AI_stability84.212 / 22AI_task_completion92.610 / 24AI_tool_selection100.01 / 24ARC_AGI_296.72 / 22ArtificialAnalysisCoding100.02 / 23ArtificialAnalysisIntelligence98.23 / 23ArtificialAnalysisReasoning100.02 / 23BlendedCost50.623 / 24ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.710 / 24GDPval95.02 / 24GPQA_HLE_Reasoning100.02 / 23GSO94.02 / 16IFBench78.17 / 23LMArenaCreativeOrOpenEnded82.67 / 24LMArenaSearchDocument27.511 / 22LMArenaText82.67 / 24LongContextRecall98.03 / 23MCPAtlas72.87 / 16OutputSpeed81.712 / 23SWEBenchPro95.06 / 21SWEBenchVerified95.08 / 23SWEComposite89.95 / 24SWERebench83.58 / 23SciCode94.54 / 23SonarBugDensity94.52 / 22SonarComposite65.58 / 24SonarFunctionalSkill46.518 / 22SonarIssueDensity52.77 / 22SonarVulnerabilityDensity99.22 / 22TTFT85.48 / 23Tau2Bench86.98 / 23TerminalBench100.01 / 24 | |||||||||
| kimi-k2.6 | moonshot | 75.8 | 75.8 | 76.7 | 76.7 | 78.7 | 78.7 | 85.3 | ▸ |
group breakdownA_B63.312 / 24A_I77.811 / 24A_P61.612 / 24A_R82.810 / 24BUILD84.44 / 24CRE78.28 / 24GEN74.35 / 24LM_ARENA_REVIEW_PROXY94.72 / 24OPS_long60.422 / 24OPS_precision73.915 / 24OPS_review72.419 / 24PLAN87.83 / 24 metricsAI_code33.511 / 22AI_complexity18.911 / 22AI_context_awareness0.016 / 24AI_correctness100.08 / 22AI_edge_cases100.08 / 22AI_efficiency38.316 / 22AI_hallucination_resistance40.018 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy86.49 / 24AI_plan_coherence13.919 / 24AI_recovery100.09 / 22AI_refusal100.09 / 22AI_spec100.09 / 22AI_stability96.96 / 22AI_task_completion77.217 / 24AI_tool_selection77.611 / 24ARC_AGI_211.914 / 22ArtificialAnalysisCoding73.67 / 23ArtificialAnalysisIntelligence88.44 / 23ArtificialAnalysisReasoning87.64 / 23BlendedCost89.19 / 24ContextWindow78.814 / 24CopilotArenaOrLMArenaCode94.65 / 24GDPval68.511 / 24GPQA_HLE_Reasoning87.64 / 23IFBench91.56 / 23LMArenaCreativeOrOpenEnded78.28 / 24LMArenaSearchDocument94.72 / 22LMArenaText78.28 / 24LongContextRecall85.38 / 23MCPAtlas92.54 / 16OutputSpeed37.822 / 23SWEBenchMultilingual95.05 / 19SWEBenchPro95.04 / 21SWEBenchVerified95.06 / 23SWEComposite86.29 / 24SWERebench73.114 / 23SciCode94.53 / 23SonarBugDensity92.55 / 22SonarComposite80.66 / 24SonarFunctionalSkill66.817 / 22SonarIssueDensity92.54 / 22SonarVulnerabilityDensity81.69 / 22TTFT94.25 / 23Tau2Bench96.03 / 23TerminalBench74.65 / 24 | |||||||||
| glm-5.1 | zai | 79.2 | 79.2 | 71.2 | 71.2 | 77.1 | 77.1 | 80.7 | ▸ |
group breakdownA_B61.314 / 24A_I73.713 / 24A_P59.917 / 24A_R77.914 / 24BUILD81.96 / 24CRE87.05 / 24GEN67.48 / 24LM_ARENA_REVIEW_PROXY88.07 / 24OPS_long83.89 / 24OPS_precision88.16 / 24OPS_review85.86 / 24PLAN76.96 / 24 metricsAI_code36.09 / 22AI_complexity23.68 / 22AI_context_awareness7.510 / 24AI_correctness92.514 / 22AI_edge_cases92.514 / 22AI_efficiency40.115 / 22AI_hallucination_resistance41.516 / 24AI_memory_retention7.510 / 24AI_parameter_accuracy81.013 / 24AI_plan_coherence19.315 / 24AI_recovery92.519 / 22AI_refusal92.515 / 22AI_spec92.515 / 22AI_stability89.910 / 22AI_task_completion73.118 / 24AI_tool_selection73.513 / 24ARC_AGI_25.216 / 22ArtificialAnalysisCoding61.910 / 23ArtificialAnalysisIntelligence79.68 / 23ArtificialAnalysisReasoning63.311 / 23BlendedCost93.06 / 24ContextWindow74.919 / 24CopilotArenaOrLMArenaCode97.13 / 24GDPval73.410 / 24GPQA_HLE_Reasoning63.311 / 23IFBench92.35 / 23LMArenaCreativeOrOpenEnded87.05 / 24LMArenaSearchDocument88.07 / 22LMArenaText87.05 / 24LongContextRecall49.017 / 23MCPAtlas100.01 / 16OutputSpeed79.115 / 23SWEBenchMultilingual50.910 / 19SWEBenchPro95.07 / 21SWEBenchVerified91.912 / 23SWEComposite92.13 / 24SWERebench100.01 / 23SciCode41.516 / 23SonarBugDensity100.01 / 22SonarComposite86.02 / 24SonarFunctionalSkill69.812 / 22SonarIssueDensity100.01 / 22SonarVulnerabilityDensity87.25 / 22TTFT98.84 / 23Tau2Bench100.02 / 23TerminalBench55.811 / 24 | |||||||||
| claude-sonnet-4.6 | anthropic | 75.7 | 75.7 | 62.3 | 62.3 | 76.5 | 76.5 | 65.3 | ▸ |
group breakdownA_B87.64 / 24A_I91.33 / 24A_P66.53 / 24A_R83.08 / 24BUILD76.98 / 24CRE73.810 / 24GEN66.39 / 24LM_ARENA_REVIEW_PROXY23.214 / 24OPS_long66.221 / 24OPS_precision53.723 / 24OPS_review63.621 / 24PLAN58.812 / 24 metricsAI_canary_health88.24 / 7AI_code100.01 / 22AI_complexity96.03 / 22AI_context_awareness7.89 / 24AI_correctness100.06 / 22AI_edge_cases100.06 / 22AI_efficiency100.01 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy100.04 / 24AI_plan_coherence16.716 / 24AI_recovery100.06 / 22AI_refusal100.06 / 22AI_spec100.06 / 22AI_stability100.02 / 22AI_task_completion95.94 / 24AI_tool_selection51.615 / 24ARC_AGI_210.615 / 22ArtificialAnalysisCoding85.54 / 23ArtificialAnalysisIntelligence80.77 / 23ArtificialAnalysisReasoning68.79 / 23BlendedCost74.418 / 24ContextWindow99.311 / 24CopilotArenaOrLMArenaCode95.04 / 24GDPval86.16 / 24GPQA_HLE_Reasoning68.79 / 23GSO30.711 / 16IFBench39.716 / 23LMArenaCreativeOrOpenEnded73.810 / 24LMArenaSearchDocument23.212 / 22LMArenaText73.810 / 24LongContextRecall90.26 / 23MCPAtlas69.810 / 16OutputSpeed79.714 / 23SWEBenchMultilingual95.04 / 19SWEBenchPro76.516 / 21SWEBenchVerified90.313 / 23SWEComposite88.18 / 24SWERebench95.73 / 23SciCode57.910 / 23SonarBugDensity65.810 / 22SonarComposite55.812 / 24SonarFunctionalSkill84.55 / 22SonarIssueDensity22.313 / 22SonarVulnerabilityDensity21.820 / 22TTFT0.023 / 23Tau2Bench51.214 / 23TerminalBench47.415 / 24 | |||||||||
| gemini-3.1-pro-preview | 89.5 | 89.5 | 84.9 | 84.9 | 73.9 | 73.9 | 83.3 | ▸ | |
group breakdownA_B49.819 / 24A_I63.218 / 24A_P60.416 / 24A_R71.518 / 24BUILD82.15 / 24CRE100.02 / 24GEN100.01 / 24LM_ARENA_REVIEW_PROXY92.34 / 24OPS_long79.816 / 24OPS_precision68.219 / 24OPS_review75.516 / 24PLAN92.61 / 24 metricsAI_code22.318 / 22AI_complexity13.718 / 22AI_context_awareness28.35 / 24AI_correctness50.620 / 22AI_edge_cases89.017 / 22AI_efficiency21.620 / 22AI_hallucination_resistance92.59 / 24AI_memory_retention29.77 / 24AI_parameter_accuracy26.719 / 24AI_plan_coherence92.58 / 24AI_recovery92.518 / 22AI_refusal47.118 / 22AI_spec56.219 / 22AI_stability83.815 / 22AI_task_completion88.313 / 24AI_tool_selection18.618 / 24ARC_AGI_2100.01 / 22ArtificialAnalysisCoding100.01 / 23ArtificialAnalysisIntelligence100.02 / 23ArtificialAnalysisReasoning100.01 / 23BlendedCost77.313 / 24ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.59 / 24GDPval50.215 / 24GPQA_HLE_Reasoning100.01 / 23GSO51.39 / 16IFBench94.44 / 23LMArenaCreativeOrOpenEnded100.02 / 24LMArenaSearchDocument92.34 / 22LMArenaText100.02 / 24LongContextRecall100.02 / 23MCPAtlas71.19 / 16OutputSpeed93.75 / 23SWEBenchMultilingual36.012 / 19SWEBenchPro89.110 / 21SWEBenchVerified95.05 / 23SWEComposite88.96 / 24SWERebench99.82 / 23SciCode100.02 / 23SonarBugDensity52.717 / 22SonarComposite54.217 / 24SonarFunctionalSkill78.910 / 22SonarIssueDensity13.217 / 22SonarVulnerabilityDensity58.216 / 22TTFT27.722 / 23Tau2Bench95.35 / 23TerminalBench89.43 / 24 | |||||||||
| deepseek-v4-flash | deepseek | 65.3 | 65.3 | 69.9 | 69.9 | 73.6 | 73.6 | 80.8 | ▸ |
group breakdownA_B65.98 / 24A_I78.010 / 24A_P63.37 / 24A_R84.87 / 24BUILD73.811 / 24CRE58.816 / 24GEN56.513 / 24LM_ARENA_REVIEW_PROXY88.05 / 24OPS_long88.05 / 24OPS_precision91.53 / 24OPS_review88.65 / 24PLAN78.45 / 24 metricsAI_canary_health83.45 / 7AI_code33.510 / 22AI_complexity18.99 / 22AI_context_awareness0.014 / 24AI_correctness100.07 / 22AI_edge_cases100.07 / 22AI_efficiency65.85 / 22AI_hallucination_resistance60.012 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy86.311 / 24AI_plan_coherence16.717 / 24AI_recovery100.07 / 22AI_refusal100.07 / 22AI_spec100.07 / 22AI_stability82.216 / 22AI_task_completion77.216 / 24AI_tool_selection98.45 / 24ARC_AGI_211.912 / 22ArtificialAnalysisCoding47.213 / 23ArtificialAnalysisIntelligence62.512 / 23ArtificialAnalysisReasoning76.78 / 23BlendedCost100.01 / 24ContextWindow71.622 / 24CopilotArenaOrLMArenaCode87.96 / 24GDPval67.413 / 24GPQA_HLE_Reasoning76.78 / 23IFBench100.01 / 23LMArenaCreativeOrOpenEnded58.816 / 24LMArenaSearchDocument88.05 / 22LMArenaText58.816 / 24LongContextRecall52.516 / 23OutputSpeed85.911 / 23SWEBenchMultilingual58.69 / 19SWEBenchPro95.03 / 21SWEBenchVerified95.04 / 23SWEComposite82.611 / 24SWERebench73.112 / 23SciCode47.513 / 23SonarBugDensity92.53 / 22SonarComposite80.64 / 24SonarFunctionalSkill66.815 / 22SonarIssueDensity92.52 / 22SonarVulnerabilityDensity81.67 / 22TTFT99.92 / 23Tau2Bench94.06 / 23TerminalBench60.99 / 24 | |||||||||
| gpt-5.3-codex | openai | 70.1 | 70.1 | 51.6 | 51.6 | 71.8 | 71.8 | 71.5 | ▸ |
group breakdownA_B66.37 / 24A_I78.97 / 24A_P59.218 / 24A_R85.94 / 24BUILD75.39 / 24CRE72.612 / 24GEN49.715 / 24LM_ARENA_REVIEW_PROXY92.53 / 24OPS_long85.68 / 24OPS_precision82.610 / 24OPS_review83.210 / 24PLAN42.217 / 24 metricsAI_code30.313 / 22AI_complexity18.913 / 22AI_context_awareness0.018 / 24AI_correctness100.010 / 22AI_edge_cases100.010 / 22AI_efficiency65.46 / 22AI_hallucination_resistance60.014 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy86.410 / 24AI_plan_coherence2.822 / 24AI_recovery100.011 / 22AI_refusal100.011 / 22AI_spec100.011 / 22AI_stability96.97 / 22AI_task_completion61.819 / 24AI_tool_selection79.010 / 24ARC_AGI_271.98 / 22ArtificialAnalysisCoding44.415 / 23ArtificialAnalysisIntelligence34.317 / 23ArtificialAnalysisReasoning35.316 / 23BlendedCost76.614 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode60.117 / 24GDPval68.012 / 24GPQA_HLE_Reasoning35.316 / 23GSO53.48 / 16IFBench59.911 / 23LMArenaCreativeOrOpenEnded72.612 / 24LMArenaSearchDocument92.53 / 22LMArenaText72.612 / 24LongContextRecall45.019 / 23OutputSpeed90.08 / 23SWEBenchPro95.05 / 21SWEBenchVerified92.59 / 23SWEComposite92.12 / 24SWERebench89.55 / 23SciCode44.715 / 23SonarBugDensity80.87 / 22SonarComposite60.99 / 24SonarFunctionalSkill72.311 / 22SonarIssueDensity7.518 / 22SonarVulnerabilityDensity92.53 / 22TTFT78.413 / 23Tau2Bench7.520 / 23TerminalBench74.36 / 24 | |||||||||
| gpt-5.4 | openai | 71.8 | 71.8 | 52.8 | 52.8 | 71.4 | 71.4 | 62.2 | ▸ |
group breakdownA_B65.410 / 24A_I78.28 / 24A_P63.66 / 24A_R85.95 / 24BUILD74.210 / 24CRE76.69 / 24GEN47.216 / 24LM_ARENA_REVIEW_PROXY17.119 / 24OPS_long93.03 / 24OPS_precision89.35 / 24OPS_review90.73 / 24PLAN43.116 / 24 metricsAI_code30.314 / 22AI_complexity18.914 / 22AI_context_awareness0.019 / 24AI_correctness100.011 / 22AI_edge_cases100.011 / 22AI_efficiency53.612 / 22AI_hallucination_resistance60.015 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy87.38 / 24AI_plan_coherence5.621 / 24AI_recovery100.012 / 22AI_refusal100.012 / 22AI_spec100.012 / 22AI_stability96.98 / 22AI_task_completion92.69 / 24AI_tool_selection99.84 / 24ARC_AGI_275.87 / 22ArtificialAnalysisCoding35.517 / 23ArtificialAnalysisIntelligence33.018 / 23ArtificialAnalysisReasoning15.519 / 23BlendedCost75.015 / 24ContextWindow100.01 / 24CopilotArenaOrLMArenaCode68.913 / 24GDPval87.94 / 24GPQA_HLE_Reasoning15.519 / 23GSO54.07 / 16IFBench60.510 / 23LMArenaCreativeOrOpenEnded76.69 / 24LMArenaSearchDocument17.117 / 22LMArenaText76.69 / 24LongContextRecall24.520 / 23MCPAtlas72.86 / 16OutputSpeed96.64 / 23SWEBenchPro92.59 / 21SWEBenchVerified95.07 / 23SWEComposite88.97 / 24SWERebench83.57 / 23SciCode12.020 / 23SonarBugDensity84.76 / 22SonarComposite60.410 / 24SonarFunctionalSkill66.814 / 22SonarIssueDensity6.820 / 22SonarVulnerabilityDensity100.01 / 22TTFT86.87 / 23Tau2Bench0.023 / 23TerminalBench100.02 / 24 | |||||||||
| claude-opus-4.1 | anthropic | 63.8 | 63.8 | 63.2 | 63.2 | 71.0 | 71.0 | 62.2 | ▸ |
group breakdownA_B75.15 / 24A_I80.96 / 24A_P61.911 / 24A_R85.06 / 24BUILD71.912 / 24CRE53.117 / 24GEN66.310 / 24LM_ARENA_REVIEW_PROXY0.123 / 24OPS_long67.020 / 24OPS_precision58.522 / 24OPS_review59.022 / 24PLAN63.011 / 24 metricsAI_canary_health68.17 / 7AI_code70.25 / 22AI_complexity54.45 / 22AI_context_awareness0.011 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency55.511 / 22AI_hallucination_resistance40.017 / 24AI_memory_retention0.011 / 24AI_parameter_accuracy63.715 / 24AI_plan_coherence19.413 / 24AI_recovery100.01 / 22AI_refusal100.01 / 22AI_spec100.01 / 22AI_stability81.718 / 22AI_task_completion77.215 / 24AI_tool_selection90.18 / 24ARC_AGI_282.86 / 22ArtificialAnalysisCoding71.98 / 23ArtificialAnalysisIntelligence70.110 / 23ArtificialAnalysisReasoning61.712 / 23BlendedCost0.024 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode53.819 / 24GDPval80.48 / 24GPQA_HLE_Reasoning61.712 / 23GSO57.96 / 16IFBench44.413 / 23LMArenaCreativeOrOpenEnded53.117 / 24LMArenaSearchDocument0.121 / 22LMArenaText53.117 / 24LongContextRecall92.54 / 23MCPAtlas92.52 / 16OutputSpeed76.120 / 23SWEBenchMultilingual92.56 / 19SWEBenchPro82.612 / 21SWEBenchVerified92.011 / 23SWEComposite72.914 / 24SWERebench52.318 / 23SciCode69.38 / 23SonarBugDensity70.19 / 22SonarComposite81.53 / 24SonarFunctionalSkill92.53 / 22SonarIssueDensity73.16 / 22SonarVulnerabilityDensity81.66 / 22TTFT69.818 / 23Tau2Bench77.011 / 23TerminalBench29.418 / 24 | |||||||||
| gemini-3-flash | 76.3 | 76.3 | 65.3 | 65.3 | 60.8 | 60.8 | 61.6 | ▸ | |
group breakdownA_B49.818 / 24A_I63.217 / 24A_P60.415 / 24A_R71.517 / 24BUILD60.714 / 24CRE86.36 / 24GEN63.011 / 24LM_ARENA_REVIEW_PROXY19.217 / 24OPS_long95.11 / 24OPS_precision91.62 / 24OPS_review93.41 / 24PLAN64.510 / 24 metricsAI_code22.317 / 22AI_complexity13.717 / 22AI_context_awareness28.34 / 24AI_correctness50.619 / 22AI_edge_cases89.016 / 22AI_efficiency21.619 / 22AI_hallucination_resistance92.58 / 24AI_memory_retention29.76 / 24AI_parameter_accuracy26.718 / 24AI_plan_coherence92.57 / 24AI_recovery92.517 / 22AI_refusal47.117 / 22AI_spec56.218 / 22AI_stability83.814 / 22AI_task_completion88.312 / 24AI_tool_selection18.617 / 24ARC_AGI_23.119 / 22ArtificialAnalysisCoding59.411 / 23ArtificialAnalysisIntelligence62.113 / 23ArtificialAnalysisReasoning82.77 / 23BlendedCost91.58 / 24ContextWindow100.05 / 24CopilotArenaOrLMArenaCode68.814 / 24GDPval39.117 / 24GPQA_HLE_Reasoning82.77 / 23GSO14.014 / 16IFBench96.83 / 23LMArenaCreativeOrOpenEnded86.36 / 24LMArenaSearchDocument19.215 / 22LMArenaText86.36 / 24LongContextRecall68.69 / 23MCPAtlas22.412 / 16OutputSpeed99.42 / 23SWEBenchMultilingual100.01 / 19SWEBenchPro53.018 / 21SWEBenchVerified100.01 / 23SWEComposite74.112 / 24SWERebench76.310 / 23SciCode78.76 / 23SonarBugDensity52.716 / 22SonarComposite54.216 / 24SonarFunctionalSkill78.99 / 22SonarIssueDensity13.216 / 22SonarVulnerabilityDensity58.215 / 22TTFT81.49 / 23Tau2Bench61.612 / 23TerminalBench48.313 / 24 | |||||||||
| gemini-3-pro | 76.1 | 76.1 | 57.6 | 57.6 | 60.2 | 60.2 | 57.5 | ▸ | |
group breakdownA_B49.820 / 24A_I65.515 / 24A_P62.210 / 24A_R75.315 / 24BUILD66.313 / 24CRE94.83 / 24GEN60.012 / 24LM_ARENA_REVIEW_PROXY19.916 / 24OPS_long45.223 / 24OPS_precision48.024 / 24OPS_review43.024 / 24PLAN55.214 / 24 metricsAI_code17.419 / 22AI_complexity7.320 / 22AI_context_awareness24.56 / 24AI_correctness50.717 / 22AI_edge_cases95.913 / 22AI_efficiency16.621 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention26.28 / 24AI_parameter_accuracy22.620 / 24AI_plan_coherence100.01 / 24AI_recovery100.08 / 22AI_refusal46.619 / 22AI_spec57.316 / 22AI_stability89.811 / 22AI_task_completion95.15 / 24AI_tool_selection13.119 / 24ARC_AGI_241.99 / 22BlendedCost77.312 / 24ContextWindow0.024 / 24CopilotArenaOrLMArenaCode69.212 / 24GDPval37.219 / 24GSO40.710 / 16LMArenaCreativeOrOpenEnded94.83 / 24LMArenaSearchDocument19.914 / 22LMArenaText94.83 / 24MCPAtlas74.95 / 16SWEBenchMultilingual33.513 / 19SWEBenchPro80.314 / 21SWEBenchVerified82.916 / 23SWEComposite72.115 / 24SWERebench70.616 / 23SonarBugDensity53.213 / 22SonarComposite54.913 / 24SonarFunctionalSkill84.16 / 22SonarIssueDensity6.721 / 22SonarVulnerabilityDensity59.712 / 22TerminalBench61.28 / 24 | |||||||||
| gpt-5.2 | openai | 67.9 | 67.9 | 58.5 | 58.5 | 57.8 | 57.8 | 60.0 | ▸ |
group breakdownA_B65.89 / 24A_I75.312 / 24A_P60.913 / 24A_R87.92 / 24BUILD51.717 / 24CRE67.813 / 24GEN53.514 / 24LM_ARENA_REVIEW_PROXY20.815 / 24OPS_long86.06 / 24OPS_precision83.29 / 24OPS_review83.99 / 24PLAN55.513 / 24 metricsAI_code30.312 / 22AI_complexity18.912 / 22AI_context_awareness0.017 / 24AI_correctness100.09 / 22AI_edge_cases100.09 / 22AI_efficiency53.113 / 22AI_hallucination_resistance80.010 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy89.07 / 24AI_plan_coherence0.024 / 24AI_recovery100.010 / 22AI_refusal100.010 / 22AI_spec100.010 / 22AI_stability82.217 / 22AI_task_completion92.68 / 24AI_tool_selection90.17 / 24ARC_AGI_20.022 / 22ArtificialAnalysisCoding64.59 / 23ArtificialAnalysisIntelligence62.811 / 23ArtificialAnalysisReasoning56.413 / 23BlendedCost80.111 / 24ContextWindow85.312 / 24CopilotArenaOrLMArenaCode39.222 / 24GDPval66.314 / 24GPQA_HLE_Reasoning56.413 / 23GSO64.74 / 16IFBench62.79 / 23LMArenaCreativeOrOpenEnded67.813 / 24LMArenaSearchDocument20.813 / 22LMArenaText67.813 / 24LongContextRecall53.915 / 23OutputSpeed90.07 / 23SWEBenchMultilingual0.019 / 19SWEBenchPro38.220 / 21SWEBenchVerified81.318 / 23SWEComposite45.621 / 24SciCode54.611 / 23SonarBugDensity64.211 / 22SonarComposite59.711 / 24SonarFunctionalSkill67.213 / 22SonarIssueDensity35.711 / 22SonarVulnerabilityDensity73.410 / 22TTFT78.412 / 23Tau2Bench48.116 / 23TerminalBench58.210 / 24 | |||||||||
| claude-sonnet-4 | anthropic | 28.8 | 28.8 | 38.5 | 38.5 | 53.0 | 53.0 | 57.7 | ▸ |
group breakdownA_B64.711 / 24A_I78.29 / 24A_P64.65 / 24A_R78.413 / 24BUILD47.318 / 24CRE0.023 / 24GEN16.321 / 24LM_ARENA_REVIEW_PROXY86.28 / 24OPS_long80.115 / 24OPS_precision79.812 / 24OPS_review82.012 / 24PLAN29.720 / 24 metricsAI_code41.67 / 22AI_complexity27.87 / 22AI_context_awareness0.012 / 24AI_correctness100.05 / 22AI_edge_cases100.05 / 22AI_efficiency61.18 / 22AI_hallucination_resistance20.020 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy81.012 / 24AI_plan_coherence19.414 / 24AI_recovery100.05 / 22AI_refusal100.05 / 22AI_spec100.05 / 22AI_stability78.819 / 22AI_task_completion92.66 / 24AI_tool_selection97.06 / 24ARC_AGI_20.221 / 22ArtificialAnalysisCoding32.718 / 23ArtificialAnalysisIntelligence35.116 / 23ArtificialAnalysisReasoning8.621 / 23BlendedCost74.416 / 24ContextWindow99.39 / 24CopilotArenaOrLMArenaCode53.520 / 24GDPval86.15 / 24GPQA_HLE_Reasoning8.621 / 23GSO6.015 / 16IFBench34.717 / 23LMArenaCreativeOrOpenEnded0.023 / 24LMArenaSearchDocument86.28 / 22LMArenaText0.023 / 24LiveCodeBench0.02 / 2LongContextRecall60.812 / 23MCPAtlas13.113 / 16OutputSpeed77.117 / 23SWEBenchMultilingual10.814 / 19SWEBenchPro78.415 / 21SWEBenchVerified69.921 / 23SWEComposite61.017 / 24SWERebench55.117 / 23SciCode20.818 / 23SonarBugDensity0.022 / 22SonarComposite19.522 / 24SonarFunctionalSkill26.419 / 22SonarIssueDensity35.810 / 22SonarVulnerabilityDensity0.022 / 22TTFT76.815 / 23Tau2Bench26.619 / 23TerminalBench47.414 / 24 | |||||||||
| glm-4.7 | zai | 33.3 | 33.3 | 51.2 | 51.2 | 52.2 | 52.2 | 55.1 | ▸ |
group breakdownA_B56.016 / 24A_I55.020 / 24A_P47.021 / 24A_R58.520 / 24BUILD45.319 / 24CRE10.022 / 24GEN38.018 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long89.44 / 24OPS_precision92.01 / 24OPS_review89.44 / 24PLAN54.315 / 24 metricsAI_context_awareness0.024 / 24AI_hallucination_resistance100.05 / 24AI_memory_retention100.04 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.05 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding39.616 / 23ArtificialAnalysisIntelligence47.015 / 23ArtificialAnalysisReasoning55.814 / 23BlendedCost96.13 / 24ContextWindow74.918 / 24CopilotArenaOrLMArenaCode69.711 / 24GDPval36.820 / 24GPQA_HLE_Reasoning55.814 / 23IFBench69.98 / 23LMArenaCreativeOrOpenEnded10.022 / 24LMArenaText10.022 / 24LongContextRecall57.414 / 23MCPAtlas0.016 / 16OutputSpeed88.49 / 23SWEBenchMultilingual5.017 / 19SWEBenchVerified90.214 / 23SWEComposite60.718 / 24SWERebench70.915 / 23SciCode48.612 / 23SonarBugDensity51.618 / 22SonarComposite27.321 / 24SonarFunctionalSkill0.022 / 22SonarIssueDensity50.88 / 22SonarVulnerabilityDensity28.718 / 22TTFT100.01 / 23Tau2Bench96.04 / 23TerminalBench27.119 / 24 | |||||||||
| kimi-k2-0905 | moonshot | 24.7 | 24.7 | 29.4 | 29.4 | 51.4 | 51.4 | 47.3 | ▸ |
group breakdownA_B32.323 / 24A_I27.023 / 24A_P37.923 / 24A_R28.423 / 24BUILD59.915 / 24CRE27.620 / 24GEN11.924 / 24LM_ARENA_REVIEW_PROXY88.06 / 24OPS_long35.624 / 24OPS_precision58.721 / 24OPS_review54.823 / 24PLAN30.219 / 24 metricsAI_canary_health88.92 / 7AI_code39.98 / 22AI_complexity18.910 / 22AI_context_awareness0.015 / 24AI_correctness1.321 / 22AI_edge_cases0.022 / 22AI_efficiency58.39 / 22AI_hallucination_resistance60.013 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy79.314 / 24AI_plan_coherence22.212 / 24AI_recovery0.022 / 22AI_refusal100.08 / 22AI_spec100.08 / 22AI_stability0.022 / 22AI_task_completion92.67 / 24AI_tool_selection83.29 / 24ARC_AGI_211.913 / 22ArtificialAnalysisCoding6.921 / 23ArtificialAnalysisIntelligence7.721 / 23ArtificialAnalysisReasoning0.022 / 23BlendedCost92.77 / 24ContextWindow53.423 / 24CopilotArenaOrLMArenaCode87.97 / 24GDPval5.023 / 24GPQA_HLE_Reasoning0.022 / 23IFBench0.022 / 23LMArenaCreativeOrOpenEnded27.620 / 24LMArenaSearchDocument88.06 / 22LMArenaText27.620 / 24LongContextRecall0.022 / 23MCPAtlas92.53 / 16OutputSpeed0.023 / 23SWEBenchMultilingual5.015 / 19SWEBenchPro92.58 / 21SWEBenchVerified78.620 / 23SWEComposite73.913 / 24SWERebench73.113 / 23SciCode0.022 / 23SonarBugDensity92.54 / 22SonarComposite80.65 / 24SonarFunctionalSkill66.816 / 22SonarIssueDensity92.53 / 22SonarVulnerabilityDensity81.68 / 22TTFT91.86 / 23Tau2Bench46.117 / 23TerminalBench44.616 / 24 | |||||||||
| grok-code-fast-1 | xai | 55.7 | 55.7 | 32.8 | 32.8 | 48.6 | 48.6 | 44.2 | ▸ |
group breakdownA_B94.51 / 24A_I96.61 / 24A_P68.32 / 24A_R99.11 / 24BUILD29.522 / 24CRE48.218 / 24GEN15.822 / 24LM_ARENA_REVIEW_PROXY15.720 / 24OPS_long85.87 / 24OPS_precision85.78 / 24OPS_review85.67 / 24PLAN12.824 / 24 metricsAI_code90.74 / 22AI_complexity99.62 / 22AI_context_awareness0.022 / 24AI_correctness100.012 / 22AI_edge_cases100.012 / 22AI_efficiency57.810 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention100.02 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.03 / 24AI_recovery100.014 / 22AI_refusal100.014 / 22AI_spec100.014 / 22AI_stability100.04 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.110 / 22ArtificialAnalysisCoding0.023 / 23ArtificialAnalysisIntelligence0.023 / 23ArtificialAnalysisReasoning0.023 / 23BlendedCost99.32 / 24ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.024 / 24GDPval5.024 / 24GPQA_HLE_Reasoning0.023 / 23IFBench0.023 / 23LMArenaCreativeOrOpenEnded48.218 / 24LMArenaSearchDocument15.718 / 22LMArenaText48.218 / 24LongContextRecall0.023 / 23OutputSpeed87.810 / 23SWEBenchVerified82.717 / 23SWEComposite46.119 / 24SWERebench27.921 / 23SciCode0.023 / 23SonarComposite50.020 / 24TTFT79.410 / 23Tau2Bench51.215 / 23TerminalBench0.024 / 24 | |||||||||
| grok-4-latest | xai | 59.3 | 59.3 | 63.8 | 63.8 | 45.7 | 45.7 | 53.8 | ▸ |
group breakdownA_B36.422 / 24A_I43.622 / 24A_P43.222 / 24A_R57.221 / 24BUILD43.820 / 24CRE58.915 / 24GEN69.07 / 24LM_ARENA_REVIEW_PROXY18.418 / 24OPS_long81.412 / 24OPS_precision70.118 / 24OPS_review74.017 / 24PLAN71.28 / 24 metricsAI_code8.120 / 22AI_complexity5.521 / 22AI_context_awareness0.021 / 24AI_correctness52.516 / 22AI_edge_cases45.719 / 22AI_efficiency0.022 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.02 / 24AI_recovery100.013 / 22AI_refusal31.220 / 22AI_spec36.620 / 22AI_stability8.521 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_220.711 / 22ArtificialAnalysisCoding54.412 / 23ArtificialAnalysisIntelligence86.05 / 23ArtificialAnalysisReasoning83.96 / 23BlendedCost74.419 / 24ContextWindow78.415 / 24CopilotArenaOrLMArenaCode60.616 / 24GDPval15.622 / 24GPQA_HLE_Reasoning83.96 / 23IFBench100.02 / 23LMArenaCreativeOrOpenEnded58.915 / 24LMArenaSearchDocument18.416 / 22LMArenaText58.915 / 24LongContextRecall58.813 / 23OutputSpeed98.83 / 23SWEComposite45.620 / 24SWERebench39.119 / 23SciCode60.79 / 23SonarComposite50.019 / 24TTFT39.421 / 23Tau2Bench100.01 / 23TerminalBench11.821 / 24 | |||||||||
| claude-sonnet-4.5 | anthropic | 50.7 | 50.7 | 42.9 | 42.9 | 44.7 | 44.7 | 36.8 | ▸ |
group breakdownA_B12.524 / 24A_I20.724 / 24A_P32.624 / 24A_R17.624 / 24BUILD52.916 / 24CRE64.314 / 24GEN44.117 / 24LM_ARENA_REVIEW_PROXY2.322 / 24OPS_long80.214 / 24OPS_precision80.311 / 24OPS_review82.411 / 24PLAN40.918 / 24 metricsAI_canary_health89.11 / 7AI_code0.022 / 22AI_complexity0.022 / 22AI_context_awareness0.013 / 24AI_correctness0.022 / 22AI_edge_cases15.021 / 22AI_efficiency26.417 / 22AI_hallucination_resistance20.021 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy100.03 / 24AI_plan_coherence22.211 / 24AI_recovery26.220 / 22AI_refusal0.022 / 22AI_spec0.022 / 22AI_stability75.420 / 22AI_task_completion100.01 / 24AI_tool_selection99.83 / 24ARC_AGI_23.717 / 22ArtificialAnalysisCoding46.914 / 23ArtificialAnalysisIntelligence50.214 / 23ArtificialAnalysisReasoning35.317 / 23BlendedCost74.417 / 24ContextWindow99.310 / 24CopilotArenaOrLMArenaCode54.118 / 24GDPval88.23 / 24GPQA_HLE_Reasoning35.317 / 23GSO27.312 / 16IFBench41.615 / 23LMArenaCreativeOrOpenEnded64.314 / 24LMArenaSearchDocument2.320 / 22LMArenaText64.314 / 24LongContextRecall65.711 / 23MCPAtlas6.615 / 16OutputSpeed76.619 / 23SWEBenchMultilingual3.918 / 19SWEBenchPro81.213 / 21SWEBenchVerified85.715 / 23SWEComposite71.616 / 24SWERebench74.911 / 23SciCode46.414 / 23SonarBugDensity2.821 / 22SonarComposite15.623 / 24SonarFunctionalSkill17.220 / 22SonarIssueDensity30.012 / 22SonarVulnerabilityDensity4.621 / 22TTFT78.811 / 23Tau2Bench56.513 / 23TerminalBench37.417 / 24 | |||||||||
| gemini-2.5-pro | 25.5 | 25.5 | 36.9 | 36.9 | 42.8 | 42.8 | 41.7 | ▸ | |
group breakdownA_B49.817 / 24A_I63.216 / 24A_P60.414 / 24A_R71.516 / 24BUILD37.521 / 24CRE0.024 / 24GEN17.319 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long81.911 / 24OPS_precision73.516 / 24OPS_review79.215 / 24PLAN28.821 / 24 metricsAI_code22.316 / 22AI_complexity13.716 / 22AI_context_awareness28.33 / 24AI_correctness50.618 / 22AI_edge_cases89.015 / 22AI_efficiency21.618 / 22AI_hallucination_resistance92.57 / 24AI_memory_retention29.75 / 24AI_parameter_accuracy26.717 / 24AI_plan_coherence92.56 / 24AI_recovery92.516 / 22AI_refusal47.116 / 22AI_spec56.217 / 22AI_stability83.813 / 22AI_task_completion88.311 / 24AI_tool_selection18.616 / 24ARC_AGI_23.718 / 22ArtificialAnalysisCoding25.819 / 23ArtificialAnalysisIntelligence20.719 / 23ArtificialAnalysisReasoning44.815 / 23BlendedCost80.110 / 24ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.923 / 24GDPval37.918 / 24GPQA_HLE_Reasoning44.815 / 23GSO0.016 / 16IFBench18.720 / 23LMArenaCreativeOrOpenEnded0.024 / 24LMArenaSearchDocument0.022 / 22LMArenaText0.024 / 24LongContextRecall67.210 / 23MCPAtlas71.18 / 16OutputSpeed91.36 / 23SWEBenchMultilingual36.011 / 19SWEBenchPro75.717 / 21SWEBenchVerified38.222 / 23SWEComposite36.522 / 24SWERebench1.822 / 23SciCode36.117 / 23SonarBugDensity52.715 / 22SonarComposite54.215 / 24SonarFunctionalSkill78.98 / 22SonarIssueDensity13.215 / 22SonarVulnerabilityDensity58.214 / 22TTFT43.319 / 23Tau2Bench3.321 / 23TerminalBench1.822 / 24 | |||||||||
| gemini-2.5-flash | 43.4 | 43.4 | 29.7 | 29.7 | 35.2 | 35.2 | 40.6 | ▸ | |
group breakdownA_B39.221 / 24A_I48.121 / 24A_P53.619 / 24A_R47.122 / 24BUILD28.723 / 24CRE46.019 / 24GEN14.223 / 24LM_ARENA_REVIEW_PROXY78.89 / 24OPS_long94.12 / 24OPS_precision89.64 / 24OPS_review92.22 / 24PLAN14.223 / 24 metricsAI_code6.021 / 22AI_complexity11.619 / 22AI_context_awareness100.01 / 24AI_correctness56.315 / 22AI_edge_cases39.420 / 22AI_efficiency85.44 / 22AI_hallucination_resistance95.56 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy100.05 / 24AI_plan_coherence56.09 / 24AI_recovery11.021 / 22AI_refusal14.821 / 22AI_spec21.421 / 22AI_stability100.03 / 22AI_task_completion25.020 / 24AI_tool_selection67.814 / 24ARC_AGI_20.820 / 22ArtificialAnalysisCoding0.022 / 23ArtificialAnalysisIntelligence0.022 / 23ArtificialAnalysisReasoning14.120 / 23BlendedCost94.45 / 24ContextWindow100.03 / 24CopilotArenaOrLMArenaCode66.015 / 24GDPval39.716 / 24GPQA_HLE_Reasoning14.120 / 23GSO19.413 / 16IFBench22.919 / 23LMArenaCreativeOrOpenEnded46.019 / 24LMArenaSearchDocument78.89 / 22LMArenaText46.019 / 24LiveCodeBench100.01 / 2LongContextRecall46.118 / 23MCPAtlas26.611 / 16OutputSpeed100.01 / 23SWEBenchMultilingual92.57 / 19SWEBenchPro52.519 / 21SWEBenchVerified0.023 / 23SWEComposite27.624 / 24SWERebench0.023 / 23SciCode17.519 / 23SonarBugDensity52.714 / 22SonarComposite54.214 / 24SonarFunctionalSkill78.97 / 22SonarIssueDensity13.214 / 22SonarVulnerabilityDensity58.213 / 22TTFT73.416 / 23Tau2Bench0.022 / 23TerminalBench0.323 / 24 | |||||||||
| glm-4.6 | zai | 34.7 | 34.7 | 30.1 | 30.1 | 34.3 | 34.3 | 37.8 | ▸ |
group breakdownA_B56.015 / 24A_I55.019 / 24A_P47.020 / 24A_R58.519 / 24BUILD20.824 / 24CRE24.521 / 24GEN17.320 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long80.213 / 24OPS_precision86.77 / 24OPS_review84.38 / 24PLAN17.722 / 24 metricsAI_context_awareness0.023 / 24AI_hallucination_resistance100.04 / 24AI_memory_retention100.03 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.04 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding18.220 / 23ArtificialAnalysisIntelligence13.320 / 23ArtificialAnalysisReasoning16.518 / 23BlendedCost95.44 / 24ContextWindow75.017 / 24CopilotArenaOrLMArenaCode45.021 / 24GDPval20.521 / 24GPQA_HLE_Reasoning16.518 / 23IFBench4.521 / 23LMArenaCreativeOrOpenEnded24.521 / 24LMArenaText24.521 / 24LongContextRecall9.821 / 23MCPAtlas7.514 / 16OutputSpeed71.921 / 23SWEBenchMultilingual5.016 / 19SWEBenchPro0.021 / 21SWEBenchVerified79.019 / 23SWEComposite27.723 / 24SWERebench38.420 / 23SciCode12.021 / 23SonarBugDensity7.520 / 22SonarComposite10.724 / 24SonarFunctionalSkill7.521 / 22SonarIssueDensity7.519 / 22SonarVulnerabilityDensity29.017 / 22TTFT99.43 / 23Tau2Bench39.718 / 23TerminalBench13.920 / 24 | |||||||||