how scoring works
Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.
raw vs adjusted
The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models lead the direct LM Arena search/document review proxy, that proxy lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.
Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted.
missing data
If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.
| gemini-3.1-pro-preview | 96.0 | 96.0 | 89.9 | 89.9 | 84.1 | 84.1 | 87.2 | ▸ | |
group breakdownA_B90.15 / 25A_I88.96 / 25A_P78.45 / 25A_R85.810 / 25BUILD82.15 / 25CRE100.02 / 25GEN100.01 / 25LM_ARENA_REVIEW_PROXY92.34 / 25OPS_long83.510 / 25OPS_precision75.817 / 25OPS_review80.715 / 25PLAN92.71 / 25 metricsAI_code92.55 / 23AI_complexity92.55 / 23AI_context_awareness22.37 / 25AI_correctness92.513 / 23AI_edge_cases91.511 / 23AI_efficiency92.55 / 23AI_hallucination_resistance84.518 / 25AI_memory_retention92.58 / 25AI_parameter_accuracy33.320 / 25AI_plan_coherence91.78 / 25AI_recovery63.919 / 23AI_refusal92.517 / 23AI_spec92.517 / 23AI_stability92.512 / 23AI_task_completion80.79 / 25AI_tool_selection64.616 / 25ARC_AGI_2100.01 / 23ArtificialAnalysisCoding100.01 / 24ArtificialAnalysisIntelligence100.02 / 24ArtificialAnalysisReasoning100.01 / 24BlendedCost77.214 / 25ContextWindow100.07 / 25CopilotArenaOrLMArenaCode73.410 / 25GDPval50.116 / 25GPQA_HLE_Reasoning100.01 / 24GSO51.39 / 16IFBench94.54 / 24LMArenaCreativeOrOpenEnded100.02 / 25LMArenaSearchDocument92.34 / 23LMArenaText100.02 / 25LongContextRecall100.02 / 24MCPAtlas71.19 / 16OutputSpeed92.15 / 24SWEBenchMultilingual36.013 / 20SWEBenchPro89.111 / 22SWEBenchVerified95.06 / 24SWEComposite88.96 / 25SWERebench99.82 / 24SciCode100.02 / 24SonarBugDensity52.718 / 23SonarComposite54.218 / 25SonarFunctionalSkill78.910 / 23SonarIssueDensity13.218 / 23SonarVulnerabilityDensity58.217 / 23TTFT50.722 / 24Tau2Bench95.56 / 24TerminalBench89.43 / 25 | |||||||||
| gpt-5.5 | openai | 83.6 | 83.6 | 83.3 | 83.3 | 81.1 | 81.1 | 76.5 | ▸ |
group breakdownA_B63.415 / 25A_I76.114 / 25A_P59.913 / 25A_R73.315 / 25BUILD87.41 / 25CRE82.87 / 25GEN94.53 / 25LM_ARENA_REVIEW_PROXY27.514 / 25OPS_long81.312 / 25OPS_precision78.215 / 25OPS_review80.016 / 25PLAN90.62 / 25 metricsAI_code46.015 / 23AI_complexity34.312 / 23AI_context_awareness0.021 / 25AI_correctness93.89 / 23AI_edge_cases84.413 / 23AI_efficiency58.614 / 23AI_hallucination_resistance0.224 / 25AI_memory_retention2.018 / 25AI_parameter_accuracy97.53 / 25AI_plan_coherence0.724 / 25AI_recovery98.78 / 23AI_refusal100.012 / 23AI_spec100.012 / 23AI_stability90.315 / 23AI_task_completion80.114 / 25AI_tool_selection84.86 / 25ARC_AGI_296.72 / 23ArtificialAnalysisCoding100.02 / 24ArtificialAnalysisIntelligence98.33 / 24ArtificialAnalysisReasoning100.02 / 24BlendedCost49.024 / 25ContextWindow100.02 / 25CopilotArenaOrLMArenaCode71.611 / 25GDPval95.02 / 25GPQA_HLE_Reasoning100.02 / 24GSO94.02 / 16IFBench78.28 / 24LMArenaCreativeOrOpenEnded82.87 / 25LMArenaSearchDocument27.512 / 23LMArenaText82.87 / 25LongContextRecall98.13 / 24MCPAtlas72.87 / 16OutputSpeed81.512 / 24SWEBenchPro95.07 / 22SWEBenchVerified95.09 / 24SWEComposite89.95 / 25SWERebench83.58 / 24SciCode94.84 / 24SonarBugDensity94.52 / 23SonarComposite65.59 / 25SonarFunctionalSkill46.519 / 23SonarIssueDensity52.78 / 23SonarVulnerabilityDensity99.22 / 23TTFT82.811 / 24Tau2Bench87.09 / 24TerminalBench100.01 / 25 | |||||||||
| kimi-k2.6 | moonshot | 76.7 | 76.7 | 76.5 | 76.5 | 80.1 | 80.1 | 85.7 | ▸ |
group breakdownA_B69.112 / 25A_I81.59 / 25A_P60.412 / 25A_R84.212 / 25BUILD84.44 / 25CRE78.38 / 25GEN74.35 / 25LM_ARENA_REVIEW_PROXY94.72 / 25OPS_long60.123 / 25OPS_precision74.019 / 25OPS_review72.420 / 25PLAN87.83 / 25 metricsAI_code46.014 / 23AI_complexity34.315 / 23AI_context_awareness0.017 / 25AI_correctness100.06 / 23AI_edge_cases100.04 / 23AI_efficiency56.515 / 23AI_hallucination_resistance40.122 / 25AI_memory_retention0.022 / 25AI_parameter_accuracy68.913 / 25AI_plan_coherence18.715 / 25AI_recovery100.04 / 23AI_refusal100.08 / 23AI_spec100.08 / 23AI_stability98.16 / 23AI_task_completion66.716 / 25AI_tool_selection57.818 / 25ARC_AGI_211.915 / 23ArtificialAnalysisCoding73.88 / 24ArtificialAnalysisIntelligence88.54 / 24ArtificialAnalysisReasoning87.84 / 24BlendedCost89.010 / 25ContextWindow78.715 / 25CopilotArenaOrLMArenaCode94.85 / 25GDPval68.611 / 25GPQA_HLE_Reasoning87.84 / 24IFBench91.67 / 24LMArenaCreativeOrOpenEnded78.38 / 25LMArenaSearchDocument94.72 / 23LMArenaText78.38 / 25LongContextRecall85.38 / 24MCPAtlas92.54 / 16OutputSpeed36.923 / 24SWEBenchMultilingual95.06 / 20SWEBenchPro95.05 / 22SWEBenchVerified95.07 / 24SWEComposite86.210 / 25SWERebench73.115 / 24SciCode94.83 / 24SonarBugDensity92.56 / 23SonarComposite80.67 / 25SonarFunctionalSkill66.818 / 23SonarIssueDensity92.55 / 23SonarVulnerabilityDensity81.610 / 23TTFT95.36 / 24Tau2Bench96.24 / 24TerminalBench74.65 / 25 | |||||||||
| glm-5.1 | zai | 80.0 | 80.0 | 70.9 | 70.9 | 78.3 | 78.3 | 80.9 | ▸ |
group breakdownA_B66.313 / 25A_I76.813 / 25A_P58.814 / 25A_R79.114 / 25BUILD81.96 / 25CRE87.25 / 25GEN67.49 / 25LM_ARENA_REVIEW_PROXY88.08 / 25OPS_long83.111 / 25OPS_precision87.76 / 25OPS_review85.38 / 25PLAN76.97 / 25 metricsAI_code46.613 / 23AI_complexity36.711 / 23AI_context_awareness7.510 / 25AI_correctness92.514 / 23AI_edge_cases92.58 / 23AI_efficiency55.516 / 23AI_hallucination_resistance41.621 / 25AI_memory_retention7.514 / 25AI_parameter_accuracy66.114 / 25AI_plan_coherence23.413 / 25AI_recovery92.510 / 23AI_refusal92.518 / 23AI_spec92.518 / 23AI_stability90.914 / 23AI_task_completion64.217 / 25AI_tool_selection56.620 / 25ARC_AGI_25.217 / 23ArtificialAnalysisCoding62.111 / 24ArtificialAnalysisIntelligence79.69 / 24ArtificialAnalysisReasoning63.312 / 24BlendedCost93.07 / 25ContextWindow74.720 / 25CopilotArenaOrLMArenaCode97.33 / 25GDPval73.510 / 25GPQA_HLE_Reasoning63.312 / 24IFBench92.46 / 24LMArenaCreativeOrOpenEnded87.25 / 25LMArenaSearchDocument88.08 / 23LMArenaText87.25 / 25LongContextRecall48.818 / 24MCPAtlas100.01 / 16OutputSpeed77.915 / 24SWEBenchMultilingual50.911 / 20SWEBenchPro95.08 / 22SWEBenchVerified91.913 / 24SWEComposite92.13 / 25SWERebench100.01 / 24SciCode41.317 / 24SonarBugDensity100.01 / 23SonarComposite86.02 / 25SonarFunctionalSkill69.812 / 23SonarIssueDensity100.01 / 23SonarVulnerabilityDensity87.25 / 23TTFT98.64 / 24Tau2Bench100.02 / 24TerminalBench55.812 / 25 | |||||||||
| claude-opus-4.6 | anthropic | 86.0 | 86.0 | 70.9 | 70.9 | 77.6 | 77.6 | 71.8 | ▸ |
group breakdownA_B55.218 / 25A_I58.016 / 25A_P43.618 / 25A_R71.916 / 25BUILD86.92 / 25CRE100.01 / 25GEN90.04 / 25LM_ARENA_REVIEW_PROXY33.613 / 25OPS_long77.917 / 25OPS_precision76.216 / 25OPS_review78.617 / 25PLAN73.28 / 25 metricsAI_canary_health84.25 / 6AI_code33.220 / 23AI_complexity21.820 / 23AI_context_awareness0.012 / 25AI_correctness64.916 / 23AI_edge_cases73.716 / 23AI_efficiency41.519 / 23AI_hallucination_resistance100.01 / 25AI_memory_retention2.316 / 25AI_parameter_accuracy58.116 / 25AI_plan_coherence0.025 / 25AI_recovery64.516 / 23AI_refusal62.120 / 23AI_spec62.120 / 23AI_stability100.02 / 23AI_task_completion44.120 / 25AI_tool_selection90.94 / 25ARC_AGI_290.94 / 23ArtificialAnalysisCoding77.05 / 24ArtificialAnalysisIntelligence85.36 / 24ArtificialAnalysisReasoning86.45 / 24BlendedCost60.422 / 25ContextWindow99.38 / 25CopilotArenaOrLMArenaCode100.01 / 25GDPval82.47 / 25GPQA_HLE_Reasoning86.45 / 24GSO75.33 / 16IFBench30.319 / 24LMArenaCreativeOrOpenEnded100.01 / 25LMArenaSearchDocument33.611 / 23LMArenaText100.01 / 25LongContextRecall90.25 / 24OutputSpeed76.219 / 24SWEBenchMultilingual90.99 / 20SWEBenchPro100.01 / 22SWEBenchVerified99.72 / 24SWEComposite95.71 / 25SWERebench91.64 / 24SciCode85.95 / 24SonarBugDensity59.513 / 23SonarComposite70.58 / 25SonarFunctionalSkill92.24 / 23SonarIssueDensity46.810 / 23SonarVulnerabilityDensity66.612 / 23TTFT75.318 / 24Tau2Bench87.78 / 24TerminalBench64.28 / 25 | |||||||||
| claude-opus-4.5 | anthropic | 73.6 | 73.6 | 69.7 | 69.7 | 75.6 | 75.6 | 67.6 | ▸ |
group breakdownA_B64.814 / 25A_I72.615 / 25A_P68.37 / 25A_R82.213 / 25BUILD80.77 / 25CRE73.611 / 25GEN73.56 / 25LM_ARENA_REVIEW_PROXY10.822 / 25OPS_long77.118 / 25OPS_precision74.418 / 25OPS_review74.219 / 25PLAN67.110 / 25 metricsAI_canary_health88.52 / 6AI_code45.716 / 23AI_complexity23.019 / 23AI_context_awareness56.04 / 25AI_correctness81.515 / 23AI_edge_cases88.812 / 23AI_efficiency55.017 / 23AI_hallucination_resistance78.119 / 25AI_memory_retention2.117 / 25AI_parameter_accuracy99.82 / 25AI_plan_coherence27.511 / 25AI_recovery94.59 / 23AI_refusal85.619 / 23AI_spec85.619 / 23AI_stability91.113 / 23AI_task_completion99.92 / 25AI_tool_selection84.47 / 25ARC_AGI_284.85 / 23ArtificialAnalysisCoding76.16 / 24ArtificialAnalysisIntelligence73.610 / 24ArtificialAnalysisReasoning63.711 / 24BlendedCost60.421 / 25ContextWindow74.522 / 25CopilotArenaOrLMArenaCode77.68 / 25GDPval80.69 / 25GPQA_HLE_Reasoning63.711 / 24GSO59.35 / 16IFBench43.415 / 24LMArenaCreativeOrOpenEnded73.611 / 25LMArenaSearchDocument10.820 / 23LMArenaText73.611 / 25LongContextRecall100.01 / 24OutputSpeed80.913 / 24SWEBenchMultilingual95.02 / 20SWEBenchPro88.412 / 22SWEBenchVerified92.211 / 24SWEComposite84.911 / 25SWERebench76.59 / 24SciCode72.78 / 24SonarBugDensity73.79 / 23SonarComposite87.11 / 25SonarFunctionalSkill100.01 / 23SonarIssueDensity77.26 / 23SonarVulnerabilityDensity87.24 / 23TTFT76.917 / 24Tau2Bench81.910 / 24TerminalBench54.813 / 25 | |||||||||
| gpt-5.3-codex | openai | 70.3 | 70.3 | 50.9 | 50.9 | 73.1 | 73.1 | 73.2 | ▸ |
group breakdownA_B71.911 / 25A_I79.611 / 25A_P56.615 / 25A_R93.14 / 25BUILD75.310 / 25CRE72.713 / 25GEN49.616 / 25LM_ARENA_REVIEW_PROXY92.53 / 25OPS_long85.59 / 25OPS_precision82.811 / 25OPS_review83.312 / 25PLAN42.118 / 25 metricsAI_code36.118 / 23AI_complexity34.317 / 23AI_context_awareness0.019 / 25AI_correctness100.07 / 23AI_edge_cases100.05 / 23AI_efficiency62.512 / 23AI_hallucination_resistance100.06 / 25AI_memory_retention0.024 / 25AI_parameter_accuracy90.85 / 25AI_plan_coherence0.723 / 25AI_recovery100.05 / 23AI_refusal100.010 / 23AI_spec100.010 / 23AI_stability94.59 / 23AI_task_completion53.419 / 25AI_tool_selection57.819 / 25ARC_AGI_271.98 / 23ArtificialAnalysisCoding44.416 / 24ArtificialAnalysisIntelligence34.118 / 24ArtificialAnalysisReasoning35.117 / 24BlendedCost76.515 / 25ContextWindow85.214 / 25CopilotArenaOrLMArenaCode59.918 / 25GDPval68.112 / 25GPQA_HLE_Reasoning35.117 / 24GSO53.48 / 16IFBench59.912 / 24LMArenaCreativeOrOpenEnded72.713 / 25LMArenaSearchDocument92.53 / 23LMArenaText72.713 / 25LongContextRecall44.820 / 24OutputSpeed89.28 / 24SWEBenchPro95.06 / 22SWEBenchVerified92.510 / 24SWEComposite92.12 / 25SWERebench89.55 / 24SciCode44.516 / 24SonarBugDensity80.88 / 23SonarComposite60.910 / 25SonarFunctionalSkill72.311 / 23SonarIssueDensity7.519 / 23SonarVulnerabilityDensity92.53 / 23TTFT79.814 / 24Tau2Bench7.521 / 24TerminalBench74.36 / 25 | |||||||||
| gpt-5.4 | openai | 72.4 | 72.4 | 52.4 | 52.4 | 73.0 | 73.0 | 63.9 | ▸ |
group breakdownA_B72.110 / 25A_I80.810 / 25A_P62.311 / 25A_R93.43 / 25BUILD74.111 / 25CRE76.79 / 25GEN47.017 / 25LM_ARENA_REVIEW_PROXY17.120 / 25OPS_long92.23 / 25OPS_precision89.25 / 25OPS_review90.43 / 25PLAN43.017 / 25 metricsAI_code36.119 / 23AI_complexity34.318 / 23AI_context_awareness0.020 / 25AI_correctness100.08 / 23AI_edge_cases100.06 / 23AI_efficiency61.013 / 23AI_hallucination_resistance100.07 / 25AI_memory_retention0.025 / 25AI_parameter_accuracy91.94 / 25AI_plan_coherence7.419 / 25AI_recovery100.06 / 23AI_refusal100.011 / 23AI_spec100.011 / 23AI_stability98.17 / 23AI_task_completion80.113 / 25AI_tool_selection83.68 / 25ARC_AGI_275.87 / 23ArtificialAnalysisCoding35.518 / 24ArtificialAnalysisIntelligence32.819 / 24ArtificialAnalysisReasoning15.220 / 24BlendedCost74.916 / 25ContextWindow100.01 / 25CopilotArenaOrLMArenaCode68.714 / 25GDPval88.24 / 25GPQA_HLE_Reasoning15.220 / 24GSO54.07 / 16IFBench60.511 / 24LMArenaCreativeOrOpenEnded76.79 / 25LMArenaSearchDocument17.118 / 23LMArenaText76.79 / 25LongContextRecall24.221 / 24MCPAtlas72.86 / 16OutputSpeed94.93 / 24SWEBenchPro92.510 / 22SWEBenchVerified95.08 / 24SWEComposite88.97 / 25SWERebench83.57 / 24SciCode11.521 / 24SonarBugDensity84.77 / 23SonarComposite60.411 / 25SonarFunctionalSkill66.814 / 23SonarIssueDensity6.821 / 23SonarVulnerabilityDensity100.01 / 23TTFT87.88 / 24Tau2Bench0.024 / 24TerminalBench100.02 / 25 | |||||||||
| claude-opus-4.1 | anthropic | 65.4 | 65.4 | 66.0 | 66.0 | 72.1 | 72.1 | 63.6 | ▸ |
group breakdownA_B79.18 / 25A_I87.57 / 25A_P73.26 / 25A_R90.96 / 25BUILD71.913 / 25CRE53.018 / 25GEN66.211 / 25LM_ARENA_REVIEW_PROXY0.124 / 25OPS_long67.721 / 25OPS_precision59.622 / 25OPS_review59.723 / 25PLAN63.012 / 25 metricsAI_canary_health68.16 / 6AI_code59.88 / 23AI_complexity66.37 / 23AI_context_awareness56.93 / 25AI_correctness100.01 / 23AI_edge_cases100.01 / 23AI_efficiency64.410 / 23AI_hallucination_resistance69.920 / 25AI_memory_retention13.912 / 25AI_parameter_accuracy69.912 / 25AI_plan_coherence36.710 / 25AI_recovery100.01 / 23AI_refusal100.01 / 23AI_spec100.01 / 23AI_stability100.01 / 23AI_task_completion86.54 / 25AI_tool_selection67.412 / 25ARC_AGI_282.86 / 23ArtificialAnalysisCoding72.29 / 24ArtificialAnalysisIntelligence70.111 / 24ArtificialAnalysisReasoning61.713 / 24BlendedCost0.025 / 25ContextWindow74.521 / 25CopilotArenaOrLMArenaCode53.520 / 25GDPval80.68 / 25GPQA_HLE_Reasoning61.713 / 24GSO57.96 / 16IFBench44.414 / 24LMArenaCreativeOrOpenEnded53.018 / 25LMArenaSearchDocument0.122 / 23LMArenaText53.018 / 25LongContextRecall92.54 / 24MCPAtlas92.52 / 16OutputSpeed76.318 / 24SWEBenchMultilingual92.57 / 20SWEBenchPro82.613 / 22SWEBenchVerified92.012 / 24SWEComposite72.915 / 25SWERebench52.319 / 24SciCode69.39 / 24SonarBugDensity70.110 / 23SonarComposite81.53 / 25SonarFunctionalSkill92.53 / 23SonarIssueDensity73.17 / 23SonarVulnerabilityDensity81.66 / 23TTFT72.919 / 24Tau2Bench77.112 / 24TerminalBench29.419 / 25 | |||||||||
| gemini-3-pro | 83.5 | 83.5 | 62.7 | 62.7 | 71.6 | 71.6 | 61.5 | ▸ | |
group breakdownA_B97.11 / 25A_I95.81 / 25A_P83.41 / 25A_R92.25 / 25BUILD66.314 / 25CRE95.03 / 25GEN60.013 / 25LM_ARENA_REVIEW_PROXY19.917 / 25OPS_long45.224 / 25OPS_precision47.925 / 25OPS_review42.925 / 25PLAN55.215 / 25 metricsAI_code100.01 / 23AI_complexity100.01 / 23AI_context_awareness17.48 / 25AI_correctness100.05 / 23AI_edge_cases98.87 / 23AI_efficiency100.01 / 23AI_hallucination_resistance90.614 / 25AI_memory_retention100.01 / 25AI_parameter_accuracy30.421 / 25AI_plan_coherence99.05 / 25AI_recovery66.315 / 23AI_refusal100.06 / 23AI_spec100.06 / 23AI_stability100.05 / 23AI_task_completion86.25 / 25AI_tool_selection67.213 / 25ARC_AGI_241.99 / 23BlendedCost77.213 / 25ContextWindow0.025 / 25CopilotArenaOrLMArenaCode69.113 / 25GDPval36.920 / 25GSO40.710 / 16LMArenaCreativeOrOpenEnded95.03 / 25LMArenaSearchDocument19.915 / 23LMArenaText95.03 / 25MCPAtlas74.95 / 16SWEBenchMultilingual33.514 / 20SWEBenchPro80.315 / 22SWEBenchVerified82.917 / 24SWEComposite72.116 / 25SWERebench70.617 / 24SonarBugDensity53.214 / 23SonarComposite54.914 / 25SonarFunctionalSkill84.16 / 23SonarIssueDensity6.722 / 23SonarVulnerabilityDensity59.713 / 23TerminalBench61.29 / 25 | |||||||||
| deepseek-v4-pro | deepseek | 62.0 | 62.0 | 68.7 | 68.7 | 71.4 | 71.4 | 72.4 | ▸ |
group breakdownA_B42.119 / 25A_I32.321 / 25A_P38.819 / 25A_R38.722 / 25BUILD80.18 / 25CRE72.712 / 25GEN68.58 / 25LM_ARENA_REVIEW_PROXY88.06 / 25OPS_long72.220 / 25OPS_precision83.89 / 25OPS_review84.210 / 25PLAN83.54 / 25 metricsAI_code52.911 / 23AI_complexity36.710 / 23AI_context_awareness7.59 / 25AI_correctness8.218 / 23AI_edge_cases7.520 / 23AI_efficiency63.111 / 23AI_hallucination_resistance92.513 / 25AI_memory_retention7.513 / 25AI_parameter_accuracy81.29 / 25AI_plan_coherence25.312 / 25AI_recovery7.520 / 23AI_refusal92.514 / 23AI_spec92.514 / 23AI_stability9.020 / 23AI_task_completion75.615 / 25AI_tool_selection73.310 / 25ARC_AGI_211.913 / 23ArtificialAnalysisCoding75.17 / 24ArtificialAnalysisIntelligence80.08 / 24ArtificialAnalysisReasoning83.27 / 24BlendedCost99.03 / 25ContextWindow100.03 / 25CopilotArenaOrLMArenaCode74.09 / 25GDPval67.514 / 25GPQA_HLE_Reasoning83.27 / 24IFBench92.95 / 24LMArenaCreativeOrOpenEnded72.712 / 25LMArenaSearchDocument88.06 / 23LMArenaText72.712 / 25LongContextRecall68.59 / 24OutputSpeed50.822 / 24SWEBenchMultilingual95.05 / 20SWEBenchPro95.04 / 22SWEBenchVerified95.05 / 24SWEComposite86.29 / 25SWERebench73.113 / 24SciCode75.57 / 24SonarBugDensity92.54 / 23SonarComposite80.65 / 25SonarFunctionalSkill66.816 / 23SonarIssueDensity92.53 / 23SonarVulnerabilityDensity81.68 / 23TTFT96.55 / 24Tau2Bench96.83 / 24TerminalBench70.07 / 25 | |||||||||
| claude-opus-4.7 | anthropic | 77.5 | 77.1 | 70.7 | 69.7 | 70.8 | 69.1 | 75.7 | ▸ |
group breakdownA_B27.225 / 25A_I28.823 / 25A_P26.625 / 25A_R46.320 / 25BUILD86.63 / 25CRE93.84 / 25GEN96.62 / 25LM_ARENA_REVIEW_PROXY100.01 / 25OPS_long77.019 / 25OPS_precision73.520 / 25OPS_review76.818 / 25PLAN78.95 / 25 metricsAI_code0.023 / 23AI_complexity0.023 / 23AI_context_awareness0.111 / 25AI_correctness0.023 / 23AI_edge_cases67.218 / 23AI_efficiency39.621 / 23AI_hallucination_resistance100.02 / 25AI_memory_retention15.911 / 25AI_parameter_accuracy55.617 / 25AI_plan_coherence2.821 / 25AI_recovery77.114 / 23AI_refusal0.122 / 23AI_spec0.122 / 23AI_stability67.218 / 23AI_task_completion53.618 / 25AI_tool_selection99.33 / 25ARC_AGI_292.73 / 23ArtificialAnalysisCoding91.03 / 24ArtificialAnalysisIntelligence100.01 / 24ArtificialAnalysisReasoning95.83 / 24BlendedCost60.423 / 25ContextWindow99.39 / 25CopilotArenaOrLMArenaCode100.02 / 25GDPval95.01 / 25GPQA_HLE_Reasoning95.83 / 24GSO100.01 / 16IFBench45.013 / 24LMArenaCreativeOrOpenEnded93.84 / 25LMArenaSearchDocument100.01 / 23LMArenaText93.84 / 25LongContextRecall88.37 / 24OutputSpeed77.916 / 24SWEBenchMultilingual95.03 / 20SWEBenchPro95.02 / 22SWEBenchVerified95.03 / 24SWEComposite91.14 / 25SWERebench85.36 / 24SciCode100.01 / 24SonarBugDensity50.120 / 23SonarComposite51.419 / 25SonarFunctionalSkill93.92 / 23SonarIssueDensity0.023 / 23SonarVulnerabilityDensity25.320 / 23TTFT66.221 / 24Tau2Bench79.911 / 24TerminalBench78.24 / 25 | |||||||||
| gemini-3-flash | 82.6 | 82.6 | 69.7 | 69.7 | 70.5 | 70.5 | 65.1 | ▸ | |
group breakdownA_B90.14 / 25A_I88.95 / 25A_P78.44 / 25A_R85.89 / 25BUILD60.715 / 25CRE86.46 / 25GEN63.012 / 25LM_ARENA_REVIEW_PROXY19.218 / 25OPS_long95.31 / 25OPS_precision92.11 / 25OPS_review93.81 / 25PLAN64.611 / 25 metricsAI_code92.54 / 23AI_complexity92.54 / 23AI_context_awareness22.36 / 25AI_correctness92.512 / 23AI_edge_cases91.510 / 23AI_efficiency92.54 / 23AI_hallucination_resistance84.517 / 25AI_memory_retention92.57 / 25AI_parameter_accuracy33.319 / 25AI_plan_coherence91.77 / 25AI_recovery63.918 / 23AI_refusal92.516 / 23AI_spec92.516 / 23AI_stability92.511 / 23AI_task_completion80.78 / 25AI_tool_selection64.615 / 25ARC_AGI_23.120 / 23ArtificialAnalysisCoding59.612 / 24ArtificialAnalysisIntelligence62.014 / 24ArtificialAnalysisReasoning82.88 / 24BlendedCost91.59 / 25ContextWindow100.06 / 25CopilotArenaOrLMArenaCode68.715 / 25GDPval38.918 / 25GPQA_HLE_Reasoning82.88 / 24GSO14.014 / 16IFBench96.93 / 24LMArenaCreativeOrOpenEnded86.46 / 25LMArenaSearchDocument19.216 / 23LMArenaText86.46 / 25LongContextRecall68.510 / 24MCPAtlas22.412 / 16OutputSpeed99.22 / 24SWEBenchMultilingual100.01 / 20SWEBenchPro53.019 / 22SWEBenchVerified100.01 / 24SWEComposite74.113 / 25SWERebench76.310 / 24SciCode78.86 / 24SonarBugDensity52.717 / 23SonarComposite54.217 / 25SonarFunctionalSkill78.99 / 23SonarIssueDensity13.217 / 23SonarVulnerabilityDensity58.216 / 23TTFT83.110 / 24Tau2Bench61.713 / 24TerminalBench48.314 / 25 | |||||||||
| deepseek-v4-flash | deepseek | 53.4 | 53.4 | 63.5 | 63.5 | 67.5 | 67.5 | 69.2 | ▸ |
group breakdownA_B40.720 / 25A_I29.222 / 25A_P36.821 / 25A_R36.724 / 25BUILD73.812 / 25CRE58.817 / 25GEN56.514 / 25LM_ARENA_REVIEW_PROXY88.05 / 25OPS_long87.05 / 25OPS_precision90.93 / 25OPS_review88.15 / 25PLAN78.46 / 25 metricsAI_canary_health84.74 / 6AI_code53.410 / 23AI_complexity34.313 / 23AI_context_awareness0.015 / 25AI_correctness0.820 / 23AI_edge_cases0.021 / 23AI_efficiency65.49 / 23AI_hallucination_resistance100.04 / 25AI_memory_retention0.020 / 25AI_parameter_accuracy86.77 / 25AI_plan_coherence21.014 / 25AI_recovery0.021 / 23AI_refusal100.04 / 23AI_spec100.04 / 23AI_stability1.821 / 23AI_task_completion80.110 / 25AI_tool_selection77.59 / 25ARC_AGI_211.912 / 23ArtificialAnalysisCoding47.214 / 24ArtificialAnalysisIntelligence62.413 / 24ArtificialAnalysisReasoning76.89 / 24BlendedCost100.01 / 25ContextWindow71.523 / 25CopilotArenaOrLMArenaCode88.16 / 25GDPval67.513 / 25GPQA_HLE_Reasoning76.89 / 24IFBench100.01 / 24LMArenaCreativeOrOpenEnded58.817 / 25LMArenaSearchDocument88.05 / 23LMArenaText58.817 / 25LongContextRecall52.317 / 24OutputSpeed84.111 / 24SWEBenchMultilingual58.610 / 20SWEBenchPro95.03 / 22SWEBenchVerified95.04 / 24SWEComposite82.612 / 25SWERebench73.112 / 24SciCode47.414 / 24SonarBugDensity92.53 / 23SonarComposite80.64 / 25SonarFunctionalSkill66.815 / 23SonarIssueDensity92.52 / 23SonarVulnerabilityDensity81.67 / 23TTFT100.01 / 24Tau2Bench94.27 / 24TerminalBench60.910 / 25 | |||||||||
| claude-sonnet-4.5 | anthropic | 68.2 | 68.2 | 54.4 | 54.4 | 64.7 | 64.7 | 56.3 | ▸ |
group breakdownA_B96.12 / 25A_I94.12 / 25A_P80.62 / 25A_R98.81 / 25BUILD52.917 / 25CRE64.315 / 25GEN44.018 / 25LM_ARENA_REVIEW_PROXY2.323 / 25OPS_long79.615 / 25OPS_precision80.514 / 25OPS_review82.314 / 25PLAN40.819 / 25 metricsAI_canary_health89.11 / 6AI_code99.82 / 23AI_complexity99.22 / 23AI_context_awareness89.92 / 25AI_correctness100.03 / 23AI_edge_cases100.03 / 23AI_efficiency90.96 / 23AI_hallucination_resistance92.812 / 25AI_memory_retention38.210 / 25AI_parameter_accuracy70.311 / 25AI_plan_coherence49.49 / 25AI_recovery100.03 / 23AI_refusal100.03 / 23AI_spec100.03 / 23AI_stability100.04 / 23AI_task_completion83.06 / 25AI_tool_selection60.517 / 25ARC_AGI_23.718 / 23ArtificialAnalysisCoding46.915 / 24ArtificialAnalysisIntelligence50.015 / 24ArtificialAnalysisReasoning35.118 / 24BlendedCost74.218 / 25ContextWindow99.311 / 25CopilotArenaOrLMArenaCode53.819 / 25GDPval88.63 / 25GPQA_HLE_Reasoning35.118 / 24GSO27.312 / 16IFBench41.516 / 24LMArenaCreativeOrOpenEnded64.315 / 25LMArenaSearchDocument2.321 / 23LMArenaText64.315 / 25LongContextRecall65.612 / 24MCPAtlas6.615 / 16OutputSpeed74.820 / 24SWEBenchMultilingual3.919 / 20SWEBenchPro81.214 / 22SWEBenchVerified85.716 / 24SWEComposite71.617 / 25SWERebench74.911 / 24SciCode46.315 / 24SonarBugDensity2.822 / 23SonarComposite15.624 / 25SonarFunctionalSkill17.221 / 23SonarIssueDensity30.013 / 23SonarVulnerabilityDensity4.622 / 23TTFT80.812 / 24Tau2Bench56.614 / 24TerminalBench37.418 / 25 | |||||||||
| claude-sonnet-4.6 | anthropic | 63.3 | 63.3 | 55.4 | 55.4 | 62.3 | 62.3 | 54.7 | ▸ |
group breakdownA_B28.423 / 25A_I39.419 / 25A_P38.020 / 25A_R39.221 / 25BUILD76.99 / 25CRE73.810 / 25GEN66.310 / 25LM_ARENA_REVIEW_PROXY23.215 / 25OPS_long65.822 / 25OPS_precision53.524 / 25OPS_review63.422 / 25PLAN58.813 / 25 metricsAI_canary_health88.23 / 6AI_code12.421 / 23AI_complexity7.621 / 23AI_context_awareness0.014 / 25AI_correctness14.717 / 23AI_edge_cases70.317 / 23AI_efficiency42.418 / 23AI_hallucination_resistance0.025 / 25AI_memory_retention5.915 / 25AI_parameter_accuracy60.115 / 25AI_plan_coherence8.518 / 25AI_recovery88.912 / 23AI_refusal18.321 / 23AI_spec18.321 / 23AI_stability76.316 / 23AI_task_completion100.01 / 25AI_tool_selection100.01 / 25ARC_AGI_210.616 / 23ArtificialAnalysisCoding85.94 / 24ArtificialAnalysisIntelligence80.77 / 24ArtificialAnalysisReasoning68.710 / 24BlendedCost74.219 / 25ContextWindow99.312 / 25CopilotArenaOrLMArenaCode95.24 / 25GDPval86.46 / 25GPQA_HLE_Reasoning68.710 / 24GSO30.711 / 16IFBench39.717 / 24LMArenaCreativeOrOpenEnded73.810 / 25LMArenaSearchDocument23.213 / 23LMArenaText73.810 / 25LongContextRecall90.26 / 24MCPAtlas69.810 / 16OutputSpeed79.114 / 24SWEBenchMultilingual95.04 / 20SWEBenchPro76.517 / 22SWEBenchVerified90.314 / 24SWEComposite88.18 / 25SWERebench95.73 / 24SciCode57.811 / 24SonarBugDensity65.811 / 23SonarComposite55.813 / 25SonarFunctionalSkill84.55 / 23SonarIssueDensity22.314 / 23SonarVulnerabilityDensity21.821 / 23TTFT0.024 / 24Tau2Bench51.215 / 24TerminalBench47.416 / 25 | |||||||||
| claude-sonnet-4 | anthropic | 31.5 | 31.5 | 39.2 | 39.2 | 58.7 | 58.7 | 62.3 | ▸ |
group breakdownA_B88.26 / 25A_I89.43 / 25A_P67.68 / 25A_R97.62 / 25BUILD47.219 / 25CRE0.024 / 25GEN16.222 / 25LM_ARENA_REVIEW_PROXY86.29 / 25OPS_long80.813 / 25OPS_precision80.713 / 25OPS_review82.713 / 25PLAN29.621 / 25 metricsAI_code75.86 / 23AI_complexity86.86 / 23AI_context_awareness0.013 / 25AI_correctness100.02 / 23AI_edge_cases100.02 / 23AI_efficiency86.17 / 23AI_hallucination_resistance100.03 / 25AI_memory_retention0.019 / 25AI_parameter_accuracy100.01 / 25AI_plan_coherence18.516 / 25AI_recovery100.02 / 23AI_refusal100.02 / 23AI_spec100.02 / 23AI_stability100.03 / 23AI_task_completion88.13 / 25AI_tool_selection99.72 / 25ARC_AGI_20.222 / 23ArtificialAnalysisCoding32.619 / 24ArtificialAnalysisIntelligence34.917 / 24ArtificialAnalysisReasoning8.222 / 24BlendedCost74.217 / 25ContextWindow99.310 / 25CopilotArenaOrLMArenaCode53.221 / 25GDPval86.45 / 25GPQA_HLE_Reasoning8.222 / 24GSO6.015 / 16IFBench34.618 / 24LMArenaCreativeOrOpenEnded0.024 / 25LMArenaSearchDocument86.29 / 23LMArenaText0.024 / 25LiveCodeBench0.02 / 2LongContextRecall60.713 / 24MCPAtlas13.113 / 16OutputSpeed77.617 / 24SWEBenchMultilingual10.815 / 20SWEBenchPro78.416 / 22SWEBenchVerified69.922 / 24SWEComposite61.018 / 25SWERebench55.118 / 24SciCode20.319 / 24SonarBugDensity0.023 / 23SonarComposite19.523 / 25SonarFunctionalSkill26.420 / 23SonarIssueDensity35.811 / 23SonarVulnerabilityDensity0.023 / 23TTFT79.115 / 24Tau2Bench26.520 / 24TerminalBench47.415 / 25 | |||||||||
| gemini-2.5-pro | 32.0 | 32.0 | 41.8 | 41.8 | 53.1 | 53.1 | 45.6 | ▸ | |
group breakdownA_B90.13 / 25A_I88.94 / 25A_P78.43 / 25A_R85.88 / 25BUILD37.422 / 25CRE0.025 / 25GEN17.220 / 25LM_ARENA_REVIEW_PROXY0.025 / 25OPS_long86.37 / 25OPS_precision82.412 / 25OPS_review85.47 / 25PLAN28.722 / 25 metricsAI_code92.53 / 23AI_complexity92.53 / 23AI_context_awareness22.35 / 25AI_correctness92.511 / 23AI_edge_cases91.59 / 23AI_efficiency92.53 / 23AI_hallucination_resistance84.516 / 25AI_memory_retention92.56 / 25AI_parameter_accuracy33.318 / 25AI_plan_coherence91.76 / 25AI_recovery63.917 / 23AI_refusal92.515 / 23AI_spec92.515 / 23AI_stability92.510 / 23AI_task_completion80.77 / 25AI_tool_selection64.614 / 25ARC_AGI_23.719 / 23ArtificialAnalysisCoding25.620 / 24ArtificialAnalysisIntelligence20.420 / 24ArtificialAnalysisReasoning44.716 / 24BlendedCost80.011 / 25ContextWindow100.05 / 25CopilotArenaOrLMArenaCode0.024 / 25GDPval37.619 / 25GPQA_HLE_Reasoning44.716 / 24GSO0.016 / 16IFBench18.521 / 24LMArenaCreativeOrOpenEnded0.025 / 25LMArenaSearchDocument0.023 / 23LMArenaText0.025 / 25LongContextRecall67.111 / 24MCPAtlas71.18 / 16OutputSpeed89.56 / 24SWEBenchMultilingual36.012 / 20SWEBenchPro75.718 / 22SWEBenchVerified38.223 / 24SWEComposite36.523 / 25SWERebench1.823 / 24SciCode35.818 / 24SonarBugDensity52.716 / 23SonarComposite54.216 / 25SonarFunctionalSkill78.98 / 23SonarIssueDensity13.216 / 23SonarVulnerabilityDensity58.215 / 23TTFT70.220 / 24Tau2Bench3.222 / 24TerminalBench1.823 / 25 | |||||||||
| kimi-k2-0905 | moonshot | 23.4 | 23.4 | 27.5 | 27.5 | 52.4 | 52.4 | 49.3 | ▸ |
group breakdownA_B36.521 / 25A_I22.425 / 25A_P30.224 / 25A_R36.723 / 25BUILD59.916 / 25CRE27.321 / 25GEN11.725 / 25LM_ARENA_REVIEW_PROXY88.07 / 25OPS_long35.725 / 25OPS_precision59.023 / 25OPS_review54.824 / 25PLAN30.220 / 25 metricsAI_code55.89 / 23AI_complexity34.314 / 23AI_context_awareness0.016 / 25AI_correctness0.821 / 23AI_edge_cases0.022 / 23AI_efficiency8.822 / 23AI_hallucination_resistance100.05 / 25AI_memory_retention0.021 / 25AI_parameter_accuracy85.08 / 25AI_plan_coherence0.722 / 25AI_recovery0.022 / 23AI_refusal100.07 / 23AI_spec100.07 / 23AI_stability0.022 / 23AI_task_completion80.111 / 25AI_tool_selection70.111 / 25ARC_AGI_211.914 / 23ArtificialAnalysisCoding6.622 / 24ArtificialAnalysisIntelligence7.422 / 24ArtificialAnalysisReasoning0.023 / 24BlendedCost92.78 / 25ContextWindow51.724 / 25CopilotArenaOrLMArenaCode88.17 / 25GDPval5.024 / 25GPQA_HLE_Reasoning0.023 / 24IFBench0.023 / 24LMArenaCreativeOrOpenEnded27.321 / 25LMArenaSearchDocument88.07 / 23LMArenaText27.321 / 25LongContextRecall0.023 / 24MCPAtlas92.53 / 16OutputSpeed0.024 / 24SWEBenchMultilingual5.016 / 20SWEBenchPro92.59 / 22SWEBenchVerified78.621 / 24SWEComposite73.914 / 25SWERebench73.114 / 24SciCode0.023 / 24SonarBugDensity92.55 / 23SonarComposite80.66 / 25SonarFunctionalSkill66.817 / 23SonarIssueDensity92.54 / 23SonarVulnerabilityDensity81.69 / 23TTFT93.47 / 24Tau2Bench46.118 / 24TerminalBench44.617 / 25 | |||||||||
| glm-4.7 | zai | 32.9 | 32.9 | 51.0 | 51.0 | 52.0 | 52.0 | 55.0 | ▸ |
group breakdownA_B56.017 / 25A_I55.018 / 25A_P47.017 / 25A_R58.518 / 25BUILD45.320 / 25CRE9.523 / 25GEN37.819 / 25LM_ARENA_REVIEW_PROXY50.012 / 25OPS_long87.64 / 25OPS_precision90.74 / 25OPS_review88.34 / 25PLAN54.316 / 25 metricsAI_context_awareness0.025 / 25AI_hallucination_resistance100.011 / 25AI_memory_retention99.65 / 25AI_parameter_accuracy0.025 / 25AI_plan_coherence100.04 / 25AI_task_completion0.025 / 25AI_tool_selection0.025 / 25ArtificialAnalysisCoding39.617 / 24ArtificialAnalysisIntelligence46.916 / 24ArtificialAnalysisReasoning55.715 / 24BlendedCost96.14 / 25ContextWindow74.719 / 25CopilotArenaOrLMArenaCode69.612 / 25GDPval36.621 / 25GPQA_HLE_Reasoning55.715 / 24IFBench69.99 / 24LMArenaCreativeOrOpenEnded9.523 / 25LMArenaText9.523 / 25LongContextRecall57.215 / 24MCPAtlas0.016 / 16OutputSpeed85.310 / 24SWEBenchMultilingual5.018 / 20SWEBenchVerified90.215 / 24SWEComposite60.719 / 25SWERebench70.916 / 24SciCode48.513 / 24SonarBugDensity51.619 / 23SonarComposite27.322 / 25SonarFunctionalSkill0.023 / 23SonarIssueDensity50.89 / 23SonarVulnerabilityDensity28.719 / 23TTFT99.23 / 24Tau2Bench96.25 / 24TerminalBench27.120 / 25 | |||||||||
| gpt-5.2 | openai | 56.4 | 56.4 | 52.1 | 52.1 | 49.2 | 49.2 | 43.3 | ▸ |
group breakdownA_B29.922 / 25A_I27.524 / 25A_P34.422 / 25A_R18.325 / 25BUILD51.718 / 25CRE67.814 / 25GEN53.415 / 25LM_ARENA_REVIEW_PROXY20.816 / 25OPS_long85.88 / 25OPS_precision83.510 / 25OPS_review84.011 / 25PLAN55.414 / 25 metricsAI_code41.117 / 23AI_complexity34.316 / 23AI_context_awareness0.018 / 25AI_correctness0.822 / 23AI_edge_cases0.023 / 23AI_efficiency66.98 / 23AI_hallucination_resistance0.223 / 25AI_memory_retention0.023 / 25AI_parameter_accuracy90.46 / 25AI_plan_coherence5.220 / 25AI_recovery0.023 / 23AI_refusal100.09 / 23AI_spec100.09 / 23AI_stability0.023 / 23AI_task_completion80.112 / 25AI_tool_selection84.85 / 25ARC_AGI_20.023 / 23ArtificialAnalysisCoding64.610 / 24ArtificialAnalysisIntelligence62.712 / 24ArtificialAnalysisReasoning56.314 / 24BlendedCost80.012 / 25ContextWindow85.213 / 25CopilotArenaOrLMArenaCode38.723 / 25GDPval66.315 / 25GPQA_HLE_Reasoning56.314 / 24GSO64.74 / 16IFBench62.710 / 24LMArenaCreativeOrOpenEnded67.814 / 25LMArenaSearchDocument20.814 / 23LMArenaText67.814 / 25LongContextRecall53.716 / 24OutputSpeed89.27 / 24SWEBenchMultilingual0.020 / 20SWEBenchPro38.221 / 22SWEBenchVerified81.319 / 24SWEComposite45.622 / 25SciCode54.512 / 24SonarBugDensity64.212 / 23SonarComposite59.712 / 25SonarFunctionalSkill67.213 / 23SonarIssueDensity35.712 / 23SonarVulnerabilityDensity73.411 / 23TTFT79.813 / 24Tau2Bench48.117 / 24TerminalBench58.211 / 25 | |||||||||
| gemini-2.5-flash | 50.7 | 50.7 | 32.8 | 32.8 | 45.1 | 45.1 | 50.2 | ▸ | |
group breakdownA_B79.87 / 25A_I78.812 / 25A_P66.39 / 25A_R86.67 / 25BUILD28.724 / 25CRE45.920 / 25GEN14.124 / 25LM_ARENA_REVIEW_PROXY78.810 / 25OPS_long95.22 / 25OPS_precision91.52 / 25OPS_review93.62 / 25PLAN14.124 / 25 metricsAI_code71.67 / 23AI_complexity50.58 / 23AI_context_awareness100.01 / 25AI_correctness100.04 / 23AI_edge_cases54.019 / 23AI_efficiency99.12 / 23AI_hallucination_resistance88.615 / 25AI_memory_retention42.49 / 25AI_parameter_accuracy77.410 / 25AI_plan_coherence16.817 / 25AI_recovery99.37 / 23AI_refusal100.05 / 23AI_spec100.05 / 23AI_stability74.617 / 23AI_task_completion22.021 / 25AI_tool_selection25.121 / 25ARC_AGI_20.821 / 23ArtificialAnalysisCoding0.023 / 24ArtificialAnalysisIntelligence0.023 / 24ArtificialAnalysisReasoning13.821 / 24BlendedCost94.46 / 25ContextWindow100.04 / 25CopilotArenaOrLMArenaCode65.916 / 25GDPval39.517 / 25GPQA_HLE_Reasoning13.821 / 24GSO19.413 / 16IFBench22.820 / 24LMArenaCreativeOrOpenEnded45.920 / 25LMArenaSearchDocument78.810 / 23LMArenaText45.920 / 25LiveCodeBench100.01 / 2LongContextRecall45.919 / 24MCPAtlas26.611 / 16OutputSpeed100.01 / 24SWEBenchMultilingual92.58 / 20SWEBenchPro52.520 / 22SWEBenchVerified0.024 / 24SWEComposite27.625 / 25SWERebench0.024 / 24SciCode17.020 / 24SonarBugDensity52.715 / 23SonarComposite54.215 / 25SonarFunctionalSkill78.97 / 23SonarIssueDensity13.215 / 23SonarVulnerabilityDensity58.214 / 23TTFT79.016 / 24Tau2Bench0.023 / 24TerminalBench0.324 / 25 | |||||||||
| grok-code-fast-1 | xai | 52.3 | 52.3 | 31.9 | 31.9 | 43.6 | 43.6 | 40.9 | ▸ |
group breakdownA_B73.09 / 25A_I82.28 / 25A_P64.210 / 25A_R85.111 / 25BUILD29.523 / 25CRE48.119 / 25GEN15.823 / 25LM_ARENA_REVIEW_PROXY15.721 / 25OPS_long86.56 / 25OPS_precision87.17 / 25OPS_review86.66 / 25PLAN12.825 / 25 metricsAI_code47.512 / 23AI_complexity46.19 / 23AI_context_awareness0.023 / 25AI_correctness93.110 / 23AI_edge_cases76.415 / 23AI_efficiency40.320 / 23AI_hallucination_resistance100.09 / 25AI_memory_retention99.63 / 25AI_parameter_accuracy0.023 / 25AI_plan_coherence100.02 / 25AI_recovery77.213 / 23AI_refusal95.113 / 23AI_spec95.113 / 23AI_stability94.88 / 23AI_task_completion0.023 / 25AI_tool_selection0.023 / 25ARC_AGI_225.110 / 23ArtificialAnalysisCoding0.024 / 24ArtificialAnalysisIntelligence0.024 / 24ArtificialAnalysisReasoning0.024 / 24BlendedCost99.32 / 25ContextWindow78.317 / 25CopilotArenaOrLMArenaCode0.025 / 25GDPval5.025 / 25GPQA_HLE_Reasoning0.024 / 24IFBench0.024 / 24LMArenaCreativeOrOpenEnded48.119 / 25LMArenaSearchDocument15.719 / 23LMArenaText48.119 / 25LongContextRecall0.024 / 24OutputSpeed87.59 / 24SWEBenchVerified82.718 / 24SWEComposite46.120 / 25SWERebench27.922 / 24SciCode0.024 / 24SonarComposite50.021 / 25TTFT83.59 / 24Tau2Bench51.216 / 24TerminalBench0.025 / 25 | |||||||||
| grok-4-latest | xai | 56.8 | 56.8 | 61.3 | 61.3 | 43.5 | 43.5 | 51.5 | ▸ |
group breakdownA_B28.024 / 25A_I34.720 / 25A_P33.423 / 25A_R48.319 / 25BUILD43.721 / 25CRE58.916 / 25GEN69.17 / 25LM_ARENA_REVIEW_PROXY18.419 / 25OPS_long78.116 / 25OPS_precision68.021 / 25OPS_review72.021 / 25PLAN71.39 / 25 metricsAI_code0.122 / 23AI_complexity3.322 / 23AI_context_awareness0.022 / 25AI_correctness7.419 / 23AI_edge_cases77.614 / 23AI_efficiency0.023 / 23AI_hallucination_resistance100.08 / 25AI_memory_retention99.62 / 25AI_parameter_accuracy0.022 / 25AI_plan_coherence100.01 / 25AI_recovery88.911 / 23AI_refusal0.023 / 23AI_spec0.023 / 23AI_stability38.519 / 23AI_task_completion0.022 / 25AI_tool_selection0.022 / 25ARC_AGI_220.711 / 23ArtificialAnalysisCoding54.513 / 24ArtificialAnalysisIntelligence86.05 / 24ArtificialAnalysisReasoning84.06 / 24BlendedCost74.220 / 25ContextWindow78.316 / 25CopilotArenaOrLMArenaCode60.417 / 25GDPval15.123 / 25GPQA_HLE_Reasoning84.06 / 24IFBench100.02 / 24LMArenaCreativeOrOpenEnded58.916 / 25LMArenaSearchDocument18.417 / 23LMArenaText58.916 / 25LongContextRecall58.714 / 24OutputSpeed93.34 / 24SWEComposite45.621 / 25SWERebench39.120 / 24SciCode60.610 / 24SonarComposite50.020 / 25TTFT38.323 / 24Tau2Bench100.01 / 24TerminalBench11.822 / 25 | |||||||||
| glm-4.6 | zai | 34.5 | 34.5 | 30.0 | 30.0 | 34.3 | 34.3 | 37.8 | ▸ |
group breakdownA_B56.016 / 25A_I55.017 / 25A_P47.016 / 25A_R58.517 / 25BUILD20.725 / 25CRE24.222 / 25GEN17.021 / 25LM_ARENA_REVIEW_PROXY50.011 / 25OPS_long80.314 / 25OPS_precision86.88 / 25OPS_review84.39 / 25PLAN17.523 / 25 metricsAI_context_awareness0.024 / 25AI_hallucination_resistance100.010 / 25AI_memory_retention99.64 / 25AI_parameter_accuracy0.024 / 25AI_plan_coherence100.03 / 25AI_task_completion0.024 / 25AI_tool_selection0.024 / 25ArtificialAnalysisCoding18.021 / 24ArtificialAnalysisIntelligence13.021 / 24ArtificialAnalysisReasoning16.219 / 24BlendedCost95.45 / 25ContextWindow74.918 / 25CopilotArenaOrLMArenaCode44.622 / 25GDPval20.122 / 25GPQA_HLE_Reasoning16.219 / 24IFBench4.322 / 24LMArenaCreativeOrOpenEnded24.222 / 25LMArenaText24.222 / 25LongContextRecall9.422 / 24MCPAtlas7.514 / 16OutputSpeed72.021 / 24SWEBenchMultilingual5.017 / 20SWEBenchPro0.022 / 22SWEBenchVerified79.020 / 24SWEComposite27.724 / 25SWERebench38.421 / 24SciCode11.522 / 24SonarBugDensity7.521 / 23SonarComposite10.725 / 25SonarFunctionalSkill7.522 / 23SonarIssueDensity7.520 / 23SonarVulnerabilityDensity29.018 / 23TTFT99.72 / 24Tau2Bench39.719 / 24TerminalBench13.921 / 25 | |||||||||