how scoring works
Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.
raw vs adjusted
The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models dominate Review, that lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.
Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted — it is the source of the penalty.
missing data
If a model is missing some metrics within a group, the group score is computed from the present metrics if at least 70% of the group weight is covered. Below that threshold, the score shrinks toward 50 proportional to the missing weight.
| gemini-3.1-pro-preview | 91.8 | 91.3 | 84.7 | 83.5 | 77.5 | 75.3 | 83.0 | ▸ | |
group breakdownA_B81.67 / 24A_I80.810 / 24A_P68.25 / 24A_R81.98 / 24BUILD74.93 / 24CRE100.02 / 24GEN100.01 / 24LM_ARENA_REVIEW_PROXY89.84 / 24OPS_long82.08 / 24OPS_precision73.516 / 24OPS_review69.716 / 24PLAN94.12 / 24 metricsAI_code92.55 / 22AI_complexity88.27 / 22AI_context_awareness14.68 / 24AI_correctness92.511 / 22AI_edge_cases92.511 / 22AI_efficiency92.55 / 22AI_hallucination_resistance92.58 / 24AI_memory_retention92.54 / 24AI_parameter_accuracy92.57 / 24AI_plan_coherence79.08 / 24AI_recovery92.517 / 22AI_refusal92.517 / 22AI_safety_compliance92.513 / 24AI_spec92.517 / 22AI_stability92.58 / 22AI_task_completion61.318 / 24AI_tool_selection10.319 / 24ARC_AGI_2100.01 / 17ArtificialAnalysisCoding100.01 / 21ArtificialAnalysisIntelligence100.02 / 21ArtificialAnalysisReasoning100.01 / 21ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.07 / 22GDPval23.89 / 11GPQA_HLE_Reasoning100.01 / 21IFBench97.53 / 21InverseCost77.313 / 24InverseTTFT40.918 / 19LMArenaCreativeOrOpenEnded100.02 / 24LMArenaSearchDocument89.84 / 19LMArenaText100.02 / 24LongContextRecall100.02 / 21MCPAtlas66.86 / 13OutputSpeed93.05 / 19SWEBenchPro61.14 / 14SWEBenchVerified78.03 / 18SWEComposite75.44 / 24SWERebench100.02 / 20SciCode100.02 / 21SonarFunctionalSkill86.38 / 17SonarIssueDensity18.713 / 17Tau2Bench99.34 / 21TerminalBench89.93 / 22 | |||||||||
| gemini-3-pro | 86.2 | 85.7 | 63.8 | 62.5 | 72.0 | 69.9 | 67.0 | ▸ | |
group breakdownA_B99.51 / 24A_I98.31 / 24A_P78.51 / 24A_R100.01 / 24BUILD58.910 / 24CRE95.03 / 24GEN60.110 / 24LM_ARENA_REVIEW_PROXY19.318 / 24OPS_long45.223 / 24OPS_precision46.623 / 24OPS_review50.522 / 24PLAN55.113 / 24 metricsAI_code100.02 / 22AI_complexity95.03 / 22AI_context_awareness8.310 / 24AI_correctness100.07 / 22AI_edge_cases100.07 / 22AI_efficiency100.02 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy100.02 / 24AI_plan_coherence84.15 / 24AI_recovery100.08 / 22AI_refusal100.07 / 22AI_safety_compliance100.07 / 24AI_spec100.07 / 22AI_stability100.04 / 22AI_task_completion63.315 / 24AI_tool_selection3.320 / 24ARC_AGI_242.46 / 17ContextWindow0.024 / 24CopilotArenaOrLMArenaCode67.810 / 22InverseCost77.312 / 24LMArenaCreativeOrOpenEnded95.03 / 24LMArenaSearchDocument19.313 / 19LMArenaText95.03 / 24MCPAtlas69.73 / 13SWEBenchMultilingual33.54 / 6SWEBenchPro53.78 / 14SWEBenchVerified52.113 / 18SWEComposite54.512 / 24SWERebench70.313 / 20SonarFunctionalSkill92.75 / 17SonarIssueDensity13.115 / 17TerminalBench61.48 / 22 | |||||||||
| claude-opus-4.5 | anthropic | 76.0 | 76.0 | 67.4 | 67.4 | 71.5 | 71.5 | 65.6 | ▸ |
group breakdownA_B75.09 / 24A_I83.84 / 24A_P67.76 / 24A_R81.69 / 24BUILD70.66 / 24CRE73.611 / 24GEN70.56 / 24LM_ARENA_REVIEW_PROXY10.821 / 24OPS_long76.416 / 24OPS_precision74.614 / 24OPS_review73.714 / 24PLAN65.88 / 24 metricsAI_canary_health89.51 / 7AI_code65.49 / 22AI_complexity72.69 / 22AI_context_awareness55.03 / 24AI_correctness100.02 / 22AI_edge_cases100.02 / 22AI_efficiency55.49 / 22AI_hallucination_resistance20.019 / 24AI_memory_retention10.211 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence10.717 / 24AI_recovery100.02 / 22AI_refusal100.02 / 22AI_safety_compliance72.918 / 24AI_spec100.02 / 22AI_stability86.39 / 22AI_task_completion65.314 / 24AI_tool_selection100.01 / 24ArtificialAnalysisCoding75.16 / 21ArtificialAnalysisIntelligence71.57 / 21ArtificialAnalysisReasoning63.79 / 21ContextWindow74.721 / 24CopilotArenaOrLMArenaCode76.56 / 22GDPval73.86 / 11GPQA_HLE_Reasoning63.79 / 21IFBench44.911 / 21InverseCost61.920 / 24InverseTTFT74.514 / 19LMArenaCreativeOrOpenEnded73.611 / 24LMArenaSearchDocument10.816 / 19LMArenaText73.611 / 24LongContextRecall100.01 / 21OutputSpeed80.311 / 19SWEBenchPro60.55 / 14SWEBenchVerified65.29 / 18SWEComposite65.66 / 24SWERebench76.38 / 20SciCode72.77 / 21SonarFunctionalSkill100.01 / 17SonarIssueDensity63.63 / 17Tau2Bench85.28 / 21TerminalBench54.911 / 22 | |||||||||
| gpt-5.5 | openai | 80.9 | 80.9 | 81.4 | 81.4 | 69.8 | 69.8 | 74.2 | ▸ |
group breakdownA_B58.712 / 24A_I73.012 / 24A_P59.512 / 24A_R76.112 / 24BUILD74.44 / 24CRE82.07 / 24GEN94.43 / 24LM_ARENA_REVIEW_PROXY27.414 / 24OPS_long81.710 / 24OPS_precision80.09 / 24OPS_review77.511 / 24PLAN95.41 / 24 metricsAI_code21.217 / 22AI_complexity40.120 / 22AI_context_awareness0.020 / 24AI_correctness91.917 / 22AI_edge_cases64.419 / 22AI_efficiency39.913 / 22AI_hallucination_resistance60.014 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy95.83 / 24AI_plan_coherence7.919 / 24AI_recovery93.914 / 22AI_refusal100.013 / 22AI_safety_compliance100.010 / 24AI_spec100.013 / 22AI_stability79.911 / 22AI_task_completion87.09 / 24AI_tool_selection84.95 / 24ARC_AGI_298.12 / 17ArtificialAnalysisCoding100.02 / 21ArtificialAnalysisIntelligence98.13 / 21ArtificialAnalysisReasoning100.02 / 21ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.48 / 22GPQA_HLE_Reasoning100.02 / 21IFBench80.76 / 21InverseCost50.623 / 24InverseTTFT81.78 / 19LMArenaCreativeOrOpenEnded82.07 / 24LMArenaSearchDocument27.49 / 19LMArenaText82.07 / 24LongContextRecall98.03 / 21OutputSpeed82.49 / 19SWEBenchVerified95.01 / 18SWEComposite63.58 / 24SciCode94.54 / 21SonarFunctionalSkill70.713 / 17SonarIssueDensity46.04 / 17Tau2Bench90.57 / 21TerminalBench100.02 / 22 | |||||||||
| claude-opus-4.6 | anthropic | 85.5 | 85.5 | 70.8 | 70.8 | 69.1 | 69.1 | 67.4 | ▸ |
group breakdownA_B56.514 / 24A_I72.413 / 24A_P59.014 / 24A_R71.216 / 24BUILD78.32 / 24CRE100.01 / 24GEN89.64 / 24LM_ARENA_REVIEW_PROXY32.213 / 24OPS_long78.013 / 24OPS_precision76.813 / 24OPS_review74.713 / 24PLAN72.85 / 24 metricsAI_canary_health83.34 / 7AI_code19.619 / 22AI_complexity51.312 / 22AI_context_awareness9.89 / 24AI_correctness100.03 / 22AI_edge_cases100.03 / 22AI_efficiency28.617 / 22AI_hallucination_resistance0.022 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy71.815 / 24AI_plan_coherence2.420 / 24AI_recovery100.03 / 22AI_refusal73.219 / 22AI_safety_compliance83.716 / 24AI_spec73.219 / 22AI_stability100.01 / 22AI_task_completion93.24 / 24AI_tool_selection100.02 / 24ARC_AGI_292.24 / 17ArtificialAnalysisCoding76.15 / 21ArtificialAnalysisIntelligence84.05 / 21ArtificialAnalysisReasoning86.35 / 21ContextWindow99.37 / 24CopilotArenaOrLMArenaCode100.01 / 22GDPval78.05 / 11GPQA_HLE_Reasoning86.35 / 21IFBench31.416 / 21InverseCost61.921 / 24InverseTTFT73.615 / 19LMArenaCreativeOrOpenEnded100.01 / 24LMArenaSearchDocument32.28 / 19LMArenaText100.01 / 24LongContextRecall90.24 / 21OutputSpeed76.615 / 19SWEBenchMultilingual90.92 / 6SWEBenchPro76.33 / 14SWEBenchVerified67.98 / 18SWEComposite78.33 / 24SWERebench91.64 / 20SciCode85.85 / 21SonarFunctionalSkill97.43 / 17SonarIssueDensity41.96 / 17Tau2Bench91.26 / 21TerminalBench64.57 / 22 | |||||||||
| kimi-k2.6 | moonshot | 73.0 | 73.0 | 70.0 | 70.0 | 66.3 | 66.3 | 76.1 | ▸ |
group breakdownA_B54.418 / 24A_I70.815 / 24A_P53.817 / 24A_R68.917 / 24BUILD74.25 / 24CRE77.68 / 24GEN73.75 / 24LM_ARENA_REVIEW_PROXY92.22 / 24OPS_long58.020 / 24OPS_precision59.818 / 24OPS_review60.318 / 24PLAN86.73 / 24 metricsAI_code24.613 / 22AI_complexity40.117 / 22AI_context_awareness0.016 / 24AI_correctness91.913 / 22AI_edge_cases64.415 / 22AI_efficiency23.419 / 22AI_hallucination_resistance20.021 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy66.717 / 24AI_plan_coherence7.918 / 24AI_recovery93.910 / 22AI_refusal100.09 / 22AI_safety_compliance66.719 / 24AI_spec100.09 / 22AI_stability72.614 / 22AI_task_completion72.511 / 24AI_tool_selection52.914 / 24ARC_AGI_211.99 / 17ArtificialAnalysisCoding72.87 / 21ArtificialAnalysisIntelligence87.54 / 21ArtificialAnalysisReasoning87.64 / 21ContextWindow78.414 / 24CopilotArenaOrLMArenaCode94.74 / 22GDPval54.78 / 11GPQA_HLE_Reasoning87.64 / 21IFBench94.54 / 21InverseCost87.19 / 24LMArenaCreativeOrOpenEnded77.68 / 24LMArenaSearchDocument92.22 / 19LMArenaText77.68 / 24LongContextRecall85.37 / 21MCPAtlas92.52 / 13SWEBenchVerified77.04 / 18SWEComposite62.710 / 24SWERebench72.911 / 20SciCode94.53 / 21SonarFunctionalSkill79.212 / 17SonarIssueDensity92.52 / 17Tau2Bench100.01 / 21TerminalBench74.95 / 22 | |||||||||
| gemini-3-flash | 80.8 | 80.3 | 67.3 | 66.1 | 65.6 | 63.4 | 64.3 | ▸ | |
group breakdownA_B81.66 / 24A_I80.89 / 24A_P68.24 / 24A_R81.97 / 24BUILD51.712 / 24CRE86.26 / 24GEN61.69 / 24LM_ARENA_REVIEW_PROXY19.517 / 24OPS_long94.91 / 24OPS_precision91.61 / 24OPS_review90.23 / 24PLAN64.79 / 24 metricsAI_code92.54 / 22AI_complexity88.26 / 22AI_context_awareness14.67 / 24AI_correctness92.510 / 22AI_edge_cases92.510 / 22AI_efficiency92.54 / 22AI_hallucination_resistance92.57 / 24AI_memory_retention92.53 / 24AI_parameter_accuracy92.56 / 24AI_plan_coherence79.07 / 24AI_recovery92.516 / 22AI_refusal92.516 / 22AI_safety_compliance92.512 / 24AI_spec92.516 / 22AI_stability92.57 / 22AI_task_completion61.317 / 24AI_tool_selection10.318 / 24ARC_AGI_23.014 / 17ArtificialAnalysisCoding58.39 / 21ArtificialAnalysisIntelligence58.911 / 21ArtificialAnalysisReasoning82.76 / 21ContextWindow100.05 / 24CopilotArenaOrLMArenaCode67.412 / 22GDPval5.011 / 11GPQA_HLE_Reasoning82.76 / 21IFBench100.02 / 21InverseCost91.58 / 24InverseTTFT80.29 / 19LMArenaCreativeOrOpenEnded86.26 / 24LMArenaSearchDocument19.512 / 19LMArenaText86.26 / 24LongContextRecall68.69 / 21MCPAtlas22.39 / 13OutputSpeed99.42 / 19SWEBenchMultilingual100.01 / 6SWEBenchPro31.012 / 14SWEBenchVerified68.47 / 18SWEComposite58.111 / 24SWERebench76.19 / 20SciCode78.76 / 21SonarFunctionalSkill86.37 / 17SonarIssueDensity18.712 / 17Tau2Bench64.210 / 21TerminalBench48.412 / 22 | |||||||||
| gpt-5.4 | openai | 70.1 | 70.1 | 55.1 | 55.1 | 64.9 | 64.9 | 62.5 | ▸ |
group breakdownA_B54.319 / 24A_I68.618 / 24A_P57.815 / 24A_R75.414 / 24BUILD68.47 / 24CRE77.59 / 24GEN45.216 / 24LM_ARENA_REVIEW_PROXY17.120 / 24OPS_long92.83 / 24OPS_precision91.22 / 24OPS_review89.75 / 24PLAN50.715 / 24 metricsAI_code21.216 / 22AI_complexity0.022 / 22AI_context_awareness0.019 / 24AI_correctness91.916 / 22AI_edge_cases64.418 / 22AI_efficiency42.612 / 22AI_hallucination_resistance60.013 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy91.09 / 24AI_plan_coherence2.422 / 24AI_recovery93.913 / 22AI_refusal100.012 / 22AI_safety_compliance100.09 / 24AI_spec100.012 / 22AI_stability72.615 / 22AI_task_completion87.08 / 24AI_tool_selection83.76 / 24ARC_AGI_276.95 / 17ArtificialAnalysisCoding33.715 / 21ArtificialAnalysisIntelligence27.416 / 21ArtificialAnalysisReasoning15.518 / 21ContextWindow100.01 / 24CopilotArenaOrLMArenaCode67.411 / 22GPQA_HLE_Reasoning15.518 / 21IFBench62.59 / 21InverseCost75.015 / 24InverseTTFT90.46 / 19LMArenaCreativeOrOpenEnded77.59 / 24LMArenaSearchDocument17.115 / 19LMArenaText77.59 / 24LongContextRecall24.518 / 21MCPAtlas68.24 / 13OutputSpeed95.03 / 19SWEBenchPro88.42 / 14SWEBenchVerified72.35 / 18SWEComposite82.02 / 24SWERebench83.57 / 20SciCode12.018 / 21SonarFunctionalSkill82.611 / 17SonarIssueDensity13.214 / 17Tau2Bench0.021 / 21TerminalBench100.01 / 22 | |||||||||
| glm-5.1 | zai | 73.2 | 73.2 | 62.1 | 62.1 | 64.0 | 64.0 | 70.0 | ▸ |
group breakdownA_B52.821 / 24A_I63.319 / 24A_P52.418 / 24A_R62.119 / 24BUILD67.79 / 24CRE86.75 / 24GEN57.611 / 24LM_ARENA_REVIEW_PROXY85.95 / 24OPS_long81.99 / 24OPS_precision86.57 / 24OPS_review88.77 / 24PLAN69.57 / 24 metricsAI_code28.410 / 22AI_complexity41.613 / 22AI_context_awareness7.511 / 24AI_correctness85.618 / 22AI_edge_cases62.220 / 22AI_efficiency27.418 / 22AI_hallucination_resistance24.518 / 24AI_memory_retention7.512 / 24AI_parameter_accuracy64.218 / 24AI_plan_coherence14.316 / 24AI_recovery87.319 / 22AI_refusal92.518 / 22AI_safety_compliance64.220 / 24AI_spec92.518 / 22AI_stability69.216 / 22AI_task_completion69.112 / 24AI_tool_selection52.515 / 24ARC_AGI_25.211 / 17ArtificialAnalysisCoding39.513 / 21ArtificialAnalysisIntelligence60.58 / 21ArtificialAnalysisReasoning54.013 / 21ContextWindow74.919 / 24CopilotArenaOrLMArenaCode96.23 / 22GDPval63.07 / 11GPQA_HLE_Reasoning54.013 / 21IFBench86.85 / 21InverseCost93.06 / 24InverseTTFT100.01 / 19LMArenaCreativeOrOpenEnded86.75 / 24LMArenaSearchDocument85.95 / 19LMArenaText86.75 / 24LongContextRecall41.217 / 21MCPAtlas100.01 / 13OutputSpeed75.217 / 19SWEBenchMultilingual50.93 / 6SWEBenchVerified60.511 / 18SWEComposite63.29 / 24SWERebench100.01 / 20SciCode40.414 / 21SonarFunctionalSkill84.39 / 17SonarIssueDensity100.01 / 17Tau2Bench100.03 / 21TerminalBench56.010 / 22 | |||||||||
| grok-4-latest | xai | 74.3 | 74.3 | 51.8 | 51.8 | 62.4 | 62.4 | 57.6 | ▸ |
group breakdownA_B83.94 / 24A_I83.35 / 24A_P60.810 / 24A_R85.54 / 24BUILD47.415 / 24CRE76.310 / 24GEN49.415 / 24LM_ARENA_REVIEW_PROXY18.619 / 24OPS_long77.514 / 24OPS_precision77.512 / 24OPS_review77.312 / 24PLAN38.018 / 24 metricsAI_code92.36 / 22AI_complexity89.44 / 22AI_context_awareness0.021 / 24AI_correctness100.08 / 22AI_edge_cases94.18 / 22AI_efficiency0.021 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention85.55 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.01 / 24AI_recovery28.220 / 22AI_refusal100.014 / 22AI_safety_compliance0.021 / 24AI_spec100.014 / 22AI_stability100.05 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_221.08 / 17ArtificialAnalysisCoding51.510 / 21ArtificialAnalysisIntelligence40.314 / 21ArtificialAnalysisReasoning57.010 / 21ContextWindow78.415 / 24CopilotArenaOrLMArenaCode57.015 / 22GPQA_HLE_Reasoning57.010 / 21IFBench33.115 / 21InverseCost74.419 / 24InverseTTFT78.510 / 19LMArenaCreativeOrOpenEnded76.310 / 24LMArenaSearchDocument18.614 / 19LMArenaText76.310 / 24LongContextRecall77.08 / 21OutputSpeed77.414 / 19SWEComposite47.719 / 24SWERebench38.316 / 20SciCode51.910 / 21Tau2Bench51.514 / 21TerminalBench11.619 / 22 | |||||||||
| claude-sonnet-4.6 | anthropic | 69.7 | 69.7 | 59.4 | 59.4 | 61.4 | 61.4 | 57.0 | ▸ |
group breakdownA_B53.020 / 24A_I68.617 / 24A_P59.313 / 24A_R63.818 / 24BUILD67.88 / 24CRE73.113 / 24GEN65.57 / 24LM_ARENA_REVIEW_PROXY22.615 / 24OPS_long66.718 / 24OPS_precision54.321 / 24OPS_review49.023 / 24PLAN56.612 / 24 metricsAI_code21.018 / 22AI_complexity53.811 / 22AI_context_awareness15.74 / 24AI_correctness100.05 / 22AI_edge_cases70.312 / 22AI_efficiency45.111 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy94.44 / 24AI_plan_coherence24.711 / 24AI_recovery100.06 / 22AI_refusal71.120 / 22AI_safety_compliance100.04 / 24AI_spec71.120 / 22AI_stability65.517 / 22AI_task_completion67.913 / 24AI_tool_selection94.93 / 24ARC_AGI_210.610 / 17ArtificialAnalysisCoding85.14 / 21ArtificialAnalysisIntelligence79.16 / 21ArtificialAnalysisReasoning68.78 / 21ContextWindow99.311 / 24CopilotArenaOrLMArenaCode93.45 / 22GDPval82.54 / 11GPQA_HLE_Reasoning68.78 / 21IFBench41.013 / 21InverseCost74.418 / 24InverseTTFT0.019 / 19LMArenaCreativeOrOpenEnded73.113 / 24LMArenaSearchDocument22.610 / 19LMArenaText73.113 / 24LongContextRecall90.25 / 21MCPAtlas65.17 / 13OutputSpeed80.710 / 19SWEBenchPro53.87 / 14SWEBenchVerified63.410 / 18SWEComposite66.45 / 24SWERebench95.83 / 20SciCode57.98 / 21SonarFunctionalSkill92.94 / 17SonarIssueDensity24.310 / 17Tau2Bench53.312 / 21TerminalBench47.514 / 22 | |||||||||
| claude-sonnet-4.5 | anthropic | 69.6 | 69.6 | 56.0 | 56.0 | 61.3 | 61.3 | 55.7 | ▸ |
group breakdownA_B85.93 / 24A_I92.02 / 24A_P78.02 / 24A_R88.23 / 24BUILD46.816 / 24CRE64.015 / 24GEN42.217 / 24LM_ARENA_REVIEW_PROXY1.722 / 24OPS_long79.612 / 24OPS_precision79.511 / 24OPS_review78.39 / 24PLAN42.417 / 24 metricsAI_canary_health78.16 / 7AI_code84.17 / 22AI_complexity100.01 / 22AI_context_awareness100.01 / 24AI_correctness100.04 / 22AI_edge_cases100.05 / 22AI_efficiency78.26 / 22AI_hallucination_resistance40.016 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy61.319 / 24AI_plan_coherence30.310 / 24AI_recovery100.05 / 22AI_refusal100.04 / 22AI_safety_compliance73.617 / 24AI_spec100.04 / 22AI_stability100.03 / 22AI_task_completion100.02 / 24AI_tool_selection83.17 / 24ARC_AGI_23.612 / 17ArtificialAnalysisCoding45.312 / 21ArtificialAnalysisIntelligence46.012 / 21ArtificialAnalysisReasoning35.315 / 21ContextWindow99.310 / 24CopilotArenaOrLMArenaCode52.316 / 22GDPval88.32 / 11GPQA_HLE_Reasoning35.315 / 21IFBench43.012 / 21InverseCost74.417 / 24InverseTTFT76.311 / 19LMArenaCreativeOrOpenEnded64.015 / 24LMArenaSearchDocument1.717 / 19LMArenaText64.015 / 24LongContextRecall65.711 / 21MCPAtlas8.011 / 13OutputSpeed76.516 / 19SWEBenchMultilingual3.95 / 6SWEBenchPro54.56 / 14SWEBenchVerified54.712 / 18SWEComposite53.514 / 24SWERebench74.710 / 20SciCode46.413 / 21SonarFunctionalSkill53.615 / 17SonarIssueDensity29.89 / 17Tau2Bench58.911 / 21TerminalBench37.315 / 22 | |||||||||
| gpt-5.3-codex | openai | 67.6 | 67.6 | 55.8 | 55.8 | 57.4 | 57.4 | 68.7 | ▸ |
group breakdownA_B56.015 / 24A_I69.016 / 24A_P52.419 / 24A_R76.611 / 24BUILD58.011 / 24CRE73.412 / 24GEN55.812 / 24LM_ARENA_REVIEW_PROXY91.63 / 24OPS_long58.021 / 24OPS_precision59.320 / 24OPS_review58.821 / 24PLAN58.410 / 24 metricsAI_code11.020 / 22AI_complexity40.119 / 22AI_context_awareness0.018 / 24AI_correctness91.915 / 22AI_edge_cases64.417 / 22AI_efficiency32.615 / 22AI_hallucination_resistance80.010 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy85.910 / 24AI_plan_coherence2.421 / 24AI_recovery93.912 / 22AI_refusal100.011 / 22AI_safety_compliance88.915 / 24AI_spec100.011 / 22AI_stability61.318 / 22AI_task_completion58.019 / 24AI_tool_selection66.412 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode58.414 / 22InverseCost76.614 / 24LMArenaCreativeOrOpenEnded73.412 / 24LMArenaSearchDocument91.63 / 19LMArenaText73.412 / 24SWEBenchVerified68.96 / 18SWEComposite63.67 / 24SWERebench89.45 / 20TerminalBench74.66 / 22 | |||||||||
| claude-opus-4.1 | anthropic | 60.1 | 60.1 | 48.6 | 48.6 | 55.7 | 55.7 | 50.5 | ▸ |
group breakdownA_B77.18 / 24A_I82.76 / 24A_P60.411 / 24A_R82.95 / 24BUILD48.414 / 24CRE52.717 / 24GEN50.714 / 24LM_ARENA_REVIEW_PROXY0.023 / 24OPS_long48.722 / 24OPS_precision46.224 / 24OPS_review42.524 / 24PLAN43.016 / 24 metricsAI_canary_health69.17 / 7AI_code70.28 / 22AI_complexity84.18 / 22AI_context_awareness0.012 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency64.67 / 22AI_hallucination_resistance40.015 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy71.516 / 24AI_plan_coherence19.114 / 24AI_recovery100.01 / 22AI_refusal100.01 / 22AI_safety_compliance100.01 / 24AI_spec100.01 / 22AI_stability60.819 / 22AI_task_completion83.410 / 24AI_tool_selection72.011 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode52.117 / 22InverseCost0.024 / 24LMArenaCreativeOrOpenEnded52.717 / 24LMArenaSearchDocument0.018 / 19LMArenaText52.717 / 24SWEComposite50.315 / 24SWERebench51.715 / 20TerminalBench29.316 / 22 | |||||||||
| claude-opus-4.7 | anthropic | 60.8 | 60.8 | 62.2 | 62.2 | 55.3 | 55.3 | 54.3 | ▸ |
group breakdownA_B2.724 / 24A_I3.724 / 24A_P23.224 / 24A_R0.724 / 24BUILD87.61 / 24CRE95.04 / 24GEN97.02 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long74.317 / 24OPS_precision69.017 / 24OPS_review65.617 / 24PLAN79.94 / 24 metricsAI_canary_health88.23 / 7AI_code0.021 / 22AI_complexity0.021 / 22AI_context_awareness15.25 / 24AI_correctness0.021 / 22AI_edge_cases0.021 / 22AI_efficiency10.920 / 22AI_hallucination_resistance0.023 / 24AI_memory_retention30.510 / 24AI_parameter_accuracy35.220 / 24AI_plan_coherence21.912 / 24AI_recovery0.021 / 22AI_refusal0.021 / 22AI_safety_compliance100.02 / 24AI_spec0.021 / 22AI_stability7.420 / 22AI_task_completion100.01 / 24AI_tool_selection57.613 / 24ARC_AGI_294.03 / 17ArtificialAnalysisCoding90.33 / 21ArtificialAnalysisIntelligence100.01 / 21ArtificialAnalysisReasoning95.63 / 21ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.02 / 22GDPval95.01 / 11GPQA_HLE_Reasoning95.63 / 21IFBench46.610 / 21InverseCost61.922 / 24InverseTTFT49.116 / 19LMArenaCreativeOrOpenEnded95.04 / 24LMArenaSearchDocument100.01 / 19LMArenaText95.04 / 24LongContextRecall88.26 / 21OutputSpeed78.812 / 19SWEBenchPro95.01 / 14SWEBenchVerified94.62 / 18SWEComposite92.71 / 24SWERebench85.36 / 20SciCode100.01 / 21SonarFunctionalSkill98.42 / 17SonarIssueDensity2.417 / 17Tau2Bench83.19 / 21TerminalBench78.64 / 22 | |||||||||
| deepseek-v4-flash | deepseek | 66.0 | 66.0 | 66.8 | 66.8 | 54.5 | 54.5 | 64.2 | ▸ |
group breakdownA_B57.013 / 24A_I75.211 / 24A_P64.19 / 24A_R72.015 / 24BUILD49.813 / 24CRE58.716 / 24GEN62.88 / 24LM_ARENA_REVIEW_PROXY50.08 / 24OPS_long86.56 / 24OPS_precision89.65 / 24OPS_review91.82 / 24PLAN71.16 / 24 metricsAI_canary_health82.45 / 7AI_code21.215 / 22AI_complexity40.115 / 22AI_context_awareness0.014 / 24AI_correctness91.912 / 22AI_edge_cases64.414 / 22AI_efficiency46.410 / 22AI_hallucination_resistance40.017 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy84.212 / 24AI_plan_coherence41.49 / 24AI_recovery93.99 / 22AI_refusal100.05 / 22AI_safety_compliance100.05 / 24AI_spec100.05 / 22AI_stability72.613 / 22AI_task_completion87.05 / 24AI_tool_selection73.89 / 24ArtificialAnalysisCoding45.611 / 21ArtificialAnalysisIntelligence59.310 / 21ArtificialAnalysisReasoning76.77 / 21ContextWindow71.622 / 24GPQA_HLE_Reasoning76.77 / 21IFBench100.01 / 21InverseCost100.01 / 24InverseTTFT98.93 / 19LMArenaCreativeOrOpenEnded58.716 / 24LMArenaText58.716 / 24LongContextRecall52.516 / 21OutputSpeed83.68 / 19SWEComposite50.016 / 24SciCode47.512 / 21Tau2Bench97.95 / 21 | |||||||||
| gemini-2.5-pro | 37.3 | 36.8 | 40.6 | 39.3 | 53.7 | 51.5 | 47.0 | ▸ | |
group breakdownA_B81.65 / 24A_I80.88 / 24A_P68.23 / 24A_R81.96 / 24BUILD34.122 / 24CRE0.024 / 24GEN14.521 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long82.17 / 24OPS_precision74.415 / 24OPS_review71.015 / 24PLAN21.821 / 24 metricsAI_code92.53 / 22AI_complexity88.25 / 22AI_context_awareness14.66 / 24AI_correctness92.59 / 22AI_edge_cases92.59 / 22AI_efficiency92.53 / 22AI_hallucination_resistance92.56 / 24AI_memory_retention92.52 / 24AI_parameter_accuracy92.55 / 24AI_plan_coherence79.06 / 24AI_recovery92.515 / 22AI_refusal92.515 / 22AI_safety_compliance92.511 / 24AI_spec92.515 / 22AI_stability92.56 / 22AI_task_completion61.316 / 24AI_tool_selection10.317 / 24ARC_AGI_23.613 / 17ArtificialAnalysisCoding23.617 / 21ArtificialAnalysisIntelligence14.117 / 21ArtificialAnalysisReasoning44.814 / 21ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.021 / 22GPQA_HLE_Reasoning44.814 / 21IFBench19.318 / 21InverseCost80.110 / 24InverseTTFT43.917 / 19LMArenaCreativeOrOpenEnded0.024 / 24LMArenaSearchDocument0.019 / 19LMArenaText0.024 / 24LongContextRecall67.210 / 21MCPAtlas66.85 / 13OutputSpeed91.46 / 19SWEBenchPro53.29 / 14SWEBenchVerified9.817 / 18SWEComposite27.022 / 24SWERebench0.519 / 20SciCode36.115 / 21SonarFunctionalSkill86.36 / 17SonarIssueDensity18.711 / 17Tau2Bench3.519 / 21TerminalBench1.420 / 22 | |||||||||
| gemini-2.5-flash | 59.4 | 58.8 | 38.0 | 36.8 | 53.7 | 51.5 | 57.4 | ▸ | |
group breakdownA_B93.02 / 24A_I88.73 / 24A_P65.07 / 24A_R92.52 / 24BUILD24.723 / 24CRE45.519 / 24GEN15.020 / 24LM_ARENA_REVIEW_PROXY76.97 / 24OPS_long94.62 / 24OPS_precision90.73 / 24OPS_review89.26 / 24PLAN13.423 / 24 metricsAI_code100.01 / 22AI_complexity100.02 / 22AI_context_awareness100.02 / 24AI_correctness100.06 / 22AI_edge_cases100.06 / 22AI_efficiency100.01 / 22AI_hallucination_resistance69.911 / 24AI_memory_retention31.99 / 24AI_parameter_accuracy73.314 / 24AI_plan_coherence0.023 / 24AI_recovery100.07 / 22AI_refusal100.06 / 22AI_safety_compliance100.06 / 24AI_spec100.06 / 22AI_stability76.412 / 22AI_task_completion44.320 / 24AI_tool_selection27.516 / 24ARC_AGI_20.715 / 17ArtificialAnalysisCoding0.020 / 21ArtificialAnalysisIntelligence0.819 / 21ArtificialAnalysisReasoning17.916 / 21ContextWindow100.03 / 24CopilotArenaOrLMArenaCode64.813 / 22GDPval11.810 / 11GPQA_HLE_Reasoning17.916 / 21IFBench29.217 / 21InverseCost94.45 / 24InverseTTFT75.812 / 19LMArenaCreativeOrOpenEnded45.519 / 24LMArenaSearchDocument76.97 / 19LMArenaText45.519 / 24LiveCodeBench100.01 / 2LongContextRecall58.813 / 21MCPAtlas26.48 / 13OutputSpeed100.01 / 19SWEBenchPro33.811 / 14SWEBenchVerified0.018 / 18SWEComposite15.024 / 24SWERebench0.020 / 20SciCode23.516 / 21Tau2Bench0.020 / 21TerminalBench0.021 / 22 | |||||||||
| claude-sonnet-4 | anthropic | 37.2 | 37.2 | 43.5 | 43.5 | 52.0 | 52.0 | 59.9 | ▸ |
group breakdownA_B63.210 / 24A_I81.27 / 24A_P64.68 / 24A_R76.113 / 24BUILD41.520 / 24CRE0.023 / 24GEN14.022 / 24LM_ARENA_REVIEW_PROXY84.16 / 24OPS_long80.011 / 24OPS_precision79.610 / 24OPS_review78.210 / 24PLAN33.419 / 24 metricsAI_code25.012 / 22AI_complexity57.810 / 22AI_context_awareness0.013 / 24AI_correctness84.419 / 22AI_edge_cases100.04 / 22AI_efficiency56.48 / 22AI_hallucination_resistance20.020 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy91.98 / 24AI_plan_coherence19.115 / 24AI_recovery100.04 / 22AI_refusal100.03 / 22AI_safety_compliance100.03 / 24AI_spec100.03 / 22AI_stability100.02 / 22AI_task_completion99.73 / 24AI_tool_selection90.04 / 24ARC_AGI_20.116 / 17ArtificialAnalysisCoding30.716 / 21ArtificialAnalysisIntelligence29.715 / 21ArtificialAnalysisReasoning8.619 / 21ContextWindow99.39 / 24CopilotArenaOrLMArenaCode52.018 / 22GDPval82.53 / 11GPQA_HLE_Reasoning8.619 / 21IFBench35.814 / 21InverseCost74.416 / 24InverseTTFT75.413 / 19LMArenaCreativeOrOpenEnded0.023 / 24LMArenaSearchDocument84.16 / 19LMArenaText0.023 / 24LiveCodeBench0.02 / 2LongContextRecall60.812 / 21MCPAtlas14.310 / 13OutputSpeed77.513 / 19SWEBenchPro52.110 / 14SWEBenchVerified39.716 / 18SWEComposite48.518 / 24SWERebench54.514 / 20SciCode20.817 / 21SonarFunctionalSkill59.014 / 17SonarIssueDensity34.07 / 17Tau2Bench27.718 / 21TerminalBench47.513 / 22 | |||||||||
| glm-4.7 | zai | 35.6 | 35.6 | 49.8 | 49.8 | 50.8 | 50.8 | 55.3 | ▸ |
group breakdownA_B55.417 / 24A_I54.021 / 24A_P46.121 / 24A_R58.521 / 24BUILD42.018 / 24CRE9.222 / 24GEN35.618 / 24LM_ARENA_REVIEW_PROXY50.012 / 24OPS_long87.65 / 24OPS_precision90.14 / 24OPS_review91.91 / 24PLAN53.614 / 24 metricsAI_context_awareness0.024 / 24AI_hallucination_resistance100.05 / 24AI_memory_retention85.58 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.04 / 24AI_safety_compliance0.024 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding37.914 / 21ArtificialAnalysisIntelligence42.613 / 21ArtificialAnalysisReasoning55.812 / 21ContextWindow74.918 / 24CopilotArenaOrLMArenaCode68.29 / 22GPQA_HLE_Reasoning55.812 / 21IFBench72.27 / 21InverseCost96.13 / 24InverseTTFT99.02 / 19LMArenaCreativeOrOpenEnded9.222 / 24LMArenaText9.222 / 24LongContextRecall57.414 / 21MCPAtlas0.013 / 13OutputSpeed85.37 / 19SWEComposite54.113 / 24SWERebench70.612 / 20SciCode48.611 / 21SonarFunctionalSkill31.316 / 17SonarIssueDensity44.75 / 17Tau2Bench100.02 / 21TerminalBench27.017 / 22 | |||||||||
| gpt-5.2 | openai | 65.8 | 65.8 | 55.9 | 55.9 | 50.0 | 50.0 | 57.2 | ▸ |
group breakdownA_B60.311 / 24A_I71.814 / 24A_P56.816 / 24A_R79.910 / 24BUILD41.719 / 24CRE67.914 / 24GEN52.213 / 24LM_ARENA_REVIEW_PROXY20.616 / 24OPS_long58.319 / 24OPS_precision59.819 / 24OPS_review59.520 / 24PLAN56.611 / 24 metricsAI_code24.614 / 22AI_complexity40.118 / 22AI_context_awareness0.017 / 24AI_correctness91.914 / 22AI_edge_cases64.416 / 22AI_efficiency30.716 / 22AI_hallucination_resistance80.09 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy85.211 / 24AI_plan_coherence0.024 / 24AI_recovery93.911 / 22AI_refusal100.010 / 22AI_safety_compliance100.08 / 24AI_spec100.010 / 22AI_stability79.910 / 22AI_task_completion87.07 / 24AI_tool_selection76.38 / 24ARC_AGI_20.017 / 17ArtificialAnalysisCoding63.48 / 21ArtificialAnalysisIntelligence59.79 / 21ArtificialAnalysisReasoning56.411 / 21ContextWindow85.312 / 24CopilotArenaOrLMArenaCode37.220 / 22GPQA_HLE_Reasoning56.411 / 21IFBench64.78 / 21InverseCost80.111 / 24LMArenaCreativeOrOpenEnded67.914 / 24LMArenaSearchDocument20.611 / 19LMArenaText67.914 / 24LongContextRecall53.915 / 21SWEBenchMultilingual0.06 / 6SWEBenchPro18.613 / 14SWEBenchVerified50.514 / 18SWEComposite28.221 / 24SciCode54.69 / 21SonarFunctionalSkill82.810 / 17SonarIssueDensity33.98 / 17Tau2Bench50.115 / 21TerminalBench58.49 / 22 | |||||||||
| grok-code-fast-1 | xai | 40.9 | 40.9 | 27.5 | 27.5 | 38.2 | 38.2 | 41.9 | ▸ |
group breakdownA_B34.622 / 24A_I33.823 / 24A_P36.922 / 24A_R52.622 / 24BUILD34.221 / 24CRE47.818 / 24GEN15.819 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long90.24 / 24OPS_precision89.16 / 24OPS_review89.74 / 24PLAN11.424 / 24 metricsAI_code0.022 / 22AI_complexity41.314 / 22AI_context_awareness0.022 / 24AI_correctness59.320 / 22AI_edge_cases65.513 / 22AI_efficiency0.022 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention85.56 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.02 / 24AI_recovery91.118 / 22AI_refusal0.022 / 22AI_safety_compliance0.022 / 24AI_spec0.022 / 22AI_stability0.022 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.37 / 17ArtificialAnalysisCoding0.021 / 21ArtificialAnalysisIntelligence0.021 / 21ArtificialAnalysisReasoning0.021 / 21ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.022 / 22GPQA_HLE_Reasoning0.021 / 21IFBench0.021 / 21InverseCost99.32 / 24InverseTTFT84.87 / 19LMArenaCreativeOrOpenEnded47.818 / 24LMArenaText47.818 / 24LongContextRecall0.021 / 21OutputSpeed93.84 / 19SWEComposite45.420 / 24SWERebench27.018 / 20SciCode0.021 / 21Tau2Bench53.313 / 21TerminalBench0.022 / 22 | |||||||||
| kimi-k2-0905 | moonshot | 26.2 | 26.2 | 27.8 | 27.8 | 37.9 | 37.9 | 35.7 | ▸ |
group breakdownA_B29.623 / 24A_I34.722 / 24A_P35.623 / 24A_R27.023 / 24BUILD42.717 / 24CRE26.920 / 24GEN7.924 / 24LM_ARENA_REVIEW_PROXY50.09 / 24OPS_long35.424 / 24OPS_precision53.722 / 24OPS_review60.219 / 24PLAN29.620 / 24 metricsAI_canary_health88.92 / 7AI_code28.011 / 22AI_complexity40.116 / 22AI_context_awareness0.015 / 24AI_correctness0.022 / 22AI_edge_cases0.022 / 22AI_efficiency33.014 / 22AI_hallucination_resistance60.012 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy81.113 / 24AI_plan_coherence21.913 / 24AI_recovery0.022 / 22AI_refusal100.08 / 22AI_safety_compliance88.914 / 24AI_spec100.08 / 22AI_stability0.021 / 22AI_task_completion87.06 / 24AI_tool_selection73.810 / 24ArtificialAnalysisCoding4.219 / 21ArtificialAnalysisIntelligence0.020 / 21ArtificialAnalysisReasoning0.020 / 21ContextWindow53.423 / 24GPQA_HLE_Reasoning0.020 / 21IFBench0.020 / 21InverseCost92.77 / 24InverseTTFT90.75 / 19LMArenaCreativeOrOpenEnded26.920 / 24LMArenaText26.920 / 24LongContextRecall0.020 / 21OutputSpeed0.019 / 19SWEComposite50.017 / 24SciCode0.020 / 21Tau2Bench48.016 / 21 | |||||||||
| glm-4.6 | zai | 36.8 | 36.8 | 32.0 | 32.0 | 36.4 | 36.4 | 41.5 | ▸ |
group breakdownA_B55.416 / 24A_I54.020 / 24A_P46.120 / 24A_R58.520 / 24BUILD18.224 / 24CRE23.621 / 24GEN13.423 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long77.015 / 24OPS_precision83.38 / 24OPS_review86.08 / 24PLAN18.022 / 24 metricsAI_context_awareness0.023 / 24AI_hallucination_resistance100.04 / 24AI_memory_retention85.57 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.03 / 24AI_safety_compliance0.023 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding15.918 / 21ArtificialAnalysisIntelligence6.118 / 21ArtificialAnalysisReasoning16.517 / 21ContextWindow75.017 / 24CopilotArenaOrLMArenaCode43.019 / 22GPQA_HLE_Reasoning16.517 / 21IFBench4.719 / 21InverseCost95.44 / 24InverseTTFT98.74 / 19LMArenaCreativeOrOpenEnded23.621 / 24LMArenaText23.621 / 24LongContextRecall9.819 / 21MCPAtlas7.512 / 13OutputSpeed66.318 / 19SWEBenchPro0.014 / 14SWEBenchVerified48.415 / 18SWEComposite24.523 / 24SWERebench37.617 / 20SciCode12.019 / 21SonarFunctionalSkill7.517 / 17SonarIssueDensity7.516 / 17Tau2Bench41.317 / 21TerminalBench13.718 / 22 | |||||||||