$ipbr-rank · live llm coding-role score
refreshed · 13 sources
[ idea ]
1claude-opus-4.688.688.6
2gemini-3.1-pro-preview85.685.6
3gemini-3-pro80.880.8
[ plan ]
1gemini-3.1-pro-preview82.882.8
2gpt-5.579.278.4
3gpt-5.469.668.8
[ build ]
1claude-opus-4.673.073.0
2gemini-3.1-pro-preview71.271.2
3gpt-5.570.368.9
[ review ]
1gpt-5.580.080.0
2claude-opus-4.675.475.4
3gemini-3.1-pro-preview75.375.3
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

raw vs adjusted

The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models dominate Review, that lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.

Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted — it is the source of the penalty.

missing data

If a model is missing some metrics within a group, the group score is computed from the present metrics if at least 70% of the group weight is covered. Below that threshold, the score shrinks toward 50 proportional to the missing weight.

Full math, role definitions, and source list →

claude-opus-4.6anthropic88.688.667.567.573.073.075.4

group breakdown

A_B78.73 / 14A_I86.33 / 14A_P61.64 / 14A_R80.24 / 14BUILD72.35 / 14CRE100.01 / 14GEN83.44 / 14LM_ARENA_REVIEW_PROXY100.01 / 14OPS_long52.912 / 14OPS_precision55.413 / 14OPS_review53.313 / 14PLAN65.69 / 14

metrics

AI_code82.35 / 13AI_complexity84.43 / 13AI_context_awareness8.96 / 13AI_correctness100.01 / 13AI_edge_cases100.01 / 13AI_efficiency55.64 / 13AI_hallucination_resistance0.012 / 13AI_memory_retention0.07 / 13AI_parameter_accuracy42.79 / 13AI_plan_coherence5.110 / 13AI_recovery94.48 / 13AI_refusal100.01 / 13AI_safety_compliance51.110 / 13AI_spec100.01 / 13AI_stability100.01 / 13AI_task_completion82.34 / 13AI_tool_selection100.01 / 13ARC_AGI_284.34 / 11ArtificialAnalysisCoding65.67 / 14ArtificialAnalysisIntelligence78.77 / 14ArtificialAnalysisReasoning70.78 / 14ContextWindow76.012 / 14CopilotArenaOrLMArenaCode96.42 / 12GDPval71.76 / 12GPQA_HLE_Reasoning70.78 / 14IFBench7.812 / 14InverseCost0.014 / 14InverseTTFT79.16 / 14LMArenaCreativeOrOpenEnded100.01 / 14LMArenaSearchDocument100.01 / 13LMArenaText100.01 / 14LiveCodeBench76.05 / 14LongContextRecall67.65 / 14MCPAtlas89.83 / 12OutputSpeed46.711 / 14SWEBenchMultilingual88.92 / 4SWEBenchPro57.65 / 10SWEBenchVerified71.38 / 14SWEComposite71.55 / 14SWERebench91.25 / 11SciCode64.38 / 14SonarFunctionalSkill76.73 / 12SonarIssueDensity36.64 / 12Tau2Bench90.25 / 14TerminalBench63.78 / 14
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
gemini-3.1-pro-previewgoogle85.685.682.882.871.271.275.3

group breakdown

A_B64.87 / 14A_I65.87 / 14A_P61.66 / 14A_R61.110 / 14BUILD72.54 / 14CRE97.92 / 14GEN96.91 / 14LM_ARENA_REVIEW_PROXY81.86 / 14OPS_long87.23 / 14OPS_precision84.83 / 14OPS_review83.24 / 14PLAN94.62 / 14

metrics

AI_code*78.87 / 13AI_complexity*60.87 / 13AI_context_awareness*14.04 / 13AI_correctness*92.57 / 13AI_edge_cases*7.512 / 13AI_efficiency*33.16 / 13AI_hallucination_resistance*92.53 / 13AI_memory_retention*92.53 / 13AI_parameter_accuracy*92.53 / 13AI_plan_coherence*92.53 / 13AI_recovery*7.512 / 13AI_refusal*92.511 / 13AI_safety_compliance*92.58 / 13AI_spec*92.511 / 13AI_stability*92.55 / 13AI_task_completion*26.110 / 13AI_tool_selection*7.512 / 13ARC_AGI_294.92 / 11ArtificialAnalysisCoding91.53 / 14ArtificialAnalysisIntelligence95.53 / 14ArtificialAnalysisReasoning100.01 / 14ContextWindow100.06 / 14CopilotArenaOrLMArenaCode59.46 / 12GDPval23.210 / 12GPQA_HLE_Reasoning100.01 / 14IFBench98.82 / 14InverseCost81.38 / 14InverseTTFT75.78 / 14LMArenaCreativeOrOpenEnded97.92 / 14LMArenaSearchDocument81.86 / 13LMArenaText97.92 / 14LiveCodeBench65.07 / 14LongContextRecall86.64 / 14MCPAtlas97.72 / 12OutputSpeed89.04 / 14SWEBenchPro33.16 / 10SWEBenchVerified79.63 / 14SWEComposite63.38 / 14SWERebench99.12 / 11SciCode100.01 / 14SonarFunctionalSkill100.01 / 12SonarIssueDensity26.36 / 12Tau2Bench98.03 / 14TerminalBench89.33 / 14
sources aistupidlevel*arc_agiartificial_analysislivecodebenchlmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing SWEComposite/SWEBenchMultilingual
gpt-5.5openai75.475.079.278.470.368.980.0

group breakdown

A_B50.88 / 14A_I65.08 / 14A_P54.57 / 14A_R69.46 / 14BUILD80.82 / 14CRE76.210 / 14GEN93.92 / 14LM_ARENA_REVIEW_PROXY83.94 / 14OPS_long77.45 / 14OPS_precision77.35 / 14OPS_review75.95 / 14PLAN95.21 / 14

metrics

AI_code20.112 / 13AI_complexity25.411 / 13AI_context_awareness0.013 / 13AI_correctness78.411 / 13AI_edge_cases43.39 / 13AI_efficiency25.29 / 13AI_hallucination_resistance60.07 / 13AI_memory_retention0.013 / 13AI_parameter_accuracy91.54 / 13AI_plan_coherence11.69 / 13AI_recovery97.17 / 13AI_refusal100.09 / 13AI_safety_compliance100.06 / 13AI_spec100.09 / 13AI_stability59.98 / 13AI_task_completion69.76 / 13AI_tool_selection86.43 / 13ARC_AGI_2100.01 / 11ArtificialAnalysisCoding100.01 / 14ArtificialAnalysisIntelligence100.01 / 14ArtificialAnalysisReasoning99.12 / 14ContextWindow100.02 / 14GDPval95.01 / 12GPQA_HLE_Reasoning99.12 / 14IFBench94.35 / 14InverseCost70.813 / 14InverseTTFT74.010 / 14LMArenaCreativeOrOpenEnded*76.210 / 14LMArenaSearchDocument*83.94 / 13LMArenaText*76.210 / 14LiveCodeBench100.01 / 14LongContextRecall100.01 / 14MCPAtlas*54.18 / 12OutputSpeed73.78 / 14SWEBenchPro82.24 / 10SWEBenchVerified95.01 / 14SWEComposite87.72 / 14SciCode91.54 / 14SonarFunctionalSkill0.012 / 12SonarIssueDensity41.23 / 12Tau2Bench94.24 / 14TerminalBench100.01 / 14
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarena*mcp_atlas*openrouteroverridessonarterminal_benchmissing BUILD/CopilotArenaOrLMArenaCodeSWEComposite/SWEBenchMultilingualSWEComposite/SWERebench
gpt-5.4openai55.655.369.668.864.262.765.1

group breakdown

A_B47.011 / 14A_I56.810 / 14A_P47.810 / 14A_R63.48 / 14BUILD75.73 / 14CRE42.613 / 14GEN79.75 / 14LM_ARENA_REVIEW_PROXY16.312 / 14OPS_long64.610 / 14OPS_precision53.314 / 14OPS_review48.514 / 14PLAN89.23 / 14

metrics

AI_code20.111 / 13AI_complexity40.610 / 13AI_context_awareness0.012 / 13AI_correctness78.410 / 13AI_edge_cases43.38 / 13AI_efficiency18.110 / 13AI_hallucination_resistance60.06 / 13AI_memory_retention0.012 / 13AI_parameter_accuracy81.66 / 13AI_plan_coherence5.112 / 13AI_recovery97.16 / 13AI_refusal100.08 / 13AI_safety_compliance100.05 / 13AI_spec100.08 / 13AI_stability0.112 / 13AI_task_completion69.75 / 13AI_tool_selection85.14 / 13ARC_AGI_290.93 / 11ArtificialAnalysisCoding97.82 / 14ArtificialAnalysisIntelligence93.94 / 14ArtificialAnalysisReasoning88.83 / 14ContextWindow100.01 / 14CopilotArenaOrLMArenaCode36.79 / 12GDPval80.44 / 12GPQA_HLE_Reasoning88.83 / 14IFBench86.77 / 14InverseCost78.89 / 14InverseTTFT0.014 / 14LMArenaCreativeOrOpenEnded42.613 / 14LMArenaSearchDocument16.311 / 13LMArenaText42.613 / 14LiveCodeBench98.12 / 14LongContextRecall99.03 / 14MCPAtlas54.96 / 12OutputSpeed75.87 / 14SWEBenchPro87.92 / 10SWEBenchVerified74.65 / 14SWEComposite82.23 / 14SciCode94.82 / 14SonarFunctionalSkill22.410 / 12SonarIssueDensity5.310 / 12Tau2Bench79.28 / 14TerminalBench99.92 / 14
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenamcp_atlasopenrouteroverridessonarswebench_proterminal_benchmissing SWEComposite/SWEBenchMultilingualSWEComposite/SWERebench
gpt-5.3-codexopenai71.070.665.664.862.761.274.4

group breakdown

A_B50.39 / 14A_I62.99 / 14A_P45.711 / 14A_R72.85 / 14BUILD68.76 / 14CRE76.29 / 14GEN73.57 / 14LM_ARENA_REVIEW_PROXY83.93 / 14OPS_long74.67 / 14OPS_precision71.37 / 14OPS_review70.09 / 14PLAN79.45 / 14

metrics

AI_code20.110 / 13AI_complexity0.013 / 13AI_context_awareness0.011 / 13AI_correctness78.49 / 13AI_edge_cases43.37 / 13AI_efficiency30.47 / 13AI_hallucination_resistance80.04 / 13AI_memory_retention0.011 / 13AI_parameter_accuracy71.37 / 13AI_plan_coherence5.111 / 13AI_recovery97.15 / 13AI_refusal100.07 / 13AI_safety_compliance66.79 / 13AI_spec100.07 / 13AI_stability59.97 / 13AI_task_completion11.112 / 13AI_tool_selection66.86 / 13ARC_AGI_2*36.28 / 11ArtificialAnalysisCoding83.14 / 14ArtificialAnalysisIntelligence81.16 / 14ArtificialAnalysisReasoning83.34 / 14ContextWindow86.010 / 14CopilotArenaOrLMArenaCode39.88 / 12GDPval50.29 / 12GPQA_HLE_Reasoning83.34 / 14IFBench92.46 / 14InverseCost84.46 / 14InverseTTFT53.113 / 14LMArenaCreativeOrOpenEnded*76.29 / 14LMArenaSearchDocument*83.93 / 13LMArenaText*76.29 / 14LiveCodeBench92.63 / 14LongContextRecall99.02 / 14MCPAtlas*54.17 / 12OutputSpeed77.66 / 14SWEBenchPro*82.23 / 10SWEBenchVerified*70.99 / 14SWEComposite80.24 / 14SWERebench90.16 / 11SciCode72.77 / 14SonarFunctionalSkill17.411 / 12SonarIssueDensity5.311 / 12Tau2Bench76.89 / 14TerminalBench73.96 / 14
sources aistupidlevelarc_agi*artificial_analysislivecodebenchlmarenamcp_atlas*openrouteroverridessonarswebench_pro*swerebenchterminal_benchmissing SWEComposite/SWEBenchMultilingual
claude-sonnet-4anthropic79.279.261.861.860.460.457.9

group breakdown

A_B77.94 / 14A_I86.52 / 14A_P67.43 / 14A_R80.83 / 14BUILD47.710 / 14CRE77.48 / 14GEN70.19 / 14LM_ARENA_REVIEW_PROXY19.99 / 14OPS_long75.26 / 14OPS_precision74.46 / 14OPS_review73.06 / 14PLAN47.411 / 14

metrics

AI_code82.94 / 13AI_complexity71.64 / 13AI_context_awareness0.09 / 13AI_correctness100.02 / 13AI_edge_cases70.24 / 13AI_efficiency66.33 / 13AI_hallucination_resistance20.010 / 13AI_memory_retention0.08 / 13AI_parameter_accuracy83.55 / 13AI_plan_coherence24.56 / 13AI_recovery100.01 / 13AI_refusal100.02 / 13AI_safety_compliance100.02 / 13AI_spec100.02 / 13AI_stability100.02 / 13AI_task_completion95.43 / 13AI_tool_selection91.72 / 13ARC_AGI_273.75 / 11ArtificialAnalysisCoding75.46 / 14ArtificialAnalysisIntelligence73.58 / 14ArtificialAnalysisReasoning49.010 / 14ContextWindow99.38 / 14CopilotArenaOrLMArenaCode86.55 / 12GDPval*76.35 / 12GPQA_HLE_Reasoning49.010 / 14IFBench21.111 / 14InverseCost78.010 / 14InverseTTFT64.612 / 14LMArenaCreativeOrOpenEnded77.48 / 14LMArenaSearchDocument19.98 / 13LMArenaText77.48 / 14LiveCodeBench0.014 / 14LongContextRecall67.66 / 14MCPAtlas48.610 / 12OutputSpeed71.99 / 14SWEBenchPro18.79 / 10SWEBenchVerified46.512 / 14SWEComposite45.010 / 14SWERebench95.23 / 11SciCode31.210 / 14SonarFunctionalSkill60.44 / 12SonarIssueDensity17.48 / 12Tau2Bench54.112 / 14TerminalBench*46.811 / 14
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenamcp_atlasopenrouteroverrides*sonarswebenchswebench_proswerebenchterminal_bench*missing SWEComposite/SWEBenchMultilingual
gemini-3-progoogle80.880.868.368.359.559.557.9

group breakdown

A_B73.15 / 14A_I74.85 / 14A_P68.22 / 14A_R67.47 / 14BUILD49.19 / 14CRE93.54 / 14GEN67.610 / 14LM_ARENA_REVIEW_PROXY17.411 / 14OPS_long71.98 / 14OPS_precision69.09 / 14OPS_review72.37 / 14PLAN68.97 / 14

metrics

AI_code83.93 / 13AI_complexity62.75 / 13AI_context_awareness7.67 / 13AI_correctness100.05 / 13AI_edge_cases0.013 / 13AI_efficiency30.18 / 13AI_hallucination_resistance100.01 / 13AI_memory_retention100.01 / 13AI_parameter_accuracy100.01 / 13AI_plan_coherence100.01 / 13AI_recovery0.013 / 13AI_refusal100.05 / 13AI_safety_compliance100.04 / 13AI_spec100.05 / 13AI_stability100.03 / 13AI_task_completion21.911 / 13AI_tool_selection0.013 / 13ARC_AGI_236.27 / 11ArtificialAnalysisCoding60.09 / 14ArtificialAnalysisIntelligence60.210 / 14ArtificialAnalysisReasoning74.96 / 14ContextWindow0.014 / 14CopilotArenaOrLMArenaCode52.17 / 12GDPval5.012 / 12GPQA_HLE_Reasoning74.96 / 14IFBench73.48 / 14InverseCost81.37 / 14InverseTTFT73.111 / 14LMArenaCreativeOrOpenEnded93.54 / 14LMArenaSearchDocument17.410 / 13LMArenaText93.54 / 14LiveCodeBench59.58 / 14LongContextRecall67.67 / 14MCPAtlas53.29 / 12OutputSpeed89.33 / 14SWEBenchMultilingual27.13 / 4SWEBenchPro21.38 / 10SWEBenchVerified57.311 / 14SWEComposite42.611 / 14SWERebench70.99 / 11SciCode91.53 / 14SonarFunctionalSkill56.15 / 12SonarIssueDensity14.39 / 12Tau2Bench79.27 / 14TerminalBench60.79 / 14
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
claude-opus-4.7anthropic67.267.263.063.058.758.761.5

group breakdown

A_B13.614 / 14A_I19.914 / 14A_P30.614 / 14A_R23.214 / 14BUILD85.11 / 14CRE97.73 / 14GEN93.63 / 14LM_ARENA_REVIEW_PROXY98.62 / 14OPS_long65.79 / 14OPS_precision70.78 / 14OPS_review70.68 / 14PLAN74.06 / 14

metrics

AI_code0.013 / 13AI_complexity3.312 / 13AI_context_awareness13.95 / 13AI_correctness0.013 / 13AI_edge_cases35.410 / 13AI_efficiency8.911 / 13AI_hallucination_resistance0.013 / 13AI_memory_retention30.55 / 13AI_parameter_accuracy0.013 / 13AI_plan_coherence27.85 / 13AI_recovery75.610 / 13AI_refusal0.013 / 13AI_safety_compliance100.01 / 13AI_spec0.013 / 13AI_stability49.99 / 13AI_task_completion100.01 / 13AI_tool_selection57.57 / 13ArtificialAnalysisCoding81.05 / 14ArtificialAnalysisIntelligence95.92 / 14ArtificialAnalysisReasoning82.35 / 14ContextWindow99.37 / 14CopilotArenaOrLMArenaCode100.01 / 12GDPval93.02 / 12GPQA_HLE_Reasoning82.35 / 14IFBench28.79 / 14InverseCost72.212 / 14InverseTTFT77.57 / 14LMArenaCreativeOrOpenEnded97.73 / 14LMArenaSearchDocument98.62 / 13LMArenaText97.73 / 14LiveCodeBench70.56 / 14LongContextRecall63.88 / 14MCPAtlas100.01 / 12OutputSpeed51.010 / 14SWEBenchPro95.01 / 10SWEBenchVerified94.22 / 14SWEComposite94.71 / 14SciCode81.15 / 14SonarFunctionalSkill80.42 / 12SonarIssueDensity0.012 / 12Tau2Bench82.56 / 14TerminalBench77.94 / 14
sources aistupidlevelartificial_analysislivecodebenchlmarenamcp_atlasopenrouteroverridessonarmissing GEN/ARC_AGI_2SWEComposite/SWEBenchMultilingualSWEComposite/SWERebench
kimi-k2.6moonshot68.568.564.864.857.657.667.9

group breakdown

A_B42.413 / 14A_I55.811 / 14A_P41.413 / 14A_R56.611 / 14BUILD66.27 / 14CRE80.26 / 14GEN79.66 / 14LM_ARENA_REVIEW_PROXY83.85 / 14OPS_long42.414 / 14OPS_precision59.412 / 14OPS_review65.012 / 14PLAN80.24 / 14

metrics

AI_code20.19 / 13AI_complexity40.69 / 13AI_context_awareness0.010 / 13AI_correctness78.48 / 13AI_edge_cases43.36 / 13AI_efficiency0.013 / 13AI_hallucination_resistance20.011 / 13AI_memory_retention0.010 / 13AI_parameter_accuracy32.211 / 13AI_plan_coherence11.68 / 13AI_recovery97.14 / 13AI_refusal100.06 / 13AI_safety_compliance0.013 / 13AI_spec100.06 / 13AI_stability0.111 / 13AI_task_completion40.48 / 13AI_tool_selection52.58 / 13ArtificialAnalysisCoding62.18 / 14ArtificialAnalysisIntelligence82.35 / 14ArtificialAnalysisReasoning72.57 / 14ContextWindow69.213 / 14CopilotArenaOrLMArenaCode87.54 / 12GDPval50.98 / 12GPQA_HLE_Reasoning72.57 / 14IFBench94.64 / 14InverseCost98.53 / 14InverseTTFT*90.42 / 14LMArenaCreativeOrOpenEnded80.26 / 14LMArenaSearchDocument83.85 / 13LMArenaText80.26 / 14LiveCodeBench37.412 / 14LongContextRecall58.19 / 14MCPAtlas*78.15 / 12OutputSpeed*7.513 / 14SWEBenchVerified78.74 / 14SWEComposite67.17 / 14SWERebench*92.54 / 11SciCode74.76 / 14SonarFunctionalSkill*31.97 / 12SonarIssueDensity*92.52 / 12Tau2Bench98.62 / 14TerminalBench74.35 / 14
sources aistupidlevelartificial_analysislivecodebenchlmarenamcp_atlas*openrouteroverridessonar*swerebench*missing GEN/ARC_AGI_2SWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
claude-sonnet-4.5anthropic70.170.150.850.856.956.951.3

group breakdown

A_B88.51 / 14A_I91.61 / 14A_P76.41 / 14A_R88.31 / 14BUILD36.412 / 14CRE68.411 / 14GEN36.712 / 14LM_ARENA_REVIEW_PROXY0.014 / 14OPS_long55.811 / 14OPS_precision66.210 / 14OPS_review67.811 / 14PLAN30.312 / 14

metrics

AI_code98.92 / 13AI_complexity98.82 / 13AI_context_awareness98.42 / 13AI_correctness100.03 / 13AI_edge_cases100.02 / 13AI_efficiency84.32 / 13AI_hallucination_resistance40.08 / 13AI_memory_retention0.09 / 13AI_parameter_accuracy21.312 / 13AI_plan_coherence37.54 / 13AI_recovery100.02 / 13AI_refusal100.03 / 13AI_safety_compliance20.811 / 13AI_spec100.03 / 13AI_stability86.06 / 13AI_task_completion96.12 / 13AI_tool_selection84.55 / 13ARC_AGI_213.99 / 11ArtificialAnalysisCoding32.412 / 14ArtificialAnalysisIntelligence38.612 / 14ArtificialAnalysisReasoning7.513 / 14ContextWindow99.39 / 14CopilotArenaOrLMArenaCode32.611 / 12GDPval80.93 / 12GPQA_HLE_Reasoning7.513 / 14IFBench23.710 / 14InverseCost78.011 / 14InverseTTFT83.24 / 14LMArenaCreativeOrOpenEnded68.411 / 14LMArenaSearchDocument0.013 / 13LMArenaText68.411 / 14LiveCodeBench81.54 / 14LongContextRecall20.012 / 14MCPAtlas0.012 / 12OutputSpeed29.912 / 14SWEBenchMultilingual0.04 / 4SWEBenchPro22.57 / 10SWEBenchVerified59.710 / 14SWEComposite41.912 / 14SWERebench75.08 / 11SciCode17.611 / 14SonarFunctionalSkill27.99 / 12SonarIssueDensity29.75 / 12Tau2Bench59.411 / 14TerminalBench36.512 / 14
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
glm-5.1zai69.169.160.160.155.355.362.9

group breakdown

A_B45.112 / 14A_I53.712 / 14A_P44.512 / 14A_R54.212 / 14BUILD60.18 / 14CRE87.25 / 14GEN70.38 / 14LM_ARENA_REVIEW_PROXY78.77 / 14OPS_long44.013 / 14OPS_precision63.011 / 14OPS_review68.910 / 14PLAN68.78 / 14

metrics

AI_code*24.68 / 13AI_complexity*42.08 / 13AI_context_awareness*7.58 / 13AI_correctness*74.112 / 13AI_edge_cases*44.35 / 13AI_efficiency*7.512 / 13AI_hallucination_resistance*24.59 / 13AI_memory_retention*7.56 / 13AI_parameter_accuracy*34.810 / 13AI_plan_coherence*17.37 / 13AI_recovery*90.09 / 13AI_refusal*92.512 / 13AI_safety_compliance*7.512 / 13AI_spec*92.512 / 13AI_stability*7.610 / 13AI_task_completion*41.87 / 13AI_tool_selection*52.19 / 13ArtificialAnalysisCoding49.210 / 14ArtificialAnalysisIntelligence72.39 / 14ArtificialAnalysisReasoning42.311 / 14ContextWindow76.211 / 14CopilotArenaOrLMArenaCode89.63 / 12GDPval58.37 / 12GPQA_HLE_Reasoning42.311 / 14IFBench95.83 / 14InverseCost98.92 / 14InverseTTFT100.01 / 14LMArenaCreativeOrOpenEnded87.25 / 14LMArenaSearchDocument*78.77 / 13LMArenaText87.25 / 14LiveCodeBench31.913 / 14LongContextRecall0.014 / 14MCPAtlas83.14 / 12OutputSpeed4.914 / 14SWEBenchVerified*74.46 / 14SWEComposite67.36 / 14SWERebench100.01 / 11SciCode11.712 / 14SonarFunctionalSkill28.78 / 12SonarIssueDensity100.01 / 12Tau2Bench100.01 / 14TerminalBench69.97 / 14
sources aistupidlevel*artificial_analysislivecodebenchlmarenamcp_atlasopenrouteroverridessonarswerebenchmissing GEN/ARC_AGI_2SWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
gemini-3-flashgoogle72.072.062.062.052.252.252.3

group breakdown

A_B64.86 / 14A_I65.86 / 14A_P61.65 / 14A_R61.19 / 14BUILD38.011 / 14CRE78.17 / 14GEN59.311 / 14LM_ARENA_REVIEW_PROXY18.710 / 14OPS_long95.82 / 14OPS_precision93.81 / 14OPS_review93.01 / 14PLAN56.810 / 14

metrics

AI_code*78.86 / 13AI_complexity*60.86 / 13AI_context_awareness*14.03 / 13AI_correctness*92.56 / 13AI_edge_cases*7.511 / 13AI_efficiency*33.15 / 13AI_hallucination_resistance*92.52 / 13AI_memory_retention*92.52 / 13AI_parameter_accuracy*92.52 / 13AI_plan_coherence*92.52 / 13AI_recovery*7.511 / 13AI_refusal*92.510 / 13AI_safety_compliance*92.57 / 13AI_spec*92.510 / 13AI_stability*92.54 / 13AI_task_completion*26.19 / 13AI_tool_selection*7.511 / 13ARC_AGI_239.46 / 11ArtificialAnalysisCoding46.411 / 14ArtificialAnalysisIntelligence52.211 / 14ArtificialAnalysisReasoning66.39 / 14ContextWindow100.05 / 14CopilotArenaOrLMArenaCode32.810 / 12GDPval6.411 / 12GPQA_HLE_Reasoning66.39 / 14IFBench100.01 / 14InverseCost97.24 / 14InverseTTFT85.23 / 14LMArenaCreativeOrOpenEnded78.17 / 14LMArenaSearchDocument18.79 / 13LMArenaText78.17 / 14LiveCodeBench54.09 / 14LongContextRecall25.710 / 14MCPAtlas6.311 / 12OutputSpeed98.22 / 14SWEBenchMultilingual100.01 / 4SWEBenchPro0.010 / 10SWEBenchVerified71.77 / 14SWEComposite46.89 / 14SWERebench76.47 / 11SciCode55.89 / 14SonarFunctionalSkill43.26 / 12SonarIssueDensity20.97 / 12Tau2Bench64.510 / 14TerminalBench47.610 / 14
sources aistupidlevel*arc_agiartificial_analysislivecodebenchlmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
gemini-2.5-flashgoogle60.560.529.929.948.348.347.7

group breakdown

A_B86.92 / 14A_I78.04 / 14A_P53.38 / 14A_R84.92 / 14BUILD18.914 / 14CRE57.612 / 14GEN14.413 / 14LM_ARENA_REVIEW_PROXY50.08 / 14OPS_long96.41 / 14OPS_precision93.72 / 14OPS_review92.82 / 14PLAN1.214 / 14

metrics

AI_code100.01 / 13AI_complexity100.01 / 13AI_context_awareness100.01 / 13AI_correctness100.04 / 13AI_edge_cases100.03 / 13AI_efficiency100.01 / 13AI_hallucination_resistance69.95 / 13AI_memory_retention31.94 / 13AI_parameter_accuracy45.88 / 13AI_plan_coherence0.013 / 13AI_recovery100.03 / 13AI_refusal100.04 / 13AI_safety_compliance100.03 / 13AI_spec100.04 / 13AI_stability0.013 / 13AI_task_completion0.013 / 13AI_tool_selection25.710 / 13ARC_AGI_20.011 / 11ArtificialAnalysisCoding0.014 / 14ArtificialAnalysisIntelligence0.014 / 14ArtificialAnalysisReasoning0.014 / 14ContextWindow100.03 / 14GPQA_HLE_Reasoning0.014 / 14IFBench4.813 / 14InverseCost100.01 / 14InverseTTFT82.15 / 14LMArenaCreativeOrOpenEnded57.612 / 14LMArenaText57.612 / 14LiveCodeBench48.510 / 14LongContextRecall6.713 / 14OutputSpeed100.01 / 14SWEBenchVerified0.014 / 14SWEComposite25.014 / 14SWERebench0.011 / 11SciCode0.014 / 14Tau2Bench0.014 / 14TerminalBench0.014 / 14
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing BUILD/CopilotArenaOrLMArenaCodeBUILD/GDPvalBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityLM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
gemini-2.5-progoogle25.425.428.228.237.237.231.2

group breakdown

A_B50.010 / 14A_I50.013 / 14A_P50.09 / 14A_R50.013 / 14BUILD23.313 / 14CRE0.014 / 14GEN5.714 / 14LM_ARENA_REVIEW_PROXY0.113 / 14OPS_long86.64 / 14OPS_precision84.74 / 14OPS_review83.33 / 14PLAN7.713 / 14

metrics

ARC_AGI_21.310 / 11ArtificialAnalysisCoding8.913 / 14ArtificialAnalysisIntelligence4.913 / 14ArtificialAnalysisReasoning19.412 / 14ContextWindow100.04 / 14CopilotArenaOrLMArenaCode0.012 / 12GPQA_HLE_Reasoning19.412 / 14IFBench0.014 / 14InverseCost84.45 / 14InverseTTFT75.59 / 14LMArenaCreativeOrOpenEnded0.014 / 14LMArenaSearchDocument0.112 / 13LMArenaText0.014 / 14LiveCodeBench42.911 / 14LongContextRecall22.811 / 14OutputSpeed87.45 / 14SWEBenchVerified20.213 / 14SWEComposite31.913 / 14SWERebench4.310 / 11SciCode5.213 / 14Tau2Bench6.613 / 14TerminalBench0.513 / 14
sources arc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_hallucination_resistanceA_B/AI_memory_retentionA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_codeA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_plan_coherenceA_I/AI_recoveryA_I/AI_refusalA_I/AI_specA_I/AI_stabilityA_P/AI_context_awarenessA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_memory_retentionA_P/AI_parameter_accuracyA_P/AI_plan_coherenceA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_P/AI_task_completionA_P/AI_tool_selectionA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_hallucination_resistanceA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GDPvalBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro