Leaderboard
Ranking of AI coding agents on 50 mobile development tasks
Best Task Success
12.0%
Best Test Pass
28.1%
Avg Task Success
6.6%
Total Agents
22
| Agent | Model | Tasks | |||
|---|---|---|---|---|---|
| 1 | Opus 4.5 | 12.0% | 28.1% | 6/50 | |
| 2 | Sonnet 4.5 | 12.0% | 26.7% | 6/50 | |
| 3 | GLM 4.6 | 12.0% | 19.6% | 6/50 | |
| 4 | Sonnet 4.5 | 10.0% | 28.1% | 5/50 | |
| 5 | GLM 4.6 | 10.0% | 26.7% | 5/50 | |
| 6 | Sonnet 4.5 | 10.0% | 24.7% | 5/50 | |
| 7 | GPT 5 | 10.0% | 21.4% | 5/50 | |
| 8 | GPT 5.2 | 8.0% | 27.4% | 4/50 | |
| 9 | Opus 4.5 | 8.0% | 21.8% | 4/50 | |
| 10 | Haiku | 8.0% | 18.3% | 4/50 | |
| 11 | GLM 4.6 | 8.0% | 17.8% | 4/50 | |
| 12 | Gemini 3 Pro | 6.0% | 23.2% | 3/50 | |
| 13 | GPT 5.1 | 6.0% | 7.1% | 3/50 | |
| 14 | Opus 4.5 | 4.0% | 20.7% | 2/50 | |
| 15 | Sonnet 4.5 | 4.0% | 14.7% | 2/50 | |
| 16 | GLM 4.7 | 4.0% | 14.3% | 2/50 | |
| 17 | Gemini 3 Pro | 4.0% | 13.4% | 2/50 | |
| 18 | GPT 5.2 | 4.0% | 12.0% | 2/50 | |
| 19 | GPT 5.1 | 2.0% | 19.6% | 1/50 | |
| 20 | Opus 4.5 | 2.0% | 12.0% | 1/50 | |
| 21 | GPT 5 | 2.0% | 12.0% | 1/50 | |
| 22 | GPT 5.1 | 0.0% | 7.1% | 0/50 |
Performance Analysis
Agent Comparison
By Difficulty
Performance by Category
Task Categories
Difficulty Distribution
Task Success Rate
Percentage of tasks where the agent generated a patch that passes all test cases. A task is considered successful only if 100% of its tests pass.
Test Pass Rate
Overall percentage of individual test cases passed across all tasks. This provides a more granular view of agent performance.
Task details are private. Contact us for research collaboration.
Key Insights from Paper
Agent Design Matters
The same model (Opus 4.5) achieves 12% on Cursor but only 2% on OpenCode—a 6x difference. Agent scaffolding is as important as model capability.
Complexity Impact
Performance drops sharply with task complexity: 18.5% success for easy tasks vs. only 5.8% for hard tasks (7+ files).
Cost-Performance Trade-off
Codex + GLM 4.6 offers best value: 12% success at $1.30/task vs. Cursor + Opus at $3.50/task.
Prompt Engineering
Simple "Defensive Programming" prompts outperform complex ones by 7.4%. Complexity hurts performance.