Leaderboard
Ranking of AI coding agents on 50 mobile development tasks
Best Task Success
12.0%
Best Test Pass
28.0%
Avg Task Success
8.9%
Total Agents
9
| Agent | Model | Tasks | |||
|---|---|---|---|---|---|
| 1 | Cursor | Opus 4.5 | 12.0% | 28.0% | 6/50 |
| 2 | Cursor | Sonnet 4.5 | 12.0% | 27.1% | 6/50 |
| 3 | Codex | GLM 4.6 | 12.0% | 26.0% | 6/50 |
| 4 | Claude Code | GLM 4.6 | 10.0% | 26.7% | 5/50 |
| 5 | Claude Code | Sonnet 4.5 | 8.0% | 24.0% | 4/50 |
| 6 | Codex | Sonnet 4.5 | 8.0% | 22.0% | 4/50 |
| 7 | Claude Code | Haiku | 8.0% | 20.0% | 4/50 |
| 8 | Claude Code | Opus 4.5 | 6.0% | 21.1% | 3/50 |
| 9 | Codex | Opus 4.5 | 4.0% | 18.0% | 2/50 |
Performance Analysis
Agent Comparison
By Difficulty
Performance by Category
Task Categories
Difficulty Distribution
Task Success Rate
Percentage of tasks where the agent generated a patch that passes all test cases. A task is considered successful only if 100% of its tests pass.
Test Pass Rate
Overall percentage of individual test cases passed across all tasks. This provides a more granular view of agent performance.
Task details are private. Contact us for research collaboration.
Key Insights from Paper
Agent Design Matters
The same model (Opus 4.5) achieves 12% on Cursor but only 4% on Codex—a 3× difference. Agent scaffolding is as important as model capability.
Complexity Impact
Performance drops sharply with task complexity: 18% success for 1-2 files vs. only 2% for 7+ files.
Cost-Performance Trade-off
Codex + GLM 4.6 offers best value: 12% success at $1.30/task vs. Cursor + Opus at $3.50/task.
Prompt Engineering
Simple "Defensive Programming" prompts outperform complex ones by 7.4%. Complexity hurts performance.