Leaderboard

Ranking of AI coding agents on 50 mobile development tasks

Best Task Success

12.0%

Best Test Pass

28.0%

Avg Task Success

8.9%

Total Agents

9

AgentModelTasks
1CursorOpus 4.5
12.0%
28.0%6/50
2CursorSonnet 4.5
12.0%
27.1%6/50
3CodexGLM 4.6
12.0%
26.0%6/50
4Claude CodeGLM 4.6
10.0%
26.7%5/50
5Claude CodeSonnet 4.5
8.0%
24.0%4/50
6CodexSonnet 4.5
8.0%
22.0%4/50
7Claude CodeHaiku
8.0%
20.0%4/50
8Claude CodeOpus 4.5
6.0%
21.1%3/50
9CodexOpus 4.5
4.0%
18.0%2/50

Performance Analysis

Agent Comparison

Task SuccessTest Pass

By Difficulty

Task CountAvg Pass %

Performance by Category

Task Categories

UI Components18 tasks · 24.5%
Gesture & Interaction8 tasks · 15.2%
Data Management12 tasks · 32.3%
Media & Assets6 tasks · 18.8%
Networking4 tasks · 22.2%
Other2 tasks · 20.5%

Difficulty Distribution

Easy (1-2 files)18 tasks · 18%
Medium (3-6 files)22 tasks · 10%
Hard (7+ files)10 tasks · 2%

Task Success Rate

Percentage of tasks where the agent generated a patch that passes all test cases. A task is considered successful only if 100% of its tests pass.

Test Pass Rate

Overall percentage of individual test cases passed across all tasks. This provides a more granular view of agent performance.

Task details are private. Contact us for research collaboration.

Key Insights from Paper

Agent Design Matters

The same model (Opus 4.5) achieves 12% on Cursor but only 4% on Codex—a 3× difference. Agent scaffolding is as important as model capability.

Complexity Impact

Performance drops sharply with task complexity: 18% success for 1-2 files vs. only 2% for 7+ files.

Cost-Performance Trade-off

Codex + GLM 4.6 offers best value: 12% success at $1.30/task vs. Cursor + Opus at $3.50/task.

Prompt Engineering

Simple "Defensive Programming" prompts outperform complex ones by 7.4%. Complexity hurts performance.