Leaderboard

Ranking of AI coding agents on 50 mobile development tasks

Best Task Success

12.0%

Best Test Pass

28.1%

Avg Task Success

6.6%

Total Agents

22

AgentModelTasks
1
Cursor icon
Cursor
Opus 4.5
12.0%
28.1%6/50
2
Cursor icon
Cursor
Sonnet 4.5
12.0%
26.7%6/50
3
Codex icon
Codex
GLM 4.6
12.0%
19.6%6/50
4
Codex icon
Codex
Sonnet 4.5
10.0%
28.1%5/50
5
Claude Code icon
Claude Code
GLM 4.6
10.0%
26.7%5/50
6
Claude Code icon
Claude Code
Sonnet 4.5
10.0%
24.7%5/50
7
Codex icon
Codex
GPT 5
10.0%
21.4%5/50
8
Cursor icon
Cursor
GPT 5.2
8.0%
27.4%4/50
9
Claude Code icon
Claude Code
Opus 4.5
8.0%
21.8%4/50
10
Claude Code icon
Claude Code
Haiku
8.0%
18.3%4/50
11
OpenCode icon
OpenCode
GLM 4.6
8.0%
17.8%4/50
12
Cursor icon
Cursor
Gemini 3 Pro
6.0%
23.2%3/50
13
OpenCode icon
OpenCode
GPT 5.1
6.0%
7.1%3/50
14
Codex icon
Codex
Opus 4.5
4.0%
20.7%2/50
15
OpenCode icon
OpenCode
Sonnet 4.5
4.0%
14.7%2/50
16
OpenCode icon
OpenCode
GLM 4.7
4.0%
14.3%2/50
17
OpenCode icon
OpenCode
Gemini 3 Pro
4.0%
13.4%2/50
18
OpenCode icon
OpenCode
GPT 5.2
4.0%
12.0%2/50
19
Cursor icon
Cursor
GPT 5.1
2.0%
19.6%1/50
20
OpenCode icon
OpenCode
Opus 4.5
2.0%
12.0%1/50
21
OpenCode icon
OpenCode
GPT 5
2.0%
12.0%1/50
22
Codex icon
Codex
GPT 5.1
0.0%
7.1%0/50

Performance Analysis

Agent Comparison

Task SuccessTest Pass

By Difficulty

Task CountAvg Pass %

Performance by Category

Task Categories

UI Components18 tasks · 12.5%
Data Management10 tasks · 15.3%
Gesture & Interaction8 tasks · 8%
Media & Assets7 tasks · 9.8%
Networking4 tasks · 11.2%
Other3 tasks · 10.5%

Difficulty Distribution

Easy (1-2 files)15 tasks · 18.5%
Medium (3-6 files)25 tasks · 10%
Hard (7+ files)10 tasks · 5.8%

Task Success Rate

Percentage of tasks where the agent generated a patch that passes all test cases. A task is considered successful only if 100% of its tests pass.

Test Pass Rate

Overall percentage of individual test cases passed across all tasks. This provides a more granular view of agent performance.

Task details are private. Contact us for research collaboration.

Key Insights from Paper

Agent Design Matters

The same model (Opus 4.5) achieves 12% on Cursor but only 2% on OpenCode—a 6x difference. Agent scaffolding is as important as model capability.

Complexity Impact

Performance drops sharply with task complexity: 18.5% success for easy tasks vs. only 5.8% for hard tasks (7+ files).

Cost-Performance Trade-off

Codex + GLM 4.6 offers best value: 12% success at $1.30/task vs. Cursor + Opus at $3.50/task.

Prompt Engineering

Simple "Defensive Programming" prompts outperform complex ones by 7.4%. Complexity hurts performance.