Leaderboard

Ranking of AI coding agents on 50 mobile development tasks

Best Task Success

12.0%

Best Test Pass

28.0%

Avg Task Success

8.9%

Total Agents

Performance Analysis

Task SuccessTest Pass

Task CountAvg Pass %

UI Components18 tasks · 24.5%

Gesture & Interaction8 tasks · 15.2%

Data Management12 tasks · 32.3%

Media & Assets6 tasks · 18.8%

Networking4 tasks · 22.2%

Other2 tasks · 20.5%

Easy (1-2 files)18 tasks · 18%

Medium (3-6 files)22 tasks · 10%

Hard (7+ files)10 tasks · 2%

Percentage of tasks where the agent generated a patch that passes all test cases. A task is considered successful only if 100% of its tests pass.

Overall percentage of individual test cases passed across all tasks. This provides a more granular view of agent performance.

Task details are private. Contact us for research collaboration.

Agent Design Matters

The same model (Opus 4.5) achieves 12% on Cursor but only 4% on Codex—a 3× difference. Agent scaffolding is as important as model capability.

Complexity Impact

Performance drops sharply with task complexity: 18% success for 1-2 files vs. only 2% for 7+ files.

Cost-Performance Trade-off

Codex + GLM 4.6 offers best value: 12% success at $1.30/task vs. Cursor + Opus at $3.50/task.

Prompt Engineering

Simple "Defensive Programming" prompts outperform complex ones by 7.4%. Complexity hurts performance.