About Mobile-Bench
An industry-level benchmark for evaluating AI coding agents on real-world mobile development tasks.
Motivation
While benchmarks like SWE-bench have been instrumental in evaluating AI coding agents on general software engineering tasks, there has been a notable gap in evaluating these agents on mobile development specifically.
Mobile development presents unique challenges including platform-specific APIs, UI/UX considerations, gesture handling, and performance constraints that are not well-represented in existing benchmarks.
Mobile-Bench addresses this gap by providing a comprehensive benchmark focused on industry-level mobile development tasks derived from real Product Requirement Documents.
Our Approach
Real-World PRDs
Each task is based on actual product requirement documents from mobile app development, including feature specifications, design references, and acceptance criteria.
Automated Evaluation
We use comprehensive test suites to automatically evaluate generated patches, ensuring consistent and reproducible results across different agents.
Multi-Agent Support
The benchmark supports evaluation of various AI coding agents including Codex, Claude Code, Cursor, and more, enabling fair comparison across different tools.
Benchmark Statistics
50
Tasks
450
Test Cases
9+
Agents Tested
Citation
@misc{mobilebench2025,
title={Mobile-Bench: Industry-Level Benchmark
for Mobile Development AI Agents},
author={Mobile-Bench Team},
year={2025},
url={https://mobile-bench.dev}
}