About Mobile-Bench

An industry-level benchmark for evaluating AI coding agents on real-world mobile development tasks.

Motivation

While benchmarks like SWE-bench have been instrumental in evaluating AI coding agents on general software engineering tasks, there has been a notable gap in evaluating these agents on mobile development specifically.

Mobile development presents unique challenges including platform-specific APIs, UI/UX considerations, gesture handling, and performance constraints that are not well-represented in existing benchmarks.

Mobile-Bench addresses this gap by providing a comprehensive benchmark focused on industry-level mobile development tasks derived from real Product Requirement Documents.

Our Approach

Real-World PRDs

Each task is based on actual product requirement documents from mobile app development, including feature specifications, design references, and acceptance criteria.

Automated Evaluation

We use comprehensive test suites to automatically evaluate generated patches, ensuring consistent and reproducible results across different agents.

Multi-Agent Support

The benchmark supports evaluation of various AI coding agents including Codex, Claude Code, Cursor, and more, enabling fair comparison across different tools.

Benchmark Statistics

Tasks

450

Test Cases

Agents Tested

Citation

@misc{mobilebench2025,
  title={Mobile-Bench: Industry-Level Benchmark 
         for Mobile Development AI Agents},
  author={Mobile-Bench Team},
  year={2025},
  url={https://mobile-bench.dev}
}

Contact

Interested in research collaboration or evaluating your AI coding agent? We'd love to hear from you.

GitHub Email →