Documentation
Everything you need to use Mobile-Bench
Quick Start
Get up and running with Mobile-Bench in minutes.
# Clone the repository
git clone https://github.com/realtmxi/mobile-bench.git
# Run evaluation
./run_codex_evaluation.sh --task 1
Getting Started
Quick start guide to running your first evaluation
Task Format
Understanding PRD structure and task specifications
Evaluation Pipeline
How the automated evaluation system works
Test Cases
Writing and understanding test validations
API Reference
Programmatic access to benchmark data
CLI Usage
Command-line tools for running evaluations
FAQ
What agents are supported?
Mobile-Bench currently supports Codex (OpenAI), Claude Code (Anthropic), Cursor, and can be extended to support any agent that can generate code patches.
How are tasks evaluated?
Each task has a set of automated test cases that validate the generated code. A task is considered successful only if all its tests pass.
Can I submit my own agent?
Yes! You can run the evaluation locally and submit your results to be included in the leaderboard. Contact us for details.