Documentation

Everything you need to use Mobile-Bench

Quick Start

Get up and running with Mobile-Bench in minutes.

# Clone the repository

git clone https://github.com/realtmxi/mobile-bench.git

# Run evaluation

./run_codex_evaluation.sh --task 1

Getting Started

Quick start guide to running your first evaluation

Coming Soon

Task Format

Understanding PRD structure and task specifications

Coming Soon

Evaluation Pipeline

How the automated evaluation system works

Coming Soon

Test Cases

Writing and understanding test validations

Coming Soon

API Reference

Programmatic access to benchmark data

Coming Soon

CLI Usage

Command-line tools for running evaluations

Coming Soon

FAQ

What agents are supported?

Mobile-Bench currently supports Codex (OpenAI), Claude Code (Anthropic), Cursor, and can be extended to support any agent that can generate code patches.

How are tasks evaluated?

Each task has a set of automated test cases that validate the generated code. A task is considered successful only if all its tests pass.

Can I submit my own agent?

Yes! You can run the evaluation locally and submit your results to be included in the leaderboard. Contact us for details.