Documentation
Everything you need to know about SWE-Bench Mobile
Quick Start
Submit your agent for evaluation on SWE-Bench Mobile.
# Repository status
We are currently preparing the repo for public release.
Please follow our project updates.
# Example command after access is granted
./run_evaluation.sh --agent codex --task 1
Overview
SWE-Bench Mobile is an industry-level benchmark for evaluating AI coding agents on real-world mobile development tasks. It consists of 50 tasks derived from actual product requirements at a major social media platform, with 449 human-verified test cases.
Unlike existing benchmarks that focus on isolated coding problems or bug fixes, SWE-Bench Mobile captures the full complexity of professional software engineering: multi-modal inputs (PRDs + Figma designs), a large-scale production codebase (~500K lines of Swift/Objective-C), and comprehensive testing.
Task Format
Each task consists of three components that mimic a developer's starting point:
Product Requirement Document (PRD)
Natural language descriptions of the feature to implement, including user stories, acceptance criteria, and constraints. Average length is 450 words.
Figma Design Specifications
70% of tasks include visual design specifications from Figma, containing component layout, typography, and visual details. 92% of tasks include reference images.
Production Codebase
A Git repository snapshot of ~500K lines of Swift/Objective-C code. Agents must navigate this large codebase, locate relevant files, and implement changes.
The expected output is a unified diff patch that implements the feature described in the PRD, matching the standard pull request workflow used in industry.
Evaluation Pipeline
SWE-Bench Mobile uses a diff-based evaluation strategy. Tests inspect the patch text directly without compiling or running the iOS application.
Patch Submission
Agent generates a unified diff patch for the task.
Static Analysis
Verify diff structure, reject empty patches, check file coverage.
Diff-Based Intent Tests
Task-specific pytest suites verify structural intent, feature entry points, cross-file cohesion, and semantics-aware pattern matching.
Reporting
Task-level and test-case-level results are aggregated with failure classification.
Metrics
Task Success Rate
Percentage of tasks where all test cases pass. Computed over 50 tasks. This is the strict standard for a completed feature.
Test Pass Rate
Percentage of individual test cases passed across all tasks. Computed over 449 test cases. Reveals partial progress even when tasks are not fully completed.
When an agent fails to produce a patch (e.g., timeout or error), it is counted as failing all associated tests.
Supported Agents
We evaluate four coding agents spanning commercial and open-source systems, with 9 backbone models yielding 22 agent-model configurations.
Cursor
AI-powered code editor with agent mode
Codex
OpenAI's coding agent CLI
Claude Code
Anthropic's coding agent CLI
OpenCode
Open-source coding agent
Models tested include Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku, GLM 4.6, GLM 4.7, GPT 5, GPT 5.1, GPT 5.2, and Gemini 3 Pro.
Hosted Evaluation
The benchmark is derived from a proprietary production codebase. The full dataset is not publicly released to eliminate the risk of data contamination — a well-known issue with public benchmarks where test instances may leak into LLM training corpora.
Agent companies and model providers can submit their systems for evaluation against our held-out industrial test suite. This provides an objective, contamination-free comparison on real-world mobile development tasks.
FAQ
What agents are supported?
SWE-Bench Mobile currently supports Cursor, Codex (OpenAI), Claude Code (Anthropic), and OpenCode (open-source), and can be extended to support any agent that generates unified diff patches.
How are tasks evaluated?
Each task has a set of pytest test cases that verify the generated patch using diff-based structural analysis. A task is considered successful only if all its tests pass.
Can I submit my own agent?
Yes! We host a public leaderboard where agent companies and model providers can submit their systems for evaluation. Contact us at murphy.tian@mail.utoronto.ca for submission guidelines.
Why is the dataset private?
Keeping the test set private eliminates data contamination risks. Unlike public benchmarks where test instances may leak into LLM training corpora, our hosted evaluation ensures fair and objective comparisons.