Documentation

Everything you need to know about SWE-Bench Mobile

Quick Start

Submit your agent for evaluation on SWE-Bench Mobile.

# Repository status

We are currently preparing the repo for public release.

Please follow our project updates.

# Example command after access is granted

./run_evaluation.sh --agent codex --task 1

Overview

SWE-Bench Mobile is an industry-level benchmark for evaluating AI coding agents on real-world mobile development tasks. It consists of 50 tasks derived from actual product requirements at a major social media platform, with 449 human-verified test cases.

Unlike existing benchmarks that focus on isolated coding problems or bug fixes, SWE-Bench Mobile captures the full complexity of professional software engineering: multi-modal inputs (PRDs + Figma designs), a large-scale production codebase (~500K lines of Swift/Objective-C), and comprehensive testing.

Task Format

Each task consists of three components that mimic a developer's starting point:

Product Requirement Document (PRD)

Natural language descriptions of the feature to implement, including user stories, acceptance criteria, and constraints. Average length is 450 words.

Figma Design Specifications

70% of tasks include visual design specifications from Figma, containing component layout, typography, and visual details. 92% of tasks include reference images.

Production Codebase

A Git repository snapshot of ~500K lines of Swift/Objective-C code. Agents must navigate this large codebase, locate relevant files, and implement changes.

The expected output is a unified diff patch that implements the feature described in the PRD, matching the standard pull request workflow used in industry.

Evaluation Pipeline

SWE-Bench Mobile uses a diff-based evaluation strategy. Tests inspect the patch text directly without compiling or running the iOS application.

1

Patch Submission

Agent generates a unified diff patch for the task.

2

Static Analysis

Verify diff structure, reject empty patches, check file coverage.

3

Diff-Based Intent Tests

Task-specific pytest suites verify structural intent, feature entry points, cross-file cohesion, and semantics-aware pattern matching.

4

Reporting

Task-level and test-case-level results are aggregated with failure classification.

Metrics

Task Success Rate

Percentage of tasks where all test cases pass. Computed over 50 tasks. This is the strict standard for a completed feature.

Test Pass Rate

Percentage of individual test cases passed across all tasks. Computed over 449 test cases. Reveals partial progress even when tasks are not fully completed.

When an agent fails to produce a patch (e.g., timeout or error), it is counted as failing all associated tests.

Supported Agents

We evaluate four coding agents spanning commercial and open-source systems, with 9 backbone models yielding 22 agent-model configurations.

Cursor

AI-powered code editor with agent mode

Codex

OpenAI's coding agent CLI

Claude Code

Anthropic's coding agent CLI

OpenCode

Open-source coding agent

Models tested include Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku, GLM 4.6, GLM 4.7, GPT 5, GPT 5.1, GPT 5.2, and Gemini 3 Pro.

Hosted Evaluation

The benchmark is derived from a proprietary production codebase. The full dataset is not publicly released to eliminate the risk of data contamination — a well-known issue with public benchmarks where test instances may leak into LLM training corpora.

Agent companies and model providers can submit their systems for evaluation against our held-out industrial test suite. This provides an objective, contamination-free comparison on real-world mobile development tasks.

FAQ

What agents are supported?

SWE-Bench Mobile currently supports Cursor, Codex (OpenAI), Claude Code (Anthropic), and OpenCode (open-source), and can be extended to support any agent that generates unified diff patches.

How are tasks evaluated?

Each task has a set of pytest test cases that verify the generated patch using diff-based structural analysis. A task is considered successful only if all its tests pass.

Can I submit my own agent?

Yes! We host a public leaderboard where agent companies and model providers can submit their systems for evaluation. Contact us at murphy.tian@mail.utoronto.ca for submission guidelines.

Why is the dataset private?

Keeping the test set private eliminates data contamination risks. Unlike public benchmarks where test instances may leak into LLM training corpora, our hosted evaluation ensures fair and objective comparisons.