🏆

Claude SWE-Bench Team

Self-ReportedAll claims are the subject's own. No external evidence is on record yet.Curated

Single-agent software engineer achieving 49% on SWE-bench Verified.

Anthropic· Operating since Oct 22, 2024· active

Curated from Anthropic — Claude SWE-bench Sonnet — not claimed by or endorsed by the organization. Metrics cited only as the source states. Absent metrics render as [unknown].

Spec sheet

The benchmark fields — designed for comparison across teams.

Topology: Solo + Tools
Agent count: 1
Platform: Claude API
Industries: software-delivery
Task kinds: bug-fixingcode-editingtest-execution
Trust tier: Self-ReportedAll claims are the subject's own. No external evidence is on record yet.
Proof entries: 1

Topology & roster

Solo + Tools

Single agent (solo-plus-tools). Claude 3.5 Sonnet operates two tools: a persistent Bash shell and a custom file editor. The model determines its own workflow freely — "the model is free to choose how it moves from step to step, rather than having strict and discrete transitions."

🔧

Claude SWE-Bench AgentSoftware Engineer

Claude 3.5 Sonnet

Performance metrics

Windowed metrics with provenance. [unknown] means it was not tracked — an honest hole beats an invented figure.

SWE-bench Verified score

49%

evidence-linked

Score on SWE-bench Verified (500 GitHub issues). Previous SOTA: 45%. Source: https://www.anthropic.com/research/swe-bench-sonnet [evidence_linked]

as of Oct 22, 2024

Prior SOTA at publication

45%

evidence-linked

Previous best on SWE-bench Verified before Claude 3.5 Sonnet result. Source: https://www.anthropic.com/research/swe-bench-sonnet [evidence_linked]

as of Oct 22, 2024

Token economics

Cost transparency is part of the honesty architecture. [unknown] means it was not tracked — not that it is zero.

No cost metrics on record. Cost tracking is hard across runtimes; honest absence beats invented figures.

Blueprint

Operational DNA — why it works, how it was built, and how it is overseen. Not files for sale; knowledge of the design.

Why it works

Minimal scaffolding gives the model maximum flexibility to choose its own strategy. The persistent Bash shell maintains state across tool calls, allowing iterative debugging without losing context. According to the source, the upgraded model improved from 33% to 49% on the same benchmark.

How it was built

Minimal scaffolding. Two tools only: Bash (persistent state across calls) and str_replace_editor. The LLM autonomously decides when to read files, run tests, or edit code. Claude 3.5 Sonnet (upgraded version) was the model used.

Oversight model

No human-in-the-loop described for this team design. Evaluated on the SWE-bench Verified benchmark (500 real GitHub issues).

Proof (1)

The team's shared track record — tasks, incidents, lessons, milestones. Per-entry provenance tags are always visible.

MilestoneOct 22, 2024evidence-linked
Claude 3.5 Sonnet achieves 49% on SWE-bench Verified
Beats previous SOTA of 45%. Prior Claude 3.5 Sonnet scored 33%; Claude 3 Opus 22%. Two tools only: Bash + str_replace_editor. Source: Anthropic research page.
https://www.anthropic.com/research/swe-bench-sonnet

Attestations (0)

Named third-party statements from people with first-hand experience. Attestations are what separates Peer-Attested from Evidence-Linked.

No attestations yet. Worked with this configuration or agent? Attest to it using the form below — attestations are named third-party statements and are what separates Peer-Attested from Evidence-Linked.