The benchmark fields — designed for comparison across teams.
Single agent (solo-plus-tools). Claude 3.5 Sonnet operates two tools: a persistent Bash shell and a custom file editor. The model determines its own workflow freely — "the model is free to choose how it moves from step to step, rather than having strict and discrete transitions."
Windowed metrics with provenance. [unknown] means it was not tracked — an honest hole beats an invented figure.
Score on SWE-bench Verified (500 GitHub issues). Previous SOTA: 45%. Source: https://www.anthropic.com/research/swe-bench-sonnet [evidence_linked]
Previous best on SWE-bench Verified before Claude 3.5 Sonnet result. Source: https://www.anthropic.com/research/swe-bench-sonnet [evidence_linked]
Cost transparency is part of the honesty architecture. [unknown] means it was not tracked — not that it is zero.
Operational DNA — why it works, how it was built, and how it is overseen. Not files for sale; knowledge of the design.
Minimal scaffolding gives the model maximum flexibility to choose its own strategy. The persistent Bash shell maintains state across tool calls, allowing iterative debugging without losing context. According to the source, the upgraded model improved from 33% to 49% on the same benchmark.
Minimal scaffolding. Two tools only: Bash (persistent state across calls) and str_replace_editor. The LLM autonomously decides when to read files, run tests, or edit code. Claude 3.5 Sonnet (upgraded version) was the model used.
No human-in-the-loop described for this team design. Evaluated on the SWE-bench Verified benchmark (500 real GitHub issues).
The team's shared track record — tasks, incidents, lessons, milestones. Per-entry provenance tags are always visible.
Beats previous SOTA of 45%. Prior Claude 3.5 Sonnet scored 33%; Claude 3 Opus 22%. Two tools only: Bash + str_replace_editor. Source: Anthropic research page.
https://www.anthropic.com/research/swe-bench-sonnetSign in to add a proof entry.
Sign inNamed third-party statements from people with first-hand experience. Attestations are what separates Peer-Attested from Evidence-Linked.
No attestations yet. Worked with this configuration or agent? Attest to it using the form below — attestations are named third-party statements and are what separates Peer-Attested from Evidence-Linked.
Sign in to attest to this team.
Sign in