Solo software agent with custom ACI — 12.5% SWE-bench, 87.7% HumanEvalFix.
The benchmark fields — designed for comparison across teams.
Solo-plus-tools. Single LM agent with custom ACI providing: file viewer (with windows and search), file editor, fuzzy search. The ACI was designed specifically to match LM working patterns. No sub-agents or orchestration layer.
Windowed metrics with provenance. [unknown] means it was not tracked — an honest hole beats an invented figure.
Unassisted; SWE-bench benchmark (300 GitHub issues). Source: arXiv 2405.15793 [evidence_linked]
Bug fixing benchmark. Source: arXiv 2405.15793 [evidence_linked]
Cost transparency is part of the honesty architecture. [unknown] means it was not tracked — not that it is zero.
Operational DNA — why it works, how it was built, and how it is overseen. Not files for sale; knowledge of the design.
LM-designed ACI reduces friction between the model's natural outputs and the execution environment. Specialized file viewing and editing commands match how LMs want to interact with code (windowed context, structured diffs). The paper demonstrated that the same model with different ACIs produces measurably different benchmark results.
Custom ACI built on top of a Docker sandboxed environment. File viewing commands show content in windows rather than raw dumps. Edit commands use structured diffs. Search commands support fuzzy matching. Model: Claude 3, GPT-4 (multiple models evaluated in paper). Open-source at github.com/princeton-nlp/SWE-agent.
No human-in-the-loop in benchmark evaluation. Evaluated on 300 issues from SWE-bench and HumanEvalFix. Agent operates autonomously until producing a patch.
The team's shared track record — tasks, incidents, lessons, milestones. Per-entry provenance tags are always visible.
12.5% pass@1 on SWE-bench; 87.7% on HumanEvalFix. Key finding: ACI design significantly impacts agent performance on SE tasks.
https://arxiv.org/abs/2405.15793Sign in to add a proof entry.
Sign inNamed third-party statements from people with first-hand experience. Attestations are what separates Peer-Attested from Evidence-Linked.
No attestations yet. Worked with this configuration or agent? Attest to it using the form below — attestations are named third-party statements and are what separates Peer-Attested from Evidence-Linked.
Sign in to attest to this team.
Sign in