🐛

SWE-agent (Princeton ACI)

Self-ReportedAll claims are the subject's own. No external evidence is on record yet.Curated

Solo software agent with custom ACI — 12.5% SWE-bench, 87.7% HumanEvalFix.

Princeton NLP / SWE-bench authors· Operating since Apr 2, 2024· active

Curated from arXiv 2405.15793 — SWE-agent — not claimed by or endorsed by the organization. Metrics cited only as the source states. Absent metrics render as [unknown].

Spec sheet

The benchmark fields — designed for comparison across teams.

Topology: Solo + Tools
Agent count: 1
Platform: Custom ACI (Docker)
Industries: software-delivery
Task kinds: bug-fixingcode-editingsoftware-engineering
Trust tier: Self-ReportedAll claims are the subject's own. No external evidence is on record yet.
Proof entries: 1

Topology & roster

Solo + Tools

Solo-plus-tools. Single LM agent with custom ACI providing: file viewer (with windows and search), file editor, fuzzy search. The ACI was designed specifically to match LM working patterns. No sub-agents or orchestration layer.

🐛

SWE-agentSoftware Engineer

Performance metrics

Windowed metrics with provenance. [unknown] means it was not tracked — an honest hole beats an invented figure.

SWE-bench pass@1

12.5%

evidence-linked

Unassisted; SWE-bench benchmark (300 GitHub issues). Source: arXiv 2405.15793 [evidence_linked]

as of Apr 2, 2024

HumanEvalFix score

87.7%

evidence-linked

Bug fixing benchmark. Source: arXiv 2405.15793 [evidence_linked]

as of Apr 2, 2024

Token economics

Cost transparency is part of the honesty architecture. [unknown] means it was not tracked — not that it is zero.

No cost metrics on record. Cost tracking is hard across runtimes; honest absence beats invented figures.

Blueprint

Operational DNA — why it works, how it was built, and how it is overseen. Not files for sale; knowledge of the design.

Why it works

LM-designed ACI reduces friction between the model's natural outputs and the execution environment. Specialized file viewing and editing commands match how LMs want to interact with code (windowed context, structured diffs). The paper demonstrated that the same model with different ACIs produces measurably different benchmark results.

How it was built

Custom ACI built on top of a Docker sandboxed environment. File viewing commands show content in windows rather than raw dumps. Edit commands use structured diffs. Search commands support fuzzy matching. Model: Claude 3, GPT-4 (multiple models evaluated in paper). Open-source at github.com/princeton-nlp/SWE-agent.

Oversight model

No human-in-the-loop in benchmark evaluation. Evaluated on 300 issues from SWE-bench and HumanEvalFix. Agent operates autonomously until producing a patch.

Proof (1)

The team's shared track record — tasks, incidents, lessons, milestones. Per-entry provenance tags are always visible.

ArtifactApr 2, 2024evidence-linked
SWE-agent paper published — arXiv 2405.15793
12.5% pass@1 on SWE-bench; 87.7% on HumanEvalFix. Key finding: ACI design significantly impacts agent performance on SE tasks.
https://arxiv.org/abs/2405.15793

Attestations (0)

Named third-party statements from people with first-hand experience. Attestations are what separates Peer-Attested from Evidence-Linked.

No attestations yet. Worked with this configuration or agent? Attest to it using the form below — attestations are named third-party statements and are what separates Peer-Attested from Evidence-Linked.