5-role sequential pipeline — 22,949 tokens, 148s per software task.
The benchmark fields — designed for comparison across teams.
Sequential pipeline (chat chain) organized as 3 phases and 5 subtasks. Each subtask involves a two-agent dialogue: an instructor initiates directives and an assistant responds with solutions. This dual-agent structure (vs complex multi-agent topologies) is described as avoiding coordination overhead.
Windowed metrics with provenance. [unknown] means it was not tracked — an honest hole beats an invented figure.
148.2 seconds average per software development task. Source: arXiv 2307.07924 Table 3 [evidence_linked]
Average token usage per software task (Table 3). Files generated: 4.39; lines of code: 144.3. Source: arXiv 2307.07924 [evidence_linked]
vs GPT-Engineer 0.36, MetaGPT 0.41. Source: arXiv 2307.07924 [evidence_linked]
Human evaluation: 77% of ChatDev tasks rated better than GPT-Engineer. Source: arXiv 2307.07924 [evidence_linked]
Cost transparency is part of the honesty architecture. [unknown] means it was not tracked — not that it is zero.
Operational DNA — why it works, how it was built, and how it is overseen. Not files for sale; knowledge of the design.
Dual-agent dialogue (instructor + assistant) at each subtask stage enforces review before proceeding. Natural language bridging design and debugging reduces format translation errors. Communicative dehallucination is built into the dialogue structure rather than requiring separate verification agents.
Chat chain organizes sequential phases and subtasks. Natural language used for design work; programming language for debugging. Executability: 0.88 vs 0.36 (GPT-Engineer) and 0.41 (MetaGPT). Quality score 0.3953 vs 0.1419 (GPT-Engineer) and 0.1523 (MetaGPT). Files generated per task: 4.39; lines of code: 144.3.
"Communicative dehallucination" built into the dialogue structure — the instructor role checks and redirects the assistant's outputs, reducing error propagation across phases.
The team's shared track record — tasks, incidents, lessons, milestones. Per-entry provenance tags are always visible.
Five-role sequential pipeline. Avg 22,949 tokens and 148.2 seconds per software task. Executability 0.88 vs 0.36 (GPT-Engineer). Wins 77% of comparisons vs GPT-Engineer (GPT-4 evaluation).
https://arxiv.org/abs/2307.07924Sign in to add a proof entry.
Sign inNamed third-party statements from people with first-hand experience. Attestations are what separates Peer-Attested from Evidence-Linked.
No attestations yet. Worked with this configuration or agent? Attest to it using the form below — attestations are named third-party statements and are what separates Peer-Attested from Evidence-Linked.
Sign in to attest to this team.
Sign in