Compare
Row-aligned side-by-side. Highlighted cells differ across columns. [unknown] cells are honest gaps, never hidden.
| Field | 🏭 MetaGPT Software Dev Pipeline Self-ReportedCurated |
|---|
| Topology | Pipeline | Pipeline |
| Agents | 5 agents | 5 agents |
| Platform | MetaGPT | ChatDev |
| Roster |
|
|
| Industries | software-delivery | software-delivery |
| Task kinds | software-developmentrequirements-analysiscode-generationqa | software-developmentcode-reviewqadesign |
| Operating since | Aug 1, 2023 | Jul 14, 2023 |
| Trust tier | Self-Reported | Self-Reported |
| Proof entries | 1 total(1 with external links) | 1 total(1 with external links) |
| Oversight | SOP verification at each pipeline stage — agents check intermediate results against structured specifications. The QA Engineer formulates test cases and validates code quality as the final stage. | "Communicative dehallucination" built into the dialogue structure — the instructor role checks and redirects the assistant's outputs, reducing error propagation across phases. |
| Source | Curated | Curated |
| Metrics | ||
| Tokens per line of code | 124.3 evidence-linkedas of Aug 1, 2023 SoftwareDev benchmark; vs ChatDev 248.9. Source: arXiv 2308.00352 [evidence_linked] | [unknown] |
| HumanEval Pass@1 | 85.9% evidence-linkedas of Aug 1, 2023 With executable feedback loop. MBPP: 87.7% Pass@1. Source: arXiv 2308.00352 [evidence_linked] | [unknown] |
| Executability score (SoftwareDev) | 3.75 evidence-linkedas of Aug 1, 2023 3.75/4; vs ChatDev 2.25. Source: arXiv 2308.00352 Table 3 [evidence_linked] | 0.88 evidence-linkedas of Jul 14, 2023 vs GPT-Engineer 0.36, MetaGPT 0.41. Source: arXiv 2307.07924 [evidence_linked] |
| MBPP Pass@1 | 87.7% evidence-linkedas of Aug 1, 2023 With executable feedback loop. Source: arXiv 2308.00352 [evidence_linked] | [unknown] |
| Avg task duration | [unknown] | 148.2s evidence-linkedas of Jul 14, 2023 148.2 seconds average per software development task. Source: arXiv 2307.07924 Table 3 [evidence_linked] |
| Avg tokens per task | [unknown] | 22,949 evidence-linkedas of Jul 14, 2023 Average token usage per software task (Table 3). Files generated: 4.39; lines of code: 144.3. Source: arXiv 2307.07924 [evidence_linked] |
| Win rate vs GPT-Engineer | [unknown] | 77% evidence-linkedas of Jul 14, 2023 Human evaluation: 77% of ChatDev tasks rated better than GPT-Engineer. Source: arXiv 2307.07924 [evidence_linked] |