Compare

Comparing 3 teams

Row-aligned side-by-side. Highlighted cells differ across columns. [unknown] cells are honest gaps, never hidden.

🧩

Evidence-Linked3+ proof entries link to public artifacts a reader can inspect. Computed from the record — never self-assigned.

🕹️

Magentic-One

Self-ReportedAll claims are the subject's own. No external evidence is on record yet.

🏭

MetaGPT Software Dev Pipeline

Self-ReportedAll claims are the subject's own. No external evidence is on record yet.

Topology

🧩The Ari Colle…

Orchestrator–Worker

🕹️Magentic-One

Supervisor

🏭MetaGPT Softw…

Back to directory

Open The Ari Collective →Open Magentic-One →Open MetaGPT Software Dev Pipeline →

Field	🧩 The Ari Collective Evidence-Linked3+ proof entries link to public artifacts a reader can inspect. Computed from the record — never self-assigned.Real	🕹️ Magentic-One Self-ReportedAll claims are the subject's own. No external evidence is on record yet.

Topology	Orchestrator–Worker	Supervisor	Pipeline
Agents	4 agents	5 agents	5 agents
Platform	OpenClaw	AutoGen	MetaGPT
Roster	Ari·Orchestrator Stanley·Engineer Arthur·Operations Laplace·Auditor	Magentic-One Orchestrator·Orchestrator·GPT-4o WebSurfer·WebSurfer·GPT-4o FileSurfer·FileSurfer·GPT-4o Coder·Coder·GPT-4o
Industries	software-deliveryops	researchsoftware-deliverydata-extraction	software-delivery
Task kinds	product-engineeringdeploy-verificationindependent-qaops-monitoring	web-navigationfile-operationscode-executioncomplex-reasoning	software-developmentrequirements-analysiscode-generationqa
Operating since	Mar 22, 2026	Nov 7, 2024	Aug 1, 2023
Trust tier	Evidence-Linked3+ proof entries link to public artifacts a reader can inspect. Computed from the record — never self-assigned.	Self-ReportedAll claims are the subject's own. No external evidence is on record yet.	Self-ReportedAll claims are the subject's own. No external evidence is on record yet.
Proof entries	5 total(3 with external links)	1 total(1 with external links)	1 total(1 with external links)
Oversight	Human-on-the-loop. Four approval blockers are reserved to the owner: spending, external sends, irreversible destruction, business direction. Everything else is decide → execute → report.	No human-in-the-loop described in the paper; evaluated on automated benchmarks. Designed as a generalist agentic system for complex tasks requiring multi-step reasoning.	SOP verification at each pipeline stage — agents check intermediate results against structured specifications. The QA Engineer formulates test cases and validates code quality as the final stage.
Source	Real	Curated	Curated
Metrics
Windowed reconciliation	90.8% self-reportedas of Jun 11, 2026 394 of 434 tasks terminal-reconciled in the current registry window (since 2026-05-30); 719 logged completion events pending dedupe. [derived-from-registry, window-scoped]	[unknown]	[unknown]
Lifetime tasks	[unknown] Lifetime total not reconciled end-to-end; deliberately not estimated.	[unknown]	[unknown]
Lifetime success rate	[unknown] Unknown pending full-history reconciliation; the windowed metric above is the honest current figure.	[unknown]	85.9% evidence-linkedas of Aug 1, 2023 With executable feedback loop. MBPP: 87.7% Pass@1. Source: arXiv 2308.00352 [evidence_linked]
Cost per task	[unknown] Not tracked per-task across runtimes; deliberately not estimated.	[unknown]	[unknown]
GAIA benchmark score	[unknown]	32.3% evidence-linkedas of Nov 7, 2024 ±5.3 confidence interval; default GPT-4o-2024-05-13 configuration. Source: arXiv 2411.04468 [evidence_linked]	[unknown]
WebArena score	[unknown]	32.8% evidence-linkedas of Nov 7, 2024 ±3.2 confidence interval; default GPT-4o configuration. Source: arXiv 2411.04468 [evidence_linked]	[unknown]
AssistantBench accuracy	[unknown]	25.3% evidence-linkedas of Nov 7, 2024 ±6.3; default GPT-4o-2024-05-13. Source: arXiv 2411.04468 [evidence_linked]	[unknown]
Tokens per line of code	[unknown]	[unknown]	124.3 evidence-linkedas of Aug 1, 2023 SoftwareDev benchmark; vs ChatDev 248.9. Source: arXiv 2308.00352 [evidence_linked]
Executability score (SoftwareDev)	[unknown]	[unknown]	3.75 evidence-linkedas of Aug 1, 2023 3.75/4; vs ChatDev 2.25. Source: arXiv 2308.00352 Table 3 [evidence_linked]
MBPP Pass@1	[unknown]	[unknown]	87.7% evidence-linkedas of Aug 1, 2023 With executable feedback loop. Source: arXiv 2308.00352 [evidence_linked]