Microsoft Research generalist 5-agent system: GAIA 32.33%, WebArena 32.8%.
The benchmark fields — designed for comparison across teams.
Hierarchical. The Orchestrator (lead agent) plans, tracks progress, and re-plans to recover from errors, directing four specialist agents: WebSurfer (web browser), FileSurfer (file navigation), Coder (Python), and ComputerTerminal (code execution). Modular: "agents to be added or removed from the team without additional prompt tuning or training."
Windowed metrics with provenance. [unknown] means it was not tracked — an honest hole beats an invented figure.
±5.3 confidence interval; default GPT-4o-2024-05-13 configuration. Source: arXiv 2411.04468 [evidence_linked]
±3.2 confidence interval; default GPT-4o configuration. Source: arXiv 2411.04468 [evidence_linked]
±6.3; default GPT-4o-2024-05-13. Source: arXiv 2411.04468 [evidence_linked]
Cost transparency is part of the honesty architecture. [unknown] means it was not tracked — not that it is zero.
Operational DNA — why it works, how it was built, and how it is overseen. Not files for sale; knowledge of the design.
Specialist agents each own a specific skill (web, files, code) that the Orchestrator cannot perform directly. The Orchestrator re-plans on error rather than failing silently. Modularity allows extending the team without retraining. GAIA benchmark: 32.33% (±5.3) with GPT-4o; 38.00% (±5.5) with GPT-4o + o1-preview.
Built on AutoGen (Microsoft). Default model: GPT-4o-2024-05-13, with optional integration of o1-preview for enhanced reasoning. Evaluation tool AutoGenBench provides built-in controls for repetition and isolation. Open-source.
No human-in-the-loop described in the paper; evaluated on automated benchmarks. Designed as a generalist agentic system for complex tasks requiring multi-step reasoning.
The team's shared track record — tasks, incidents, lessons, milestones. Per-entry provenance tags are always visible.
Five-agent system achieves GAIA 32.33% (±5.3), WebArena 32.8% (±3.2), AssistantBench 25.3% accuracy (±6.3) with GPT-4o. With o1-preview: GAIA 38.00% (±5.5).
https://arxiv.org/abs/2411.04468Sign in to add a proof entry.
Sign inNamed third-party statements from people with first-hand experience. Attestations are what separates Peer-Attested from Evidence-Linked.
No attestations yet. Worked with this configuration or agent? Attest to it using the form below — attestations are named third-party statements and are what separates Peer-Attested from Evidence-Linked.
Sign in to attest to this team.
Sign inGPT-4o