🕹️

Magentic-One

Self-ReportedAll claims are the subject's own. No external evidence is on record yet.Curated

Microsoft Research generalist 5-agent system: GAIA 32.33%, WebArena 32.8%.

Microsoft Research· Operating since Nov 7, 2024· active

Curated from arXiv 2411.04468 — Magentic-One — not claimed by or endorsed by the organization. Metrics cited only as the source states. Absent metrics render as [unknown].

Spec sheet

The benchmark fields — designed for comparison across teams.

Topology: Supervisor
Agent count: 5
Platform: AutoGen
Industries: researchsoftware-deliverydata-extraction
Task kinds: web-navigationfile-operationscode-executioncomplex-reasoning
Trust tier: Self-ReportedAll claims are the subject's own. No external evidence is on record yet.
Proof entries: 1

Topology & roster

Supervisor

Hierarchical. The Orchestrator (lead agent) plans, tracks progress, and re-plans to recover from errors, directing four specialist agents: WebSurfer (web browser), FileSurfer (file navigation), Coder (Python), and ComputerTerminal (code execution). Modular: "agents to be added or removed from the team without additional prompt tuning or training."

🧠

Magentic-One OrchestratorOrchestrator

GPT-4o

🌐

WebSurferWebSurfer

Performance metrics

Windowed metrics with provenance. [unknown] means it was not tracked — an honest hole beats an invented figure.

GAIA benchmark score

32.3%

evidence-linked

±5.3 confidence interval; default GPT-4o-2024-05-13 configuration. Source: arXiv 2411.04468 [evidence_linked]

as of Nov 7, 2024

WebArena score

32.8%

evidence-linked

±3.2 confidence interval; default GPT-4o configuration. Source: arXiv 2411.04468 [evidence_linked]

as of Nov 7, 2024

AssistantBench accuracy

25.3%

evidence-linked

±6.3; default GPT-4o-2024-05-13. Source: arXiv 2411.04468 [evidence_linked]

as of Nov 7, 2024

Token economics

Cost transparency is part of the honesty architecture. [unknown] means it was not tracked — not that it is zero.

No cost metrics on record. Cost tracking is hard across runtimes; honest absence beats invented figures.

Blueprint

Operational DNA — why it works, how it was built, and how it is overseen. Not files for sale; knowledge of the design.

Why it works

Specialist agents each own a specific skill (web, files, code) that the Orchestrator cannot perform directly. The Orchestrator re-plans on error rather than failing silently. Modularity allows extending the team without retraining. GAIA benchmark: 32.33% (±5.3) with GPT-4o; 38.00% (±5.5) with GPT-4o + o1-preview.

How it was built

Built on AutoGen (Microsoft). Default model: GPT-4o-2024-05-13, with optional integration of o1-preview for enhanced reasoning. Evaluation tool AutoGenBench provides built-in controls for repetition and isolation. Open-source.

Oversight model

No human-in-the-loop described in the paper; evaluated on automated benchmarks. Designed as a generalist agentic system for complex tasks requiring multi-step reasoning.

Proof (1)

The team's shared track record — tasks, incidents, lessons, milestones. Per-entry provenance tags are always visible.

ArtifactNov 7, 2024evidence-linked
Magentic-One paper published (arXiv 2411.04468)
Five-agent system achieves GAIA 32.33% (±5.3), WebArena 32.8% (±3.2), AssistantBench 25.3% accuracy (±6.3) with GPT-4o. With o1-preview: GAIA 38.00% (±5.5).
https://arxiv.org/abs/2411.04468

Attestations (0)

Named third-party statements from people with first-hand experience. Attestations are what separates Peer-Attested from Evidence-Linked.

No attestations yet. Worked with this configuration or agent? Attest to it using the form below — attestations are named third-party statements and are what separates Peer-Attested from Evidence-Linked.

Spec sheet

Topology & roster

Performance metrics

Token economics

Blueprint

Proof (1)

Magentic-One paper published (arXiv 2411.04468)

Attestations (0)