June 22, 20268 min read

Multi-Agent Orchestration: When It Helps and When It's Hype

Sakana's Fugu reignited the multi-agent orchestration debate. Here is what it actually solves, when it earns its complexity, and when a single model wins.

Axel Dekker

Founder & CEO

Japan's Sakana AI recently launched Fugu, a model that quietly farms each request out to a pool of other models through a single API. You send one prompt; behind the scenes a core model chooses helpers, hands them work, checks their answers, and merges the result into one response. Sakana pitched it as a way to deliver frontier capability without the risk of export controls, the same controls that recently pulled Anthropic's top models from some markets. The reception was mixed: bold benchmark claims, skeptical users who felt it did not perform at the frontier in practice, and open questions about cost and what is actually running under the hood.

We build multi-agent systems for a living, so a launch like this is less interesting for the headline than for the question it forces into the open: when does orchestrating many models actually beat using one good one? Fugu is one answer to that question. Here is how we think about it, minus the hype.

What multi-agent orchestration actually is

Multi-agent orchestration is the practice of coordinating several AI models or agents to complete a task that a single model would otherwise handle alone. A coordinator breaks the goal into steps, routes each step to the model or tool best suited to it, runs them in sequence or in parallel, and combines the results into a single answer. The user sees one clean response; the work behind it was split across specialists.

The pattern is not new. Router models that send easy queries to a cheap model and hard ones to an expensive one, critique loops where one model checks another's work, and tool-calling agents that reach into your systems are all forms of orchestration. What Fugu does is hide all of that behind one endpoint, so the complexity becomes someone else's problem. If terms like agentic AI or the Model Context Protocol are new to you, our AI glossary has plain-language definitions of both.

Why labs are betting on composition

Fugu is part of a clear trend. OpenRouter's Fusion does something similar, and plenty of production systems already blend models from different providers. The strategic logic is simple: training a single model that is best at everything is enormously expensive and increasingly constrained by hardware access. Combining several capable models is a way to reach for frontier results without owning a frontier model, and, in Sakana's framing, without depending on chips and weights that a policy change can take away overnight.

That makes orchestration attractive for reasons that have nothing to do with raw capability. It is a hedge: against vendor lock-in, against a single model's blind spots, against a provider that changes its pricing or pulls a model from your region. For a business, that resilience can matter more than topping a leaderboard.

When orchestration genuinely helps

In our own work, orchestration earns its complexity in a handful of recurring situations. The first is heterogeneous tasks: a workflow that needs a fast model for routine classification, a strong reasoning model for the hard exceptions, and a cheap model for bulk summarisation. Forcing one model to do all three either overpays on the easy work or underperforms on the hard work.

The second is tool use and actions. An agent that reads a ticket, looks up an order, drafts a reply and updates a record is already orchestrating: the language model coordinates calls to your systems and decides what to do next. The third is verification. Having a second model critique or fact-check the first, or running the same task twice and comparing, measurably reduces errors on high-stakes output. The fourth is cost routing, sending the 80 percent of easy requests to a cheap model and reserving the expensive one for the 20 percent that need it. Each of these is a real engineering win, not a benchmark trick.

To make that concrete, take a support workflow we see often. A fast, cheap model reads every incoming ticket and classifies it. Routine questions are answered straight away from the company's own documentation. Anything ambiguous is escalated to a stronger reasoning model that can weigh the edge cases, and before a refund or a policy decision goes out, a verification step checks it against the rules. A single model could attempt all of this, but it would be slower and more expensive on the easy tickets and less reliable on the hard ones. The orchestration is what makes the economics and the accuracy work at the same time.

When it is hype, and one model wins

Orchestration is not free. Every extra model in the chain adds latency, adds cost, and adds a place for things to go wrong. A single capable model that answers in one call is often faster, cheaper and easier to reason about than five models passing work between them. For simple, well-defined tasks, more agents usually means more overhead for no gain.

There is also a quieter risk that Fugu's mixed reception illustrates well. Combining models does not automatically combine their strengths; done carelessly, it can average them into a mixture of mediocre. If the coordinator routes poorly, or the helper models disagree and the merge step papers over it, you get an answer that is confident, expensive and worse than what a single strong model would have produced. And when the underlying models are hidden behind one API, as they are with Fugu, you lose the visibility you need to debug exactly that failure. Benchmarks rarely capture this; production does.

The part the benchmarks miss

Here is the thing the leaderboard numbers will not tell you: whether a multi-agent system works in production is mostly an engineering question, not a model question. The hard parts are not the models themselves but the scaffolding around them. How do you observe what each agent did and why? How do you evaluate the merged output against real cases, not toy prompts? Where are the guardrails, and where does a human get pulled in? How is the hand-off between agents designed so context is not lost at each step?

Get that scaffolding right and a modest set of models can outperform a single flagship. Get it wrong and the most impressive models in the world will still produce an unreliable system. This is why we treat AI as engineering rather than magic: the value is in the disciplined design, the monitoring and the evaluation, not in the model badge on the box.

What Fugu gets right, and what to watch

Fugu's instinct is sound. Hiding orchestration behind a single API is genuinely useful, because most teams do not want to assemble and maintain a committee of models themselves. And the export-control hedge is a real, underrated benefit: a system that can swap its underlying models is far more resilient to a provider or a policy disappearing than one wired to a single flagship.

What to watch is the cost and the opacity. If you cannot see which models ran, how often the expensive ones were called, and why a given answer was produced, you cannot control your spend or debug your failures. For a quick coding question that is a fine trade. For a system that touches revenue, compliance or customers, that visibility is not optional, and it is exactly the line we would draw before putting any orchestrated black box into production.

How we approach it in production

Our default is to start with the simplest thing that could work, usually a single well-prompted model grounded in the client's data, and to add orchestration only when a specific failure mode demands it. If exceptions need deeper reasoning, we add a stronger model for those. If accuracy is critical, we add a verification step. If volume makes cost a problem, we add routing. Each agent we introduce has to earn its place by fixing a real, measured problem.

We also instrument everything from day one, so we can see which model handled what, where errors crept in, and what each call cost. And we hand the system over with the code, documentation and monitoring the client needs to run it without us. A multi-agent system you cannot see into is a liability; one you can observe and own is an asset.

The bottom line

Multi-agent orchestration is a powerful tool, not a default setting. Fugu and systems like it are worth watching, and the export-control hedge is a genuinely interesting angle for businesses that need resilience. But the early gap between the benchmark claims and what users report is the reminder we keep coming back to: orchestration only pays off when the architecture matches the problem and the engineering around it is sound.

If you are weighing whether a single model or a multi-agent system fits your use case, that is exactly the kind of question we help mid-market teams answer, with a working prototype on your real data rather than a benchmark on someone else's. Tell us the problem and we will tell you whether it needs one agent, several, or none at all.

←Back to insights

Multi-Agent Orchestration: When It Helps and When It's Hype

What multi-agent orchestration actually is

Why labs are betting on composition

When orchestration genuinely helps

When it is hype, and one model wins

The part the benchmarks miss

What Fugu gets right, and what to watch

How we approach it in production

The bottom line

More insights

Claude Opus 4.8: What Actually Changed and Why It Matters for Your Business

Why Hiring an AI Agency Beats Building It Yourself

The time of rented AI solutions is over, value is everything

Have an AI question worth answering?