Single-agent LLMs suck at long-running complex tasks.
We’ve open-sourced a multi-agent orchestrator that we’ve been using to handle long-running LLM tasks. We found that single LLM agents tend to stall, loop, or generate non-compiling code, so we built a harness for agents to coordinate over shared context while work is in progress.
How it works: 1. Orchestrator agent that manages task decomposition 2. Sub-agents for parallel work 3. Subscriptions to task state and progress 4. Real-time sharing of intermediate discoveries between agents
We tested this on a Putnam-level math problem, but the pattern generalizes to things like refactors, app builds, and long research. It’s packaged as a Claude Code skill and designed to be small, readable, and modifiable.
Use it, break it, tell me about what workloads we should try and run next!
giancarlostoro2 days ago
I feel like there's two camps:
* Throw more agents * Use something like Beads
I'm in the latter, I don't have infinite resources, I'd rather stick to one agent and optimize what it can do. When I hit my Claude Code limit, I stop, I use Claude Code primarily for side projects.
gck12 days ago
Even Anthropic research articles consistently demonstrate they themselves use one agent, and just tune the harness around it.
I ignore all Skills, MCPs, and view all of these as distractions that consume context, which leads to worse performance. It's better to observe what agent is doing, where it needs help and just throw a few bits of helpful, sometimes persistent context at it.
You can't observe what 20 agents are doing.
austinbaggioop2 days ago
For most tasks, I agree. One agent with a good harness wins. The case for multiple agents is when the context required to solve the problem exceeds what one agent can hold. This Putnam problem needed more working context than fits in a single window. Decomposing into subgoals lets each agent work with a focused context instead of one agent suffocating on state. Ideally, multi-agent approaches shouldn't add more overall complexity, but there needs to be better tooling for observation etc, as you describe.
giancarlostoroa day ago
Thats the other thing, you hit the nail on the head, I dont want 20 agents unless they're doing research and scouring code. Claude can do that just fine. I want Claude Code doing as much as I can handle, and something like Beads does it for me.
Kiboneu2 days ago
Yes, but you can observe the agent observing what 20 agents are doing! /s
Now I see why Grey Walter made artificial tortoises in the 50s - he foresaw that it would be turtles all the way down.
austinbaggioop2 days ago
Yeah I have seen those camps too. I think there will always be a set of problems that have complexity, measured by amount of context required to be kept in working ram, that need more than one agent to achieve a workable or optimal result. I think that single player mode, dev + claude code, you'll come up against these less frequently, but cross-team, cross-codebase bigger complex problems will need more complex agent coordination.
killbot_200010 hours ago
Running 70+ specialized agents locally here. The key insight for me was specialization over generalization - each agent handles a narrow domain (docs, testing, deployment, etc.) rather than trying to make one super-agent do everything. The orchestration overhead is real, but Herald-style message passing between agents with clear domain boundaries has worked better than shared context approaches. The observation problem mentioned in comments is solved by logging everything to a central activity stream - you can't watch 20 agents in real-time, but you can review what happened. Curious what coordination overhead you're seeing at scale?
raniazyane2 days ago
I wonder if there’s a third camp that isn’t about agent count at all, but about decision boundaries.
At some point the interesting question isn’t whether one agent or twenty agents can coordinate better, but which decisions we’re comfortable fully delegating versus which ones feel like they need a human checkpoint.
Multi-agent systems solve coordination and memory scaling, but they also make it easier to move further away from direct human oversight. I’m curious how people here think about where that boundary should sit — especially for tasks that have real downstream consequences.
austinbaggioop2 days ago
I think about this with the analogue of MoE a lot. Essentially, a decision routing process, and similar to having expert submodels, you have a human in the loop or decision sub-tasks when the task requires it.
More specifically, we've been working on a memory/context observability agent. It's currently really good at understanding users and understanding the wide memory space. It could help with the oversight and at least the introspection part.
joshribakoff2 days ago
Check out primeageons 99 prompts. The idea, as i understand it, is you scope an agent to implementing a single function at a time with firm guardrails. So something in between yolo agents and rudimentary tab complete
clairekart2 days ago
What’s the failure mode you see with single-agent Claude Code on complex tasks? (looping, context drift, plan collapse, tool misuse?)
austinbaggioop2 days ago
All of the above. The most frustrating one with the Putnam example with Claude was generating solutions that obviously didn't compile. This feels like plan collapse- not verifying its own work. I'm sure that if you just had a dumb two-model setup, it would eventually get to compiling code after n runs, but that was just for this one failure mode.
Atotalnoob2 days ago
You can use hooks to not allow it to stop without successful build
miligauss2 days ago
It's a more of a black box with claude, at least with this you see the proof strategy and mistakes made by the model when it decomposes the problem. I think instead of Ralph looping you get something that is top-down. If models were smarter and context windows bigger i am sure complex tasks like this one would be simpler, but braking it down into sub agents and having a collective --"we already tried this strategy and it backtracked"-- intelligence is a nice way to scope a limited context window to an independent sub problem.
raphaelmolly82 days ago
The Lean angle here is really interesting: most multi-agent demos dodge hard verification, but tying each agent’s output to makes the feedback loop objective. Curious how you’re handling goal-claim conflicts/duplication when two agents find competing tactic sequences for the same subgoal—do you keep both in memory with some ranking signal (time-to-verify, proof term size, etc.)?
austinbaggioop2 days ago
We use TTL-based claim locks so only one agent works on one goal at a time.
Failed strategies + successful tactics all get written to shared memory, so if a claim expires and a new agent picks it up, it sees everything the previous agent tried.
Ranking is first-verified-wins.
For competing decomposition strategies, we backtrack: if children fail, the goal reopens, and the failed architecture gets recorded so the next attempt avoids it.
hsdeva day ago
Impressive! Are there any boilerplates that people know of for running something similar to this using open offline models? Would be cool to run this (or a single agent version) on a VPS that has some leftover compute resources.
visarga2 days ago
Great work! I like the approach of maximum freedom inside bounded blast radius and how you use code to encode policy.
austinbaggioop2 days ago
Thanks! That was the goal. We want to let agents be autonomous within their scope, so they can try new paths and fail gracefully. A bad tactic just fails to compile, it can't break anything else.
yodon2 days ago
The first screen of your signup flow asks for "organization" - is that used as a username or as an organization name or both (I can't tell what if anything will be on the next screen)
If your registration process is eventually going to ask me for a username, can the org name and user name be the same?
austinbaggioop2 days ago
We're working on improvements to make it easier to join orgs as a user so you can add friends/colleagues, but for now treat them as the same object
yodon2 days ago
When you get a chance to work on your login flow, I recommend giving users an opportunity to request the key rather than automatically showing it once only on the first screen.
I created the account from my phone, and don't have access to the dev tools I'd want to paste the key into. I can deal with it, but I don't know if I'll be able to regenerate the key if I lose it, I'd rather not store it on my phone, and I don't trust my accuracy in manually typing it in on my laptop while looking at my phone, so all the options feel not great. Again, not an actual roadblock, but still something I'd encourage fixing.
Edit added: Good thing I copied the key to my phone before writing this message. Jumping over to this page seems to have forced a refresh/logout on the ensure page in the other tab, so my token would (I think? maybe?) be lost at this point if I'd done it in the other order.
austinbaggioop2 days ago
Ahh good call. You absolutely can generate a new key from the dashboard, so if you did lose the one generated during the quickstart, you'd be able to generate another when you log in next and go to the API keys tab.
Will make this more clear in the quickstart, thanks for the feedback
austinbaggioop2 days ago
username==orgname for now, so yes, just treat that as one in the same
yodon2 days ago
Can you add a license.txt file so we know we have permission to run this (eg MIT and GPL V3 are very different)
austinbaggioop2 days ago
Oversight - added MIT. How are you thinking of using it?
yodon2 days ago
For the moment, researching multi-agent orchestration. At first glance, your work looks among the best in class of published work I've seen. Particularly interested to understand the memory/communication/search model you're using, as it sounds like you've trying to think well past the GasTown/Beads/Claude-Code-Swarms concepts.
austinbaggioop2 days ago
Very kind of you to say. Our whole vision is that agents can produce way better results, compounding their intelligence, when they lean on shared memory.
I'm curious to see how it feels for you when you run it. I'm happy to help however I can.
arbol2 days ago
So is this shared memory as in RAM or a markdown file that they update with their statuses?
austinbaggioop2 days ago
I'm using "RAM" loosely, meaning working memory here. In practice, it's a key-value store with pub/sub stored on our shared memory layer, Ensue. Agents write structured state to keys like proofs/{id}/goals/{goal_id}, others subscribe via SSE. Also has embedding-based semantic search, so agents can find tactics from similar past goals.
christinetyip2 days ago
Cool, what’s a good first task to try this on where it’s likely to beat a single agent?
austinbaggioop2 days ago
Math proofs are really easy to run with this specific harness. Our next experiments are going to be bigger, think full code base refactors. We're working on applying RLM to improve context window limits so we can keep more of the actual code in RAM,
Any workloads you want to see? The best are ones that have ways to measure the output being successful, thinking about recreating the C compiler example Anthropic did, but doing it for less than the $20k in tokens they used.
dimitri-vs2 days ago
Maybe I'm just not working on complex or big enough projects but I haven't encountered a case of a feature that couldn't be implemented in one or two context windows. Or using vanilla Claude Code a multi-phase plan doc with a couple of sub agents and a final verification pass with Codex.
I guess maybe I'm doing the orchestration manually, but I always find there's tons of decisions that need to be made in the middle of large plan implementations.
Your refactor example terrifies me because the best part of a refactor is cleaning out all the bandaid workarounds and obsolete business logic you didn't even know existed. Can't see how an agent swarm would be able to figure that out unless you provide a giga-spec file containing all current business knowledge. And if you don't spec it the agents will just eagerly bake these inefficiencies and problems into your migrated app.
indra_varta2 days ago
[dead]
miligauss2 days ago
we tried putnam a2
slopusila2 days ago
seems like it requires an API key to your proprietary Ensue memory system
austinbaggioop2 days ago
Yeah we're using Ensue since it already handles the annoying infra pieces you’d otherwise have to build to make this work (shared task state + updates, event streams/subscriptions, embeddings + retrieval over intermediate artifacts). You can run the example with a free key from ensue-network.ai. This repo focuses on the orchestration harness.
zmanian2 days ago
“How does progress subscription work — are agents watching specific signals (test failures, TODO list, build status), or just a global feed?”
miligauss2 days ago
claude code doesn't support subscriptions out of the box, so we use the subscription feature to just alert the orchestrator to a single polling file. Not the most elegant thing but still a token save over reading a bunch of sub agent logs. It is as reactive as you can be given the current feature set of claude code.