2026-03-07 · AI & Agents

State of the Art: Agent Harnesses — March 2026

This is a dated snapshot. The agent harness space moves fast enough that anything written here will be partially stale within weeks. That's fine — a specific opinion from a specific moment is more useful than a vague evergreen overview.

I run a boss/worker agent pool on a 2-core Alpine LXC. This is written from that perspective: what actually works when you're coordinating multiple Claude Code instances on real tasks, not what looks good in a demo.

The Landscape

Agent harnesses fall into three buckets right now:

CLI-native agent tools — Claude Code, Codex CLI, Gemini CLI. The agent IS the harness.
Framework SDKs — CrewAI, LangGraph, AutoGen, OpenAI Agents SDK. You write Python/TS that orchestrates API calls.
Orchestration layers — Gas Town, Ruflo, BMAD, Claude Flow. They sit on top of CLI agents and coordinate them.

The interesting tension: category 1 is eating categories 2 and 3. Claude Code's Agent tool, worktree isolation, and hooks system give you most of what the frameworks promise, with less abstraction tax.

What Works Right Now

Claude Code (the current king)

The agent harness I'd recommend to anyone starting today. Not because it's the most feature-rich, but because it's the thinnest layer between you and useful work.

What ships natively:

Agent tool — spawn subagents with custom prompts, tool restrictions, model selection, and permission modes. Defined in markdown files with YAML frontmatter.
Worktree isolation — isolation: worktree in agent definitions. Each agent gets its own git branch and working directory. This is the single most important feature for parallel agents. Without it, you get merge conflicts and corrupted state.
Agent Teams (experimental) — hierarchical multi-agent with shared task queues. Enable with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1. One lead coordinates teammates, each with its own context window. Teammates idle between turns and wake when messaged.
Hooks — deterministic shell commands at lifecycle points (PreToolUse, PostToolUse, TeammateIdle, TaskCompleted, etc.). 13 hook events. HTTP hooks also supported. This is how you bolt on observability, linting, notifications, and guardrails.
Custom subagents — markdown files with YAML frontmatter defining description, prompt, tools, disallowedTools, model, permissionMode, mcpServers, hooks, maxTurns, skills, memory. Create with /agents command or manually.

What's good: Minimal abstraction. Git-native. The filesystem IS your coordination layer — no databases, no message queues, no custom protocols. Hooks give deterministic control over a probabilistic system.

What's not: Agent Teams is still experimental. Token consumption in team mode is significantly higher because each teammate maintains its own context window. Documentation lags the release cadence (8 releases in 15 days in late Feb 2026).

Links:

Custom subagents docs
Hooks guide
Agent Teams guide
disler/claude-code-hooks-mastery — 3,000+ stars, all 13 hook events implemented

OpenAI Codex CLI

OpenAI's answer to Claude Code. Terminal-based, supports multi-agent via the Agents SDK.

What's interesting: Codex CLI can be exposed as an MCP server, letting you orchestrate it from the Agents SDK. Two tools: codex() and codex-reply(). This lets you build PM/Designer/Dev/Tester pipelines.

What's not: Tied to OpenAI models. The Agents SDK is the supported production path (Swarm is deprecated to "educational only"). Less filesystem-native than Claude Code — more API-oriented.

Links:

Framework SDKs: The Python Layer

These are for when you're building a product, not doing development work. They orchestrate API calls, not CLI sessions.

LangGraph

Best for: Stateful, long-running workflows with human-in-the-loop. Graph-based state machines.

30-40% lower latency than alternatives on complex workflows in benchmarks. Durable execution. Good if you need to model agent behavior as a DAG with explicit state transitions.

The catch: Complexity. You're writing graph definitions, managing state objects, and debugging execution paths. Overkill for "run 4 agents in parallel on different tasks."

Links: LangGraph docs

CrewAI

Best for: Role-based agent teams. Fastest setup of the framework SDKs.

Deploy a multi-agent team ~40% faster than LangGraph. Intuitive role/task abstraction. Growing A2A protocol support.

The catch: Less control over execution flow than LangGraph. The role abstraction can fight you when tasks don't fit clean role boundaries.

Links: CrewAI docs

AutoGen (Microsoft)

Best for: Conversational multi-agent — group debates, consensus-building, sequential dialogues.

The catch: Microsoft has shifted strategic focus to the broader Microsoft Agent Framework. Major new feature development has slowed. Bug fixes and security patches only. The community is noticing.

Links: AutoGen repo

Verdict on framework SDKs

If you're building an agent-powered product (customer support, data pipeline, content generation), these make sense. If you're doing software development with agents, they add abstraction without adding capability over Claude Code's native tools.

Orchestration Layers

These sit on top of CLI agents (usually Claude Code) and add coordination, roles, and persistence.

Gas Town (Steve Yegge)

The most opinionated orchestrator. Manages colonies of 20-30 parallel Claude Code agents through a structured hierarchy.

Architecture: Mayor (orchestrates), Polecats (execute in parallel), Witness and Deacon (monitor health), Refinery (manages merges). Git is the persistence layer — no databases.

Philosophy: Instead of fighting chaos with structure (BMAD) or memory (Claude Flow), Gas Town embraces chaos. Git handles durability and crash recovery.

What's good: Battle-tested by Yegge. Git-native persistence. Crash-recoverable.

What's not: Complex role hierarchy. The 20-30 agent scale requires more compute than most people have. The naming convention (Polecats? Deacon?) adds cognitive overhead.

Links: steveyegge/gastown | Maggie Appleton's analysis

BMAD Method

Assembly-line orchestration. Each agent has a specialized role (Business Analyst, Architect, Developer) and produces documents that feed the next agent.

What's good: Great for upfront planning discipline. Each stage has clear inputs/outputs.

What's not: Sequential by design. Slow for iterative work. The ceremony-to-output ratio can be painful.

Links: BMAD on GitHub

Ruflo

Claims to be "the leading agent orchestration platform for Claude." Adds self-learning neural routing, MCP integration, and distributed swarm intelligence.

What's good: Native Claude Code integration via MCP. Ambitious feature set.

What's not: The marketing language is aggressive relative to the maturity. "Self-learning neural capabilities that no other agent orchestration framework offers" — be skeptical. Evaluate on your own workloads.

Links: ruvnet/ruflo

Claude Flow

Deploys 54+ specialized agents in coordinated swarms. Queen Agent orchestrator with worker agents. Memory-heavy approach.

What's good: Parallel execution with shared knowledge. Good for projects needing persistent memory across agent sessions.

What's not: 54 agents is a lot of context windows. Token costs add up fast.

Parallel Code (johannesjo)

Practical tool for running Claude Code, Codex CLI, and Gemini CLI side by side. Auto-creates git worktrees per task, unified GUI for monitoring.

What's good: Multi-vendor agent support. Solves the "I want to use Claude AND Codex" problem. GUI monitoring.

Links: Worth looking at if you're running multiple CLI agents and want a visual overview.

Observability

You can't run agents in production without watching them. The options, from simple to enterprise:

Filesystem + hooks (what we use)

Claude Code hooks POST tool-call data to a local service. We run disler's observability dashboard — Vue + WebSocket, shows tool calls, sessions, agent swim lanes. Runs as an OpenRC service on the same LXC.

Cost: zero. Complexity: low. Good enough for 1-4 agents.

Langfuse (open source)

MIT licensed, self-hosted. Trace viewing, prompt versioning, cost tracking. Captures nested traces for agent workflows — model calls, tool usage, execution paths. The go-to open-source option.

Links: langfuse.com

Helicone (open source)

AI gateway and observability platform. Proxy-based integration, built in Rust. <1ms P99 latency overhead. Good if you want a proxy layer in front of your API calls.

Links: helicone.ai

Braintrust

Production traces loading in seconds. 80x faster query performance than alternatives (their claim). SDK, OpenTelemetry, or proxy integration.

Links: braintrust.dev

Datadog LLM Observability

Enterprise-grade. Auto-instruments OpenAI Agents SDK, LangGraph, CrewAI, Google ADK. Maps decision paths, traces tool calls, measures token usage. AI Agent Monitoring visualizes decision paths in interactive graphs.

The catch: It's Datadog pricing.

Protocols: MCP and A2A

Two protocols worth knowing:

MCP (Model Context Protocol) — Anthropic's protocol for agent-to-tool interactions. 97M+ monthly SDK downloads. The standard for giving agents access to external tools and data. If you're building agent tooling, support MCP.

A2A (Agent-to-Agent Protocol) — Google's protocol for agent-to-agent communication. 100+ enterprise supporters. Complements MCP: MCP handles agent-tool, A2A handles agent-agent. NIST is now involved in standards work.

Neither is controversial. Both are winning their respective lanes.

Key Patterns

Boss/Worker with Filesystem Task Queue

This is what we run. A persistent "boss" Claude Code session in tmux polls an inbox directory, spawns workers via the Agent tool with isolation: worktree, monitors their status.json files, and harvests results when they're done.

cron (*/5) -> is boss alive? -> no -> tmux new "claude -c"
                               yes -> noop

boss (persistent):
  reads ~/tasks/inbox/ -> validates -> spawns workers -> harvests

workers (ephemeral):
  do task -> log to thread.md -> retro -> exit
  if blocked -> write question -> boss responds -> resume

Why this works:

No framework dependencies. Just Claude Code, git, and the filesystem.
Workers get worktree isolation — no merge conflicts.
status.json + thread.md per session = full observability.
Workers can block and resume. The boss answers questions and re-launches.
Cron keeps the boss alive. Workers are ephemeral by design.
Everything is version controlled. Crash recovery = read the last commit.

Why it might not work for you: Limited to the Agent tool's capabilities. Max ~4 workers on modest hardware. No fancy routing or self-learning. You have to write the orchestration logic in CLAUDE.md instructions, not code.

Fan-out with Worktree Isolation

The general pattern: one orchestrator breaks a task into subtasks, spawns N agents each in their own worktree, waits for completion, merges results.

Claude Code supports this natively. Gas Town scales it to 20-30 agents. The key insight is that git worktrees are the isolation primitive — not containers, not VMs, not separate repos.

Utility Model vs Smart Model

The biggest cost lever in any agent system: route tasks to the cheapest model that can handle them.

Model	Input/1M tokens	Output/1M tokens	Use for
Haiku 4.5	$1	$5	Classification, extraction, simple transforms, high-volume subtasks
Sonnet 4.5	$3	$15	Most development work, planning, code generation
Opus 4.5	$5	$25	Complex reasoning, architecture decisions, novel problems

Haiku 4.5 performs within 5 percentage points of Sonnet on many benchmarks at 1/5 the cost and 2x+ the speed. A smart orchestrator uses Sonnet/Opus for planning and Haiku for execution.

In Claude Code, you set the model per subagent in the YAML frontmatter. We run workers on Sonnet for most tasks.

What We Actually Use and Why

Our setup on a 2-core, 4GB RAM Alpine LXC:

Claude Code as the runtime. No framework SDK.
Boss/worker pattern with filesystem task queue (~/tasks/inbox/).
Git worktree isolation for parallel workers (max 4).
Hooks posting to disler's observability dashboard on port 4444.
status.json + thread.md per worker session for monitoring and retros.
Cron keeping the boss alive.
SMB share from Mac to LXC for task submission.

Total infrastructure: one LXC, one tmux session, one cron job, one observability service.

Why not Gas Town / BMAD / CrewAI? They solve problems we don't have yet. With 4 workers max, the boss can track everything by reading files. The complexity of a framework isn't justified until you're scaling past what filesystem coordination handles cleanly.

Why not Agent Teams? Still experimental. Higher token consumption. Our boss/worker pattern predates it and works. We'll evaluate when it stabilizes.

What would make us switch: If Agent Teams gets native task queuing and worker-to-worker messaging without the token overhead of maintaining separate context windows, that would be worth migrating to.

What's Promising but Early

Agent Teams — If Anthropic stabilizes it and solves the token cost problem, this replaces custom boss/worker setups.
A2A protocol — Cross-vendor agent communication. Currently more enterprise than indie, but the spec is solid.
viwo-cli — Docker-based Claude Code with worktree volume mounts. Interesting for permission isolation (--dangerously-skip-permissions in a container).
Parallel Code — Multi-vendor CLI agent runner. Solves a real problem if you use Claude + Codex + Gemini.
Ruflo's neural routing — If the self-learning claims hold up under real workloads, this could be interesting for dynamic task routing.

What to Ignore

OpenAI Swarm — Officially educational only. Use the Agents SDK instead.
AutoGen — Microsoft is sunsetting active development. Don't start new projects on it.
Any framework marketing "100x productivity" — The actual multiplier on good agent setups is 2-5x for experienced developers. Still transformative, but calibrate expectations.
Overly complex role hierarchies — Mayor/Polecat/Witness/Deacon is clever, but simple boss/worker with clear task boundaries outperforms elaborate role-play for most teams.

Summary Table

Tool	Category	Maturity	Best For
Claude Code (native)	CLI agent	Production	Software development, general tasks
Codex CLI	CLI agent	Production	OpenAI-ecosystem development
LangGraph	Framework SDK	Production	Stateful workflows, human-in-the-loop
CrewAI	Framework SDK	Production	Role-based teams, fast setup
AutoGen	Framework SDK	Declining	Conversational agents (legacy)
Gas Town	Orchestrator	Early production	Large-scale parallel development (20+ agents)
BMAD	Orchestrator	Production	Waterfall-style planning discipline
Ruflo	Orchestrator	Early	Claude Code MCP integration
Agent Teams	Native feature	Experimental	Built-in multi-agent coordination
Langfuse	Observability	Production	Self-hosted tracing (open source)
Datadog LLM Obs	Observability	Production	Enterprise monitoring

Last updated: March 7, 2026. Written from production experience running a Claude Code boss/worker pool on a homelab LXC.