2026-03-07 · AI & Agents

State of the Art: Agent Harnesses — March 2026

This is a dated snapshot. The agent harness space moves fast enough that anything written here will be partially stale within weeks. That's fine — a specific opinion from a specific moment is more useful than a vague evergreen overview.

I run a boss/worker agent pool on a 2-core Alpine LXC. This is written from that perspective: what actually works when you're coordinating multiple Claude Code instances on real tasks, not what looks good in a demo.


The Landscape

Agent harnesses fall into three buckets right now:

  1. CLI-native agent tools — Claude Code, Codex CLI, Gemini CLI. The agent IS the harness.
  2. Framework SDKs — CrewAI, LangGraph, AutoGen, OpenAI Agents SDK. You write Python/TS that orchestrates API calls.
  3. Orchestration layers — Gas Town, Ruflo, BMAD, Claude Flow. They sit on top of CLI agents and coordinate them.

The interesting tension: category 1 is eating categories 2 and 3. Claude Code's Agent tool, worktree isolation, and hooks system give you most of what the frameworks promise, with less abstraction tax.


What Works Right Now

Claude Code (the current king)

The agent harness I'd recommend to anyone starting today. Not because it's the most feature-rich, but because it's the thinnest layer between you and useful work.

What ships natively:

What's good: Minimal abstraction. Git-native. The filesystem IS your coordination layer — no databases, no message queues, no custom protocols. Hooks give deterministic control over a probabilistic system.

What's not: Agent Teams is still experimental. Token consumption in team mode is significantly higher because each teammate maintains its own context window. Documentation lags the release cadence (8 releases in 15 days in late Feb 2026).

Links:

OpenAI Codex CLI

OpenAI's answer to Claude Code. Terminal-based, supports multi-agent via the Agents SDK.

What's interesting: Codex CLI can be exposed as an MCP server, letting you orchestrate it from the Agents SDK. Two tools: codex() and codex-reply(). This lets you build PM/Designer/Dev/Tester pipelines.

What's not: Tied to OpenAI models. The Agents SDK is the supported production path (Swarm is deprecated to "educational only"). Less filesystem-native than Claude Code — more API-oriented.

Links:


Framework SDKs: The Python Layer

These are for when you're building a product, not doing development work. They orchestrate API calls, not CLI sessions.

LangGraph

Best for: Stateful, long-running workflows with human-in-the-loop. Graph-based state machines.

30-40% lower latency than alternatives on complex workflows in benchmarks. Durable execution. Good if you need to model agent behavior as a DAG with explicit state transitions.

The catch: Complexity. You're writing graph definitions, managing state objects, and debugging execution paths. Overkill for "run 4 agents in parallel on different tasks."

Links: LangGraph docs

CrewAI

Best for: Role-based agent teams. Fastest setup of the framework SDKs.

Deploy a multi-agent team ~40% faster than LangGraph. Intuitive role/task abstraction. Growing A2A protocol support.

The catch: Less control over execution flow than LangGraph. The role abstraction can fight you when tasks don't fit clean role boundaries.

Links: CrewAI docs

AutoGen (Microsoft)

Best for: Conversational multi-agent — group debates, consensus-building, sequential dialogues.

The catch: Microsoft has shifted strategic focus to the broader Microsoft Agent Framework. Major new feature development has slowed. Bug fixes and security patches only. The community is noticing.

Links: AutoGen repo

Verdict on framework SDKs

If you're building an agent-powered product (customer support, data pipeline, content generation), these make sense. If you're doing software development with agents, they add abstraction without adding capability over Claude Code's native tools.


Orchestration Layers

These sit on top of CLI agents (usually Claude Code) and add coordination, roles, and persistence.

Gas Town (Steve Yegge)

The most opinionated orchestrator. Manages colonies of 20-30 parallel Claude Code agents through a structured hierarchy.

Architecture: Mayor (orchestrates), Polecats (execute in parallel), Witness and Deacon (monitor health), Refinery (manages merges). Git is the persistence layer — no databases.

Philosophy: Instead of fighting chaos with structure (BMAD) or memory (Claude Flow), Gas Town embraces chaos. Git handles durability and crash recovery.

What's good: Battle-tested by Yegge. Git-native persistence. Crash-recoverable.

What's not: Complex role hierarchy. The 20-30 agent scale requires more compute than most people have. The naming convention (Polecats? Deacon?) adds cognitive overhead.

Links: steveyegge/gastown | Maggie Appleton's analysis

BMAD Method

Assembly-line orchestration. Each agent has a specialized role (Business Analyst, Architect, Developer) and produces documents that feed the next agent.

What's good: Great for upfront planning discipline. Each stage has clear inputs/outputs.

What's not: Sequential by design. Slow for iterative work. The ceremony-to-output ratio can be painful.

Links: BMAD on GitHub

Ruflo

Claims to be "the leading agent orchestration platform for Claude." Adds self-learning neural routing, MCP integration, and distributed swarm intelligence.

What's good: Native Claude Code integration via MCP. Ambitious feature set.

What's not: The marketing language is aggressive relative to the maturity. "Self-learning neural capabilities that no other agent orchestration framework offers" — be skeptical. Evaluate on your own workloads.

Links: ruvnet/ruflo

Claude Flow

Deploys 54+ specialized agents in coordinated swarms. Queen Agent orchestrator with worker agents. Memory-heavy approach.

What's good: Parallel execution with shared knowledge. Good for projects needing persistent memory across agent sessions.

What's not: 54 agents is a lot of context windows. Token costs add up fast.

Parallel Code (johannesjo)

Practical tool for running Claude Code, Codex CLI, and Gemini CLI side by side. Auto-creates git worktrees per task, unified GUI for monitoring.

What's good: Multi-vendor agent support. Solves the "I want to use Claude AND Codex" problem. GUI monitoring.

Links: Worth looking at if you're running multiple CLI agents and want a visual overview.


Observability

You can't run agents in production without watching them. The options, from simple to enterprise:

Filesystem + hooks (what we use)

Claude Code hooks POST tool-call data to a local service. We run disler's observability dashboard — Vue + WebSocket, shows tool calls, sessions, agent swim lanes. Runs as an OpenRC service on the same LXC.

Cost: zero. Complexity: low. Good enough for 1-4 agents.

Langfuse (open source)

MIT licensed, self-hosted. Trace viewing, prompt versioning, cost tracking. Captures nested traces for agent workflows — model calls, tool usage, execution paths. The go-to open-source option.

Links: langfuse.com

Helicone (open source)

AI gateway and observability platform. Proxy-based integration, built in Rust. <1ms P99 latency overhead. Good if you want a proxy layer in front of your API calls.

Links: helicone.ai

Braintrust

Production traces loading in seconds. 80x faster query performance than alternatives (their claim). SDK, OpenTelemetry, or proxy integration.

Links: braintrust.dev

Datadog LLM Observability

Enterprise-grade. Auto-instruments OpenAI Agents SDK, LangGraph, CrewAI, Google ADK. Maps decision paths, traces tool calls, measures token usage. AI Agent Monitoring visualizes decision paths in interactive graphs.

The catch: It's Datadog pricing.


Protocols: MCP and A2A

Two protocols worth knowing:

MCP (Model Context Protocol) — Anthropic's protocol for agent-to-tool interactions. 97M+ monthly SDK downloads. The standard for giving agents access to external tools and data. If you're building agent tooling, support MCP.

A2A (Agent-to-Agent Protocol) — Google's protocol for agent-to-agent communication. 100+ enterprise supporters. Complements MCP: MCP handles agent-tool, A2A handles agent-agent. NIST is now involved in standards work.

Neither is controversial. Both are winning their respective lanes.


Key Patterns

Boss/Worker with Filesystem Task Queue

This is what we run. A persistent "boss" Claude Code session in tmux polls an inbox directory, spawns workers via the Agent tool with isolation: worktree, monitors their status.json files, and harvests results when they're done.

cron (*/5) -> is boss alive? -> no -> tmux new "claude -c"
                               yes -> noop

boss (persistent):
  reads ~/tasks/inbox/ -> validates -> spawns workers -> harvests

workers (ephemeral):
  do task -> log to thread.md -> retro -> exit
  if blocked -> write question -> boss responds -> resume

Why this works:

Why it might not work for you: Limited to the Agent tool's capabilities. Max ~4 workers on modest hardware. No fancy routing or self-learning. You have to write the orchestration logic in CLAUDE.md instructions, not code.

Fan-out with Worktree Isolation

The general pattern: one orchestrator breaks a task into subtasks, spawns N agents each in their own worktree, waits for completion, merges results.

Claude Code supports this natively. Gas Town scales it to 20-30 agents. The key insight is that git worktrees are the isolation primitive — not containers, not VMs, not separate repos.

Utility Model vs Smart Model

The biggest cost lever in any agent system: route tasks to the cheapest model that can handle them.

Model Input/1M tokens Output/1M tokens Use for
Haiku 4.5 $1 $5 Classification, extraction, simple transforms, high-volume subtasks
Sonnet 4.5 $3 $15 Most development work, planning, code generation
Opus 4.5 $5 $25 Complex reasoning, architecture decisions, novel problems

Haiku 4.5 performs within 5 percentage points of Sonnet on many benchmarks at 1/5 the cost and 2x+ the speed. A smart orchestrator uses Sonnet/Opus for planning and Haiku for execution.

In Claude Code, you set the model per subagent in the YAML frontmatter. We run workers on Sonnet for most tasks.


What We Actually Use and Why

Our setup on a 2-core, 4GB RAM Alpine LXC:

Total infrastructure: one LXC, one tmux session, one cron job, one observability service.

Why not Gas Town / BMAD / CrewAI? They solve problems we don't have yet. With 4 workers max, the boss can track everything by reading files. The complexity of a framework isn't justified until you're scaling past what filesystem coordination handles cleanly.

Why not Agent Teams? Still experimental. Higher token consumption. Our boss/worker pattern predates it and works. We'll evaluate when it stabilizes.

What would make us switch: If Agent Teams gets native task queuing and worker-to-worker messaging without the token overhead of maintaining separate context windows, that would be worth migrating to.


What's Promising but Early


What to Ignore


Summary Table

Tool Category Maturity Best For
Claude Code (native) CLI agent Production Software development, general tasks
Codex CLI CLI agent Production OpenAI-ecosystem development
LangGraph Framework SDK Production Stateful workflows, human-in-the-loop
CrewAI Framework SDK Production Role-based teams, fast setup
AutoGen Framework SDK Declining Conversational agents (legacy)
Gas Town Orchestrator Early production Large-scale parallel development (20+ agents)
BMAD Orchestrator Production Waterfall-style planning discipline
Ruflo Orchestrator Early Claude Code MCP integration
Agent Teams Native feature Experimental Built-in multi-agent coordination
Langfuse Observability Production Self-hosted tracing (open source)
Datadog LLM Obs Observability Production Enterprise monitoring

Last updated: March 7, 2026. Written from production experience running a Claude Code boss/worker pool on a homelab LXC.