2026-03-12 · AI & Agents

Best Ways for LLMs to Use Browsers — March 2026

Every agent eventually needs to use a browser. Filling forms, scraping data, navigating UIs that don't have APIs — the web is still the universal integration layer, and LLMs need to interact with it.

The space has matured fast. A year ago you were choosing between "screenshot everything and hope the vision model figures it out" and "write Puppeteer scripts by hand." Now there are dedicated frameworks, MCP servers, and hybrid approaches that combine structured data with AI reasoning. Some of them actually work.

This is a snapshot from March 2026, written from the perspective of building AI agents on a homelab. We've used Playwright MCP, Puppeteer, and direct browser automation in production. This is what we've found.


The Landscape

Browser automation for LLMs falls into four approaches:

  1. MCP browser servers — Playwright MCP, Puppeteer MCP. They expose browser actions as MCP tools. The LLM calls browser_navigate, browser_click, browser_snapshot like any other tool. No vision model required.
  2. AI-native browser agents — Browser Use, Stagehand. Purpose-built frameworks where the AI drives the browser through natural language. The framework handles element selection, action planning, and error recovery.
  3. Vision-based automation — Claude Computer Use, screenshot-and-click approaches. The model sees the screen as pixels and decides where to click. Most general, least reliable.
  4. Web scraping / extraction — Crawl4AI, Firecrawl, Jina Reader. Not browser automation per se, but solve the "get web content into an LLM" problem without a full browser. Often the right answer when people reach for a browser.

The tension: approaches 1 and 2 are converging. Playwright MCP now recommends its CLI+SKILLS mode for coding agents over MCP, acknowledging that token-efficient CLI invocations beat verbose tool schemas. Meanwhile, Stagehand explicitly bridges code and natural language, letting developers choose which mode fits each step.


What Works

Playwright MCP (Microsoft)

The most popular browser MCP server. 28,000+ GitHub stars. Uses Playwright's accessibility tree instead of screenshots, which is a fundamental design decision that makes everything downstream better.

How it works: Exposes ~20 browser tools via MCP — browser_navigate, browser_click, browser_fill_form, browser_snapshot, browser_evaluate, browser_take_screenshot, etc. The key tool is browser_snapshot, which captures the page's accessibility tree as structured data. The LLM reads element roles, names, and states rather than interpreting pixels.

What's good:

What's not:

The CLI pivot: Playwright now ships a separate Playwright CLI with SKILLS (purpose-built commands). Their own README says CLI is better for "high-throughput coding agents that must balance browser automation with large codebases, tests, and reasoning within limited context windows." MCP is better for "exploratory automation, self-healing tests, or long-running autonomous workflows." This is an honest and useful distinction.

Setup:

# Claude Code
claude mcp add playwright npx @playwright/mcp@latest

# Or in MCP config
{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@playwright/mcp@latest"]
    }
  }
}

Links:

Browser Use

The highest-starred project in the space at 80,000+ GitHub stars. A Python framework that gives LLMs autonomous browser control through natural language task descriptions.

How it works: You describe a task ("Fill in this job application with my resume"), provide an LLM, and Browser Use handles the browser session, element detection, action planning, and execution. It supports multiple LLM providers — their own ChatBrowserUse model, Anthropic (Claude), Google (Gemini), and OpenAI.

What's good:

What's not:

Setup:

uv init && uv add browser-use && uv sync
from browser_use import Agent, Browser, ChatBrowserUse
import asyncio

async def main():
    agent = Agent(
        task="Find the latest pricing for Claude API",
        llm=ChatBrowserUse(),
        browser=Browser(),
    )
    await agent.run()

asyncio.run(main())

Links:

Stagehand (Browserbase)

The most thoughtful hybrid in the space. 21,000+ GitHub stars. Stagehand's core insight: let developers choose when to use code and when to use natural language, and bridge the gap between them.

How it works: Three primitives — act() for single AI-driven actions, extract() for structured data extraction with Zod schemas, and agent() for multi-step autonomous tasks. Underneath, it uses Playwright's CDP engine. You can freely mix Playwright code with AI actions.

What's good:

What's not:

Setup:

npx create-browser-app
const page = stagehand.context.pages()[0];
await page.goto("https://example.com");

// AI-driven action
await stagehand.act("click on the pricing link");

// Structured extraction
const { price, plan } = await stagehand.extract(
  "extract the price and plan name",
  z.object({
    price: z.string(),
    plan: z.string(),
  }),
);

Links:

Claude Computer Use (Vision-Based)

Anthropic's approach: the model sees the screen as pixels and acts through mouse/keyboard coordinates. The most general approach — it works on any visual interface, not just browsers.

How it works: Claude receives screenshots and can issue mouse clicks, keyboard input, and scroll actions at specific coordinates. The model reasons about what it sees visually, identifying buttons, text fields, and navigation elements from the rendered page.

What's good:

What's not:

When to use it: When structured approaches fail — custom canvas apps, poorly accessible sites, or when you need to automate desktop applications alongside browser work. Not the right default for standard web automation.

Links:

Crawl4AI (When You Don't Need a Browser)

61,000+ GitHub stars. Often the right answer when people reach for browser automation but actually just need web content.

How it works: Crawls web pages and converts them to LLM-friendly formats — clean markdown, structured data, or raw content. Handles JavaScript rendering when needed but optimized for content extraction, not interaction.

What's good:

What's not:

When to use it: Research tasks, content monitoring, data extraction, building RAG pipelines from web content. When the LLM needs to read the web, not use it.

Links:


What Doesn't

Raw Puppeteer/Selenium for LLM Agents

Writing Puppeteer or Selenium scripts and having the LLM generate them on the fly sounds logical but works poorly in practice. The LLM generates brittle selectors (document.querySelector('#app > div:nth-child(3) > button.submit-btn')) that break when the page changes. The accessibility tree approach (Playwright MCP) or natural language approach (Stagehand, Browser Use) handle this much better.

Puppeteer is still excellent as a library that other tools build on. But don't give an LLM raw Puppeteer and expect good results.

Screenshot-Only Approaches (Without Structured Fallback)

Pure vision-based browser automation — screenshot, reason, click coordinates, repeat — is too slow and expensive for most production use. Claude Computer Use is the best implementation of this approach and even it is best reserved for cases where structured methods fail.

The exception: if you're automating non-browser desktop applications, vision is your only option, and it works acceptably there.

Over-Engineered MCP Chains

Running three MCP servers (browser + scraper + screenshot analyzer) in a pipeline is a pattern that's emerged in some agent architectures. It's almost always worse than picking one good tool. Each MCP hop adds latency, token overhead, and failure modes.


Key Patterns

Pattern 1: Accessibility-First, Vision Fallback

The most reliable pattern: start with Playwright MCP's accessibility tree. If elements aren't accessible (missing ARIA, canvas-based UI, heavy custom components), fall back to a screenshot with Claude's vision.

1. browser_snapshot → accessibility tree
2. Can I find the target element? → YES → browser_click(ref)
3. Can I find the target element? → NO → browser_take_screenshot
4. Vision model identifies coordinates → browser_click(coordinates)

This gives you speed and token efficiency 90% of the time, with a reliable fallback for the edge cases.

Pattern 2: Extract, Don't Navigate

Before building a multi-step browser automation flow, ask: can I get this data without navigating?

Option A: Navigate to page → click filters → read table → extract data
           (5-10 browser actions, 30+ seconds, high token cost)

Option B: Crawl4AI or fetch the URL → extract from markdown
           (1 HTTP request, 2 seconds, minimal tokens)

Many "browser automation" tasks are actually content extraction tasks dressed up. Crawl4AI, Jina Reader, or even a plain curl + markdown conversion handles them better.

Pattern 3: Hybrid Code + AI (Stagehand Pattern)

For production automations that need to be both reliable and adaptive:

Code:  page.goto(knownURL)           ← deterministic, fast
Code:  page.fill('#email', email)    ← known selector, no AI needed
AI:    stagehand.act("solve captcha") ← unpredictable, needs reasoning
Code:  page.click('#submit')         ← known selector
AI:    stagehand.extract("confirmation number", schema) ← structured output

Use code for the parts you understand, AI for the parts that change or require reasoning. This minimizes token cost and maximizes reliability.

Pattern 4: Persistent Browser Session

For agents that interact with the web repeatedly, keep the browser alive between tasks rather than launching fresh:

❌ Each task: launch browser → navigate → authenticate → do work → close
   (adds 2-5 seconds per task, loses session state)

✅ Persistent: browser stays open, MCP server maintains state
   (instant subsequent actions, cookies and auth preserved)

Playwright MCP handles this natively — the server maintains browser state across MCP calls. For Puppeteer, reuse the browser instance instead of launching new ones (reduces launch overhead by 60%).


Our Take

For our homelab agent setup, Playwright MCP is the default browser tool. It's simple to add (claude mcp add playwright npx @playwright/mcp@latest), works with Claude Code natively, and the accessibility tree approach keeps token costs reasonable. We've used it for scraping documentation, filling forms, and navigating admin UIs.

When we just need to read web content — pulling docs, checking API references, monitoring pages — we skip the browser entirely. A curl or Crawl4AI call is faster, cheaper, and more reliable than spinning up a browser session.

We haven't needed Stagehand or Browser Use yet because our browser tasks are relatively simple — targeted tool calls rather than autonomous multi-step workflows. But if we were building a product that needed to navigate arbitrary websites (think: an agent that books appointments or fills out applications), Stagehand's hybrid approach would be the first thing we'd evaluate. The auto-caching and self-healing features solve real production problems.

Claude Computer Use stays in reserve for when nothing else works — desktop app automation or sites with such poor accessibility that the tree is useless. The cost and latency penalty is too high for routine use.

What would make us change: If Playwright MCP gets smarter about handling inaccessible elements — automatic vision fallback when the accessibility tree is incomplete — it would close the gap with Stagehand for interactive tasks. Microsoft seems to be heading this direction with the CLI+SKILLS split.


Summary Table

Tool Approach Stars Language Best For Maturity
Playwright MCP Accessibility tree via MCP 28,000+ Node Targeted browser actions from agents Production
Browser Use AI-driven autonomous agent 80,000+ Python End-to-end task automation Production
Stagehand Hybrid code + AI 21,000+ TypeScript Production automations needing adaptability Production
Computer Use Vision/screenshot Any Desktop apps, inaccessible UIs Production
Crawl4AI Content extraction 61,000+ Python Reading web content into LLMs Production
Puppeteer MCP DOM scripting via MCP Node Simple page interaction, screenshots Production
Playwright CLI CLI commands + SKILLS New Node Token-efficient browser automation Early

Sources


Last updated: March 12, 2026. Written from production experience running Playwright MCP on a homelab agent pool.