2026-03-12 · AI & Agents
Best Ways for LLMs to Use Browsers — March 2026
Every agent eventually needs to use a browser. Filling forms, scraping data, navigating UIs that don't have APIs — the web is still the universal integration layer, and LLMs need to interact with it.
The space has matured fast. A year ago you were choosing between "screenshot everything and hope the vision model figures it out" and "write Puppeteer scripts by hand." Now there are dedicated frameworks, MCP servers, and hybrid approaches that combine structured data with AI reasoning. Some of them actually work.
This is a snapshot from March 2026, written from the perspective of building AI agents on a homelab. We've used Playwright MCP, Puppeteer, and direct browser automation in production. This is what we've found.
The Landscape
Browser automation for LLMs falls into four approaches:
- MCP browser servers — Playwright MCP, Puppeteer MCP. They expose browser actions as MCP tools. The LLM calls
browser_navigate,browser_click,browser_snapshotlike any other tool. No vision model required. - AI-native browser agents — Browser Use, Stagehand. Purpose-built frameworks where the AI drives the browser through natural language. The framework handles element selection, action planning, and error recovery.
- Vision-based automation — Claude Computer Use, screenshot-and-click approaches. The model sees the screen as pixels and decides where to click. Most general, least reliable.
- Web scraping / extraction — Crawl4AI, Firecrawl, Jina Reader. Not browser automation per se, but solve the "get web content into an LLM" problem without a full browser. Often the right answer when people reach for a browser.
The tension: approaches 1 and 2 are converging. Playwright MCP now recommends its CLI+SKILLS mode for coding agents over MCP, acknowledging that token-efficient CLI invocations beat verbose tool schemas. Meanwhile, Stagehand explicitly bridges code and natural language, letting developers choose which mode fits each step.
What Works
Playwright MCP (Microsoft)
The most popular browser MCP server. 28,000+ GitHub stars. Uses Playwright's accessibility tree instead of screenshots, which is a fundamental design decision that makes everything downstream better.
How it works: Exposes ~20 browser tools via MCP — browser_navigate, browser_click, browser_fill_form, browser_snapshot, browser_evaluate, browser_take_screenshot, etc. The key tool is browser_snapshot, which captures the page's accessibility tree as structured data. The LLM reads element roles, names, and states rather than interpreting pixels.
What's good:
- Fast and lightweight. Accessibility trees are tiny compared to screenshots.
- No vision model needed. Works with any text LLM.
- Deterministic element targeting.
browser_clicktakes a reference from the accessibility snapshot, not coordinates. No "click at pixel (342, 718)" guessing. browser_run_codelets the LLM write and execute arbitrary Playwright code when the predefined tools aren't enough.browser_fill_formhandles multiple fields in one call — less back-and-forth.- Works with every major MCP client: Claude Code, VS Code, Cursor, Windsurf, Codex CLI.
What's not:
- Accessibility trees can be incomplete. SPAs with custom components sometimes have poor ARIA markup, leaving the LLM blind to interactive elements.
- Token cost. Complex pages generate large accessibility snapshots. Microsoft acknowledges this — they now recommend their CLI+SKILLS approach for coding agents where token budgets are tight.
- The MCP server maintains browser state across calls, which is powerful but means a crashed server loses your session.
The CLI pivot: Playwright now ships a separate Playwright CLI with SKILLS (purpose-built commands). Their own README says CLI is better for "high-throughput coding agents that must balance browser automation with large codebases, tests, and reasoning within limited context windows." MCP is better for "exploratory automation, self-healing tests, or long-running autonomous workflows." This is an honest and useful distinction.
Setup:
# Claude Code
claude mcp add playwright npx @playwright/mcp@latest
# Or in MCP config
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["@playwright/mcp@latest"]
}
}
}
Links:
- microsoft/playwright-mcp — 28,000+ stars
- microsoft/playwright-cli — CLI+SKILLS alternative
Browser Use
The highest-starred project in the space at 80,000+ GitHub stars. A Python framework that gives LLMs autonomous browser control through natural language task descriptions.
How it works: You describe a task ("Fill in this job application with my resume"), provide an LLM, and Browser Use handles the browser session, element detection, action planning, and execution. It supports multiple LLM providers — their own ChatBrowserUse model, Anthropic (Claude), Google (Gemini), and OpenAI.
What's good:
- Highest-level abstraction available. Genuinely useful for end-to-end tasks like "buy groceries from Instacart" or "fill out this job application."
- Cloud offering for scalable, stealth-enabled browser automation (bypasses bot detection).
- CLI for quick one-off browser tasks from the terminal.
- Template system (
uvx browser-use init --template default) for fast project scaffolding. - Active community and fast iteration.
What's not:
- High-level abstraction means less control. When the agent misinterprets a page, debugging is harder than with Playwright MCP where you can inspect the exact accessibility tree.
- Python-only. If your agent stack is TypeScript/Node, you need a separate process.
- The "autonomous agent" approach burns more tokens than targeted tool calls. Each step requires the LLM to reason about what it sees, decide what to do, and verify the result.
- Reliability on complex, dynamic pages is still inconsistent. Works great on well-structured sites, struggles on SPAs with heavy JavaScript rendering.
Setup:
uv init && uv add browser-use && uv sync
from browser_use import Agent, Browser, ChatBrowserUse
import asyncio
async def main():
agent = Agent(
task="Find the latest pricing for Claude API",
llm=ChatBrowserUse(),
browser=Browser(),
)
await agent.run()
asyncio.run(main())
Links:
- browser-use/browser-use — 80,000+ stars
- Browser Use Cloud
- Docs
Stagehand (Browserbase)
The most thoughtful hybrid in the space. 21,000+ GitHub stars. Stagehand's core insight: let developers choose when to use code and when to use natural language, and bridge the gap between them.
How it works: Three primitives — act() for single AI-driven actions, extract() for structured data extraction with Zod schemas, and agent() for multi-step autonomous tasks. Underneath, it uses Playwright's CDP engine. You can freely mix Playwright code with AI actions.
What's good:
- The code/AI hybrid model is genuinely useful. Navigate with code (
page.goto(url)), interact with AI (stagehand.act("click the login button")), extract with schemas (stagehand.extract("get the price", z.object({...}))). - Auto-caching. Stagehand remembers previous actions and replays them without LLM inference. Only invokes AI when the page has changed. This dramatically reduces token costs for repeatable workflows.
- Self-healing. When cached actions fail (because the site changed), Stagehand automatically falls back to AI to figure out the new approach.
- TypeScript-native. Fits naturally into Node agent stacks.
- Zod schema validation on
extract()means you get typed, validated data out of web pages.
What's not:
- Requires Browserbase credentials for cloud browser sessions (or local Playwright for development).
- The
agent()mode for multi-step tasks is newer and less battle-tested than the single-action primitives. - TypeScript-only until recently — Python SDK now exists but is less mature.
Setup:
npx create-browser-app
const page = stagehand.context.pages()[0];
await page.goto("https://example.com");
// AI-driven action
await stagehand.act("click on the pricing link");
// Structured extraction
const { price, plan } = await stagehand.extract(
"extract the price and plan name",
z.object({
price: z.string(),
plan: z.string(),
}),
);
Links:
- browserbase/stagehand — 21,000+ stars
- Docs
Claude Computer Use (Vision-Based)
Anthropic's approach: the model sees the screen as pixels and acts through mouse/keyboard coordinates. The most general approach — it works on any visual interface, not just browsers.
How it works: Claude receives screenshots and can issue mouse clicks, keyboard input, and scroll actions at specific coordinates. The model reasons about what it sees visually, identifying buttons, text fields, and navigation elements from the rendered page.
What's good:
- Works on anything with a screen. Not just browsers — desktop apps, terminal UIs, mobile emulators.
- No dependency on accessibility trees or DOM structure. Sites with poor ARIA markup or heavy canvas rendering still work.
- Handles visual context that accessibility trees miss: colors indicating state, spatial layout suggesting relationships, icons without text labels.
What's not:
- Slow. Each action requires a screenshot, model inference, coordinate calculation, action execution, and another screenshot to verify. Multiple seconds per action vs. milliseconds for Playwright MCP.
- Expensive. Screenshots are large tokens. A typical browser session might use 10-50x more tokens than the accessibility tree approach.
- Coordinate precision is imperfect. "Click at (342, 718)" sometimes misses small targets. The model is getting better at this but it's fundamentally harder than targeting a named element.
- Requires a display environment. Headless browser + VNC or a real display. More infrastructure than MCP tools that work headlessly.
When to use it: When structured approaches fail — custom canvas apps, poorly accessible sites, or when you need to automate desktop applications alongside browser work. Not the right default for standard web automation.
Links:
- Claude Computer Use docs
- anthropic-quickstarts — reference implementations
Crawl4AI (When You Don't Need a Browser)
61,000+ GitHub stars. Often the right answer when people reach for browser automation but actually just need web content.
How it works: Crawls web pages and converts them to LLM-friendly formats — clean markdown, structured data, or raw content. Handles JavaScript rendering when needed but optimized for content extraction, not interaction.
What's good:
- If your task is "read this web page and understand it," Crawl4AI is 10x simpler and cheaper than spinning up a full browser automation session.
- Produces clean markdown that fits efficiently in LLM context windows.
- Handles JavaScript-rendered pages when configured to do so.
- Much faster than screenshot-based approaches for content extraction.
What's not:
- Can't interact with pages. No form filling, no clicking, no navigation flows.
- Not a browser automation tool — it's a content extraction tool. The distinction matters.
When to use it: Research tasks, content monitoring, data extraction, building RAG pipelines from web content. When the LLM needs to read the web, not use it.
Links:
- unclecode/crawl4ai — 61,000+ stars
What Doesn't
Raw Puppeteer/Selenium for LLM Agents
Writing Puppeteer or Selenium scripts and having the LLM generate them on the fly sounds logical but works poorly in practice. The LLM generates brittle selectors (document.querySelector('#app > div:nth-child(3) > button.submit-btn')) that break when the page changes. The accessibility tree approach (Playwright MCP) or natural language approach (Stagehand, Browser Use) handle this much better.
Puppeteer is still excellent as a library that other tools build on. But don't give an LLM raw Puppeteer and expect good results.
Screenshot-Only Approaches (Without Structured Fallback)
Pure vision-based browser automation — screenshot, reason, click coordinates, repeat — is too slow and expensive for most production use. Claude Computer Use is the best implementation of this approach and even it is best reserved for cases where structured methods fail.
The exception: if you're automating non-browser desktop applications, vision is your only option, and it works acceptably there.
Over-Engineered MCP Chains
Running three MCP servers (browser + scraper + screenshot analyzer) in a pipeline is a pattern that's emerged in some agent architectures. It's almost always worse than picking one good tool. Each MCP hop adds latency, token overhead, and failure modes.
Key Patterns
Pattern 1: Accessibility-First, Vision Fallback
The most reliable pattern: start with Playwright MCP's accessibility tree. If elements aren't accessible (missing ARIA, canvas-based UI, heavy custom components), fall back to a screenshot with Claude's vision.
1. browser_snapshot → accessibility tree
2. Can I find the target element? → YES → browser_click(ref)
3. Can I find the target element? → NO → browser_take_screenshot
4. Vision model identifies coordinates → browser_click(coordinates)
This gives you speed and token efficiency 90% of the time, with a reliable fallback for the edge cases.
Pattern 2: Extract, Don't Navigate
Before building a multi-step browser automation flow, ask: can I get this data without navigating?
Option A: Navigate to page → click filters → read table → extract data
(5-10 browser actions, 30+ seconds, high token cost)
Option B: Crawl4AI or fetch the URL → extract from markdown
(1 HTTP request, 2 seconds, minimal tokens)
Many "browser automation" tasks are actually content extraction tasks dressed up. Crawl4AI, Jina Reader, or even a plain curl + markdown conversion handles them better.
Pattern 3: Hybrid Code + AI (Stagehand Pattern)
For production automations that need to be both reliable and adaptive:
Code: page.goto(knownURL) ← deterministic, fast
Code: page.fill('#email', email) ← known selector, no AI needed
AI: stagehand.act("solve captcha") ← unpredictable, needs reasoning
Code: page.click('#submit') ← known selector
AI: stagehand.extract("confirmation number", schema) ← structured output
Use code for the parts you understand, AI for the parts that change or require reasoning. This minimizes token cost and maximizes reliability.
Pattern 4: Persistent Browser Session
For agents that interact with the web repeatedly, keep the browser alive between tasks rather than launching fresh:
❌ Each task: launch browser → navigate → authenticate → do work → close
(adds 2-5 seconds per task, loses session state)
✅ Persistent: browser stays open, MCP server maintains state
(instant subsequent actions, cookies and auth preserved)
Playwright MCP handles this natively — the server maintains browser state across MCP calls. For Puppeteer, reuse the browser instance instead of launching new ones (reduces launch overhead by 60%).
Our Take
For our homelab agent setup, Playwright MCP is the default browser tool. It's simple to add (claude mcp add playwright npx @playwright/mcp@latest), works with Claude Code natively, and the accessibility tree approach keeps token costs reasonable. We've used it for scraping documentation, filling forms, and navigating admin UIs.
When we just need to read web content — pulling docs, checking API references, monitoring pages — we skip the browser entirely. A curl or Crawl4AI call is faster, cheaper, and more reliable than spinning up a browser session.
We haven't needed Stagehand or Browser Use yet because our browser tasks are relatively simple — targeted tool calls rather than autonomous multi-step workflows. But if we were building a product that needed to navigate arbitrary websites (think: an agent that books appointments or fills out applications), Stagehand's hybrid approach would be the first thing we'd evaluate. The auto-caching and self-healing features solve real production problems.
Claude Computer Use stays in reserve for when nothing else works — desktop app automation or sites with such poor accessibility that the tree is useless. The cost and latency penalty is too high for routine use.
What would make us change: If Playwright MCP gets smarter about handling inaccessible elements — automatic vision fallback when the accessibility tree is incomplete — it would close the gap with Stagehand for interactive tasks. Microsoft seems to be heading this direction with the CLI+SKILLS split.
Summary Table
| Tool | Approach | Stars | Language | Best For | Maturity |
|---|---|---|---|---|---|
| Playwright MCP | Accessibility tree via MCP | 28,000+ | Node | Targeted browser actions from agents | Production |
| Browser Use | AI-driven autonomous agent | 80,000+ | Python | End-to-end task automation | Production |
| Stagehand | Hybrid code + AI | 21,000+ | TypeScript | Production automations needing adaptability | Production |
| Computer Use | Vision/screenshot | — | Any | Desktop apps, inaccessible UIs | Production |
| Crawl4AI | Content extraction | 61,000+ | Python | Reading web content into LLMs | Production |
| Puppeteer MCP | DOM scripting via MCP | — | Node | Simple page interaction, screenshots | Production |
| Playwright CLI | CLI commands + SKILLS | New | Node | Token-efficient browser automation | Early |
Sources
- microsoft/playwright-mcp — MCP server with accessibility tree approach, 28,000+ stars
- microsoft/playwright-cli — CLI+SKILLS alternative for coding agents
- browser-use/browser-use — AI browser agent framework, 80,000+ stars
- browserbase/stagehand — Hybrid code+AI browser automation, 21,000+ stars
- unclecode/crawl4ai — LLM-friendly web crawler, 61,000+ stars
- Claude Computer Use docs — Anthropic's vision-based computer control
- Stagehand docs — Framework documentation
- Browser Use docs — Framework documentation
Last updated: March 12, 2026. Written from production experience running Playwright MCP on a homelab agent pool.