← Projects

browser-agent

Browser automation library for LLM agents

Overview

A Python library for browser automation using Chrome DevTools Protocol (CDP). Designed specifically for LLM-driven interaction, with built-in support for OpenAI, Anthropic, and Google Gemini backends.

Problem

Existing browser automation tools weren't designed for LLM agents. They lack structured element representation, confidence scoring, and native integration with language model APIs.

Solution

browser-agent correlates data from the DOM, DOMSnapshot, and Accessibility tree to build a unified view of interactive elements. Each element receives a confidence score (0-1) indicating its actionability.

Architecture

  • CDP Client — WebSocket connection to Chrome
  • Data Merger — Correlates DOM, snapshot, and accessibility data
  • Serialization — Converts page state to LLM-friendly text
  • Tool Executor — Maps LLM tool calls to browser actions

Usage

from browser_agent import Browser, Agent, OpenAIBackend

backend = OpenAIBackend(model="gpt-4o")
agent = Agent(backend)

history = await agent.run(
    task="Search for Python tutorials",
    start_url="https://google.com"
)

Features

  • Async API with context manager support
  • Tool schemas for OpenAI and Anthropic formats
  • Click, type, scroll, select, and keyboard actions
  • Screenshot capture (viewport or full-page)
  • Configurable agent with step limits and failure handling