Pry API Reference

Complete reference for the Pry web scraping engine. All endpoints return JSON. Base URL: http://localhost:8005

Quickstart

Pry runs as a Docker Compose stack. After deploying, the API is available on port 8005.

# Health check
curl http://localhost:8005/health

# Scrape a page
curl -X POST http://localhost:8005/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com"}'

Authentication

By default, Pry does not require authentication — it's designed to run on your internal network behind a firewall. For public-facing deployments, you can enable API key authentication via the config endpoint or environment variable.

# Set an API key
curl -X POST http://localhost:8005/v1/config \
  -H "Content-Type: application/json" \
  -d '{"api_key":"your-secret-key"}'

# Then include it in requests
curl -X POST http://localhost:8005/v1/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{"url":"https://example.com"}'

Rate Limiting

Pry includes built-in rate limiting to prevent abuse. Default limits: 120 requests per minute per IP, with a burst allowance of 200. Configure via /v1/config.

# Check current rate limit config
curl http://localhost:8005/v1/config

# Response
{"rate_limit":{"rpm":120,"burst":200}}

Core Endpoints

POST/v1/scrapeScrape a single URL

The primary endpoint. Fetches and extracts content from any URL. Supports Cloudflare bypass, JavaScript rendering, and multiple output formats.

Request body:

{
  "url": "https://example.com",
  "bypass_cloudflare": true,     // Use FlareSolverr
  "render_js": true,             // Use Playwright browser
  "timeout": 30,                  // Seconds
  "format": "markdown"           // "markdown" | "html" | "text"
}

Response:

{
  "success": true,
  "data": {
    "url": "https://example.com",
    "title": "Example Domain",
    "content": "...",
    "status_code": 200,
    "method": "flaresolverr"
  }
}

POST/v1/crawlCrawl multiple pages

Crawl a site starting from a URL. Specify depth and max pages. Returns all discovered pages.

{
  "url": "https://example.com",
  "max_pages": 10,
  "max_depth": 2,
  "same_domain": true
}

POST/v1/screenshotCapture page screenshot

Take a full-page or viewport screenshot of any URL using Playwright.

{
  "url": "https://example.com",
  "full_page": true,
  "width": 1920,
  "height": 1080,
  "format": "png"          // "png" | "jpeg"
}

POST/v1/suggestAI schema detection

Analyze a page and auto-detect its data structure. Returns CSS selectors for products, prices, names, descriptions, ratings, and stock levels. Uses local LLM (Ollama) for intelligent detection.

{"url": "https://shop.example.com"}

Response:

{
  "success": true,
  "data": {
    "_page_title": "Shop",
    "suggested": {
      "price": "[class*='price']",
      "name": "h1.product-title",
      "image": "[class*='product-image']",
      "rating": "[class*='review-score']"
    }
  }
}

POST/v1/extractStructured extraction

Extract specific fields from a page using CSS selectors. Combine with /v1/suggest to auto-detect selectors first.

{
  "url": "https://shop.example.com",
  "fields": {
    "price": "[class*='price']",
    "name": "h1.title",
    "image": "img.product-image"
  }
}

POST/v1/pipeData pipeline transform

Scrape a URL and transform the output into JSON, CSV, or SQL format. Pipe directly into your database or analytics stack.

{
  "url": "https://example.com",
  "transform": "json"      // "json" | "csv" | "sql"
}

Advanced Endpoints

POST/v1/batchBatch scrape multiple URLs

Scrape multiple URLs in a single request. Results returned as an array.

{"urls": ["https://example.com", "https://example.org"], "bypass_cloudflare": true}

POST/v1/compareDiff two pages

Scrape two URLs and return a diff of their content. Useful for monitoring competitor changes.

{"url_a": "https://example.com/v1", "url_b": "https://example.com/v2"}

POST/v1/watchMonitor page for changes

{"url": "https://example.com", "interval": "1h", "webhook": "https://your-server.com/hook"}

POST/v1/mapGenerate sitemap

Crawl a domain and generate a sitemap of all discovered URLs.

{"url": "https://example.com", "max_pages": 500}

POST/v1/parseParse raw HTML

Parse and extract content from raw HTML input. Useful when you already have the HTML from another source.

{"html": "<html>...</html>", "format": "markdown"}

POST/v1/automateBrowser automation

Execute browser automation steps using Playwright. Navigate, click, type, wait for selectors, then extract.

{
  "url": "https://example.com/login",
  "steps": [
    {"action": "type", "selector": "#email", "value": "[email protected]"},
    {"action": "type", "selector": "#password", "value": "pass123"},
    {"action": "click", "selector": "button[type=submit]"},
    {"action": "wait", "selector": ".dashboard"},
    {"action": "extract"}
  ]
}

System Endpoints

GET/v1/breaker/statusCircuit breaker status

View the circuit breaker state for all tracked domains. Shows failure counts and backoff timers.

# Check breaker status
curl http://localhost:8005/v1/breaker/status

# Reset breaker for a domain or all domains
curl -X POST http://localhost:8005/v1/breaker/reset \
  -H "Content-Type: application/json" \
  -d '{"domain":"example.com"}'

The circuit breaker prevents aggressive retries against failing targets. After N consecutive failures, Pry backs off exponentially (max 60 seconds). This prevents IP bans.

GET/healthHealth check

Returns service health, version, cache stats, rate limiter status, and enabled features.

{
  "status": "ok",
  "service": "pry",
  "version": "3.0.0",
  "backends": {"flaresolverr":true,"playwright":true}
}

GET/mcp/toolsMCP tools list

List all available MCP (Model Context Protocol) tools. Connect any MCP-compatible AI agent (Claude Desktop, Cursor) to give it web scraping capabilities.

curl http://localhost:8005/mcp/tools

Deployment

Pry ships as a Docker Compose project. Minimum requirements: Linux server with Docker, 2GB+ RAM.

# 1. Extract the download
tar xzf pry-3.0.0.tar.gz && cd pry

# 2. Start the stack
docker compose up -d

# 3. Verify
curl http://localhost:8005/health

# 4. (Optional) Enable Tor support
docker compose --profile tor up -d

What's included in the stack:

Service	Port	Description
munchcrawl	8005	Pry API server (FastAPI)
flaresolverr	8191	Cloudflare bypass service
playwright	—	Headless Chromium (bundled)
tor (optional)	9050	Tor SOCKS5 proxy

Error Codes

Status	Meaning	Action
200	Success	—
400	Bad request — missing URL or invalid params	Check your request body
429	Rate limit exceeded	Wait and retry; increase burst in config
500	Scrape failed — target unreachable or blocked	Try bypass_cloudflare:true or render_js:true
502	FlareSolverr down	Restart flaresolverr container
503	Circuit breaker open for domain	Check /v1/breaker/status; wait for backoff

← Back to Pry · Need help? [email protected]