Live POST /v1/extract 5 credits per request

Define a Schema.
Get Back Structured Data.

Tell CrawlHQ what you want. Point it at any URL. Get back a JSON object that matches your exact schema — every run, every time. No regex, no XPath, no fragile selectors.

terminal
curl -X POST https://api.crawlhq.dev/v1/extract \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://competitor.com/pricing",
    "schema": {
      "plans": [{"name":"string","price":"number"}]
    }
  }'
response
{
  "extracted": {
    "plans": [
      {"name": "Starter", "price": 49},
      {"name": "Pro", "price": 149}
    ]
  },
  "source_url": "https://competitor.com/pricing",
  "credits_used": 5
}
200 OK 5 credits per request

What makes it production-grade

Every module is built for pipelines that run without you watching.

📐

Schema-Defined Output

You define exactly what fields you want. CrawlHQ returns a JSON object matching your schema — nested objects, arrays, enums, whatever you need.

🧠

LLM-Powered Extraction

No fragile CSS selectors or XPath. Semantic understanding means extraction works even when the site redesigns or reorganises content.

🔍

Source Attribution

Every extracted field traces back to a source URL and timestamp. Full audit trail — know exactly where every data point came from.

📄

PDF Support

Extract structured data from PDF documents — ECI affidavits, financial reports, government filings, contracts. Same schema-driven approach.

🔁

Batch Extraction

Pass an array of URLs and get back an array of structured objects. Extract from 100 product pages in a single API call.

Schema Validation

Output is validated against your schema before returning. You get a match_confidence score and field-level validation results.

Use Cases

What teams build with extract

Competitor Pricing Intelligence

Define a pricing schema. Point at 20 competitor pricing pages. Get back a structured comparison table — automatically, daily.

ECI Affidavit Data Extraction

Extract candidate criminal history, declared assets, and liabilities from public ECI PDF affidavits. 500 candidates processed in minutes.

Product Catalogue Scraping

Define a product schema with name, price, SKU, availability. Extract from any e-commerce site — even JS-rendered ones.

Job Listing Extraction

Extract structured job data — title, company, salary, requirements, location — from any job board. Build market intelligence tools.

Lead Data Enrichment

Point at a company's About page. Extract company size, founding year, tech stack, leadership team. Feed into your CRM automatically.

Financial Filing Analysis

Extract key metrics from annual reports, quarterly results, and investor presentations. Structure unstructured financial data at scale.

Frequently asked questions

What schema format do you use?
A simple JSON object where keys are field names and values are type hints: 'string', 'number', 'boolean', or nested objects and arrays. You don't need to write JSON Schema or Pydantic models — just describe what you want.
Why does extract cost 5 credits vs 1 for scrape?
Extract uses an LLM to semantically understand the page and map content to your schema. The higher compute cost is reflected in the credit price — but you're getting structured, validated data ready for your database, not raw HTML.
What happens if the page doesn't contain some fields in my schema?
Missing fields are returned as null, with a match_confidence score indicating how well the schema was satisfied. You can set require_all_fields: true to get an error instead of nulls.
Can it extract from PDFs?
Yes. Pass a PDF URL (including Google Drive, ECI portal, and most document hosting sites). CrawlHQ will fetch, parse, and extract structured data from the PDF content.
Does the extraction break if the website redesigns?
Unlike CSS selectors, semantic extraction is resilient to layout changes. As long as the content is still on the page, extraction will still work after a redesign.
Is there a way to validate the quality of extraction?
Yes. Each response includes a match_confidence field (0–1) and per-field extraction status. You can set a confidence threshold and retry or flag low-confidence extractions.

Start using extract in minutes

2,500 free credits. No credit card. One API key for all 9 modules.

Get API Key Free →