Live POST /v1/read 1–2 credits per request

Any URL to Clean Markdown.
LLM-Ready in One Call.

Strip the noise — ads, navbars, cookie banners, footers — and get back clean, structured Markdown your LLM can actually use. Deterministic output, every run.

terminal
curl -X POST https://api.crawlhq.dev/v1/read \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://techcrunch.com/article"}'
response
{
  "status": "success",
  "markdown": "# Article Title

Clean body text...",
  "title": "Article Title",
  "word_count": 1842,
  "credits_used": 1
}
200 OK 1–2 credits per request

What makes it production-grade

Every module is built for pipelines that run without you watching.

🧹

Noise Removal

Strips ads, cookie banners, navigation, sidebars, and footers automatically. Returns only the main content.

📝

Structured Markdown

Preserves heading hierarchy, lists, code blocks, and tables. Output is valid Markdown ready to drop into any LLM context.

🔗

Link Preservation

Internal and external links are preserved in Markdown syntax. Useful for citation tracking and source attribution.

🖼️

Image Alt Text

Image references are preserved with alt text. Your LLM knows what visuals were on the page without needing to process images.

📊

Metadata Extraction

Returns title, description, author, publish date, and word count alongside the Markdown. Structured context for free.

Fast & Deterministic

Same URL returns the same structure every run. Build RAG pipelines that chunk and embed predictably.

Use Cases

What teams build with read

RAG Knowledge Base

Crawl your documentation, competitor blogs, and industry news. Convert to Markdown, chunk, embed, and serve via RAG — with fresh content daily.

LLM Context Grounding

Before calling your LLM, fetch and read the relevant URL. Ground your prompt in live web content instead of stale training data.

Content Summarisation Pipeline

Read 100 articles, pass the Markdown to an LLM, get structured summaries. Build daily briefing tools in a weekend.

Competitor Blog Monitoring

Read competitor articles as clean Markdown. Feed into an LLM to extract topic clusters, identify content gaps, and track strategic messaging.

Research Automation

Turn any URL list into a structured reading list. Read, chunk, embed, and surface relevant passages using semantic search.

Legal & Compliance Monitoring

Read regulatory pages, government notices, and policy documents. Convert to searchable Markdown and alert on content changes.

Frequently asked questions

How is this different from just fetching the page and stripping HTML tags?
Simple HTML stripping gives you raw text with no structure. CrawlHQ's read endpoint understands content hierarchy — it identifies the main article, preserves heading levels, formats tables properly, and removes boilerplate using semantic content detection, not regex.
Does it work on paywalled or JS-rendered sites?
For JS-rendered sites, read automatically uses headless Chrome when needed. For paywalled sites, you can pass session cookies to access subscriber content.
What's the token count of typical output?
A standard news article returns 500–2,000 tokens of clean Markdown. Long-form content and documentation pages can be 5,000–15,000 tokens. The word_count field in the response helps you estimate before chunking.
Can I use this to build a RAG pipeline?
Yes — this is the primary use case. Fetch URLs with read, chunk the Markdown at heading boundaries, embed with your vector model, and store in Pinecone, Weaviate, or pgvector. The deterministic output makes chunk boundaries predictable across runs.
Is the output the same every time I call it on the same URL?
Yes, for static pages. For dynamic pages (news feeds, dashboards), the content changes as the source updates — which is the correct behaviour for a live web intelligence system.

Start using read in minutes

2,500 free credits. No credit card. One API key for all 9 modules.

Get API Key Free →