RAG Without Hallucinations: Why Your Data Pipeline Matters More Than Your Model

Most RAG hallucinations don't come from the model — they come from bad input data. Stale web pages, unparsed HTML noise, and broken chunk boundaries create garbage-in-garbage-out failure modes that no amount of prompt engineering can fix.

CrawlHQ Team · 27 March 2026 · 10 min read

If your RAG system is hallucinating, the usual suspects get the blame: the embedding model, the retrieval strategy, the LLM, the prompts. Developers iterate on all of these looking for the fix.

Most of the time, the real problem is upstream. It’s the data.

This post covers the data pipeline failures that cause RAG hallucinations and how to fix them before your documents ever reach an embedding model.

Why Input Data Quality Dominates RAG Performance

The fundamental promise of RAG is: give the model the right context, and it will reason correctly over it. The failure mode is: give the model the wrong context, and it will hallucinate with confidence.

“Wrong context” has several forms:

  1. Stale context — the web page you indexed six months ago no longer reflects current pricing, product status, or policy
  2. Noisy context — HTML boilerplate, navigation text, cookie banners, and footer links embedded in your chunks alongside the actual content
  3. Broken chunks — a sentence split mid-thought across two chunks, causing retrieval to surface incomplete fragments
  4. Mixed-language chunks — navigation text in English mixed with content in another language, causing embedding drift
  5. Duplicate context — the same information indexed multiple times with slight variations, causing the model to see “conflicting” information

None of these are LLM problems. They’re pipeline problems.

Failure Mode 1: HTML Noise in Embeddings

If you’re scraping web pages and passing the raw HTML through a naive text extractor (Beautiful Soup’s get_text(), for example), you’re embedding navigation menus, cookie consent text, ad labels, and footer links alongside your actual content.

Consider a pricing page. The text extractor might produce:

Home Products Pricing Blog About Login Sign Up
Skip to main content
Cookie preferences Accept all Reject all
Starter Plan ₹999/month Up to 5 users 10GB storage
Pro Plan ₹2,999/month Unlimited users 100GB storage
© 2026 Company Inc. Privacy Policy Terms of Service

The actual content — two pricing tiers — is buried in navigation and footer noise. When you chunk and embed this, the navigation text dilutes the semantic signal of the pricing information. Retrieval for “what does the Pro plan cost?” may not surface this chunk reliably.

The fix: Use /v1/read instead of /v1/scrape for content destined for RAG. The read endpoint converts web pages to clean Markdown, stripping navigation, footer, ads, and boilerplate. It preserves headers, lists, tables, and code blocks — the structural elements that help chunkers make sensible cuts.

# Instead of this:
resp = await crawlhq.scrape(url=url)
text = strip_html(resp.html)  # noisy

# Do this:
resp = await crawlhq.read(url=url)
markdown = resp.markdown  # clean, structured

Failure Mode 2: Stale Context

RAG systems built on crawled content have a fundamental freshness problem: the web changes constantly, but your index doesn’t.

A pricing page indexed three months ago may show ₹999/month for a plan that now costs ₹1,499/month. Your chatbot will confidently quote the old price. The user will sign up expecting one price and encounter another. This is a trust-destroying hallucination that has nothing to do with your LLM.

The fix: Use /v1/watch to maintain a live index. Register watches on the pages most likely to change (pricing, product features, API documentation) and trigger re-indexing whenever content changes.

async def register_content_watch(url: str, collection_id: str):
    """Watch a URL and re-index it whenever content changes."""
    await crawlhq.watch(
        url=url,
        schedule="0 6 * * *",  # check daily
        webhook="https://yourapp.com/hooks/reindex",
        metadata={"collection_id": collection_id, "source_url": url}
    )

async def handle_reindex_webhook(event: dict):
    """Called by CrawlHQ when watched content changes."""
    if event["event"] != "content_changed":
        return

    source_url = event["metadata"]["source_url"]
    collection_id = event["metadata"]["collection_id"]

    # Re-fetch clean content
    resp = await crawlhq.read(url=source_url)

    # Delete old chunks from vector DB
    await vector_db.delete_by_source(source_url)

    # Re-chunk and re-embed
    chunks = chunk_markdown(resp.markdown)
    embeddings = await embed_batch(chunks)
    await vector_db.upsert(
        collection=collection_id,
        documents=chunks,
        embeddings=embeddings,
        metadata={"source_url": source_url, "indexed_at": datetime.utcnow().isoformat()}
    )

Failure Mode 3: Semantically Dead Chunks

The default chunking strategy — split every N tokens with M token overlap — is a blunt instrument. It creates chunks that cut mid-sentence, split tables across boundaries, and separate headings from the content they describe.

Consider a Markdown document with this structure:

## Pricing

| Plan | Price | Storage |
|------|-------|---------|
| Starter | ₹999/mo | 10GB |
| Pro | ₹2,999/mo | 100GB |

The Pro plan includes priority support and a 99.9% SLA.

A naive 512-token chunker might split this into:

Chunk 1: ## Pricing\n\n| Plan | Price | Storage |\n|------|-------|---------| Chunk 2: | Starter | ₹999/mo | 10GB |\n| Pro | ₹2,999/mo | 100GB |

Chunk 1 has the heading but no data. Chunk 2 has data but no heading. A query for “Pro plan pricing” may retrieve Chunk 2 with no context about what these numbers refer to.

The fix: Structure-aware chunking. Split on Markdown headings first, then split large sections by paragraph, then by sentence. Keep tables intact. Attach the parent heading to each chunk as a metadata prefix.

def chunk_markdown(md: str, max_tokens: int = 512) -> list[dict]:
    """Chunk Markdown by structure, not by token count alone."""
    chunks = []
    current_heading = ""

    for section in split_by_heading(md):
        # Prepend heading to provide context
        content = f"{current_heading}\n\n{section.content}" if current_heading else section.content

        if section.is_heading:
            current_heading = section.text
            continue

        # If section fits in one chunk, keep it whole
        if token_count(content) <= max_tokens:
            chunks.append({"text": content, "heading": current_heading})
        else:
            # Split large sections by paragraph, preserving heading
            for para in split_by_paragraph(content):
                chunks.append({"text": para, "heading": current_heading})

    return chunks

Failure Mode 4: Missing Source Attribution

When your RAG system produces a wrong answer, you need to trace it to its source. If your chunks don’t carry source metadata, you can’t debug the pipeline.

Every chunk should carry at minimum:

  • source_url — the exact URL it was extracted from
  • indexed_at — when it was indexed
  • chunk_index — its position within the source document

CrawlHQ’s read endpoint returns source attribution by default:

{
  "markdown": "...",
  "source_url": "https://example.com/pricing",
  "crawled_at": "2026-03-27T09:12:44Z",
  "title": "Pricing — Example",
  "word_count": 847
}

Propagate this metadata through your pipeline to every chunk and every embedding. When a hallucination occurs, you can trace it to the exact source document and determine whether the issue is stale data, bad chunking, or a retrieval failure.

The Pipeline That Doesn’t Hallucinate

Combining these fixes, here’s a RAG data pipeline with strong hallucination resistance:

1. Crawl → /v1/read (clean Markdown, no HTML noise)
2. Watch → /v1/watch (re-index on change, no stale data)
3. Chunk → structure-aware chunker (whole tables, heading context)
4. Metadata → source_url + indexed_at + chunk_index on every chunk
5. Validate → confidence threshold before writing to vector DB
6. Monitor → log retrieval scores; alert when average drops

The LLM is the last 20% of this problem. The data pipeline is the first 80%. Get the pipeline right, and your model choice matters much less than you think.


CrawlHQ /v1/read produces clean, structured Markdown from any URL — built specifically for LLM ingestion. Start free →

C
CrawlHQ Team
Building India's web data API platform. Previously: data engineering, growth engineering, and too much time on HN.

Related Articles

Ready to build?

500 free credits. No credit card. API key in 30 seconds.

Get API Key Free →