tutorial india political PDF extract ECI

Building an ECI Affidavit Extraction Pipeline: From PDF to Database in Minutes

India's Election Commission publishes affidavits for every candidate running for office. This guide shows how to extract structured criminal records, asset declarations, and financial data from ECI PDFs at scale using CrawlHQ.

CrawlHQ Team · 21 March 2026 · 9 min read

Before every election in India, candidates file sworn affidavits with the Election Commission of India (ECI) declaring their criminal cases, assets, liabilities, and educational qualifications. These documents are public. They contain extraordinarily valuable structured data. And historically, accessing them at scale required either manual extraction or expensive proprietary databases.

This changes with LLM-powered extraction.

In this guide, we’ll build a pipeline that takes a list of ECI affidavit URLs and returns a clean, structured database-ready JSON object for each candidate — criminal records, declared wealth, liabilities, and all.

Why ECI Affidavit Data Matters

The use cases are broader than you might expect:

Voter information apps — Show voters their local candidates’ criminal records and declared wealth
Journalism and RTI work — Build searchable databases of candidate histories
Political consultancy — Screen candidates before recommending them to parties
Academic research — Study wealth accumulation and criminal case patterns in Indian politics
Civic tech — Power tools like the ADR (Association for Democratic Reforms) database

During general elections, there are 543 Lok Sabha constituencies, each with multiple candidates. During state elections, add thousands more. A fully automated pipeline can process this entire dataset in hours.

The Data Available in ECI Affidavits

ECI affidavits (Form 26) contain:

Section A: Personal information (name, DOB, PAN, address)
Section B: Criminal antecedents (pending cases, convictions)
Section C: Assets — movable and immovable, self and dependents
Section D: Liabilities (loans from banks, others)
Section E: Income and tax details
Section F: Educational qualifications

The PDFs are publicly accessible via the ECI portal and election-specific affidavit repositories.

Step 1: Discover Affidavit URLs

ECI provides a search interface. For bulk extraction, the URLs follow a predictable pattern once you have candidate IDs. We’ll use CrawlHQ’s /v1/scrape to retrieve the candidate list first.

import httpx
import asyncio

API_KEY = "your_api_key"
BASE = "https://api.crawlhq.dev"

async def get_candidate_list(constituency_url: str) -> list[dict]:
    """Scrape the candidate list page for a constituency."""
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"{BASE}/v1/extract",
            headers={"X-API-Key": API_KEY},
            json={
                "url": constituency_url,
                "schema": {
                    "candidates": [{
                        "name": "string",
                        "party": "string",
                        "affidavit_url": "string"
                    }]
                }
            }
        )
        data = resp.json()
        return data["extracted"]["candidates"]

Step 2: Define the Extraction Schema

The schema is the core of this pipeline. It tells CrawlHQ exactly what to extract from each PDF:

AFFIDAVIT_SCHEMA = {
    "candidate_name": "string",
    "party": "string",
    "constituency": "string",
    "state": "string",
    "election_type": "string",  # lok_sabha, vidhan_sabha, etc.

    "criminal_cases": [{
        "case_number": "string",
        "sections": ["string"],   # IPC sections
        "court": "string",
        "status": "string",       # pending, convicted, acquitted
        "year": "number"
    }],

    "self_assets": {
        "movable_total_inr": "number",
        "immovable_total_inr": "number",
        "cash_inr": "number",
        "bank_deposits_inr": "number",
        "vehicles": [{
            "type": "string",
            "value_inr": "number"
        }]
    },

    "spouse_assets": {
        "movable_total_inr": "number",
        "immovable_total_inr": "number"
    },

    "total_liabilities_inr": "number",
    "bank_loan_inr": "number",
    "declared_income_inr": "number",
    "income_tax_paid_inr": "number",

    "education": {
        "highest_qualification": "string",
        "institution": "string"
    },

    "pan": "string",
    "age": "number"
}

Step 3: Bulk Extract All Affidavits

With the schema defined, extraction is straightforward:

async def extract_affidavit(affidavit_url: str) -> dict:
    """Extract structured data from a single ECI affidavit PDF."""
    async with httpx.AsyncClient(timeout=60.0) as client:
        resp = await client.post(
            f"{BASE}/v1/extract",
            headers={"X-API-Key": API_KEY},
            json={
                "url": affidavit_url,
                "schema": AFFIDAVIT_SCHEMA
            }
        )
        result = resp.json()

        if result.get("match_confidence", 0) < 0.7:
            # Low confidence — flag for manual review
            return {"_needs_review": True, "url": affidavit_url, **result["extracted"]}

        return result["extracted"]


async def process_constituency(constituency_url: str) -> list[dict]:
    """Process all candidates in a constituency."""
    candidates = await get_candidate_list(constituency_url)

    # Process in batches of 10 to respect rate limits
    results = []
    for i in range(0, len(candidates), 10):
        batch = candidates[i:i+10]
        batch_results = await asyncio.gather(*[
            extract_affidavit(c["affidavit_url"])
            for c in batch
            if c.get("affidavit_url")
        ])
        results.extend(batch_results)
        await asyncio.sleep(1)  # Brief pause between batches

    return results

Step 4: Validate and Store

Before writing to your database, validate the extracted data. ECI affidavits can have inconsistencies:

def validate_candidate(data: dict) -> tuple[bool, list[str]]:
    """Validate extracted affidavit data. Returns (is_valid, issues)."""
    issues = []

    if not data.get("candidate_name"):
        issues.append("Missing candidate name")

    # Asset sanity check
    if data.get("self_assets"):
        assets = data["self_assets"]
        total = (assets.get("movable_total_inr", 0) or 0) + \
                (assets.get("immovable_total_inr", 0) or 0)
        if total > 10_000_000_000:  # >1000 crore — likely extraction error
            issues.append(f"Suspicious asset total: ₹{total:,}")

    # Criminal case validation
    for case in data.get("criminal_cases", []):
        if not case.get("sections"):
            issues.append(f"Criminal case {case.get('case_number')} missing IPC sections")

    return len(issues) == 0, issues

Credit Cost for a Full Election

For the 2024 Lok Sabha election (7,928 total candidates):

Operation	Count	Credits
Candidate list extraction	543 constituencies	2,715
Affidavit PDF extraction	7,928 affidavits	39,640
Total		~42,355

At ₹0.40/credit (Starter plan): ₹16,942 for a complete dataset of every Lok Sabha candidate’s declared assets and criminal history. That’s less than the cost of a single day’s manual data entry.

What You Can Build on Top

With this structured dataset, you can:

Filter by criminal cases — find all candidates with serious IPC charges pending
Rank by declared wealth — identify the wealthiest candidates by constituency
Build trend analysis — compare wealth declarations across election cycles
Cross-reference — join with voter rolls, constituency demographics, party data
Build voter apps — show citizens their candidates’ complete history before they vote

The Association for Democratic Reforms (ADR) has been doing this manually for years. With CrawlHQ, you can replicate and extend their work programmatically, updated in real-time as new filings come in.

Start processing ECI affidavits today. Get your free API key →

CrawlHQ Team

Building India's web data API platform. Previously: data engineering, growth engineering, and too much time on HN.