Building an ECI Affidavit Extraction Pipeline: From PDF to Database in Minutes
India's Election Commission publishes affidavits for every candidate running for office. This guide shows how to extract structured criminal records, asset declarations, and financial data from ECI PDFs at scale using CrawlHQ.
Before every election in India, candidates file sworn affidavits with the Election Commission of India (ECI) declaring their criminal cases, assets, liabilities, and educational qualifications. These documents are public. They contain extraordinarily valuable structured data. And historically, accessing them at scale required either manual extraction or expensive proprietary databases.
This changes with LLM-powered extraction.
In this guide, we’ll build a pipeline that takes a list of ECI affidavit URLs and returns a clean, structured database-ready JSON object for each candidate — criminal records, declared wealth, liabilities, and all.
Why ECI Affidavit Data Matters
The use cases are broader than you might expect:
- Voter information apps — Show voters their local candidates’ criminal records and declared wealth
- Journalism and RTI work — Build searchable databases of candidate histories
- Political consultancy — Screen candidates before recommending them to parties
- Academic research — Study wealth accumulation and criminal case patterns in Indian politics
- Civic tech — Power tools like the ADR (Association for Democratic Reforms) database
During general elections, there are 543 Lok Sabha constituencies, each with multiple candidates. During state elections, add thousands more. A fully automated pipeline can process this entire dataset in hours.
The Data Available in ECI Affidavits
ECI affidavits (Form 26) contain:
- Section A: Personal information (name, DOB, PAN, address)
- Section B: Criminal antecedents (pending cases, convictions)
- Section C: Assets — movable and immovable, self and dependents
- Section D: Liabilities (loans from banks, others)
- Section E: Income and tax details
- Section F: Educational qualifications
The PDFs are publicly accessible via the ECI portal and election-specific affidavit repositories.
Step 1: Discover Affidavit URLs
ECI provides a search interface. For bulk extraction, the URLs follow a predictable pattern once you have candidate IDs. We’ll use CrawlHQ’s /v1/scrape to retrieve the candidate list first.
import httpx
import asyncio
API_KEY = "your_api_key"
BASE = "https://api.crawlhq.dev"
async def get_candidate_list(constituency_url: str) -> list[dict]:
"""Scrape the candidate list page for a constituency."""
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{BASE}/v1/extract",
headers={"X-API-Key": API_KEY},
json={
"url": constituency_url,
"schema": {
"candidates": [{
"name": "string",
"party": "string",
"affidavit_url": "string"
}]
}
}
)
data = resp.json()
return data["extracted"]["candidates"]
Step 2: Define the Extraction Schema
The schema is the core of this pipeline. It tells CrawlHQ exactly what to extract from each PDF:
AFFIDAVIT_SCHEMA = {
"candidate_name": "string",
"party": "string",
"constituency": "string",
"state": "string",
"election_type": "string", # lok_sabha, vidhan_sabha, etc.
"criminal_cases": [{
"case_number": "string",
"sections": ["string"], # IPC sections
"court": "string",
"status": "string", # pending, convicted, acquitted
"year": "number"
}],
"self_assets": {
"movable_total_inr": "number",
"immovable_total_inr": "number",
"cash_inr": "number",
"bank_deposits_inr": "number",
"vehicles": [{
"type": "string",
"value_inr": "number"
}]
},
"spouse_assets": {
"movable_total_inr": "number",
"immovable_total_inr": "number"
},
"total_liabilities_inr": "number",
"bank_loan_inr": "number",
"declared_income_inr": "number",
"income_tax_paid_inr": "number",
"education": {
"highest_qualification": "string",
"institution": "string"
},
"pan": "string",
"age": "number"
}
Step 3: Bulk Extract All Affidavits
With the schema defined, extraction is straightforward:
async def extract_affidavit(affidavit_url: str) -> dict:
"""Extract structured data from a single ECI affidavit PDF."""
async with httpx.AsyncClient(timeout=60.0) as client:
resp = await client.post(
f"{BASE}/v1/extract",
headers={"X-API-Key": API_KEY},
json={
"url": affidavit_url,
"schema": AFFIDAVIT_SCHEMA
}
)
result = resp.json()
if result.get("match_confidence", 0) < 0.7:
# Low confidence — flag for manual review
return {"_needs_review": True, "url": affidavit_url, **result["extracted"]}
return result["extracted"]
async def process_constituency(constituency_url: str) -> list[dict]:
"""Process all candidates in a constituency."""
candidates = await get_candidate_list(constituency_url)
# Process in batches of 10 to respect rate limits
results = []
for i in range(0, len(candidates), 10):
batch = candidates[i:i+10]
batch_results = await asyncio.gather(*[
extract_affidavit(c["affidavit_url"])
for c in batch
if c.get("affidavit_url")
])
results.extend(batch_results)
await asyncio.sleep(1) # Brief pause between batches
return results
Step 4: Validate and Store
Before writing to your database, validate the extracted data. ECI affidavits can have inconsistencies:
def validate_candidate(data: dict) -> tuple[bool, list[str]]:
"""Validate extracted affidavit data. Returns (is_valid, issues)."""
issues = []
if not data.get("candidate_name"):
issues.append("Missing candidate name")
# Asset sanity check
if data.get("self_assets"):
assets = data["self_assets"]
total = (assets.get("movable_total_inr", 0) or 0) + \
(assets.get("immovable_total_inr", 0) or 0)
if total > 10_000_000_000: # >1000 crore — likely extraction error
issues.append(f"Suspicious asset total: ₹{total:,}")
# Criminal case validation
for case in data.get("criminal_cases", []):
if not case.get("sections"):
issues.append(f"Criminal case {case.get('case_number')} missing IPC sections")
return len(issues) == 0, issues
Credit Cost for a Full Election
For the 2024 Lok Sabha election (7,928 total candidates):
| Operation | Count | Credits |
|---|---|---|
| Candidate list extraction | 543 constituencies | 2,715 |
| Affidavit PDF extraction | 7,928 affidavits | 39,640 |
| Total | ~42,355 |
At ₹0.40/credit (Starter plan): ₹16,942 for a complete dataset of every Lok Sabha candidate’s declared assets and criminal history. That’s less than the cost of a single day’s manual data entry.
What You Can Build on Top
With this structured dataset, you can:
- Filter by criminal cases — find all candidates with serious IPC charges pending
- Rank by declared wealth — identify the wealthiest candidates by constituency
- Build trend analysis — compare wealth declarations across election cycles
- Cross-reference — join with voter rolls, constituency demographics, party data
- Build voter apps — show citizens their candidates’ complete history before they vote
The Association for Democratic Reforms (ADR) has been doing this manually for years. With CrawlHQ, you can replicate and extend their work programmatically, updated in real-time as new filings come in.
Start processing ECI affidavits today. Get your free API key →