This repository provides a Python application that fetches case details from the Delhi High Court websites, solves CAPTCHAs automatically and presents the results via a simple Flask web interface.
It was built as part of an internship project and demonstrates how to combine web scraping, optical character recognition (OCR) and a lightweight dashboard to streamline access to public case information.
- Case Search Form – Users can enter a case type (e.g.
FAO,LPA), registration number and filing year to query the Delhi High Court websites. - Automated CAPTCHA Solving – The scraper detects the CAPTCHA image, takes a screenshot of it and uses Tesseract OCR to read the four‑digit code.
- Concurrent Scraping – Two separate scrapers run in parallel: one to fetch the filing date, petitioner and respondent (from the “pCase” site), and another to fetch the next hearing date and the first order PDF link.
- Data Persistence – Successful queries are logged into a SQLite database (
queries.db) for future reference. - Dashboard – Results are displayed in a Bootstrap‑styled HTML page (
templates/index.html).
.
├── app.py # Flask application entry point
├── 2nd Scrap.py # Scraper for the pCase site (renamed from `2nd scrap.py`)
├── data_extract.py # Scraper for the public case status site (renamed from `data extract.py`)
├── chromedriver.exe # ChromeDriver binary used by Selenium (Windows build)
├── queries.db # SQLite database storing search logs
├── templates/
│ └── index.html # HTML template for the dashboard
├── external/
│ └── tesseract/ # Place `tesseract.exe` here or adjust `TESSERACT_BUNDLE` in `app.py`
├── requirements.txt # Python dependencies
├── .gitignore # Files/directories to ignore in Git
└── README.md # Project documentation (this file)
Many public court portals protect their search forms with simple four‑digit CAPTCHAs. To automate requests without human intervention, the scraper implements the following strategy:
- Locate the CAPTCHA element – When the page loads, Selenium locates the CAPTCHA image element and waits for its
srcattribute to be populated. - Extract the image – If the
srcattribute contains a Base64‑encoded image, it decodes it; otherwise, it fetches the image via an HTTP request or falls back to a screenshot of the element. - Preprocess the image – The image is converted to grayscale and then thresholded to create a high‑contrast black‑and‑white version. This helps Tesseract to distinguish digits from the noisy background.
- Run Tesseract OCR – Using the bundled
tesseract.exe, the code callspytesseract.image_to_stringwith a page segmentation mode that expects a single line of digits. A regular expression ensures that only 4‑digit results are accepted. - Retry if necessary – If OCR fails, the scraper waits briefly and refreshes the CAPTCHA up to three times before giving up.
A simplified version of the CAPTCHA solver can be found in data_extract.py:
# Extract the image data from the CAPTCHA element
img = driver.find_element(By.ID, "captcha-code")
src = img.get_attribute("src")
if src.startswith("data:image"):
captcha_bytes = base64.b64decode(src.split(",", 1)[1])
else:
captcha_bytes = img.screenshot_as_png
# Preprocess for OCR
img = Image.open(io.BytesIO(captcha_bytes))
gray = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2GRAY)
_, bw = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
# Read using Tesseract (single line of digits)
txt = pytesseract.image_to_string(bw, config="--psm 7 digits")
code = re.search(r"\b\d{4}\b", txt)- Python 3.8+
- Google Chrome or another Chromium‑based browser
- ChromeDriver – The repository includes a Windows build (
chromedriver.exe). If you are on Linux or macOS, download the matching driver for your browser and replace the binary accordingly. - Tesseract OCR – Download the appropriate Tesseract binary for your platform and place it in
external/tesseract/tesseract.exe(or updateTESSERACT_BUNDLEinapp.py).
- Clone this repository:
git clone https://github.com/your-username/court-data-fetcher.git
cd court-data-fetcher- Create a virtual environment (optional but recommended):
python -m venv .venv
source .venv/bin/activate # On Windows use: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt-
Setup Tesseract:
- Download Tesseract for your operating system from UB Mannheim builds or your package manager.
- Copy the binary (
tesseract.exeon Windows) intoexternal/tesseract/, or modify theTESSERACT_BUNDLEpath inapp.pyto point to your installation.
-
Setup ChromeDriver:
- Ensure that the version of ChromeDriver matches your installed version of Google Chrome.
- Replace the
chromedriver.exein the root if necessary.
# Start the Flask server
python app.pyThen open http://localhost:5000 in your browser to access the dashboard.
Enter a case type, case number and year, and the application will fetch and display the case details.
- Error Handling – If the sites are down or the CAPTCHA cannot be solved after several attempts, the application will return a "No Case Found" message.
- Logging – All successful queries are appended to the
queriestable inqueries.db. You can explore the database using SQLite tools. - Legal and Ethical Considerations – Scraping court websites should respect their terms of service. This project is for educational purposes; you are responsible for complying with local laws and website policies.
This project is provided for educational purposes. You should add an appropriate license if you plan to distribute or use it beyond personal or academic work.