Court Data Fetcher & Dashboard

This repository provides a Python application that fetches case details from the Delhi High Court websites, solves CAPTCHAs automatically and presents the results via a simple Flask web interface.
It was built as part of an internship project and demonstrates how to combine web scraping, optical character recognition (OCR) and a lightweight dashboard to streamline access to public case information.

Features

Case Search Form – Users can enter a case type (e.g. FAO, LPA), registration number and filing year to query the Delhi High Court websites.
Automated CAPTCHA Solving – The scraper detects the CAPTCHA image, takes a screenshot of it and uses Tesseract OCR to read the four‑digit code.
Concurrent Scraping – Two separate scrapers run in parallel: one to fetch the filing date, petitioner and respondent (from the “pCase” site), and another to fetch the next hearing date and the first order PDF link.
Data Persistence – Successful queries are logged into a SQLite database (queries.db) for future reference.
Dashboard – Results are displayed in a Bootstrap‑styled HTML page (templates/index.html).

Project Structure

.
├── app.py                # Flask application entry point
├── 2nd Scrap.py      # Scraper for the pCase site (renamed from `2nd scrap.py`)
├── data_extract.py       # Scraper for the public case status site (renamed from `data extract.py`)
├── chromedriver.exe      # ChromeDriver binary used by Selenium (Windows build)
├── queries.db            # SQLite database storing search logs
├── templates/
│   └── index.html        # HTML template for the dashboard
├── external/
│   └── tesseract/        # Place `tesseract.exe` here or adjust `TESSERACT_BUNDLE` in `app.py`
├── requirements.txt       # Python dependencies
├── .gitignore            # Files/directories to ignore in Git
└── README.md             # Project documentation (this file)

CAPTCHA Bypass Approach

Many public court portals protect their search forms with simple four‑digit CAPTCHAs. To automate requests without human intervention, the scraper implements the following strategy:

Locate the CAPTCHA element – When the page loads, Selenium locates the CAPTCHA image element and waits for its src attribute to be populated.
Extract the image – If the src attribute contains a Base64‑encoded image, it decodes it; otherwise, it fetches the image via an HTTP request or falls back to a screenshot of the element.
Preprocess the image – The image is converted to grayscale and then thresholded to create a high‑contrast black‑and‑white version. This helps Tesseract to distinguish digits from the noisy background.
Run Tesseract OCR – Using the bundled tesseract.exe, the code calls pytesseract.image_to_string with a page segmentation mode that expects a single line of digits. A regular expression ensures that only 4‑digit results are accepted.
Retry if necessary – If OCR fails, the scraper waits briefly and refreshes the CAPTCHA up to three times before giving up.

A simplified version of the CAPTCHA solver can be found in data_extract.py:

# Extract the image data from the CAPTCHA element
img = driver.find_element(By.ID, "captcha-code")
src = img.get_attribute("src")
if src.startswith("data:image"):
    captcha_bytes = base64.b64decode(src.split(",", 1)[1])
else:
    captcha_bytes = img.screenshot_as_png

# Preprocess for OCR
img  = Image.open(io.BytesIO(captcha_bytes))
gray = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2GRAY)
_, bw = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)

# Read using Tesseract (single line of digits)
txt  = pytesseract.image_to_string(bw, config="--psm 7 digits")
code = re.search(r"\b\d{4}\b", txt)

Prerequisites

Python 3.8+
Google Chrome or another Chromium‑based browser
ChromeDriver – The repository includes a Windows build (chromedriver.exe). If you are on Linux or macOS, download the matching driver for your browser and replace the binary accordingly.
Tesseract OCR – Download the appropriate Tesseract binary for your platform and place it in external/tesseract/tesseract.exe (or update TESSERACT_BUNDLE in app.py).

Installation

Clone this repository:

git clone https://github.com/your-username/court-data-fetcher.git
cd court-data-fetcher

Create a virtual environment (optional but recommended):

python -m venv .venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Setup Tesseract:
- Download Tesseract for your operating system from UB Mannheim builds or your package manager.
- Copy the binary (tesseract.exe on Windows) into external/tesseract/, or modify the TESSERACT_BUNDLE path in app.py to point to your installation.
Setup ChromeDriver:
- Ensure that the version of ChromeDriver matches your installed version of Google Chrome.
- Replace the chromedriver.exe in the root if necessary.

Running the Application

# Start the Flask server
python app.py

Then open http://localhost:5000 in your browser to access the dashboard.
Enter a case type, case number and year, and the application will fetch and display the case details.

Notes

Error Handling – If the sites are down or the CAPTCHA cannot be solved after several attempts, the application will return a "No Case Found" message.
Logging – All successful queries are appended to the queries table in queries.db. You can explore the database using SQLite tools.
Legal and Ethical Considerations – Scraping court websites should respect their terms of service. This project is for educational purposes; you are responsible for complying with local laws and website policies.

License

This project is provided for educational purposes. You should add an appropriate license if you plan to distribute or use it beyond personal or academic work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Court Data Fetcher & Dashboard

Features

Project Structure

CAPTCHA Bypass Approach

Prerequisites

Installation

Running the Application

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
external/Tesseract		external/Tesseract
templates		templates
.gitignore		.gitignore
2nd scrap.py		2nd scrap.py
LICENSE		LICENSE
README.md		README.md
app.py		app.py
chromedriver.exe		chromedriver.exe
data extract.py		data extract.py
queries.db		queries.db
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Court Data Fetcher & Dashboard

Features

Project Structure

CAPTCHA Bypass Approach

Prerequisites

Installation

Running the Application

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages