🕷️ KHC Site Crawler

A blazingly fast, concurrent web crawler built for developers who need to search for keywords across websites that don't offer native search functionality. Built with Python, Flask, and modern web technologies.

📋 Table of Contents

🔥 Features
⚡ Installation
🚀 Usage
🧰 Advanced Configuration
🔌 API Reference
🛠️ Technical Architecture
🧩 How It Works
⚠️ Limitations
🔮 Future Enhancements
🤝 Contributing
📝 License

🔥 Features

Core Functionality

Deep Web Crawling: Recursively crawls web pages up to a specified depth (supports depths 1-5)
Concurrent Processing: Multi-threaded architecture for efficient scanning using Python's ThreadPoolExecutor
Keyword Detection: Fast regex-based pattern matching for finding keywords in page content
Respect for Website Structure: Only follows links within the same domain

Advanced Crawler Features

URL Normalization: Handles different formats of the same URL to avoid duplicate crawling
Depth Control: Configurable depth limit to prevent excessive crawling
Request Management:
- Exponential backoff retry mechanism for failed requests
- Random delay between requests to reduce server load
- Timeout handling for stalled requests
- Custom user agent rotation to avoid detection

User Experience Enhancements

Real-Time Progress Display: Live terminal-like output showing crawling progress
Interactive Results: Double-click on found keywords to open the corresponding URL
Results Counter: Live-updating badge showing the count of found URL matches
Status Indicators: Visual status indicators (Processing/Error/Complete)
Export Functionality: Download scan results for offline analysis

Security & Performance

SSL Verification Override: Option to bypass SSL certificate verification for testing environments
Proxy Support: Configurable proxy rotation to avoid IP-based blocking
Connection Testing: Test connections before starting a full scan
Output Capture: Non-blocking output handling for smooth UI updates

⚡ Installation

Prerequisites

Python 3.7 or higher
pip (Python package manager)
Modern web browser

Quick Install

# Clone this repository
git clone https://github.com/yourusername/KHC-Site-Crawler.git

# Navigate to the project directory
cd KHC-Site-Crawler

# Install dependencies
pip install -r requirements.txt

# Run the application using the start script
./start.sh

Docker Installation

# Build the Docker image
docker build -t khc-site-crawler .

# Run the container (which will use start.sh internally)
docker run -p 5000:5000 khc-site-crawler

🚀 Usage

Starting a Scan

Access the Web Interface:
- Run the application with ./start.sh
- Open your browser and navigate to the URL shown in the terminal (typically http://localhost:5000 or similar)
- You'll be greeted with the KHC Site Crawler welcome screen
Enter URL:
- Input the target website URL (e.g., example.com or https://example.com)
- The protocol (http/https) will be added automatically if omitted
- Click "Next" to proceed
Configure Scan Parameters:
- Keywords: Enter one or more keywords separated by commas
- Crawl Depth: Select the depth of crawling
  - Shallow (2): Quick but limited coverage
  - Normal (3): Balanced performance and coverage
  - Deep (4): Extended coverage, slower performance
  - Very Deep (5): Exhaustive scanning, significantly slower
- Proxy Usage: Toggle to enable/disable proxy rotation for bypassing restrictions
- Click "Start Scanning" to begin
Monitor Progress:
- The terminal output window shows real-time scanning progress with color coding:
  - 🔍 Blue: Information messages
  - ⚠️ Yellow: Warning messages
  - ❌ Red: Error messages
  - ✅ Green: Success messages
- The status button indicates current scan state:
  - Processing (yellow): Scan in progress
  - Error (red): Scan encountered critical errors
  - Complete (green): Scan successfully completed
View Results:
- Found keywords appear in the results section as they're discovered
- The counter badge shows the total number of unique URLs with matches
- Double-click on any result to open the corresponding page in a new tab
- Use "Open All Found URLs" to open all matching pages at once
Export Results:
- Click "Download Results" to save a text file with all scan output and found keywords

Keyboard Shortcuts

Enter in URL field: Proceed to keywords page if URL is valid
Enter in keywords field: Start scanning if keywords are provided

🧰 Advanced Configuration

Proxy Configuration

The application supports custom proxy configurations for advanced users. Modify the FREE_PROXIES list in crawler.py:

FREE_PROXIES = [
    None,  # Direct connection
    'http://your-proxy-server:port',
    'http://another-proxy:port',
]

User Agent Rotation

Add or modify user agents in the USER_AGENTS list in crawler.py:

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
    # Add your custom user agent strings here
]

Timeout and Workers Configuration

Adjust the crawler's timeout and concurrency settings by modifying the global variables in crawler.py:

MAX_RETRIES = 3  # Maximum number of retry attempts
MAX_DEPTH = 3    # Default maximum crawl depth
MAX_WORKERS = 3  # Number of concurrent threads
TIMEOUT = 15     # Request timeout in seconds

Startup Script

The application uses a startup script (start.sh) that handles:

Finding an available port automatically
Killing any conflicting processes if needed
Starting the Flask application with the correct configuration
Providing a direct URL to access the web interface

You can modify start.sh if you need to customize the startup process further.

🔌 API Reference

Backend Routes

Endpoint	Method	Description	Parameters
`/`	GET	Renders the main application page	None
`/start_scan`	POST	Initiates a new scan	`url`, `keywords`, `max_depth`, `use_proxies`
`/get_scan_results`	GET	Retrieves current scan results	None
`/download_results`	GET	Downloads scan results as text file	None

Example API Usage

// Start a scan
fetch('/start_scan', {
    method: 'POST',
    headers: {
        'Content-Type': 'application/json',
    },
    body: JSON.stringify({
        url: 'example.com',
        keywords: ['keyword1', 'keyword2'],
        max_depth: 3,
        use_proxies: true
    })
})
.then(response => response.json())
.then(data => console.log(data));

// Get scan results
fetch('/get_scan_results')
.then(response => response.json())
.then(data => {
    console.log('Scan Results:', data.scan_results);
    console.log('Found Keywords:', data.found_keywords);
    console.log('Is Scanning:', data.is_scanning);
});

🛠️ Technical Architecture

Component Overview

KHC Site Crawler follows a modern web application architecture with three main components:

Flask Backend:
- Serves the web interface
- Manages the crawling process in a separate thread
- Provides API endpoints for frontend communication
- Captures and stores crawler output
Crawler Engine:
- Performs recursive website traversal
- Handles HTTP requests with retry logic
- Implements concurrent processing for efficiency
- Detects keywords using regex pattern matching
Web Frontend:
- Provides an intuitive user interface
- Displays real-time crawling progress
- Updates results dynamically with AJAX polling
- Offers interactive results with direct URL access

Data Flow

+------------------+       +-----------------+       +------------------+
|                  |       |                 |       |                  |
|  User Interface  |------>|  Flask Server   |------>|  Crawler Engine  |
|  (HTML/CSS/JS)   |       |  (Python/Flask) |       |  (Python)        |
|                  |<------|                 |<------|                  |
+------------------+       +-----------------+       +------------------+
                               |         ^
                               |         |
                               v         |
                           +----------------------+
                           |                      |
                           |  Output Capture &    |
                           |  Results Storage     |
                           |                      |
                           +----------------------+

🧩 How It Works

Crawling Process

Initialization:
- The application captures the base URL, keywords, max depth, and proxy settings
- A separate thread is spawned to run the crawler
- Output capture is initialized to redirect crawler output
URL Processing:
- The crawler starts with the base URL
- For each URL, the crawler:
  - Adds the URL to the visited set
  - Sends an HTTP request with retry logic
  - Parses the HTML content with BeautifulSoup
  - Searches for keywords in the text content
  - Extracts links for further crawling
Concurrent Execution:
- A ThreadPoolExecutor manages crawling multiple URLs simultaneously
- Each URL is processed in a separate thread
- Thread-safe operations use a mutex lock to prevent race conditions
Output Handling:
- Terminal output is captured in real-time
- Lines containing keyword matches are identified and stored
- Output is sent to the frontend via polling

Example Request Flow

For a typical scan:

1. User submits scan configuration
2. Frontend sends POST request to /start_scan
3. Backend spawns crawler thread
4. Frontend begins polling /get_scan_results
5. Crawler processes URLs and finds keywords
6. Output is captured and sent to frontend
7. Frontend updates UI with results
8. Scan completes and status is updated
9. User can interact with or export results

⚠️ Limitations

Respect for robots.txt: The crawler does not currently respect robots.txt directives
JavaScript Rendering: The crawler doesn't execute JavaScript, so content loaded dynamically may not be crawled
Form Authentication: Cannot crawl pages behind login forms or authentication walls
Rate Limiting: While the crawler implements delays, aggressive crawling may trigger rate limiting
Resource Intensive: Deep crawling of large sites may consume significant system resources
Single-Domain Focus: The crawler remains within a single domain and doesn't follow external links

🔮 Future Enhancements

🤝 Contributing

Contributions are welcome! Here's how you can help improve KHC Site Crawler:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit your changes: git commit -m 'Add some amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

For major changes, please open an issue first to discuss what you would like to change.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Made with ❤️ by the Krypto Hashers Community

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
__pycache__		__pycache__
static		static
templates		templates
README.md		README.md
all_text.txt		all_text.txt
app.py		app.py
app.py.bak		app.py.bak
crawler.py		crawler.py
requirements.txt		requirements.txt
scan_results.txt		scan_results.txt
srmrmp.edu.in.txt		srmrmp.edu.in.txt
start.sh		start.sh

Krypto-Hashers-Community/KHC-Site-Crawler

Folders and files

Latest commit

History

Repository files navigation