Skip to content

mibrahim0499/candidate-scrapper

Repository files navigation

πŸš€ JobToday Candidate Scraper

A powerful, user-friendly browser extension and API solution for automating candidate data collection from JobToday with AI-powered chat summaries and seamless integrations.

Python Flask Playwright License Chrome Extension

πŸ“‹ Table of Contents

🎯 Overview

JobToday Candidate Scraper is a comprehensive solution that automates the collection of candidate information from JobToday.com. It consists of:

  • πŸ”Œ Browser Extension: A sleek, user-friendly Chrome extension with a modern UI that requires zero technical knowledge
  • βš™οΈ Backend API: A robust Flask API with Playwright-based web scraping
  • πŸ€– AI Integration: Automatic chat history summarization using OpenAI GPT-4o-mini
  • πŸ“Š Data Management: Seamless integration with Airtable and n8n for automated workflows

Perfect for recruiters and HR professionals who want to streamline their candidate data collection process without writing a single line of code.

✨ Features

🎨 Browser Extension

  • Zero-Config Setup: Guided onboarding flow that walks you through setup in minutes
  • Beautiful UI: Modern, sleek interface with gradient designs and smooth animations
  • Real-Time Dashboard: Live stats showing total candidates, new today, and candidates with chat
  • Candidate Cards: Visual cards displaying key information at a glance
  • Chat Visualization: Beautiful chat-bubble interface for viewing conversation history
  • Search & Filter: Quickly find candidates by name, phone, location, or filter by chat availability
  • Detailed Profiles: Comprehensive candidate views with tabs for Overview, Chat, and Experience

⚑ Backend Scraper

  • Automated Data Collection: Extracts comprehensive candidate information including:
    • Personal details (name, phone, email, location)
    • Professional experience and work history
    • Certificates and qualifications
    • Languages spoken
    • Complete chat conversation history
  • AI-Powered Summaries: Automatically generates concise chat summaries using OpenAI
  • Smart Duplicate Prevention: Checks Airtable to avoid duplicate entries
  • Session Management: Persistent login sessions reduce authentication overhead
  • Error Handling: Robust retry logic for failed operations
  • Progress Tracking: Real-time progress updates during scraping

πŸ”— Integrations

  • Airtable: Automatic syncing of new candidates to your database
  • n8n: Webhook integration for custom automation workflows
  • Local Export: JSON and CSV exports for backup and analysis

πŸ“Έ Features in Action

Browser Extension Dashboard

The extension features a modern, gradient-based dashboard displaying:

  • Real-time Statistics: Total candidates, candidates with chat history, and new candidates today
  • Recent Candidates Preview: Quick access to the 5 most recent candidates with visual cards
  • Progress Tracking: Live progress updates during scraping operations
  • Quick Actions: One-click scraping and settings access

Candidate List View

  • Search Functionality: Instantly find candidates by name, phone number, or location
  • Smart Filters: Filter by "All", "With Chat", or "New" candidates
  • Visual Cards: Each candidate card displays key information with chat indicators
  • Smooth Scrolling: Efficient navigation through large candidate lists

Candidate Detail View

Comprehensive candidate profiles organized into three intuitive tabs:

  1. Overview Tab

    • Contact information (phone, email, location)
    • Personal "About" section
    • Languages spoken
    • Certificates and qualifications
    • Application date and job role
  2. Chat Tab

    • Beautiful chat-bubble interface with color-coded messages
    • Candidate messages (left-aligned, white bubbles)
    • Recruiter messages (right-aligned, gradient bubbles)
    • System messages (centered, highlighted)
    • Timestamps for each message
    • AI-generated chat summary at the top
    • Date separators for conversation organization
  3. Experience Tab

    • Formatted work experience
    • Company names and roles
    • Employment dates and durations
    • Detailed job descriptions

Onboarding Flow

A guided, step-by-step setup process:

  • Welcome Screen: Introduction to the extension
  • Credentials Setup: Secure JobToday login configuration
  • Job ID Configuration: Easy-to-follow instructions for finding your Job ID
  • Optional Integrations: Airtable, n8n, and OpenAI setup (all optional)
  • Validation: Real-time testing of credentials and connections
  • Summary: Overview of your configuration before completion

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Browser Extension (Chrome)                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   Dashboard  β”‚  β”‚ Candidate    β”‚  β”‚   Settings   β”‚ β”‚
β”‚  β”‚    View      β”‚  β”‚ Detail View  β”‚  β”‚ & Onboarding β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        ↕️ HTTP REST API
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Flask Backend API                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  API Routes  β”‚  β”‚   Scraper    β”‚  β”‚   Config     β”‚ β”‚
β”‚  β”‚   /api/*     β”‚  β”‚   Engine     β”‚  β”‚   Storage    β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        ↕️ Playwright
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              JobToday.com                               β”‚
β”‚         (Web Scraping Target)                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        ↕️ Integrations
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Airtable   β”‚  β”‚     n8n      β”‚  β”‚    OpenAI    β”‚
β”‚   Database   β”‚  β”‚   Webhooks   β”‚  β”‚     API      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Installation

Prerequisites

  • Python 3.8+ - Download Python
  • pip - Usually included with Python
  • Chrome Browser - Required for the extension
  • Git - For cloning the repository (optional)

Backend Setup

  1. Clone the repository

    git clone https://github.com/yourusername/jobtoday-scraper.git
    cd jobtoday-scraper
  2. Create a virtual environment

    Windows:

    python -m venv venv
    .\venv\Scripts\activate

    macOS/Linux:

    python3 -m venv venv
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Install Playwright browsers

    playwright install chromium
  5. Configure environment variables

    Create a .env file in the project root:

    # JobToday Credentials
    JOBTODAY_EMAIL=your-email@example.com
    JOBTODAY_PASSWORD=your-password
    
    # Job ID (optional, can be set via extension)
    JOB_ID=p3j9ox
    
    # Airtable Configuration (optional)
    AIRTABLE_PAT=your-airtable-token
    AIRTABLE_BASE_ID=your-base-id
    AIRTABLE_TABLE_NAME=Candidates
    
    # n8n Webhook URL (optional)
    N8N_WEBHOOK_URL=https://your-n8n-webhook-url
    
    # OpenAI API Key (optional, for chat summaries)
    OPENAI_API_KEY=sk-your-openai-key

Browser Extension Setup

  1. Load the extension in Chrome

    # Navigate to Chrome Extensions page
    # chrome://extensions/
    
    # Steps:
    # 1. Enable "Developer mode" (toggle in top-right corner)
    # 2. Click "Load unpacked" button
    # 3. Select the 'extension' folder from this project
  2. Configure the extension

    • Click the JobToday Scraper icon in your Chrome toolbar
    • Follow the guided 5-step onboarding flow:
      1. Welcome - Get introduced to the extension
      2. Credentials - Enter your JobToday email and password
      3. Job ID - Enter your job posting ID (found in the URL)
      4. Integrations - Optionally set up Airtable, n8n, and OpenAI
      5. Complete - Review your configuration and finish setup

    Note: The backend URL is automatically configured. If running locally, it defaults to http://localhost:5001.

🎬 Quick Start

Starting the Backend Server

macOS/Linux:

# Navigate to project directory
cd jobtoday-scraper

# Activate virtual environment
source venv/bin/activate

# Install dependencies (if not already done)
pip install -r requirements.txt
playwright install chromium

# Start the Flask API server
python scraper_api.py

Windows:

# Navigate to project directory
cd jobtoday-scraper

# Activate virtual environment
.\venv\Scripts\activate

# Install dependencies (if not already done)
pip install -r requirements.txt
playwright install chromium

# Start the Flask API server
python scraper_api.py

The server will start on http://localhost:5001 by default (or the port specified in the PORT environment variable).

Using the Browser Extension

  1. Start the backend server (see above)
  2. Click the extension icon in your Chrome toolbar
  3. Complete onboarding if this is your first time (or click Settings to reconfigure)
  4. View your dashboard - You'll see statistics and recent candidates
  5. Browse all candidates - Click "View All β†’" to see the full candidate list
  6. Search and filter - Use the search bar and filter buttons to find specific candidates
  7. View candidate details - Click any candidate card to see their full profile and chat history
  8. Start scraping - Click the "Start Scraping" button to initiate a new scraping session
  9. Monitor progress - Watch real-time progress updates on the dashboard

🌐 Browser Extension

Features Overview

The browser extension provides a complete, user-friendly interface for managing your candidate scraping workflow:

  • Dashboard: Overview of all candidates with quick statistics
  • Candidate List: Browse, search, and filter all scraped candidates
  • Candidate Detail: Comprehensive profile view with:
    • Contact information and location
    • Professional experience and qualifications
    • Full chat conversation history with AI summary
    • Languages and certificates

Key Benefits

  • βœ… No command-line knowledge required
  • βœ… Visual progress tracking
  • βœ… Instant access to candidate data
  • βœ… Beautiful chat interface
  • βœ… Mobile-friendly design

πŸ“‘ API Documentation

Base URL

http://localhost:5001

Endpoints

Health Check

GET /health

Returns the health status of the API.

Response:

{
  "status": "healthy",
  "service": "JobToday Scraper API",
  "timestamp": "2025-01-08T23:32:10.760000"
}

Get Status

GET /status

Returns the current scraper status and progress.

Response:

{
  "status": "idle",
  "last_run": "2025-01-08T20:00:00",
  "candidates_count": 58,
  "progress": {
    "section": "recommended",
    "candidate": "John Doe",
    "processed": 25,
    "total": 58
  }
}

Trigger Scraping

POST /trigger-scrape
Content-Type: application/json

Initiates a scraping process. Configuration can be provided in the request body or will use stored configuration.

Request Body (optional):

{
  "job_id": "p3j9ox",
  "email": "your-email@example.com",
  "password": "your-password",
  "airtable_pat": "pat...",
  "airtable_base_id": "app...",
  "airtable_table_name": "Candidates",
  "n8n_webhook_url": "https://...",
  "openai_api_key": "sk-..."
}

Response:

{
  "status": "started",
  "message": "Scraper started in background",
  "check_status_at": "/status",
  "started_at": "2025-01-08T23:32:10"
}

Get All Candidates

GET /api/candidates?limit=10&sort=date_desc

Retrieves all scraped candidates with optional pagination and sorting.

Query Parameters:

  • limit (optional): Maximum number of candidates to return
  • sort (optional): Sort order (date_desc, name_asc)

Response:

{
  "candidates": [...],
  "total": 58,
  "returned": 10,
  "scraped_at": "2025-01-08T19:56:07",
  "job_id": "p3j9ox"
}

Get Single Candidate

GET /api/candidates/{candidate_id}

Retrieves detailed information for a specific candidate.

Response:

{
  "name": "John Doe",
  "phone": "+1234567890",
  "email": "john@example.com",
  "location": "New York, NY",
  "chat_history": "...",
  "chat_summary": "...",
  ...
}

Configure Settings

POST /api/configure
Content-Type: application/json

Saves configuration settings (called automatically by the extension).

Validate Credentials

POST /api/validate-credentials
Content-Type: application/json

Tests JobToday login credentials without initiating a full scrape.

βš™οΈ Configuration

Environment Variables

All configuration can be set via environment variables or through the browser extension interface.

Variable Description Required
JOBTODAY_EMAIL Your JobToday account email Yes
JOBTODAY_PASSWORD Your JobToday account password Yes
JOB_ID Job posting ID (e.g., p3j9ox) Yes
AIRTABLE_PAT Airtable Personal Access Token No
AIRTABLE_BASE_ID Airtable Base ID No
AIRTABLE_TABLE_NAME Airtable table name (default: Candidates) No
N8N_WEBHOOK_URL n8n webhook URL for notifications No
OPENAI_API_KEY OpenAI API key for chat summaries No
PORT Backend server port (default: 5001) No
BACKEND_URL Backend URL (for production deployment) No

Extension Configuration

The browser extension stores configuration locally and automatically syncs with the backend API. All settings can be managed through the extension's Settings button.

🚒 Deployment

Local Development

Simply run the Flask server as described in Quick Start.

Production Deployment

The application can be deployed to any platform that supports Python applications. Example deployment options:

Render

  • Connect your GitHub repository
  • Set environment variables in the Render dashboard
  • The app will automatically deploy using gunicorn (configured in render.yaml)

Docker

A Dockerfile is included for containerized deployment:

docker build -t jobtoday-scraper .
docker run -p 5001:5001 --env-file .env jobtoday-scraper

Other Platforms

The application can be deployed to:

  • Heroku
  • AWS Elastic Beanstalk
  • Google Cloud Run
  • Azure App Service
  • Any VPS with Python support

πŸ“ Project Structure

jobtoday-scraper/
β”œβ”€β”€ extension/                 # Browser extension files
β”‚   β”œβ”€β”€ manifest.json         # Extension manifest
β”‚   β”œβ”€β”€ popup.html            # Main popup UI
β”‚   β”œβ”€β”€ popup.js              # Popup logic
β”‚   β”œβ”€β”€ onboarding.html       # Onboarding flow
β”‚   β”œβ”€β”€ onboarding.js         # Onboarding logic
β”‚   β”œβ”€β”€ background.js         # Background service worker
β”‚   β”œβ”€β”€ styles.css            # Extension styling
β”‚   └── icons/                # Extension icons
β”œβ”€β”€ scraper_api.py            # Main Flask API server
β”œβ”€β”€ jobtoday_1.py             # Core scraping logic
β”œβ”€β”€ config_storage.py         # Configuration management
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ Dockerfile                # Docker configuration
β”œβ”€β”€ render.yaml               # Render deployment config
β”œβ”€β”€ install_dependencies.sh   # Installation script (macOS/Linux)
β”œβ”€β”€ install_dependencies.bat  # Installation script (Windows)
└── README.md                 # This file

πŸ”§ Development

Project Structure Explained

  • scraper_api.py: Main Flask API server with all endpoints
  • jobtoday_1.py: Core scraping engine using Playwright
  • config_storage.py: Secure configuration storage with encryption
  • extension/: Browser extension source code
    • popup.html/js: Main extension interface
    • onboarding.html/js: Setup wizard
    • background.js: Service worker for background tasks
    • styles.css: Modern UI styling

Running in Development Mode

# Start with auto-reload (install python-dotenv if not installed)
export FLASK_ENV=development
python scraper_api.py

Code Style

The project follows PEP 8 style guidelines. For development, consider using:

  • black - Code formatting
  • flake8 - Linting
  • pylint - Advanced linting

πŸ› Troubleshooting

Common Issues

Port 5000/5001 already in use:

  • On macOS, port 5000 is often used by AirPlay Receiver
  • The app defaults to port 5001 to avoid conflicts
  • You can change it via the PORT environment variable

Playwright not found:

# Make sure you've installed playwright browsers
playwright install chromium

Extension not loading:

  • Ensure "Developer mode" is enabled in Chrome
  • Check the browser console for errors (chrome://extensions/ β†’ Details β†’ Inspect views)
  • Verify all files are in the extension folder

Cannot connect to backend:

  • Verify the Flask server is running (python scraper_api.py)
  • Check the backend URL in extension settings (default: http://localhost:5001)
  • Ensure no firewall is blocking the connection

Candidates not loading:

  • Verify candidates_detailed.json exists in the project root
  • Check that a scraping session has been completed at least once
  • Review backend logs for any errors

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

πŸ™ Acknowledgments

πŸ”’ Security & Privacy

  • Local Storage: All credentials are encrypted and stored locally in the extension
  • No Data Collection: The extension does not collect or transmit any user data to third parties
  • Secure Communication: All API communication uses HTTPS (in production)
  • Session Management: Login sessions are stored locally and never shared

πŸ“Š Data Collected

The scraper collects the following information from JobToday candidate profiles:

  • Personal information (name, phone, email, location)
  • Professional experience and work history
  • Languages and certifications
  • Complete chat conversation history
  • Application dates and job role information

All data is stored locally and can be exported to JSON/CSV format. When configured, data is also synced to your Airtable database.

πŸ› οΈ Built With

  • Python 3.8+ - Programming language
  • Flask - Web framework
  • Playwright - Browser automation
  • Chrome Extension API - Browser extension platform
  • OpenAI API - AI chat summarization
  • Airtable API - Database integration
  • n8n - Workflow automation

πŸ“ˆ Roadmap

Future enhancements may include:

  • Support for multiple job postings
  • Advanced filtering and sorting options
  • Export to Excel format
  • Dark mode theme
  • Candidate comparison view
  • Automated scheduling of scraping tasks
  • Email notifications
  • Mobile app companion

🀝 Contributing

We welcome contributions! Here's how you can help:

  1. Report Bugs: Open an issue with detailed information about the bug
  2. Suggest Features: Share your ideas for new features
  3. Submit Pull Requests: Help improve the codebase
  4. Improve Documentation: Make the docs better for everyone
  5. Share Feedback: Let us know how we can improve

Please ensure your code follows the existing style and includes appropriate tests.

πŸ“§ Support & Contact

  • GitHub Issues: Open an issue
  • Questions: Check existing issues or open a new one
  • Feature Requests: We'd love to hear your ideas!

⭐ Show Your Support

If you find this project useful, please:

  • ⭐ Star the repository on GitHub
  • πŸ› Report bugs to help improve the project
  • πŸ’‘ Suggest features to make it even better
  • πŸ“’ Share with others who might benefit from it

Made with ❀️ for recruiters and HR professionals

⭐ Star on GitHub β€’ πŸ› Report Bug β€’ πŸ’‘ Request Feature

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors