This document outlines the product requirements for the AI Job Scraper, a local-first, privacy-focused Python application designed to automate the scraping, filtering, and management of AI/ML job postings. It serves as the single source of truth for development, aligning business goals with technical implementation.
-
In Scope:
- Scraping from major job boards (LinkedIn, Indeed, Glassdoor) and configurable company career pages
- 2-tier scraping strategy using JobSpy (90% structured sources) and ScrapeGraphAI (10% unstructured sources)
- Local AI processing using Qwen/Qwen3-4B-Instruct-2507-FP8 with vLLM inference
- Content-aware database synchronization with SQLite + SQLModel foundation
- SQLite FTS5 full-text search with porter stemming: 5-300ms response scaling from 1K-500K records
- Rich, interactive Streamlit UI with modern card-based job browsing and filtering
- Component-based, modular architecture for maintainability and extensibility
- Background task management using Python threading.Thread with real-time progress tracking
- Performance-triggered analytics scaling from SQLite to DuckDB (p95 >500ms threshold)
- Production-ready deployment via Docker with GPU support for RTX 4090
-
Out of Scope:
- Multi-user or enterprise deployment patterns
- Complex vector search or semantic similarity (over-engineered for personal use)
- Real-time push notifications or external integrations
- Custom UI frameworks beyond Streamlit native components
- Production hosting or SaaS deployment
-
Alex, the Job Seeker (Primary): A mid-level AI engineer who is actively or passively looking for new opportunities. Alex is tech-savvy but time-constrained and needs an efficient way to aggregate relevant job postings without manually checking dozens of sites.
-
Sam, the Power User/Developer (Secondary): An open-source contributor who wants to customize the tool, add new scraping sources, or integrate it into a larger workflow. Sam values clean, modular code and comprehensive documentation.
-
FR-SCR-01: 2-Tier Scraping Strategy: The system must use JobSpy for 90% coverage of structured job boards (LinkedIn, Indeed, Glassdoor) with native proxy support and ScrapeGraphAI for 10% coverage of unstructured company career pages with AI-powered extraction.
-
FR-SCR-02: Background Execution: All scraping operations must run via threading.Thread with st.status for real-time progress display, allowing the user to continue interacting with the UI.
-
FR-SCR-03: Real-Time Progress: The UI must display real-time progress using st.rerun() + session_state for non-blocking updates, including overall progress, per-company status, and jobs found.
-
FR-SCR-04: Structured Extraction: AI-powered extraction must use LiteLLM + Instructor for structured outputs with reliable JSON parsing, eliminating custom parsing logic.
-
FR-SCR-05: Bot Evasion: The system must employ bot evasion strategies, including IPRoyal residential proxy rotation, user-agent randomization, and respectful rate limiting.
-
FR-DB-01: SQLModel Foundation: The database must use SQLModel + SQLite with type-safe operations and automatic relationship handling for personal scale (tested capacity 500K+ jobs).
-
FR-DB-02: Database Synchronization: The system must use content hash-based synchronization to preserve user data during updates.
- FR-DB-02a (Change Detection): Use content hash of core job fields (title, company, description) to detect changes.
- FR-DB-02b (User Data Preservation): Application status, favorites, and notes must be preserved across job updates.
- FR-DB-02c (Conflict Resolution): Automatic merge of updated job content with existing user annotations.
-
FR-DB-03: Analytics Scaling: The system must provide SQLite foundation with automatic DuckDB sqlite_scanner activation when p95 query latency exceeds 500ms threshold.
-
FR-UI-01: Component-Based UI: The UI must be built with a modular, component-based architecture using Streamlit native components.
-
FR-UI-02: Job Browser: The primary interface must be a responsive, card-based grid of job postings with mobile-first design.
-
FR-UI-03: Advanced Filtering: Users must be able to filter jobs by text search (SQLite FTS5 with porter stemming), company, application status, salary range, and date posted with <10ms response time.
-
FR-UI-04: Application Tracking: Users must be able to set and update the status of their job applications (
New,Interested,Applied,Rejected) with visual status indicators. -
FR-UI-05: Job Details View: Users must be able to view full job details and add personal notes using expandable interface components.
-
FR-UI-06: Real-time Updates: The UI must provide non-blocking background task progress using threading with st.status components and st.rerun() for session state updates.
-
FR-UI-07: High-Performance Caching: UI filter and search operations must complete in <100ms via st.cache_data (Streamlit native caching).
-
FR-UI-08: Analytics Dashboard: The system must provide analytics with automatic method selection between SQLite (baseline) and DuckDB (high-performance) based on p95 query latency thresholds.
-
FR-UI-09: Company Management: A dedicated UI must exist for users to add, view, and activate/deactivate companies for scraping.
-
FR-UI-10: Settings Management: A settings page must allow users to manage API keys, configure AI processing thresholds, proxy usage, and database optimization settings.
-
FR-ANALYTICS-01: Automatic Method Selection: The system must automatically select between SQLite and DuckDB analytics methods based on performance triggers (p95 latency >500ms).
-
FR-ANALYTICS-02: Streamlit Caching: Performance tracking must use Streamlit native caching (st.cache_data) for optimized operations with minimal overhead.
-
FR-ANALYTICS-03: Cost Control Integration: Real-time cost tracking must monitor the $50 monthly budget with automated alerts at 80% and 100% utilization.
-
FR-ANALYTICS-04: Visual Performance Indicators: The analytics dashboard must display current method selection, performance metrics, and cost utilization with interactive visualizations.
-
FR-CARDS-01: Mobile-First Design: Job cards must automatically adapt to screen size with responsive grid layouts (desktop: 3 columns, mobile: 1 column).
-
FR-CARDS-02: Status Visual Indicators: Cards must display color-coded status indicators with icons for application progress (new, interested, applied, rejected).
-
FR-CARDS-03: Interactive Actions: Each card must provide quick actions for status updates, apply links, and additional options menu.
-
FR-CARDS-04: Enhanced Information Display: Cards must present job information with clear hierarchy: company/title header, salary/location body, status/actions footer.
-
NFR-PERF-01: Search Performance: SQLite FTS5 search operations must complete in <10ms for 1,000 jobs with porter stemming and BM25 relevance ranking.
-
NFR-PERF-02: AI Processing: The system must route content <8K tokens to local Qwen3-4B processing and ≥8K tokens to GPT-4o-mini cloud fallback based on tiktoken measurement.
-
NFR-PERF-03: UI Responsiveness: Job card rendering must complete in <200ms for 50+ jobs with full content using Streamlit native caching and optimization.
-
NFR-SCALE-01: Personal Scale Architecture: The application must be architected for single user with tested capacity of 500K jobs (1.3GB database), with performance-based scaling triggers.
-
NFR-COST-01: Cost Optimization: Monthly operational costs must remain under $50 budget ceiling with actual target of $25-30 through local/cloud processing allocation.
-
NFR-MAINT-01: Maintenance Requirements: The system must operate with monthly dependency updates and quarterly library maintenance through library-first architecture.
-
NFR-PRIV-01: Local-First Privacy: All job data must be processed and stored locally with only processing-level API usage for cloud AI fallback, no data retention by external services.
-
NFR-LIB-01: Library-First Implementation: The codebase must leverage modern Python libraries and frameworks to minimize custom code and maintenance overhead while maximizing reliability and performance.
- AI Processing: LiteLLM unified client + Instructor structured outputs + vLLM inference server
- Local AI: Qwen/Qwen3-4B-Instruct-2507-FP8 with FP8 quantization on RTX 4090
- Token Routing: 8K context window threshold measured via tiktoken for local/cloud processing routing
- Cloud AI: GPT-4o-mini for complex tasks and fallback scenarios
- Database: SQLModel + SQLite with WAL mode and type-safe operations
- Search: SQLite FTS5 with porter stemming and BM25 relevance ranking via sqlite-utils (ADR-018)
- Analytics: Automatic method selection - SQLite baseline with DuckDB sqlite_scanner scaling (ADR-019)
- Performance Optimization: Streamlit native caching for streamlined performance tracking
- Cost Control: Real-time $50 budget monitoring with automated service-level cost tracking
- Synchronization: Database sync with content hash detection and user data preservation
- 2-Tier Strategy: JobSpy (90% structured job boards) + ScrapeGraphAI (10% company pages)
- Proxy Integration: IPRoyal residential proxies with JobSpy native compatibility
- HTTP Resilience: Native HTTPX transport retries (eliminates custom retry logic)
- Background Tasks: Python threading.Thread with Streamlit st.status integration
- Framework: Streamlit with native fragments, column configuration, and auto-refresh
- Modern UI: Card-based job browser with mobile-first responsive design (ADR-021)
- Visual Design: Status-coded cards for improved information scanning compared to table format
- Interactive Elements: Real-time status updates, quick actions, and hover effects
- Search Integration: Real-time FTS5 search with <10ms response time via sqlite-utils
- Analytics Dashboard: Analytics visualization with method selection indicators
- State Management: Native st.session_state with optimistic UI feedback
- GPU: RTX 4090 Laptop GPU with 16GB VRAM for local AI processing
- Software: CUDA >=12.1, Python 3.12+
- Storage: 2GB free disk space for models and database
- Network: Internet connection for proxy services and cloud AI fallback
- Package Management: uv for Python dependency management
- Code Quality: ruff for linting and formatting
- Testing: pytest with library-first testing patterns
- Containerization: Docker + docker-compose for consistent development environment
- Timeline: 7-day implementation with 4-phase deployment strategy
- Search: <10ms FTS5 queries with porter stemming and BM25 ranking via sqlite-utils (ADR-018)
- Analytics: Automatic method selection triggers at p95 >500ms for DuckDB sqlite_scanner activation (ADR-019)
- AI Processing: <2s local vLLM inference with FP8 optimization, 98% local processing rate (ADR-011/012)
- Performance Optimization: Streamlit native caching provides streamlined operations vs custom monitoring (ADR-019)
- UI Rendering: Card-based interface with improved scanning efficiency, <200ms rendering for 50+ jobs (ADR-021)
- GPU Utilization: 90% efficiency with RTX 4090 FP8 quantization and continuous batching
- Cost Control: $50 monthly budget with real-time monitoring and automated alerts (ADR-019)
- Success Rate: JobSpy extraction reliability with native proxy integration and resilient HTTP transport
- Library Integration: Significant code reduction through library utilization (ADR-001)