AI-powered tech job aggregation platform. Collects jobs from 6 sources, enriches them with Gemini AI into a 40-field schema, and serves them via a fast REST API with full-text search.
- 6 Data Sources: RemoteOK, JSearch, Adzuna (7 countries), HackerNews "Who is Hiring?", RSS feeds (WeWorkRemotely + RemoteOK), ATS scraper (Greenhouse, Lever, Ashby, Workable, SmartRecruiters)
- AI Enrichment: Gemini 2.5 Flash-Lite processes raw jobs into a structured 40-field schema (batch of 5, 10 concurrent)
- Full-Text Search: PostgreSQL tsvector + GIN index with weighted fields and relevance ranking (~35ms)
- Save-Per-Batch: Each batch of 5 jobs is saved to DB immediately after AI processing
- Automatic Deduplication: Title + company hash prevents duplicates
- Age Filtering: Jobs older than 15 days are dropped during ingestion
- Company Discovery: SerpAPI-powered discovery of companies on ATS platforms
- REST API: FastAPI with filtering, pagination, and full-text search
- Scheduled Fetching: APScheduler runs ingestion every 30 minutes (configurable)
- Docker Support: One-command deployment with docker-compose
- Quick Start
- Installation
- Configuration
- Usage
- API Documentation
- Architecture
- Data Schema
- Development
- Troubleshooting
- Docker & Docker Compose
- Python 3.12+ (for local development)
- API Keys:
- Gemini API Key (AI enrichment)
- Adzuna API (App ID + API Key)
- RapidAPI Key (for JSearch)
- SerpAPI Key (for ATS company discovery)
git clone <your-repo-url>
cd jobs.ai
cp .env.example .env
# Edit .env with your API keysdocker-compose up -d- API: http://localhost:8000
- Docs: http://localhost:8000/docs
- Health: http://localhost:8000/api/health
docker-compose up -d # Start all services
docker-compose logs -f # View logs
docker-compose down # Stop servicespython -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Start the server (with scheduler)
python -m src.main
# Start API only (no auto-ingestion)
DISABLE_SCHEDULER=1 python -m src.mainCreate a .env file based on .env.example:
# Database
DATABASE_URL=postgres://user:pass@host:port/db?sslmode=require&ssl_no_verify=true
# API Keys
RAPIDAPI_KEY=your_rapidapi_key
ADZUNA_APP_ID=your_adzuna_app_id
ADZUNA_API_KEY=your_adzuna_api_key
GEMINI_API_KEY=your_gemini_api_key
SERPAPI_KEY=your_serpapi_key
# AI Enrichment
ENABLE_AI_ENRICHMENT=true
# App
ENVIRONMENT=development
LOG_LEVEL=INFO
API_PORT=8000
INGESTION_INTERVAL_MINUTES=30| Flag | Purpose |
|---|---|
DISABLE_SCHEDULER=1 |
Start API only, no auto-ingestion |
ENABLE_AI_ENRICHMENT=false |
Use rule-based fallback instead of Gemini |
python -m src.mainThe server will start on http://0.0.0.0:8000, run an initial ingestion, and schedule fetching every 30 minutes.
DISABLE_SCHEDULER=1 python -m src.maincurl -X POST http://localhost:8000/api/jobs/ingestGET /api/health
GET /api/jobs?limit=50&offset=0&search=kubernetes&remote_only=true&category=devops
Query Parameters:
| Parameter | Type | Description |
|---|---|---|
limit |
int | Results per page (1-200, default 50) |
offset |
int | Pagination offset (default 0) |
search |
string | Full-text search across title, company, skills, description (min 2 chars) |
source |
string[] | Filter by source: remoteok, jsearch, adzuna, hackernews, rss_feed, ats_scraper |
employment_type |
string | FULLTIME, PARTTIME, CONTRACT, INTERN |
remote_only |
bool | Filter remote jobs only |
seniority |
string[] | junior, mid, senior, staff, principal |
category |
string[] | backend, frontend, fullstack, devops, data, ml, mobile, security, qa, general |
Response:
{
"total": 313,
"jobs": [
{
"id": "adzuna_5609947686",
"source": "adzuna",
"source_id": "5609947686",
"source_url": "https://...",
"title": "Senior Backend Engineer",
"company": "TechCorp",
"company_logo": null,
"company_website": "https://techcorp.com",
"description": "Full HTML description...",
"short_description": "AI-generated 2-3 sentence summary.",
"location": { "city": "Berlin", "country": "de", "remote": true },
"country": "de",
"city": "Berlin",
"state": null,
"is_remote": true,
"work_arrangement": "remote",
"employment_type": "FULLTIME",
"seniority_level": "senior",
"department": "Engineering",
"category": "backend",
"salary_min": "90000",
"salary_max": "130000",
"salary_currency": "EUR",
"salary_period": "year",
"skills": ["Python", "Kubernetes", "PostgreSQL"],
"required_experience_years": 5,
"required_education": "Bachelor's",
"key_responsibilities": ["Design microservices", "..."],
"nice_to_have_skills": ["Go", "Terraform"],
"benefits": ["Remote work", "Stock options"],
"visa_sponsorship": "yes",
"posted_at": "2026-02-07T10:30:00Z",
"application_deadline": null,
"fetched_at": "2026-02-07T12:00:00Z",
"apply_url": "https://...",
"apply_options": null,
"tags": ["IT Jobs"],
"quality_score": 85
}
]
}GET /api/jobs/{job_id}
GET /api/filters
Returns available values with counts for source, category, seniority, employment type, etc.
POST /api/jobs/ingest
Visit http://localhost:8000/docs for Swagger UI.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Server (:8000) β
β /api/jobs /api/filters /api/health β
ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β Ingestion Service β
β (APScheduler) β
ββββββββββββββ¬βββββββββββββ
β asyncio.gather (parallel)
βββββββββββ¬ββββββββββββΌββββββββββββ¬βββββββββββ¬βββββββββββββ
β β β β β β
ββββββΌββββ ββββΌβββββ ββββββΌββββ βββββββΌβββββ ββββΌββββ ββββββββΌβββββββ
βRemoteOKβ βJSearchβ β Adzuna β βHackerNewsβ β RSS β β ATS Scraper β
β API β βRapidAPIβ β7 ctriesβ β Thread β βFeeds β β 5 platforms β
ββββββ¬ββββ ββββ¬βββββ ββββββ¬ββββ βββββββ¬βββββ ββββ¬ββββ ββββββββ¬βββββββ
β β β β β β
ββββββββββ΄ββββββββββββ΄ββββββ¬ββββββ΄βββββββββββ΄βββββββββββββ
β raw jobs
ββββββββββββββΌβββββββββββββ
β Enrichment Pipeline β
β Gemini 2.5 Flash-Lite β
β (5/batch, 10 concurrent)β
ββββββββββββββ¬βββββββββββββ
β structured 40-field jobs
ββββββββββββββΌβββββββββββββ
β Save Per Batch (DB) β
β dedup β insert/skip β
ββββββββββββββ¬βββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β PostgreSQL β
β tsvector + GIN index β
β full-text search β
βββββββββββββββββββββββββββ
| Directory | Purpose |
|---|---|
src/agents/ |
6 fetcher agents β each pulls raw jobs from an external source |
src/enrichment/ |
AI pipeline β Gemini processes raw data into 40-field schema |
src/services/ |
Ingestion orchestration, company discovery, scheduling |
src/api/ |
FastAPI routes, Pydantic schemas |
src/database/ |
SQLAlchemy models, CRUD operations, full-text search |
src/utils/ |
Config, logging |
scripts/ |
Migration & utility scripts |
Raw API data β Gemini 2.5 Flash-Lite (JSON mode, temperature=0)
β 40-field structured extraction
β quality scoring
β title_company_hash dedup
β save to PostgreSQL
- Batch size: 5 jobs per Gemini API call
- Concurrency: 10 parallel batch calls
- Fallback: Rule-based extraction if AI is disabled or fails
- Age filter: Jobs older than 15 days are dropped
Each job has 41 API fields (42 DB columns including internal search_vector):
| Group | Fields |
|---|---|
| Identity | id, source, source_id, source_url |
| Core | title, company, company_logo, company_website, description, short_description |
| Location | location, country, city, state, is_remote, work_arrangement, latitude, longitude |
| Employment | employment_type, seniority_level, department, category |
| Compensation | salary_min, salary_max, salary_currency, salary_period |
| Skills | skills, required_experience_years, required_education, key_responsibilities, nice_to_have_skills |
| Benefits | benefits, visa_sponsorship |
| Dates | posted_at, application_deadline, fetched_at |
| Apply | apply_url, apply_options |
| Meta | tags, quality_score |
source + source_id(unique) β deduplicationposted_atβ sort by recencycategory,is_remote,seniority_levelβ filter queriesskills(GIN) β array containment queriessearch_vector(GIN) β full-text searchtitle,companyβ direct lookups
jobs.ai/
βββ src/
β βββ agents/ # Job fetcher agents
β β βββ __init__.py # BaseFetcher ABC
β β βββ remoteok.py # RemoteOK API
β β βββ jsearch.py # JSearch via RapidAPI
β β βββ adzuna.py # Adzuna API (7 countries)
β β βββ hackernews.py # HN "Who is Hiring?" scraper
β β βββ rss_feed.py # RSS feeds (WWR, RemoteOK)
β β βββ ats_scraper.py # ATS platforms (5 APIs)
β βββ enrichment/ # AI processing layer
β β βββ ai_processor.py # Gemini integration
β β βββ enrichment_pipeline.py # Batch processing + fallback
β β βββ skills_extractor.py # Rule-based skill extraction
β β βββ quality_scorer.py # Job quality scoring
β βββ api/ # FastAPI application
β β βββ main.py # App factory + CORS
β β βββ routes.py # Endpoint definitions
β β βββ schemas.py # Pydantic response models
β βββ database/ # Data layer
β β βββ models.py # SQLAlchemy models (42 columns)
β β βββ operations.py # CRUD + full-text search
β βββ services/ # Business logic
β β βββ ingestion.py # Orchestrator + scheduler
β β βββ company_discovery.py # SerpAPI company finder
β βββ utils/ # Utilities
β β βββ config.py # Settings (pydantic-settings)
β β βββ logger.py # Logging setup
β βββ main.py # Application entrypoint
βββ scripts/ # Migration & test scripts
βββ tests/ # Test suite
βββ .env.example # Example configuration
βββ requirements.txt # Python dependencies
βββ Dockerfile # Container definition
βββ docker-compose.yml # Multi-service orchestration
βββ README.md
- Create a fetcher in
src/agents/:
from src.agents import BaseFetcher
class NewSourceFetcher(BaseFetcher):
def __init__(self):
super().__init__("newsource")
async def fetch_jobs(self):
# Return list of raw dicts β no normalization needed
# The AI pipeline handles all field extraction
return [{"title": "...", "description": "...", ...}]- Register in
src/services/ingestion.py:
from src.agents.newsource import NewSourceFetcher
FETCHER_CLASSES = [
...,
NewSourceFetcher, # Add here
]That's it β the enrichment pipeline and DB layer handle everything else.
| Problem | Solution |
|---|---|
asyncpg.exceptions.InvalidCatalogNameError |
docker-compose down -v && docker-compose up -d postgres |
ssl.SSLCertVerificationError |
Add ssl_no_verify=true to DATABASE_URL |
429 Too Many Requests |
Increase INGESTION_INTERVAL_MINUTES in .env |
FutureWarning: google.generativeai |
Non-blocking β migration to google.genai planned |
Server returns 500 on search |
Check for malformed location data in DB |
from src.agents.remoteok import RemoteOKFetcher
import asyncio
async def test():
fetcher = RemoteOKFetcher()
jobs = await fetcher.fetch_jobs()
print(f"Found {len(jobs)} jobs")
asyncio.run(test())- Sources: 6 (RemoteOK, JSearch, Adzuna, HackerNews, RSS, ATS Scraper)
- ATS Platforms: 5 (Greenhouse, Lever, Ashby, Workable, SmartRecruiters)
- Adzuna Countries: 7 (US, GB, CA, AU, DE, FR, NL)
- Jobs per Run: ~10,000+
- Job Schema: 41 API fields, AI-extracted
- Search Speed: ~35ms (PostgreSQL full-text, GIN indexed)
- Processing: 100% success rate (12,182/12,182 in last full run)
- Fetch Interval: 30 minutes (configurable)
MIT License β see LICENSE file for details.
Built with Python, FastAPI, PostgreSQL, and Gemini AI