A CLI tool and Jupyter notebook for scraping, filtering, deduplicating, and exporting job listings from RemoteOK. This package enables you to automate the collection of remote tech‑job data, transform it into a structured format, and visualize key metrics—all with a few simple commands or notebook cells.
- Fetch live postings from RemoteOK’s public JSON API
- Filter roles by keyword list (e.g., Python, Data, SQL)
- Deduplicate entries based on job URL, appending new listings to a master file
- Export results in CSV, Parquet, or JSON formats
- Visualize posting trends over time via bar charts
- CLI entry point:
scrape-jobsscript for automated runs or cron jobs - Notebook: interactive exploration with pandas & matplotlib
Ensure you have:
- Python 3.8 or higher
pippackage manager- (Optional) A virtual environment tool like
venv
-
Clone the repository:
git clone https://github.com/<your-username>/python-jobs-scraper.git cd python-jobs-scraper
-
Create and activate a virtual environment:
python -m venv .venv # Windows PowerShell .\.venv\Scripts\Activate.ps1 # macOS/Linux source .venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Install the package in editable mode:
pip install -e .
Now you have access to the scrape-jobs console script.
Fetch, filter, and export in one step:
# Basic run with defaults:
# keywords = [Power BI, Python], format = csv
scrape-jobs
# Custom keywords, JSON output, no plot:
scrape-jobs \
--keywords Python Data SQL JavaScript AWS Docker \
--format json \
--out data \
--prefix all_jobs \
--no-plotArguments:
-k,--keywords: List of keywords to filter positions by.-o,--out: Output directory for result files.-p,--prefix: Filename prefix (appended with date).--format:csv,parquet, orjson.--no-plot: Skip chart display (useful for automated runs).
Results:
all_jobs_YYYYMMDD.csvall_jobs_master.csv(accumulates all previous runs)
Open scraper.ipynb in Jupyter or VS Code. Key cells:
-
Fetch & parse data:
from scraper import fetch_data, parse_data jobs_raw = fetch_data() df = parse_data(jobs_raw)
-
Filter:
from scraper import filter_data keywords = ["Python", "Data", "SQL"] filtered = filter_data(df, keywords) filtered.head()
-
Visualize:
import matplotlib.pyplot as plt counts = filtered['position'].value_counts() plt.figure(figsize=(10,4)) counts.plot(kind='bar') plt.title('Open Positions by Title') plt.show()
-
Save to CSV:
from scraper import save_data save_data(filtered, out_dir='data', prefix='filtered_jobs', fmt='csv')
Edit config.yml to customize defaults:
# config.yml
url: https://remoteok.com/api
headers:
User-Agent: Mozilla/5.0 (compatible; YourNameBot/1.0)
keywords:
- Power BI
- Python
out: data
prefix: jobsurl: API endpointheaders: HTTP headers for requestskeywords: Default keyword filtersout: Default output folderprefix: Default file name prefix
python-jobs-scraper/
├── LICENSE
├── README.md
├── config.yml
├── pyproject.toml
├── setup.py
├── requirements.txt
├── scraper.ipynb
├── remoteok_scraper/
│ ├── __init__.py
│ └── scraper.py
└── data/ # output folder created at runtime
- Add fields: modify
parse_datato include additional JSON keys (e.g.,salary). - Advanced filters: extend
filter_datato filter by tags, date ranges, or location. - Scheduling: wrap
scrape-jobsin a cron or Windows Task Scheduler job for daily updates. - Web dashboard: integrate the CSV/Parquet into a BI tool or build a Streamlit app.
This project is licensed under the MIT License. Feel free to use and adapt!
Happy scraping!