A small, deliberately conservative crawler that pulls intern listings from
LinkedIn's Shanghai job search and writes them in the same 20-column schema
as internship_finding/official_jobs_raw.csv.
Read first if changing anything: ~/.claude/plans/abundant-foraging-kahan.md.
The crawler:
- attaches to a manually-launched Chrome via CDP (real browser, persisted profile, visible window),
- listens to the Voyager JSON responses the page already loads — never replays requests,
- scrolls with random jitter and read-pauses to mimic a human,
- has hard caps (per-session and minimum cool-down) persisted in
output/budget_state.json, - aborts on any auth/challenge redirect.
pip install -r requirements.txt
playwright install chromium # not strictly needed since we attach to real Chrome,
# but pip resolves cleanly with this present
- Double-click
launch_chrome_for_linkedin.bat.- Opens a dedicated Chrome instance on
--remote-debugging-port=9222using an isolated profile underlinkedin_profile/(your normal Chrome is untouched). - First time only: log into LinkedIn in that window. Login persists.
- Opens a dedicated Chrome instance on
- Verify CDP is up:
curl http://localhost:9222/json/versionshould return JSON. - Run the crawler:
python linkedin_crawler.py - Watch the Chrome window. The script scrolls slowly, captures JSON, parses,
appends to
output/linkedin_jobs_raw.csv, and exits.
| File | Purpose |
|---|---|
launch_chrome_for_linkedin.bat |
Launches real Chrome on the debug port with an isolated profile |
probe_voyager.py |
One-shot: dumps raw Voyager payloads to output/intercepted_payloads/ for debugging |
linkedin_crawler.py |
Main entry point — drives the page and writes CSV |
parser.py |
Walks Voyager GraphQL/REST payloads, extracts JobPostingCard + JobDescription |
enrich.py |
Detail-field extractor (applicantCount, expireAt, employeeCount, …) keyed by entityUrn |
enrich_dumps.py |
Run after a crawl: walks all on-disk payload dumps, fills raw_tags/company_size/deadline in CSV without overwriting filled data |
enrich_jd.py |
Offline regex pass over jd_raw: fills academic/duration/salary |
backfill_jd.py |
Targeted JD-body backfill: visits /jobs/view/{id} for rows with empty jd_raw, captures detail JSON, fills the column. Shares the crawler's session budget. |
schema.py |
20-column schema (matches internship_finding) and row normalization |
config.py |
Search URL, paths, all caps and pacing knobs |
budget.py |
Persistent session-budget enforcement |
pacing.py |
Human-like scroll/pause helpers |
After each linkedin_crawler.py run, the CSV has the basic 8 columns
populated (url, company, name, city, jd_raw, publish_time, …). Three
post-processing scripts add the rest. Two are offline (zero LinkedIn
traffic), one does talk to LinkedIn but uses the same session budget
as the main crawler so total traffic per cool-down stays bounded.
python enrich_dumps.py # offline. raw_tags, company_size, deadline.
python enrich_jd.py # offline. academic, duration, salary (regex on jd_raw).
python backfill_jd.py # ONLINE. Fills jd_raw for rows the search-page didn't prefetch.
# Visits /jobs/view/{id} one at a time with strict pacing.
# Default cap: 30 jobs/run. Shares crawler's 12h cooldown.
enrich_dumps.py and enrich_jd.py are idempotent and safe to re-run any time.
backfill_jd.py is the one to run when jd_raw coverage is below ~80% and
you want to grow it. Each contributor's run fills 30 missing-JD rows; with
2-3 collaborators running in parallel, the gap closes in 1-3 days. After
running it, re-run enrich_jd.py so the new JD bodies get regex-mined for
academic/duration/salary.
Recommended contributor session order:
git pull # get the latest CSV from the repo
python linkedin_crawler.py # search + scroll, capture new jobs
python enrich_dumps.py
python enrich_jd.py
python backfill_jd.py # OPTIONAL: only on days you don't want to crawl new
python enrich_jd.py # re-run if backfill_jd.py added new JDs
git add output/linkedin_jobs_raw.csv
git commit -m "data: <your_handle> session <date>"
git push # PR or direct push, depending on your access
Realistic post-enrichment fill rates (based on Shanghai intern listings):
| Column | Typical fill | Notes |
|---|---|---|
| url, company, name, city, publish_time, external_job_id | 100% | from list payload |
| jd_raw | ~55–60% | only the ~50 cards LinkedIn auto-prefetches per page-load have detail |
| raw_tags (companyId, repostedJob, …) | 100% | from list+detail payloads |
| academic | ~35% | regex on JD; only rows with JD body |
| duration | ~25% | regex on JD |
| salary | ~15% | LinkedIn rarely structured; mostly extracted from JD prose |
| company_size | ~3% | only when the company entity payload was prefetched |
| deadline | ~2% | LinkedIn intern postings rarely have explicit deadlines |
output/linkedin_jobs_raw.csv— appended each session, UTF-8 with BOMoutput/budget_state.json— last session start/end, lifetime countoutput/intercepted_payloads/— raw JSON from the latest run (for debugging; safe to delete)output/anomaly.flag— created if a session aborted due to a challenge/login redirect; investigate before running again
After the canary run completes:
- Open
output/linkedin_jobs_raw.csvin Excel — should have 20 columns and ~10 rows. - Spot-check that
company,name,city,publish_time,external_job_idare populated for every row. - Manually scroll the LinkedIn tab the script left open — if LinkedIn shows a "we noticed unusual activity" banner or forces re-login, shrink the budget in
config.py(canary down to 5, setMIN_SECONDS_BETWEEN_SESSIONS = 86400) before any further run. - Re-run within the cool-down — script must refuse with a
[BUDGET]log line and exit cleanly.
If all four pass, lift the cap by editing config.py:
MAX_JOBS_PER_SESSION = STEADY_MAX_JOBS_PER_SESSION # 50
MIN_SECONDS_BETWEEN_SESSIONS = STEADY_MIN_SECONDS_BETWEEN_SESSIONS # 12hTo force a fresh first-session state, delete output/budget_state.json. The
CSV is append-only and is not cleared by reset.
This crawler is safe to publish as long as .gitignore is respected. The
two paths that must never be committed:
linkedin_profile/— contains login cookies for whoever launched Chrome. Pushing it == handing the LinkedIn account to anyone who clones.output/— each contributor's run state and CSV are local. Sharing them via git would clobber each other and publish the scraped data.
These are excluded by the included .gitignore. Verify with git status
before any push that those paths are not staged.
- They clone the repo to their machine.
- They run
pip install -r requirements.txtandplaywright install chromium. - They double-click
launch_chrome_for_linkedin.bat. A freshlinkedin_profile/is created on their machine (gitignored). - They log into LinkedIn with their own account in that window.
- They run
python linkedin_crawler.py. Their CSV grows in theiroutput/.
Each contributor has their own cooldown counter, their own session state, their own login. There is no shared infrastructure.
Friends running this on their own machines + accounts will mostly capture the same listings, because LinkedIn's search results don't vary much by viewer for the same query. So plain duplication of the same query gives ~10–20% unique gain per extra account — not 3×.
To make distribution actually compounding, have each collaborator search a
different segment by editing KEYWORD_VARIANTS and/or GEO_ID_* in
their config.py:
| Person | KEYWORD_VARIANTS slice |
GEO_ID |
|---|---|---|
| You | English/Chinese generic intern | Shanghai |
| Friend A | data/analyst/finance specific | Shanghai |
| Friend B | software/research/product specific | Shanghai |
| Friend C | generic intern | Beijing or Shenzhen |
Then merge their CSVs with a one-liner:
import pandas as pd, glob
df = pd.concat([pd.read_csv(p, encoding="utf-8-sig") for p in glob.glob("contributors/*.csv")])
df.drop_duplicates(subset="external_job_id", keep="first").to_csv("merged.csv", index=False, encoding="utf-8-sig")- One session per cooldown window. Do not delete
output/budget_state.jsonto retry — that's exactly what gets accounts flagged. If a session failed, fix the underlying issue and wait it out. - If LinkedIn shows ANY captcha / "unusual activity" banner during a run, stop, file an issue, do not run again until the group decides.
- Do not increase
MAX_JOBS_PER_SESSIONpast 300 without group sign-off.
- No headless / no Playwright-bundled Chromium (fingerprint risk).
- No throwaway accounts (new accounts trigger detection more readily).
- No replay of
/voyager/api/*POSTs (single biggest ban risk). - No screenshot/OCR fallback (the JSON path covers all needed fields cleanly).
- No salary fabrication (LinkedIn rarely returns
compensationfor CN intern listings; the column stays empty).