LinkedIn Shanghai Intern Crawler

A small, deliberately conservative crawler that pulls intern listings from LinkedIn's Shanghai job search and writes them in the same 20-column schema as internship_finding/official_jobs_raw.csv.

Design priority: do not get the LinkedIn account banned

Read first if changing anything: ~/.claude/plans/abundant-foraging-kahan.md.

The crawler:

attaches to a manually-launched Chrome via CDP (real browser, persisted profile, visible window),
listens to the Voyager JSON responses the page already loads — never replays requests,
scrolls with random jitter and read-pauses to mimic a human,
has hard caps (per-session and minimum cool-down) persisted in output/budget_state.json,
aborts on any auth/challenge redirect.

One-time setup

pip install -r requirements.txt
playwright install chromium    # not strictly needed since we attach to real Chrome,
                               # but pip resolves cleanly with this present

Each run

Double-click launch_chrome_for_linkedin.bat.
- Opens a dedicated Chrome instance on --remote-debugging-port=9222 using an isolated profile under linkedin_profile/ (your normal Chrome is untouched).
- First time only: log into LinkedIn in that window. Login persists.
Verify CDP is up: curl http://localhost:9222/json/version should return JSON.
Run the crawler:
```
python linkedin_crawler.py
```
Watch the Chrome window. The script scrolls slowly, captures JSON, parses, appends to output/linkedin_jobs_raw.csv, and exits.

Files

File	Purpose
`launch_chrome_for_linkedin.bat`	Launches real Chrome on the debug port with an isolated profile
`probe_voyager.py`	One-shot: dumps raw Voyager payloads to `output/intercepted_payloads/` for debugging
`linkedin_crawler.py`	Main entry point — drives the page and writes CSV
`parser.py`	Walks Voyager GraphQL/REST payloads, extracts JobPostingCard + JobDescription
`enrich.py`	Detail-field extractor (applicantCount, expireAt, employeeCount, …) keyed by entityUrn
`enrich_dumps.py`	Run after a crawl: walks all on-disk payload dumps, fills `raw_tags`/`company_size`/`deadline` in CSV without overwriting filled data
`enrich_jd.py`	Offline regex pass over `jd_raw`: fills `academic`/`duration`/`salary`
`backfill_jd.py`	Targeted JD-body backfill: visits `/jobs/view/{id}` for rows with empty `jd_raw`, captures detail JSON, fills the column. Shares the crawler's session budget.
`schema.py`	20-column schema (matches `internship_finding`) and row normalization
`config.py`	Search URL, paths, all caps and pacing knobs
`budget.py`	Persistent session-budget enforcement
`pacing.py`	Human-like scroll/pause helpers

Post-crawl enrichment pipeline

After each linkedin_crawler.py run, the CSV has the basic 8 columns populated (url, company, name, city, jd_raw, publish_time, …). Three post-processing scripts add the rest. Two are offline (zero LinkedIn traffic), one does talk to LinkedIn but uses the same session budget as the main crawler so total traffic per cool-down stays bounded.

python enrich_dumps.py    # offline. raw_tags, company_size, deadline.
python enrich_jd.py       # offline. academic, duration, salary (regex on jd_raw).
python backfill_jd.py     # ONLINE. Fills jd_raw for rows the search-page didn't prefetch.
                          # Visits /jobs/view/{id} one at a time with strict pacing.
                          # Default cap: 30 jobs/run. Shares crawler's 12h cooldown.

enrich_dumps.py and enrich_jd.py are idempotent and safe to re-run any time.

backfill_jd.py is the one to run when jd_raw coverage is below ~80% and you want to grow it. Each contributor's run fills 30 missing-JD rows; with 2-3 collaborators running in parallel, the gap closes in 1-3 days. After running it, re-run enrich_jd.py so the new JD bodies get regex-mined for academic/duration/salary.

Recommended contributor session order:

git pull                       # get the latest CSV from the repo
python linkedin_crawler.py     # search + scroll, capture new jobs
python enrich_dumps.py
python enrich_jd.py
python backfill_jd.py          # OPTIONAL: only on days you don't want to crawl new
python enrich_jd.py            # re-run if backfill_jd.py added new JDs
git add output/linkedin_jobs_raw.csv
git commit -m "data: <your_handle> session <date>"
git push                       # PR or direct push, depending on your access

Realistic post-enrichment fill rates (based on Shanghai intern listings):

Column	Typical fill	Notes
url, company, name, city, publish_time, external_job_id	100%	from list payload
jd_raw	~55–60%	only the ~50 cards LinkedIn auto-prefetches per page-load have detail
raw_tags (companyId, repostedJob, …)	100%	from list+detail payloads
academic	~35%	regex on JD; only rows with JD body
duration	~25%	regex on JD
salary	~15%	LinkedIn rarely structured; mostly extracted from JD prose
company_size	~3%	only when the company entity payload was prefetched
deadline	~2%	LinkedIn intern postings rarely have explicit deadlines

Output

output/linkedin_jobs_raw.csv — appended each session, UTF-8 with BOM
output/budget_state.json — last session start/end, lifetime count
output/intercepted_payloads/ — raw JSON from the latest run (for debugging; safe to delete)
output/anomaly.flag — created if a session aborted due to a challenge/login redirect; investigate before running again

Canary verification (do this on the first real run)

After the canary run completes:

Open output/linkedin_jobs_raw.csv in Excel — should have 20 columns and ~10 rows.
Spot-check that company, name, city, publish_time, external_job_id are populated for every row.
Manually scroll the LinkedIn tab the script left open — if LinkedIn shows a "we noticed unusual activity" banner or forces re-login, shrink the budget in config.py (canary down to 5, set MIN_SECONDS_BETWEEN_SESSIONS = 86400) before any further run.
Re-run within the cool-down — script must refuse with a [BUDGET] log line and exit cleanly.

If all four pass, lift the cap by editing config.py:

MAX_JOBS_PER_SESSION = STEADY_MAX_JOBS_PER_SESSION       # 50
MIN_SECONDS_BETWEEN_SESSIONS = STEADY_MIN_SECONDS_BETWEEN_SESSIONS  # 12h

Reset

To force a fresh first-session state, delete output/budget_state.json. The CSV is append-only and is not cleared by reset.

Sharing with collaborators (GitHub, friends running on their own machines)

This crawler is safe to publish as long as .gitignore is respected. The two paths that must never be committed:

linkedin_profile/ — contains login cookies for whoever launched Chrome. Pushing it == handing the LinkedIn account to anyone who clones.
output/ — each contributor's run state and CSV are local. Sharing them via git would clobber each other and publish the scraped data.

These are excluded by the included .gitignore. Verify with git status before any push that those paths are not staged.

Adding a new contributor

They clone the repo to their machine.
They run pip install -r requirements.txt and playwright install chromium.
They double-click launch_chrome_for_linkedin.bat. A fresh linkedin_profile/ is created on their machine (gitignored).
They log into LinkedIn with their own account in that window.
They run python linkedin_crawler.py. Their CSV grows in their output/.

Each contributor has their own cooldown counter, their own session state, their own login. There is no shared infrastructure.

Merging contributors' CSVs

Friends running this on their own machines + accounts will mostly capture the same listings, because LinkedIn's search results don't vary much by viewer for the same query. So plain duplication of the same query gives ~10–20% unique gain per extra account — not 3×.

To make distribution actually compounding, have each collaborator search a different segment by editing KEYWORD_VARIANTS and/or GEO_ID_* in their config.py:

Person	`KEYWORD_VARIANTS` slice	`GEO_ID`
You	English/Chinese generic intern	Shanghai
Friend A	data/analyst/finance specific	Shanghai
Friend B	software/research/product specific	Shanghai
Friend C	generic intern	Beijing or Shenzhen

Then merge their CSVs with a one-liner:

import pandas as pd, glob
df = pd.concat([pd.read_csv(p, encoding="utf-8-sig") for p in glob.glob("contributors/*.csv")])
df.drop_duplicates(subset="external_job_id", keep="first").to_csv("merged.csv", index=False, encoding="utf-8-sig")

Safety rules to put in your contributor README

One session per cooldown window. Do not delete output/budget_state.json to retry — that's exactly what gets accounts flagged. If a session failed, fix the underlying issue and wait it out.
If LinkedIn shows ANY captcha / "unusual activity" banner during a run, stop, file an issue, do not run again until the group decides.
Do not increase MAX_JOBS_PER_SESSION past 300 without group sign-off.

What this crawler intentionally does NOT do

No headless / no Playwright-bundled Chromium (fingerprint risk).
No throwaway accounts (new accounts trigger detection more readily).
No replay of /voyager/api/* POSTs (single biggest ban risk).
No screenshot/OCR fallback (the JSON path covers all needed fields cleanly).
No salary fabrication (LinkedIn rarely returns compensation for CN intern listings; the column stays empty).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LinkedIn Shanghai Intern Crawler

Design priority: do not get the LinkedIn account banned

One-time setup

Each run

Files

Post-crawl enrichment pipeline

Output

Canary verification (do this on the first real run)

Reset

Sharing with collaborators (GitHub, friends running on their own machines)

Adding a new contributor

Merging contributors' CSVs

Safety rules to put in your contributor README

What this crawler intentionally does NOT do

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
output		output
.gitignore		.gitignore
README.md		README.md
backfill_jd.py		backfill_jd.py
budget.py		budget.py
config.py		config.py
enrich.py		enrich.py
enrich_dumps.py		enrich_dumps.py
enrich_jd.py		enrich_jd.py
launch_chrome_for_linkedin.bat		launch_chrome_for_linkedin.bat
linkedin_crawler.py		linkedin_crawler.py
pacing.py		pacing.py
parser.py		parser.py
probe_voyager.py		probe_voyager.py
requirements.txt		requirements.txt
schema.py		schema.py

Folders and files

Latest commit

History

Repository files navigation

LinkedIn Shanghai Intern Crawler

Design priority: do not get the LinkedIn account banned

One-time setup

Each run

Files

Post-crawl enrichment pipeline

Output

Canary verification (do this on the first real run)

Reset

Sharing with collaborators (GitHub, friends running on their own machines)

Adding a new contributor

Merging contributors' CSVs

Safety rules to put in your contributor README

What this crawler intentionally does NOT do

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages