Skip to content

Soli22de/Linkedin_crawler

Repository files navigation

LinkedIn Shanghai Intern Crawler

A small, deliberately conservative crawler that pulls intern listings from LinkedIn's Shanghai job search and writes them in the same 20-column schema as internship_finding/official_jobs_raw.csv.

Design priority: do not get the LinkedIn account banned

Read first if changing anything: ~/.claude/plans/abundant-foraging-kahan.md.

The crawler:

  • attaches to a manually-launched Chrome via CDP (real browser, persisted profile, visible window),
  • listens to the Voyager JSON responses the page already loads — never replays requests,
  • scrolls with random jitter and read-pauses to mimic a human,
  • has hard caps (per-session and minimum cool-down) persisted in output/budget_state.json,
  • aborts on any auth/challenge redirect.

One-time setup

pip install -r requirements.txt
playwright install chromium    # not strictly needed since we attach to real Chrome,
                               # but pip resolves cleanly with this present

Each run

  1. Double-click launch_chrome_for_linkedin.bat.
    • Opens a dedicated Chrome instance on --remote-debugging-port=9222 using an isolated profile under linkedin_profile/ (your normal Chrome is untouched).
    • First time only: log into LinkedIn in that window. Login persists.
  2. Verify CDP is up: curl http://localhost:9222/json/version should return JSON.
  3. Run the crawler:
    python linkedin_crawler.py
    
  4. Watch the Chrome window. The script scrolls slowly, captures JSON, parses, appends to output/linkedin_jobs_raw.csv, and exits.

Files

File Purpose
launch_chrome_for_linkedin.bat Launches real Chrome on the debug port with an isolated profile
probe_voyager.py One-shot: dumps raw Voyager payloads to output/intercepted_payloads/ for debugging
linkedin_crawler.py Main entry point — drives the page and writes CSV
parser.py Walks Voyager GraphQL/REST payloads, extracts JobPostingCard + JobDescription
enrich.py Detail-field extractor (applicantCount, expireAt, employeeCount, …) keyed by entityUrn
enrich_dumps.py Run after a crawl: walks all on-disk payload dumps, fills raw_tags/company_size/deadline in CSV without overwriting filled data
enrich_jd.py Offline regex pass over jd_raw: fills academic/duration/salary
backfill_jd.py Targeted JD-body backfill: visits /jobs/view/{id} for rows with empty jd_raw, captures detail JSON, fills the column. Shares the crawler's session budget.
schema.py 20-column schema (matches internship_finding) and row normalization
config.py Search URL, paths, all caps and pacing knobs
budget.py Persistent session-budget enforcement
pacing.py Human-like scroll/pause helpers

Post-crawl enrichment pipeline

After each linkedin_crawler.py run, the CSV has the basic 8 columns populated (url, company, name, city, jd_raw, publish_time, …). Three post-processing scripts add the rest. Two are offline (zero LinkedIn traffic), one does talk to LinkedIn but uses the same session budget as the main crawler so total traffic per cool-down stays bounded.

python enrich_dumps.py    # offline. raw_tags, company_size, deadline.
python enrich_jd.py       # offline. academic, duration, salary (regex on jd_raw).
python backfill_jd.py     # ONLINE. Fills jd_raw for rows the search-page didn't prefetch.
                          # Visits /jobs/view/{id} one at a time with strict pacing.
                          # Default cap: 30 jobs/run. Shares crawler's 12h cooldown.

enrich_dumps.py and enrich_jd.py are idempotent and safe to re-run any time.

backfill_jd.py is the one to run when jd_raw coverage is below ~80% and you want to grow it. Each contributor's run fills 30 missing-JD rows; with 2-3 collaborators running in parallel, the gap closes in 1-3 days. After running it, re-run enrich_jd.py so the new JD bodies get regex-mined for academic/duration/salary.

Recommended contributor session order:

git pull                       # get the latest CSV from the repo
python linkedin_crawler.py     # search + scroll, capture new jobs
python enrich_dumps.py
python enrich_jd.py
python backfill_jd.py          # OPTIONAL: only on days you don't want to crawl new
python enrich_jd.py            # re-run if backfill_jd.py added new JDs
git add output/linkedin_jobs_raw.csv
git commit -m "data: <your_handle> session <date>"
git push                       # PR or direct push, depending on your access

Realistic post-enrichment fill rates (based on Shanghai intern listings):

Column Typical fill Notes
url, company, name, city, publish_time, external_job_id 100% from list payload
jd_raw ~55–60% only the ~50 cards LinkedIn auto-prefetches per page-load have detail
raw_tags (companyId, repostedJob, …) 100% from list+detail payloads
academic ~35% regex on JD; only rows with JD body
duration ~25% regex on JD
salary ~15% LinkedIn rarely structured; mostly extracted from JD prose
company_size ~3% only when the company entity payload was prefetched
deadline ~2% LinkedIn intern postings rarely have explicit deadlines

Output

  • output/linkedin_jobs_raw.csv — appended each session, UTF-8 with BOM
  • output/budget_state.json — last session start/end, lifetime count
  • output/intercepted_payloads/ — raw JSON from the latest run (for debugging; safe to delete)
  • output/anomaly.flag — created if a session aborted due to a challenge/login redirect; investigate before running again

Canary verification (do this on the first real run)

After the canary run completes:

  1. Open output/linkedin_jobs_raw.csv in Excel — should have 20 columns and ~10 rows.
  2. Spot-check that company, name, city, publish_time, external_job_id are populated for every row.
  3. Manually scroll the LinkedIn tab the script left open — if LinkedIn shows a "we noticed unusual activity" banner or forces re-login, shrink the budget in config.py (canary down to 5, set MIN_SECONDS_BETWEEN_SESSIONS = 86400) before any further run.
  4. Re-run within the cool-down — script must refuse with a [BUDGET] log line and exit cleanly.

If all four pass, lift the cap by editing config.py:

MAX_JOBS_PER_SESSION = STEADY_MAX_JOBS_PER_SESSION       # 50
MIN_SECONDS_BETWEEN_SESSIONS = STEADY_MIN_SECONDS_BETWEEN_SESSIONS  # 12h

Reset

To force a fresh first-session state, delete output/budget_state.json. The CSV is append-only and is not cleared by reset.

Sharing with collaborators (GitHub, friends running on their own machines)

This crawler is safe to publish as long as .gitignore is respected. The two paths that must never be committed:

  • linkedin_profile/ — contains login cookies for whoever launched Chrome. Pushing it == handing the LinkedIn account to anyone who clones.
  • output/ — each contributor's run state and CSV are local. Sharing them via git would clobber each other and publish the scraped data.

These are excluded by the included .gitignore. Verify with git status before any push that those paths are not staged.

Adding a new contributor

  1. They clone the repo to their machine.
  2. They run pip install -r requirements.txt and playwright install chromium.
  3. They double-click launch_chrome_for_linkedin.bat. A fresh linkedin_profile/ is created on their machine (gitignored).
  4. They log into LinkedIn with their own account in that window.
  5. They run python linkedin_crawler.py. Their CSV grows in their output/.

Each contributor has their own cooldown counter, their own session state, their own login. There is no shared infrastructure.

Merging contributors' CSVs

Friends running this on their own machines + accounts will mostly capture the same listings, because LinkedIn's search results don't vary much by viewer for the same query. So plain duplication of the same query gives ~10–20% unique gain per extra account — not 3×.

To make distribution actually compounding, have each collaborator search a different segment by editing KEYWORD_VARIANTS and/or GEO_ID_* in their config.py:

Person KEYWORD_VARIANTS slice GEO_ID
You English/Chinese generic intern Shanghai
Friend A data/analyst/finance specific Shanghai
Friend B software/research/product specific Shanghai
Friend C generic intern Beijing or Shenzhen

Then merge their CSVs with a one-liner:

import pandas as pd, glob
df = pd.concat([pd.read_csv(p, encoding="utf-8-sig") for p in glob.glob("contributors/*.csv")])
df.drop_duplicates(subset="external_job_id", keep="first").to_csv("merged.csv", index=False, encoding="utf-8-sig")

Safety rules to put in your contributor README

  • One session per cooldown window. Do not delete output/budget_state.json to retry — that's exactly what gets accounts flagged. If a session failed, fix the underlying issue and wait it out.
  • If LinkedIn shows ANY captcha / "unusual activity" banner during a run, stop, file an issue, do not run again until the group decides.
  • Do not increase MAX_JOBS_PER_SESSION past 300 without group sign-off.

What this crawler intentionally does NOT do

  • No headless / no Playwright-bundled Chromium (fingerprint risk).
  • No throwaway accounts (new accounts trigger detection more readily).
  • No replay of /voyager/api/* POSTs (single biggest ban risk).
  • No screenshot/OCR fallback (the JSON path covers all needed fields cleanly).
  • No salary fabrication (LinkedIn rarely returns compensation for CN intern listings; the column stays empty).

About

LinkedIn Shanghai intern jobs crawler. Attaches via CDP to a manually-launched Chrome, multi-keyword sweep, ban-safe pacing budget, schema-compatible CSV output.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors