Internship Finder -- Multi-Source Campus Recruitment Crawler

A Python-based web crawler and data pipeline that aggregates campus recruitment internship postings from 9 major Chinese technology companies. The system crawls official career portals, normalizes heterogeneous job data into a unified schema, applies rule-based filtering for target graduation cohorts, and produces ranked candidate-job match reports.

Overview

Chinese tech companies each maintain distinct career portal architectures with varying APIs, page structures, and data formats. This project abstracts those differences behind a unified adapter pattern, enabling a single crawl command to collect, clean, deduplicate, and score thousands of internship listings across all target companies.

Supported Companies

ByteDance, Tencent, Kuaishou, Xiaohongshu, Meituan, Alibaba, JD.com, Bilibili, Baidu

Features

Multi-source crawling -- Fetches job listings from 9 company career portals using both REST API calls (requests) and browser-rendered pages (Playwright).
Adapter pattern -- Each company has a dedicated adapter class (extending BaseCompanyAdapter) that implements fetch_list, parse, and get_27_signal, isolating company-specific logic from the core pipeline.
Configurable company selection -- Enable or disable individual company crawlers and set per-company strict mode via config.py.
Snapshot-based change detection -- Maintains JSON snapshots (snapshot_latest.json, snapshot_prev.json) to detect new, updated, and removed listings between crawl runs.
Content-hash deduplication -- Generates MD5 hashes from job title, city, JD text, and recruitment type to eliminate duplicate postings across sources.
Cohort classification engine -- A rule-based classifier (cohort27_rules.py) assigns high/medium/low confidence scores to determine whether a listing targets the 2027 graduation cohort, using 12+ pattern rules covering explicit year mentions, special recruitment programs (X-Star, RedStar, etc.), and timing heuristics.
Location normalization -- Maps varied city representations (district names, slash-separated multi-city strings, aliases) to canonical city names via a JSON alias mapping and fallback regex.
Data enrichment pipeline (merge_file.py):
- JD parsing: splits raw job descriptions into responsibility and requirement sections.
- Structured field extraction: degree requirements, graduation batch, internship duration, days per week.
- Salary normalization: converts K-range, K-plus, and daily-rate formats to a unified monthly range.
- Experience extraction: parses year-range, year-plus, and zero-experience patterns.
- Skill keyword matching against a configurable list (SQL, Python, Spark, Tableau, etc.).
- Composite relevance scoring (0--100) combining company tier, target city, cohort match, role-keyword density, and JD quality.
Vectorized pandas operations -- Performance-critical normalization and scoring functions have vectorized implementations for large datasets.
Link validation -- Async batch link checking to flag dead application URLs.
Multi-format output -- Generates CSV, Excel, and JSON reports, including filtered views (Shanghai-only, verified-only, new-this-update).

Tech Stack

Language: Python 3.10+
Web scraping: Requests, Playwright (Chromium)
Data processing: pandas, regex
Async: asyncio (link validation)
Configuration: Python dict-based config with environment variable overrides

Project Structure

internship_finding/
  official_multi_crawler.py   # Core multi-source crawler orchestrator
  merge_file.py               # Data cleaning, enrichment, scoring pipeline
  config.py                   # Company enable/disable and crawl settings
  parsers/
    base_adapter.py           # Abstract adapter interface
    company_adapters.py       # Per-company adapter implementations
    company_registry.py       # Adapter registry and factory
  rules/
    cohort27_rules.py         # Graduation cohort classification rules
    location_normalizer.py    # City name normalization with alias mapping
  release_data/               # Output CSVs and Excel files
  archive/                    # Historical crawl snapshots

How to Run

# Install dependencies
pip install pandas requests playwright apscheduler

# Install Playwright browsers (first time only)
playwright install chromium

# Run the multi-source crawler
python official_multi_crawler.py

# Run the merge and scoring pipeline
python merge_file.py

Environment variables (optional):

Variable	Default	Description
`HEADLESS`	`1`	Run browser in headless mode
`MAX_SCROLL`	`10`	Maximum scroll iterations for infinite-scroll pages
`CITY_KEYWORD`	Shanghai	Target city filter
`GRAD_KEYWORDS`	`27届,2027届,2027`	Comma-separated graduation cohort keywords

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
ResuMiner @ 962718a		ResuMiner @ 962718a
backend		backend
crawlers		crawlers
docs		docs
miniprogram		miniprogram
parsers		parsers
rules		rules
scripts		scripts
tools		tools
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
README.md		README.md
config.py		config.py
inspect_alibaba.py		inspect_alibaba.py
inspect_bilibili.py		inspect_bilibili.py
inspect_jd.py		inspect_jd.py
inspect_kuaishou.py		inspect_kuaishou.py
inspect_meituan.py		inspect_meituan.py
inspect_tencent.py		inspect_tencent.py
merge_file.py		merge_file.py
official_multi_crawler.py		official_multi_crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Internship Finder -- Multi-Source Campus Recruitment Crawler

Overview

Supported Companies

Features

Tech Stack

Project Structure

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Internship Finder -- Multi-Source Campus Recruitment Crawler

Overview

Supported Companies

Features

Tech Stack

Project Structure

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages