Agentic workflow to crawl UC Berkeley EECS faculty, explore lab links, filter for AI-related research, and produce a mail-merge-ready Excel with personalized lines and resume matches.
- Step 1: Load config,
keywords.txt,resume.txt, and initialize the OpenAI client. - Step 2: Crawl faculty lists (CS/EE) to collect professor names, departments, and profile/homepages.
- Step 3: Explore promising links per professor (depth-limited, adaptive thresholds, caching, politeness).
- Step 4: LLM filters pages for AI relevance and writes a personalized one-line outreach summary.
- Step 5: LLM crafts a one-sentence “I have …” resume match tailored to the lab.
- Step 6: Extract and assemble rows with emails, lab names, and additional names.
- Step 7: Save to
berkeley_ai_labs.xlsxand write alog.txtwith any warnings. - Step 8: Politeness, rate limiting, caching, and rerun safety.
- Python 3.11+ recommended. Create a virtual env and install deps:
cd /path/to/ScrapeSearch python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
- Create
.envwith your OpenAI key (not committed):OPENAI_API_KEY=sk-... # Optional: choose a model; default is gpt-4o-mini # OPENAI_MODEL=gpt-4o-mini
- Add your inputs (not committed):
keywords.txt: one topic per line (e.g., AI, ML, RL, vision, NLP, robotics, adversarial ML, security, privacy)resume.txt: paste your resume text. You can also point to PDFs for one-off tests via--match.start_links.txt: starting URLs to crawl (one per line). Defaults provided for Berkeley EECS CS/EE lists.
- Small test:
python main.py --run --limit 10
- Exhaustive/overnight (writes progress/errors to log):
nohup python main.py --run --limit 1000 > run.out 2>&1 & tail -f run.out tail -f log.txt
berkeley_ai_labs.xlsx(overwritten each run) with columns:- Professor Name, Department, Lab Name, Lab/Research Link, 1-Sentence Project Summary, Matched Resume Experience, Email Address, Extra Notes, Additional Names.
log.txtclears each run and appends warnings/errors (e.g., 404s).
- Caching:
.cache/httpstores fetched pages for faster reruns. - Politeness: 0.5–2.0s random delay per request; depth-limited crawl with visited tracking.
- Personalization: Prompts include professor and lab names; outreach-style one-liners.
- Debug flags:
--crawl: print first N professors from Step 2--links: print top links per professor (Step 3)--deep: deep link discovery from the first professor--filter: run Step 4 on one discovered page--match: run Step 5 on one page to generate the resume sentence
Already ignored: .env, berkeley_ai_labs.xlsx, log.txt, resume.txt, keywords.txt, PDFs, caches, and virtualenv.