GitHub - NishK05/ScrapeSearch

Berkeley EECS AI Labs Scraper & Summarizer

Agentic workflow to crawl UC Berkeley EECS faculty, explore lab links, filter for AI-related research, and produce a mail-merge-ready Excel with personalized lines and resume matches.

What it does (Steps 1–8)

Step 1: Load config, keywords.txt, resume.txt, and initialize the OpenAI client.
Step 2: Crawl faculty lists (CS/EE) to collect professor names, departments, and profile/homepages.
Step 3: Explore promising links per professor (depth-limited, adaptive thresholds, caching, politeness).
Step 4: LLM filters pages for AI relevance and writes a personalized one-line outreach summary.
Step 5: LLM crafts a one-sentence “I have …” resume match tailored to the lab.
Step 6: Extract and assemble rows with emails, lab names, and additional names.
Step 7: Save to berkeley_ai_labs.xlsx and write a log.txt with any warnings.
Step 8: Politeness, rate limiting, caching, and rerun safety.

Quick start

Python 3.11+ recommended. Create a virtual env and install deps:

cd /path/to/ScrapeSearch
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create .env with your OpenAI key (not committed):

OPENAI_API_KEY=sk-...
# Optional: choose a model; default is gpt-4o-mini
# OPENAI_MODEL=gpt-4o-mini

Add your inputs (not committed):
- keywords.txt: one topic per line (e.g., AI, ML, RL, vision, NLP, robotics, adversarial ML, security, privacy)
- resume.txt: paste your resume text. You can also point to PDFs for one-off tests via --match.
- start_links.txt: starting URLs to crawl (one per line). Defaults provided for Berkeley EECS CS/EE lists.

Run

Small test:
```
python main.py --run --limit 10
```

Exhaustive/overnight (writes progress/errors to log):

nohup python main.py --run --limit 1000 > run.out 2>&1 &
tail -f run.out
tail -f log.txt

Output

berkeley_ai_labs.xlsx (overwritten each run) with columns:
- Professor Name, Department, Lab Name, Lab/Research Link, 1-Sentence Project Summary, Matched Resume Experience, Email Address, Extra Notes, Additional Names.
log.txt clears each run and appends warnings/errors (e.g., 404s).

Design notes

Caching: .cache/http stores fetched pages for faster reruns.
Politeness: 0.5–2.0s random delay per request; depth-limited crawl with visited tracking.
Personalization: Prompts include professor and lab names; outreach-style one-liners.

Development

Debug flags:
- --crawl: print first N professors from Step 2
- --links: print top links per professor (Step 3)
- --deep: deep link discovery from the first professor
- --filter: run Step 4 on one discovered page
- --match: run Step 5 on one page to generate the resume sentence

What to commit

Already ignored: .env, berkeley_ai_labs.xlsx, log.txt, resume.txt, keywords.txt, PDFs, caches, and virtualenv.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
extractors.py		extractors.py
filter.py		filter.py
logger.py		logger.py
main.py		main.py
matcher.py		matcher.py
navigator.py		navigator.py
pdf_content.txt		pdf_content.txt
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Berkeley EECS AI Labs Scraper & Summarizer

What it does (Steps 1–8)

Quick start

Run

Output

Design notes

Development

What to commit

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Berkeley EECS AI Labs Scraper & Summarizer

What it does (Steps 1–8)

Quick start

Run

Output

Design notes

Development

What to commit

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages