Skip to content

NishK05/ScrapeSearch

Repository files navigation

Berkeley EECS AI Labs Scraper & Summarizer

Agentic workflow to crawl UC Berkeley EECS faculty, explore lab links, filter for AI-related research, and produce a mail-merge-ready Excel with personalized lines and resume matches.

What it does (Steps 1–8)

  • Step 1: Load config, keywords.txt, resume.txt, and initialize the OpenAI client.
  • Step 2: Crawl faculty lists (CS/EE) to collect professor names, departments, and profile/homepages.
  • Step 3: Explore promising links per professor (depth-limited, adaptive thresholds, caching, politeness).
  • Step 4: LLM filters pages for AI relevance and writes a personalized one-line outreach summary.
  • Step 5: LLM crafts a one-sentence “I have …” resume match tailored to the lab.
  • Step 6: Extract and assemble rows with emails, lab names, and additional names.
  • Step 7: Save to berkeley_ai_labs.xlsx and write a log.txt with any warnings.
  • Step 8: Politeness, rate limiting, caching, and rerun safety.

Quick start

  1. Python 3.11+ recommended. Create a virtual env and install deps:
    cd /path/to/ScrapeSearch
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
  2. Create .env with your OpenAI key (not committed):
    OPENAI_API_KEY=sk-...
    # Optional: choose a model; default is gpt-4o-mini
    # OPENAI_MODEL=gpt-4o-mini
  3. Add your inputs (not committed):
    • keywords.txt: one topic per line (e.g., AI, ML, RL, vision, NLP, robotics, adversarial ML, security, privacy)
    • resume.txt: paste your resume text. You can also point to PDFs for one-off tests via --match.
    • start_links.txt: starting URLs to crawl (one per line). Defaults provided for Berkeley EECS CS/EE lists.

Run

  • Small test:
    python main.py --run --limit 10
  • Exhaustive/overnight (writes progress/errors to log):
    nohup python main.py --run --limit 1000 > run.out 2>&1 &
    tail -f run.out
    tail -f log.txt

Output

  • berkeley_ai_labs.xlsx (overwritten each run) with columns:
    • Professor Name, Department, Lab Name, Lab/Research Link, 1-Sentence Project Summary, Matched Resume Experience, Email Address, Extra Notes, Additional Names.
  • log.txt clears each run and appends warnings/errors (e.g., 404s).

Design notes

  • Caching: .cache/http stores fetched pages for faster reruns.
  • Politeness: 0.5–2.0s random delay per request; depth-limited crawl with visited tracking.
  • Personalization: Prompts include professor and lab names; outreach-style one-liners.

Development

  • Debug flags:
    • --crawl: print first N professors from Step 2
    • --links: print top links per professor (Step 3)
    • --deep: deep link discovery from the first professor
    • --filter: run Step 4 on one discovered page
    • --match: run Step 5 on one page to generate the resume sentence

What to commit

Already ignored: .env, berkeley_ai_labs.xlsx, log.txt, resume.txt, keywords.txt, PDFs, caches, and virtualenv.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages