|
| 1 | +# LinkedIn Graph Scraper - Setup & Usage |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This graph scraper discovers LinkedIn networks by: |
| 6 | +1. Starting with a seed person or company |
| 7 | +2. Finding connections via: |
| 8 | + - **Strategy E (Primary)**: Company employees (people who work/worked at same companies) |
| 9 | + - **Strategy D (Secondary)**: "People also viewed" suggestions |
| 10 | +3. Using BFS (breadth-first search) with configurable depth |
| 11 | +4. Storing in dual databases (PostgreSQL + Neo4j) |
| 12 | +5. Stopping at configurable profile limit or manual interrupt (Ctrl+C) |
| 13 | + |
| 14 | +## Quick Start |
| 15 | + |
| 16 | +### 1. Start Databases |
| 17 | + |
| 18 | +```bash |
| 19 | +# Start PostgreSQL and Neo4j |
| 20 | +docker-compose up -d |
| 21 | + |
| 22 | +# Wait for containers to be healthy (30-60 seconds) |
| 23 | +docker-compose ps |
| 24 | + |
| 25 | +# Check logs if needed |
| 26 | +docker-compose logs -f |
| 27 | +``` |
| 28 | + |
| 29 | +### 2. Install Dependencies |
| 30 | + |
| 31 | +```bash |
| 32 | +pip install -r requirements.txt |
| 33 | +``` |
| 34 | + |
| 35 | +### 3. Configure |
| 36 | + |
| 37 | +Edit `config.yaml`: |
| 38 | +```yaml |
| 39 | +scraping: |
| 40 | + max_profiles: 100 # Change this limit as needed |
| 41 | + max_depth: 3 |
| 42 | + rate_limit_min: 5 |
| 43 | + rate_limit_max: 10 |
| 44 | +``` |
| 45 | +
|
| 46 | +### 4. Run Scraper |
| 47 | +
|
| 48 | +```bash |
| 49 | +# From a person |
| 50 | +python cli.py scrape "https://linkedin.com/in/williamhgates" --max-profiles 50 |
| 51 | + |
| 52 | +# From a company |
| 53 | +python cli.py scrape "https://linkedin.com/company/microsoft" --max-profiles 50 |
| 54 | + |
| 55 | +# Resume from previous session |
| 56 | +python cli.py resume <session-id> |
| 57 | +``` |
| 58 | + |
| 59 | +## Database Access |
| 60 | + |
| 61 | +### PostgreSQL |
| 62 | +```bash |
| 63 | +# Connect via psql |
| 64 | +docker exec -it linkedin_postgres psql -U linkedin_user -d linkedin |
| 65 | + |
| 66 | +# Or use connection string |
| 67 | +postgresql://linkedin_user:linkedin_pass@localhost:5432/linkedin |
| 68 | +``` |
| 69 | + |
| 70 | +### Neo4j |
| 71 | +``` |
| 72 | +# Open Neo4j Browser |
| 73 | +http://localhost:7474 |
| 74 | +
|
| 75 | +# Login credentials |
| 76 | +Username: neo4j |
| 77 | +Password: linkedin_pass |
| 78 | +
|
| 79 | +# Example query - find network path |
| 80 | +MATCH path = shortestPath( |
| 81 | + (p1:Person {name: "Bill Gates"})-[*]-(p2:Person {name: "Satya Nadella"}) |
| 82 | +) |
| 83 | +RETURN path |
| 84 | +``` |
| 85 | + |
| 86 | +## Configuration |
| 87 | + |
| 88 | +`config.yaml` settings: |
| 89 | + |
| 90 | +| Setting | Description | Default | |
| 91 | +|---------|-------------|---------| |
| 92 | +| `max_profiles` | Stop after N profiles | 100 | |
| 93 | +| `max_depth` | BFS depth limit | 3 | |
| 94 | +| `rate_limit_min/max` | Delay range (seconds) | 5-10 | |
| 95 | +| `employee_limit` | Max employees per company (null = unlimited) | null | |
| 96 | + |
| 97 | +## Implementation Status |
| 98 | + |
| 99 | +- [x] Docker Compose setup |
| 100 | +- [ ] Phase 1: Enhanced scrapers (in progress) |
| 101 | +- [ ] Phase 2: Connection discovery |
| 102 | +- [ ] Phase 3: Dual storage |
| 103 | +- [ ] Phase 4: Queue system |
| 104 | +- [ ] Phase 5: Main orchestrator |
| 105 | +- [ ] Phase 6: CLI |
| 106 | +- [ ] Phase 7: Testing |
| 107 | + |
| 108 | +## Next Steps |
| 109 | + |
| 110 | +Currently implementing Phase 1: Enhancing PersonScraper and CompanyScraper with connection/employee discovery methods. |
0 commit comments