Skip to content

Commit 9b63afd

Browse files
committed
3.0.0 use playwright instead of selenium
1 parent 56305c6 commit 9b63afd

71 files changed

Lines changed: 9274 additions & 1813 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# LinkedIn credentials for scraping
2+
# Copy this file to .env and fill in your credentials
3+
4+
# Use either LINKEDIN_EMAIL or LINKEDIN_USERNAME (both work)
5+
LINKEDIN_EMAIL=your.email@example.com
6+
# LINKEDIN_USERNAME=your.email@example.com
7+
8+
LINKEDIN_PASSWORD=your_password_here

.gitignore

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,33 @@ scrape.py
1313
creds.json
1414
venv
1515
*.zip
16+
.env
17+
18+
# Test outputs
19+
*.log
20+
test_*.db
21+
test_linkedin.db
22+
test_summary.json
23+
results_*.json
24+
person_*.json
25+
*_improved.json
26+
27+
# Debug scripts (keep debug_connection_selectors.py)
28+
debug_*.py
29+
!debug_connection_selectors.py
30+
31+
# Backup files
32+
*_old.py
33+
*.backup
34+
35+
# Session files (sensitive cookies)
36+
linkedin_session.json
37+
38+
# Build artifacts
39+
MANIFEST
40+
MANIFEST.ini
41+
.pytest_cache/
42+
43+
# Basic package build artifacts
44+
build-basic/
45+
dist-basic/

ARCHITECTURE.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# LinkedIn Graph Scraper - Setup & Usage
2+
3+
## Overview
4+
5+
This graph scraper discovers LinkedIn networks by:
6+
1. Starting with a seed person or company
7+
2. Finding connections via:
8+
- **Strategy E (Primary)**: Company employees (people who work/worked at same companies)
9+
- **Strategy D (Secondary)**: "People also viewed" suggestions
10+
3. Using BFS (breadth-first search) with configurable depth
11+
4. Storing in dual databases (PostgreSQL + Neo4j)
12+
5. Stopping at configurable profile limit or manual interrupt (Ctrl+C)
13+
14+
## Quick Start
15+
16+
### 1. Start Databases
17+
18+
```bash
19+
# Start PostgreSQL and Neo4j
20+
docker-compose up -d
21+
22+
# Wait for containers to be healthy (30-60 seconds)
23+
docker-compose ps
24+
25+
# Check logs if needed
26+
docker-compose logs -f
27+
```
28+
29+
### 2. Install Dependencies
30+
31+
```bash
32+
pip install -r requirements.txt
33+
```
34+
35+
### 3. Configure
36+
37+
Edit `config.yaml`:
38+
```yaml
39+
scraping:
40+
max_profiles: 100 # Change this limit as needed
41+
max_depth: 3
42+
rate_limit_min: 5
43+
rate_limit_max: 10
44+
```
45+
46+
### 4. Run Scraper
47+
48+
```bash
49+
# From a person
50+
python cli.py scrape "https://linkedin.com/in/williamhgates" --max-profiles 50
51+
52+
# From a company
53+
python cli.py scrape "https://linkedin.com/company/microsoft" --max-profiles 50
54+
55+
# Resume from previous session
56+
python cli.py resume <session-id>
57+
```
58+
59+
## Database Access
60+
61+
### PostgreSQL
62+
```bash
63+
# Connect via psql
64+
docker exec -it linkedin_postgres psql -U linkedin_user -d linkedin
65+
66+
# Or use connection string
67+
postgresql://linkedin_user:linkedin_pass@localhost:5432/linkedin
68+
```
69+
70+
### Neo4j
71+
```
72+
# Open Neo4j Browser
73+
http://localhost:7474
74+
75+
# Login credentials
76+
Username: neo4j
77+
Password: linkedin_pass
78+
79+
# Example query - find network path
80+
MATCH path = shortestPath(
81+
(p1:Person {name: "Bill Gates"})-[*]-(p2:Person {name: "Satya Nadella"})
82+
)
83+
RETURN path
84+
```
85+
86+
## Configuration
87+
88+
`config.yaml` settings:
89+
90+
| Setting | Description | Default |
91+
|---------|-------------|---------|
92+
| `max_profiles` | Stop after N profiles | 100 |
93+
| `max_depth` | BFS depth limit | 3 |
94+
| `rate_limit_min/max` | Delay range (seconds) | 5-10 |
95+
| `employee_limit` | Max employees per company (null = unlimited) | null |
96+
97+
## Implementation Status
98+
99+
- [x] Docker Compose setup
100+
- [ ] Phase 1: Enhanced scrapers (in progress)
101+
- [ ] Phase 2: Connection discovery
102+
- [ ] Phase 3: Dual storage
103+
- [ ] Phase 4: Queue system
104+
- [ ] Phase 5: Main orchestrator
105+
- [ ] Phase 6: CLI
106+
- [ ] Phase 7: Testing
107+
108+
## Next Steps
109+
110+
Currently implementing Phase 1: Enhancing PersonScraper and CompanyScraper with connection/employee discovery methods.

MANIFEST

Lines changed: 0 additions & 6 deletions
This file was deleted.

MANIFEST.in

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Include documentation
2+
include README.md
3+
include LICENSE
4+
include AGENTS.md
5+
include ARCHITECTURE.md
6+
include USAGE.md
7+
include TESTING.md
8+
9+
# Include configuration files
10+
include config.yaml
11+
include test_config.yaml
12+
include docker-compose.yml
13+
include requirements.txt
14+
include requirements-dev.txt
15+
16+
# Include setup files
17+
include setup.py
18+
include setup.cfg
19+
include pytest.ini
20+
21+
# Include all Python files in the package
22+
recursive-include linkedin_scraper *.py
23+
24+
# Exclude build, test, and cache files
25+
global-exclude __pycache__
26+
global-exclude *.py[cod]
27+
global-exclude *$py.class
28+
global-exclude *.so
29+
global-exclude .DS_Store
30+
31+
# Exclude test files and directories
32+
prune tests
33+
prune samples
34+
prune build
35+
prune dist
36+
prune *.egg-info
37+
prune venv
38+
prune .pytest_cache
39+
prune .git

MANIFEST.ini

Lines changed: 0 additions & 5 deletions
This file was deleted.

0 commit comments

Comments
 (0)