joeyism
diff --git a/‎.env.example‎
Lines changed: 8 additions & 0 deletions b/‎.env.example‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 30 additions & 0 deletions b/‎.gitignore‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎ARCHITECTURE.md‎
Lines changed: 110 additions & 0 deletions b/‎ARCHITECTURE.md‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎MANIFEST‎
Lines changed: 0 additions & 6 deletions b/‎MANIFEST‎
Lines changed: 0 additions & 6 deletions
diff --git a/‎MANIFEST.in‎
Lines changed: 39 additions & 0 deletions b/‎MANIFEST.in‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎MANIFEST.ini‎
Lines changed: 0 additions & 5 deletions b/‎MANIFEST.ini‎
Lines changed: 0 additions & 5 deletions
@@ -0,0 +1,8 @@
+# LinkedIn credentials for scraping
+# Copy this file to .env and fill in your credentials
+
+# Use either LINKEDIN_EMAIL or LINKEDIN_USERNAME (both work)
+LINKEDIN_EMAIL=your.email@example.com
+# LINKEDIN_USERNAME=your.email@example.com
+
+LINKEDIN_PASSWORD=your_password_here
@@ -13,3 +13,33 @@ scrape.py
 creds.json
 venv
 *.zip
+.env
+
+# Test outputs
+*.log
+test_*.db
+test_linkedin.db
+test_summary.json
+results_*.json
+person_*.json
+*_improved.json
+
+# Debug scripts (keep debug_connection_selectors.py)
+debug_*.py
+!debug_connection_selectors.py
+
+# Backup files
+*_old.py
+*.backup
+
+# Session files (sensitive cookies)
+linkedin_session.json
+
+# Build artifacts
+MANIFEST
+MANIFEST.ini
+.pytest_cache/
+
+# Basic package build artifacts
+build-basic/
+dist-basic/
@@ -0,0 +1,110 @@
+# LinkedIn Graph Scraper - Setup & Usage
+
+## Overview
+
+This graph scraper discovers LinkedIn networks by:
+1. Starting with a seed person or company
+2. Finding connections via:
+   - **Strategy E (Primary)**: Company employees (people who work/worked at same companies)
+   - **Strategy D (Secondary)**: "People also viewed" suggestions
+3. Using BFS (breadth-first search) with configurable depth
+4. Storing in dual databases (PostgreSQL + Neo4j)
+5. Stopping at configurable profile limit or manual interrupt (Ctrl+C)
+
+## Quick Start
+
+### 1. Start Databases
+
+```bash
+# Start PostgreSQL and Neo4j
+docker-compose up -d
+
+# Wait for containers to be healthy (30-60 seconds)
+docker-compose ps
+
+# Check logs if needed
+docker-compose logs -f
+```
+
+### 2. Install Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### 3. Configure
+
+Edit `config.yaml`:
+```yaml
+scraping:
+  max_profiles: 100     # Change this limit as needed
+  max_depth: 3
+  rate_limit_min: 5
+  rate_limit_max: 10
+```
+
+### 4. Run Scraper
+
+```bash
+# From a person
+python cli.py scrape "https://linkedin.com/in/williamhgates" --max-profiles 50
+
+# From a company
+python cli.py scrape "https://linkedin.com/company/microsoft" --max-profiles 50
+
+# Resume from previous session
+python cli.py resume <session-id>
+```
+
+## Database Access
+
+### PostgreSQL
+```bash
+# Connect via psql
+docker exec -it linkedin_postgres psql -U linkedin_user -d linkedin
+
+# Or use connection string
+postgresql://linkedin_user:linkedin_pass@localhost:5432/linkedin
+```
+
+### Neo4j
+```
+# Open Neo4j Browser
+http://localhost:7474
+
+# Login credentials
+Username: neo4j
+Password: linkedin_pass
+
+# Example query - find network path
+MATCH path = shortestPath(
+  (p1:Person {name: "Bill Gates"})-[*]-(p2:Person {name: "Satya Nadella"})
+)
+RETURN path
+```
+
+## Configuration
+
+`config.yaml` settings:
+
+| Setting | Description | Default |
+|---------|-------------|---------|
+| `max_profiles` | Stop after N profiles | 100 |
+| `max_depth` | BFS depth limit | 3 |
+| `rate_limit_min/max` | Delay range (seconds) | 5-10 |
+| `employee_limit` | Max employees per company (null = unlimited) | null |
+
+## Implementation Status
+
+- [x] Docker Compose setup
+- [ ] Phase 1: Enhanced scrapers (in progress)
+- [ ] Phase 2: Connection discovery
+- [ ] Phase 3: Dual storage
+- [ ] Phase 4: Queue system  
+- [ ] Phase 5: Main orchestrator
+- [ ] Phase 6: CLI
+- [ ] Phase 7: Testing
+
+## Next Steps
+
+Currently implementing Phase 1: Enhancing PersonScraper and CompanyScraper with connection/employee discovery methods.
@@ -0,0 +1,39 @@
+# Include documentation
+include README.md
+include LICENSE
+include AGENTS.md
+include ARCHITECTURE.md
+include USAGE.md
+include TESTING.md
+
+# Include configuration files
+include config.yaml
+include test_config.yaml
+include docker-compose.yml
+include requirements.txt
+include requirements-dev.txt
+
+# Include setup files
+include setup.py
+include setup.cfg
+include pytest.ini
+
+# Include all Python files in the package
+recursive-include linkedin_scraper *.py
+
+# Exclude build, test, and cache files
+global-exclude __pycache__
+global-exclude *.py[cod]
+global-exclude *$py.class
+global-exclude *.so
+global-exclude .DS_Store
+
+# Exclude test files and directories
+prune tests
+prune samples
+prune build
+prune dist
+prune *.egg-info
+prune venv
+prune .pytest_cache
+prune .git