feat/patchright & persistent context#280
feat/patchright & persistent context#280stickerdaniel wants to merge 7 commits intojoeyism:masterfrom
Conversation
text_content() captures invisible React RSC serialized JSON that LinkedIn now embeds on every page, containing "try again later" as a preloaded error template. This causes false positive rate limit detection on every scrape. inner_text() returns only visible text, matching the pattern used throughout the rest of the codebase. Resolves: joeyism#277 See also: joeyism#275
Replace playwright with patchright (anti-detection fork) and rewrite BrowserManager to use launch_persistent_context(user_data_dir=...). BREAKING CHANGE: BrowserManager now requires user_data_dir parameter. save_session() and load_session() methods removed - state auto-persists.
This repo uses requirements.txt, not uv.
There was a problem hiding this comment.
Pull request overview
This PR implements a major architectural change to improve LinkedIn session persistence and evade bot detection by:
- Switching from Playwright to Patchright - A drop-in replacement that removes automation signals (navigator.webdriver, HeadlessChrome UA, CDP leaks) to avoid detection
- Replacing manual session handling with persistent browser context - Instead of saving/loading cookies+localStorage to JSON files, the entire Chromium profile (cache, service workers, history, cookies, etc.) is now stored in a persistent directory
- Updating Python version requirement - Changed from >=3.8 to >=3.10 to align with existing code that uses modern type hint syntax (
list[...],tuple[...])
Changes:
- Replaced all
playwrightimports withpatchright - Removed
save_session()andload_session()methods from BrowserManager - Replaced
browser.new_context()withlaunch_persistent_context()that stores full browser profile - Added
user_data_dirparameter to BrowserManager (defaults to~/.linkedin_scraper/browser_data) - Improved exception handling with more specific error types and better logging
- Updated all tests to use
tmp_pathfixtures for isolated test environments - Updated sample scripts to remove manual session loading
- Removed setup.py in favor of pyproject.toml
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Replaced playwright>=1.40.0 with patchright>=1.40.0 |
| pyproject.toml | Updated dependency to patchright, bumped Python requirement to >=3.10, removed Python 3.8/3.9 classifiers |
| linkedin_scraper/init.py | Updated module docstring to reference Patchright |
| linkedin_scraper/core/browser.py | Complete rewrite to use persistent context; removed save_session/load_session/browser property |
| linkedin_scraper/core/auth.py | Updated all Playwright references to Patchright, fixed indentation in one docstring |
| linkedin_scraper/core/utils.py | Updated imports to patchright, improved exception handling specificity |
| linkedin_scraper/scrapers/*.py | Updated imports, improved error handling, removed redundant callback defaults |
| linkedin_scraper/models/person.py | Changed open_to_work from bool = False to Optional[bool] = None |
| tests/*.py | Added tmp_path fixtures, removed session save/load tests, updated fixtures for persistent context |
| samples/*.py | Removed manual session loading calls, updated documentation |
| setup.py | Removed entire file (migrated to pyproject.toml) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| Login to LinkedIn using li_at cookie. | ||
|
|
||
| Args: | ||
| page: Playwright page object | ||
| cookie_value: Value of li_at cookie | ||
| Args: | ||
| page: Patchright page object | ||
| cookie_value: Value of li_at cookie | ||
|
|
There was a problem hiding this comment.
The docstring and Args section have inconsistent indentation. The opening triple quotes at line 192 uses a single space instead of proper indentation, and the Args section at line 195 has extra leading spaces. This should be corrected to maintain consistent code formatting.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| cookies_before = await browser.context.cookies() | ||
| assert isinstance(cookies_before, list) | ||
|
|
||
| # Second session: same data dir should reuse the persistent profile | ||
| async with BrowserManager(user_data_dir=data_dir, headless=True) as browser: | ||
| assert browser.page is not None | ||
| assert browser.context is not None |
There was a problem hiding this comment.
The test for persistent context doesn't actually verify that data persists between sessions. Consider enhancing this test to add a cookie or localStorage item in the first session, then verify it exists in the second session. For example, after navigating in the first session, you could add a test cookie via await browser.context.add_cookies([{"name": "test_cookie", "value": "test_value", "domain": ".example.com", "path": "/"}]), then in the second session, check that this cookie still exists.
| cookies_before = await browser.context.cookies() | |
| assert isinstance(cookies_before, list) | |
| # Second session: same data dir should reuse the persistent profile | |
| async with BrowserManager(user_data_dir=data_dir, headless=True) as browser: | |
| assert browser.page is not None | |
| assert browser.context is not None | |
| await browser.context.add_cookies( | |
| [ | |
| { | |
| "name": "test_cookie", | |
| "value": "test_value", | |
| "domain": ".example.com", | |
| "path": "/", | |
| } | |
| ] | |
| ) | |
| cookies = await browser.context.cookies() | |
| assert any( | |
| c.get("name") == "test_cookie" and c.get("value") == "test_value" | |
| for c in cookies | |
| ) | |
| # Second session: same data dir should reuse the persistent profile | |
| async with BrowserManager(user_data_dir=data_dir, headless=True) as browser: | |
| await browser.page.goto("https://www.example.com") | |
| cookies_after = await browser.context.cookies() | |
| assert any( | |
| c.get("name") == "test_cookie" and c.get("value") == "test_value" | |
| for c in cookies_after | |
| ) |
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| _DEFAULT_USER_DATA_DIR = Path.home() / ".linkedin_scraper" / "browser_data" |
There was a problem hiding this comment.
When using the default user_data_dir, concurrent executions of the scraper will fail because Chromium locks the profile directory. If multiple browser instances are needed, users must provide different user_data_dir paths. Consider documenting this limitation or adding a check that provides a helpful error message when the directory is locked, rather than letting Chromium's generic error surface.
| if len(text) < 50: | ||
| return text | ||
| except: | ||
| except PlaywrightTimeoutError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| if 'applicant' in text_lower or 'people clicked' in text_lower or 'applied' in text_lower: | ||
| return text | ||
| except: | ||
| except PlaywrightTimeoutError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | ||
| except RateLimitError: | ||
| raise | ||
| except PlaywrightTimeoutError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| return True | ||
| except: | ||
| except PlaywrightTimeoutError: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| pass | |
| logger.debug("Timeout while attempting to close modal; treating as no modal closed") | |
| return False |
20081d1 to
2224286
Compare
Playwright exports some LinkedIn cookies with .www.linkedin.com domain but Chromium stores them as .linkedin.com. On Linux, the mismatched domains prevent cookies from being sent with HTTP requests. - Normalize .www.linkedin.com → .linkedin.com at export time - Filter to LinkedIn-only cookies on export (skip third-party) - Only import auth cookies (li_at, li_rm) to avoid redirect loops from stale anti-bot/routing cookies - Clear existing cookies before import to remove undecryptable entries
Implementation of proposed enhancements in #279. Switch from Playwright to Patchright and replacing the manual session handling with Playwright's persistent context