Skip to content

feat/patchright & persistent context#280

Open
stickerdaniel wants to merge 7 commits intojoeyism:masterfrom
stickerdaniel:feat/patchright-persistent-context
Open

feat/patchright & persistent context#280
stickerdaniel wants to merge 7 commits intojoeyism:masterfrom
stickerdaniel:feat/patchright-persistent-context

Conversation

@stickerdaniel
Copy link
Contributor

Implementation of proposed enhancements in #279. Switch from Playwright to Patchright and replacing the manual session handling with Playwright's persistent context

text_content() captures invisible React RSC serialized JSON that
LinkedIn now embeds on every page, containing "try again later" as
a preloaded error template. This causes false positive rate limit
detection on every scrape.

inner_text() returns only visible text, matching the pattern used
throughout the rest of the codebase.

Resolves: joeyism#277
See also: joeyism#275
Replace playwright with patchright (anti-detection fork) and rewrite
BrowserManager to use launch_persistent_context(user_data_dir=...).

BREAKING CHANGE: BrowserManager now requires user_data_dir parameter.
save_session() and load_session() methods removed - state auto-persists.
This repo uses requirements.txt, not uv.
Copilot AI review requested due to automatic review settings February 12, 2026 20:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a major architectural change to improve LinkedIn session persistence and evade bot detection by:

  1. Switching from Playwright to Patchright - A drop-in replacement that removes automation signals (navigator.webdriver, HeadlessChrome UA, CDP leaks) to avoid detection
  2. Replacing manual session handling with persistent browser context - Instead of saving/loading cookies+localStorage to JSON files, the entire Chromium profile (cache, service workers, history, cookies, etc.) is now stored in a persistent directory
  3. Updating Python version requirement - Changed from >=3.8 to >=3.10 to align with existing code that uses modern type hint syntax (list[...], tuple[...])

Changes:

  • Replaced all playwright imports with patchright
  • Removed save_session() and load_session() methods from BrowserManager
  • Replaced browser.new_context() with launch_persistent_context() that stores full browser profile
  • Added user_data_dir parameter to BrowserManager (defaults to ~/.linkedin_scraper/browser_data)
  • Improved exception handling with more specific error types and better logging
  • Updated all tests to use tmp_path fixtures for isolated test environments
  • Updated sample scripts to remove manual session loading
  • Removed setup.py in favor of pyproject.toml

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
requirements.txt Replaced playwright>=1.40.0 with patchright>=1.40.0
pyproject.toml Updated dependency to patchright, bumped Python requirement to >=3.10, removed Python 3.8/3.9 classifiers
linkedin_scraper/init.py Updated module docstring to reference Patchright
linkedin_scraper/core/browser.py Complete rewrite to use persistent context; removed save_session/load_session/browser property
linkedin_scraper/core/auth.py Updated all Playwright references to Patchright, fixed indentation in one docstring
linkedin_scraper/core/utils.py Updated imports to patchright, improved exception handling specificity
linkedin_scraper/scrapers/*.py Updated imports, improved error handling, removed redundant callback defaults
linkedin_scraper/models/person.py Changed open_to_work from bool = False to Optional[bool] = None
tests/*.py Added tmp_path fixtures, removed session save/load tests, updated fixtures for persistent context
samples/*.py Removed manual session loading calls, updated documentation
setup.py Removed entire file (migrated to pyproject.toml)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 192 to 198
"""
Login to LinkedIn using li_at cookie.

Args:
page: Playwright page object
cookie_value: Value of li_at cookie
Args:
page: Patchright page object
cookie_value: Value of li_at cookie

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring and Args section have inconsistent indentation. The opening triple quotes at line 192 uses a single space instead of proper indentation, and the Args section at line 195 has extra leading spaces. This should be corrected to maintain consistent code formatting.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +32 to +38
cookies_before = await browser.context.cookies()
assert isinstance(cookies_before, list)

# Second session: same data dir should reuse the persistent profile
async with BrowserManager(user_data_dir=data_dir, headless=True) as browser:
assert browser.page is not None
assert browser.context is not None
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for persistent context doesn't actually verify that data persists between sessions. Consider enhancing this test to add a cookie or localStorage item in the first session, then verify it exists in the second session. For example, after navigating in the first session, you could add a test cookie via await browser.context.add_cookies([{"name": "test_cookie", "value": "test_value", "domain": ".example.com", "path": "/"}]), then in the second session, check that this cookie still exists.

Suggested change
cookies_before = await browser.context.cookies()
assert isinstance(cookies_before, list)
# Second session: same data dir should reuse the persistent profile
async with BrowserManager(user_data_dir=data_dir, headless=True) as browser:
assert browser.page is not None
assert browser.context is not None
await browser.context.add_cookies(
[
{
"name": "test_cookie",
"value": "test_value",
"domain": ".example.com",
"path": "/",
}
]
)
cookies = await browser.context.cookies()
assert any(
c.get("name") == "test_cookie" and c.get("value") == "test_value"
for c in cookies
)
# Second session: same data dir should reuse the persistent profile
async with BrowserManager(user_data_dir=data_dir, headless=True) as browser:
await browser.page.goto("https://www.example.com")
cookies_after = await browser.context.cookies()
assert any(
c.get("name") == "test_cookie" and c.get("value") == "test_value"
for c in cookies_after
)

Copilot uses AI. Check for mistakes.

logger = logging.getLogger(__name__)

_DEFAULT_USER_DATA_DIR = Path.home() / ".linkedin_scraper" / "browser_data"
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the default user_data_dir, concurrent executions of the scraper will fail because Chromium locks the profile directory. If multiple browser instances are needed, users must provide different user_data_dir paths. Consider documenting this limitation or adding a check that provides a helpful error message when the directory is locked, rather than letting Chromium's generic error surface.

Copilot uses AI. Check for mistakes.
if len(text) < 50:
return text
except:
except PlaywrightTimeoutError:
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if 'applicant' in text_lower or 'people clicked' in text_lower or 'applied' in text_lower:
return text
except:
except PlaywrightTimeoutError:
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
except Exception:
except RateLimitError:
raise
except PlaywrightTimeoutError:
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
return True
except:
except PlaywrightTimeoutError:
pass
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
pass
logger.debug("Timeout while attempting to close modal; treating as no modal closed")
return False

Copilot uses AI. Check for mistakes.
@stickerdaniel stickerdaniel force-pushed the feat/patchright-persistent-context branch from 20081d1 to 2224286 Compare February 12, 2026 23:44
Playwright exports some LinkedIn cookies with .www.linkedin.com domain
but Chromium stores them as .linkedin.com. On Linux, the mismatched
domains prevent cookies from being sent with HTTP requests.

- Normalize .www.linkedin.com → .linkedin.com at export time
- Filter to LinkedIn-only cookies on export (skip third-party)
- Only import auth cookies (li_at, li_rm) to avoid redirect loops
  from stale anti-bot/routing cookies
- Clear existing cookies before import to remove undecryptable entries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants