feat(scraping): add get_saved_posts tool for LinkedIn bookmarks#353
feat(scraping): add get_saved_posts tool for LinkedIn bookmarks#353blaisebruno wants to merge 798 commits into
Conversation
- Use fixed 25-per-page offset instead of dynamic ID count - Read "Page X of Y" from pagination state to cap pagination - Add soft rate-limit retry via _extract_search_page helper - Use keyword arguments in tool wrapper for clarity
- Stop on page 0 when no job IDs found (avoid useless page 1) - Fix test_stops_at_total_pages to use distinct IDs per page so only the total_pages guard stops pagination
Add date_posted, job_type, experience_level, work_type, easy_apply, and sort_by filters to search_jobs with human-readable normalization. Fix Greptile review: always log no-results break, move _PAGE_SIZE to module level, add Field(ge=1, le=10) on max_pages, skip ID extraction on empty text. Resolves: stickerdaniel#174
Use _normalize_csv for job_type to preserve raw commas in multi-value filters and add human-readable names (full_time, contract, etc.).
Break early when _extract_search_page returns _RATE_LIMITED_MSG to avoid extracting IDs from unreliable DOM state. Remove redundant truthiness check now guarded by the early break.
Move _normalize_csv out of _build_job_search_url to module level for reusability. Wait for job card links before sidebar scrolling to handle async rendering. Document DOM-independence principle in CONTRIBUTING.md and AGENTS.md.
The pagination state element has display:none so innerText cannot capture it. Document why the class-based selector is necessary and that it degrades gracefully to max_pages if LinkedIn renames it.
Use direct .get() lookup for date_posted and sort_by (single-select filters). Remove unreachable _RATE_LIMITED_MSG check after early break. Query _get_total_search_pages only once per search to avoid repeated evaluate() calls when the element is absent.
Apply quote_plus to date_posted and sort_by passthrough values to prevent malformed URLs from unexpected input. Use consistent 1-indexed page numbers in all debug log messages.
Warn when search page rate-limit retry also fails. Add console.debug in scroll_job_sidebar when no scrollable container is found.
Skip sidebar scrolling when <main> is absent to avoid 5s timeout on edge-case pages. Fix off-by-one in total_pages log message. Add page count assertion to test_deduplication_across_pages.
Append text to page_texts before breaking on no new IDs so the LLM can read LinkedIn's feedback (e.g. "No jobs found") instead of receiving empty sections.
Add await_count == 2 assertion to test_page_texts_joined_with_separator matching the pattern already used in test_deduplication_across_pages.
Switch from innerText to textContent in _get_total_search_pages so the "Page X of Y" text is readable regardless of CSS visibility.
- Replace console.debug in scroll_job_sidebar JS with sentinel return so the message is logged via Python logger instead - Wrap _get_total_search_pages in its own try/except to prevent an exception from discarding already-fetched page text and job IDs - Inline offset calculation into URL ternary for clarity
- Add debug log when sidebar container is found but no new content loads (scrolled == 0) - Add debug log when <main> is absent and body fallback is used on search pages
- Use -2 sentinel for "job card link vanished" vs -1 for "no
scrollable container" vs 0 for "no new content loaded"
- Return {source, text} from search page JS evaluate so the body
fallback log fires based on actual DOM state, not the pre-evaluate
wait_for_selector flag
- Add URL sanity check before _extract_job_ids to prevent extracting IDs from a stale page after a swallowed navigation failure - Add test_no_ids_on_first_page_captures_text to pin the behavior where non-empty text with zero job IDs is returned in sections - Change total_pages mock to None in test_pagination_uses_fixed_page_size since max_pages=2 caps the loop before total_pages is relevant
…uard - Move _NOISE_MARKERS comment to directly precede the list it describes - Log when <main> appears after wait_for_selector timeout but before evaluate (sidebar scroll skipped on late-appearing element) - Add test_url_redirect_skips_id_extraction to exercise the URL sanity guard that prevents extracting IDs from a stale/redirect page
Capture _get_total_search_pages mock in test_stops_at_total_pages and verify await_count == 1 to pin the query-once optimization.
…ols_add_job_ids_sidebar_scrolling_and_pagination_to_search_jobs feat(tools): add job IDs, sidebar scrolling, and pagination to search_jobs
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…l commands (stickerdaniel#202) <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR adds a "Verifying Bug Reports" section to `AGENTS.md` with step-by-step `curl` commands for testing the MCP server end-to-end via HTTP transport. The `SESSION_ID` extraction via `grep`/`awk`/`tr -d '\r'` is correct and properly handles Windows-style line endings in curl header output. However, the **server startup command blocks the terminal** — without `&` or an explicit note to use a separate shell, developers or agents following the script linearly will never reach the `curl` commands. <h3>Confidence Score: 4/5</h3> - Safe to merge once the server startup command is backgrounded or explicit terminal-switching instructions are added. - The change is documentation-only and does not affect runtime code. The session-ID extraction logic is correct. The primary issue is a usability blocker: the server startup command blocks the terminal, preventing the documented workflow from executing end-to-end in a single shell. This is straightforward to fix with `&` or an explicit note. - AGENTS.md — specifically the server startup command (line 138) needs to either background the process or include explicit instructions to use a separate terminal. <sub>Last reviewed commit: e8e8eb9</sub> > Greptile also left **1 inline comment** on this PR. <!-- /greptile_comment -->
Activity feed pages lazy-load post content after tab headers render. Add wait_for_function check and slower scroll params for /recent-activity/ URLs so posts section returns actual content instead of just tab headers. Resolves: stickerdaniel#201
…ity-feed-posts-empty fix(scraping): Wait for activity feed content before extracting
…ump_version_to_4.8.2 chore: Bump version to 4.8.2
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [docker/login-action](https://redirect.github.com/docker/login-action) ([changelog](https://redirect.github.com/docker/login-action/compare/b45d80f862d83dbcd57f89517bcf500b2ab88fb2..4907a6ddec9925e35a0a9e82d7399ccc52663121)) | action | digest | `b45d80f` → `4907a6d` | | ghcr.io/astral-sh/uv | final | digest | `c4f5de3` → `90bbb3c` | --- ### Configuration 📅 **Schedule**: Branch creation - "before 6am on Monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 👻 **Immortal**: This PR will be recreated if closed unmerged. Get [config help](https://redirect.github.com/renovatebot/renovate/discussions) if that's undesired. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/stickerdaniel/linkedin-mcp-server). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xMDIuMTEiLCJ1cGRhdGVkSW5WZXIiOiI0My4xMDIuMTEiLCJ0YXJnZXRCcmFuY2giOiJtYWluIiwibGFiZWxzIjpbXX0=-->
…olve_ty_type-checker_errors_blocking_ci fix: Resolve ty type-checker errors blocking CI
…prove_release_notes_with_direct_download_link_and_changelog_config docs: Improve release notes with direct download link and changelog config
…tools - Use --no-editable --no-dev --compile-bytecode in builder stage - Remove git (no git-based deps exist) - Remove COPY . /app/ and /app/data from runtime stage - Bump requires-python to >=3.12,<3.15 (all 371 tests pass on 3.14.3) - Add Python 3.14 classifier - Restore README.md in Docker context for clean wheel builds - Add Renovate rule to block automatic Python Docker tag bumps
…r-security chore: optimize Dockerfile security and multi-stage build
…ump_version_to_4.8.3 chore: Bump version to 4.8.3
- Remove redundant text.length > 200 guard from details-page wait condition; startsWith checks are sufficient and the length threshold would cause 10s timeouts on legitimately short sections
…aping_wait_for_detail_panel_before_extracting_experience_sections fix(scraping): wait for detail panel before extracting experience sections
- Fix off-by-one in MID line causing box misalignment with double-width emoji - Update comment to say Unicode box-drawing instead of ASCII
…dd_dynamic_ascii_download_button_to_release_notes style: Add dynamic ASCII download button to release notes
…me-status-sync docs(readme): sync tool status table
- keep only tool-specific issue links in Features & Tool Status\n- add send_message issue stickerdaniel#344 and mark unaffected tools as working\n\nCloses stickerdaniel#346
…me-tool-status-sync docs(readme): sync tool status table
Adds a new get_saved_posts MCP tool that navigates to https://www.linkedin.com/my-items/saved-posts/ and extracts the user's bookmarked posts, including author, full text, and post URLs. Changes: - linkedin_mcp_server/tools/messaging.py: register get_saved_posts tool with readOnlyHint annotation, limit param (1-50), same pattern as get_inbox - linkedin_mcp_server/scraping/extractor.py: add get_saved_posts method on LinkedInExtractor, plus _expand_see_more_buttons and _click_show_more_results private helpers for pagination Closes: resolves the missing saved-posts coverage gap
Greptile SummaryAdds
Confidence Score: 4/5Safe to merge after fixing the see-more expansion ordering — all other findings are style or minor robustness suggestions. One P1 logic defect: linkedin_mcp_server/scraping/extractor.py — specifically the ordering of Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant Tool as get_saved_posts (tool)
participant Extractor as LinkedInExtractor
participant Page as LinkedIn Page
Client->>Tool: call(limit=N)
Tool->>Extractor: get_saved_posts(limit)
Extractor->>Page: navigate /my-items/saved-posts/
Page-->>Extractor: page loaded
Extractor->>Page: _expand_see_more_buttons() ⚠️ before scroll
Page-->>Extractor: initial posts expanded
Extractor->>Page: _scroll_main_scrollable_region(attempts=limit//5)
Page-->>Extractor: new posts loaded (unexpanded)
Extractor->>Page: _click_show_more_results() once
Page-->>Extractor: additional posts loaded (unexpanded)
Extractor->>Page: _extract_root_content([main])
Page-->>Extractor: raw text + references
Extractor->>Extractor: strip_noise, build_references, filter /feed/update/ and /in/
Extractor-->>Tool: url, sections, references
Tool-->>Client: result
Prompt To Fix All With AIThis is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2232-2241
Comment:
**See-more expansion runs before loading, leaving new posts truncated**
`_expand_see_more_buttons` is called once before the scroll loop, so it only expands posts that were visible on initial page load. Any posts surfaced by `_scroll_main_scrollable_region` or `_click_show_more_results` will still have their content truncated when `_extract_root_content` runs. Moving the expansion call to after all loading steps is complete ensures it covers all loaded posts.
```suggestion
# Scroll to load more posts up to the limit
scrolls = max(1, limit // 5)
await self._scroll_main_scrollable_region(
position="bottom", attempts=scrolls, pause_time=0.8
)
# Click additional "Show more results" buttons if present
await self._click_show_more_results()
# Expand truncated posts by clicking "see more" buttons (after all content is loaded)
await self._expand_see_more_buttons()
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2279-2280
Comment:
**Redundant `import asyncio` inside method bodies**
`asyncio` is already imported at the top of the file (line 5). The inline `import asyncio` in both `_expand_see_more_buttons` (here) and `_click_show_more_results` (line 2297) are unnecessary. Python deduplicates repeated imports but this is non-idiomatic and clutters the method body.
```suggestion
await asyncio.sleep(0.5)
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2284-2300
Comment:
**`_click_show_more_results` fires only once**
For `limit` values above ~10, a single "show more results" click is unlikely to surface enough posts — the page typically loads a fixed batch per click. Combined with the scroll loop (which may not trigger a "show more results" button on its own), users requesting 30–50 posts may see fewer results than asked for with no indication of truncation. Consider looping the click until the button disappears or a target count is met, similar to how scrolling uses `attempts`.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(scraping): add get_saved_posts tool..." | Re-trigger Greptile |
| await self._expand_see_more_buttons() | ||
|
|
||
| # Scroll to load more posts up to the limit | ||
| scrolls = max(1, limit // 5) | ||
| await self._scroll_main_scrollable_region( | ||
| position="bottom", attempts=scrolls, pause_time=0.8 | ||
| ) | ||
|
|
||
| # Click additional "Show more results" buttons if present | ||
| await self._click_show_more_results() |
There was a problem hiding this comment.
See-more expansion runs before loading, leaving new posts truncated
_expand_see_more_buttons is called once before the scroll loop, so it only expands posts that were visible on initial page load. Any posts surfaced by _scroll_main_scrollable_region or _click_show_more_results will still have their content truncated when _extract_root_content runs. Moving the expansion call to after all loading steps is complete ensures it covers all loaded posts.
| await self._expand_see_more_buttons() | |
| # Scroll to load more posts up to the limit | |
| scrolls = max(1, limit // 5) | |
| await self._scroll_main_scrollable_region( | |
| position="bottom", attempts=scrolls, pause_time=0.8 | |
| ) | |
| # Click additional "Show more results" buttons if present | |
| await self._click_show_more_results() | |
| # Scroll to load more posts up to the limit | |
| scrolls = max(1, limit // 5) | |
| await self._scroll_main_scrollable_region( | |
| position="bottom", attempts=scrolls, pause_time=0.8 | |
| ) | |
| # Click additional "Show more results" buttons if present | |
| await self._click_show_more_results() | |
| # Expand truncated posts by clicking "see more" buttons (after all content is loaded) | |
| await self._expand_see_more_buttons() |
Prompt To Fix With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2232-2241
Comment:
**See-more expansion runs before loading, leaving new posts truncated**
`_expand_see_more_buttons` is called once before the scroll loop, so it only expands posts that were visible on initial page load. Any posts surfaced by `_scroll_main_scrollable_region` or `_click_show_more_results` will still have their content truncated when `_extract_root_content` runs. Moving the expansion call to after all loading steps is complete ensures it covers all loaded posts.
```suggestion
# Scroll to load more posts up to the limit
scrolls = max(1, limit // 5)
await self._scroll_main_scrollable_region(
position="bottom", attempts=scrolls, pause_time=0.8
)
# Click additional "Show more results" buttons if present
await self._click_show_more_results()
# Expand truncated posts by clicking "see more" buttons (after all content is loaded)
await self._expand_see_more_buttons()
```
How can I resolve this? If you propose a fix, please make it concise.| import asyncio | ||
| await asyncio.sleep(0.5) |
There was a problem hiding this comment.
Redundant
import asyncio inside method bodies
asyncio is already imported at the top of the file (line 5). The inline import asyncio in both _expand_see_more_buttons (here) and _click_show_more_results (line 2297) are unnecessary. Python deduplicates repeated imports but this is non-idiomatic and clutters the method body.
| import asyncio | |
| await asyncio.sleep(0.5) | |
| await asyncio.sleep(0.5) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2279-2280
Comment:
**Redundant `import asyncio` inside method bodies**
`asyncio` is already imported at the top of the file (line 5). The inline `import asyncio` in both `_expand_see_more_buttons` (here) and `_click_show_more_results` (line 2297) are unnecessary. Python deduplicates repeated imports but this is non-idiomatic and clutters the method body.
```suggestion
await asyncio.sleep(0.5)
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| async def _click_show_more_results(self) -> None: | ||
| """Click the show-more-results button if present to load additional posts.""" | ||
| try: | ||
| await self._page.evaluate( | ||
| """() => { | ||
| const buttons = Array.from(document.querySelectorAll('button')) | ||
| .filter(btn => { | ||
| const text = (btn.innerText || btn.textContent || '').trim().toLowerCase(); | ||
| return text.includes('show more') || text.includes('afficher plus'); | ||
| }); | ||
| if (buttons.length > 0) buttons[0].click(); | ||
| }""" | ||
| ) | ||
| import asyncio | ||
| await asyncio.sleep(1.0) | ||
| except Exception: | ||
| pass # Non-critical |
There was a problem hiding this comment.
_click_show_more_results fires only once
For limit values above ~10, a single "show more results" click is unlikely to surface enough posts — the page typically loads a fixed batch per click. Combined with the scroll loop (which may not trigger a "show more results" button on its own), users requesting 30–50 posts may see fewer results than asked for with no indication of truncation. Consider looping the click until the button disappears or a target count is met, similar to how scrolling uses attempts.
Prompt To Fix With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2284-2300
Comment:
**`_click_show_more_results` fires only once**
For `limit` values above ~10, a single "show more results" click is unlikely to surface enough posts — the page typically loads a fixed batch per click. Combined with the scroll loop (which may not trigger a "show more results" button on its own), users requesting 30–50 posts may see fewer results than asked for with no indication of truncation. Consider looping the click until the button disappears or a target count is met, similar to how scrolling uses `attempts`.
How can I resolve this? If you propose a fix, please make it concise.
What this adds
A new
get_saved_poststool that navigates tohttps://www.linkedin.com/my-items/saved-posts/and extracts the user's bookmarked posts, including author, full text content, and post URLs.Why
Saved posts is one of the most-used LinkedIn features for power users (content research, reading lists, CRM workflows). It is the only major user-facing page not yet covered by the server.
Changes
linkedin_mcp_server/tools/messaging.py: newget_saved_postsMCP tool registered withreadOnlyHint,limitparam (1-50, default 20), same pattern asget_inboxlinkedin_mcp_server/scraping/extractor.py: newget_saved_postsmethod onLinkedInExtractor, plus two private helpers:_expand_see_more_buttons: clicks see-more buttons to unfold truncated posts before scraping_click_show_more_results: handles pagination by clicking the load-more buttonBehaviour
limitparameter (1-50, default 20)limit/feed/update/) and profile URLs (/in/)get_inboxTool table entry (for README)
|
get_saved_posts| Get bookmarked posts from the LinkedIn saved items page | working |