Skip to content

feat(scraping): add get_saved_posts tool for LinkedIn bookmarks#353

Open
blaisebruno wants to merge 798 commits into
stickerdaniel:mainfrom
blaisebruno:feat/get-saved-posts
Open

feat(scraping): add get_saved_posts tool for LinkedIn bookmarks#353
blaisebruno wants to merge 798 commits into
stickerdaniel:mainfrom
blaisebruno:feat/get-saved-posts

Conversation

@blaisebruno
Copy link
Copy Markdown

What this adds

A new get_saved_posts tool that navigates to https://www.linkedin.com/my-items/saved-posts/ and extracts the user's bookmarked posts, including author, full text content, and post URLs.

Why

Saved posts is one of the most-used LinkedIn features for power users (content research, reading lists, CRM workflows). It is the only major user-facing page not yet covered by the server.

Changes

  • linkedin_mcp_server/tools/messaging.py: new get_saved_posts MCP tool registered with readOnlyHint, limit param (1-50, default 20), same pattern as get_inbox
    • linkedin_mcp_server/scraping/extractor.py: new get_saved_posts method on LinkedInExtractor, plus two private helpers:
    • _expand_see_more_buttons: clicks see-more buttons to unfold truncated posts before scraping
    • _click_show_more_results: handles pagination by clicking the load-more button

Behaviour

  • Accepts a limit parameter (1-50, default 20)
    • Expands truncated posts before scraping
    • Scrolls to load more results up to limit
    • Returns references filtered to post URLs (/feed/update/) and profile URLs (/in/)
    • Authentication error handling identical to get_inbox

Tool table entry (for README)

| get_saved_posts | Get bookmarked posts from the LinkedIn saved items page | working |

stickerdaniel and others added 30 commits March 5, 2026 15:22
- Use fixed 25-per-page offset instead of dynamic ID count
- Read "Page X of Y" from pagination state to cap pagination
- Add soft rate-limit retry via _extract_search_page helper
- Use keyword arguments in tool wrapper for clarity
- Stop on page 0 when no job IDs found (avoid useless page 1)
- Fix test_stops_at_total_pages to use distinct IDs per page so
  only the total_pages guard stops pagination
Add date_posted, job_type, experience_level, work_type, easy_apply,
and sort_by filters to search_jobs with human-readable normalization.
Fix Greptile review: always log no-results break, move _PAGE_SIZE to
module level, add Field(ge=1, le=10) on max_pages, skip ID extraction
on empty text.

Resolves: stickerdaniel#174
Use _normalize_csv for job_type to preserve raw commas in multi-value
filters and add human-readable names (full_time, contract, etc.).
Break early when _extract_search_page returns _RATE_LIMITED_MSG to
avoid extracting IDs from unreliable DOM state. Remove redundant
truthiness check now guarded by the early break.
Move _normalize_csv out of _build_job_search_url to module level for
reusability. Wait for job card links before sidebar scrolling to handle
async rendering. Document DOM-independence principle in CONTRIBUTING.md
and AGENTS.md.
The pagination state element has display:none so innerText cannot
capture it. Document why the class-based selector is necessary and
that it degrades gracefully to max_pages if LinkedIn renames it.
Use direct .get() lookup for date_posted and sort_by (single-select
filters). Remove unreachable _RATE_LIMITED_MSG check after early break.
Query _get_total_search_pages only once per search to avoid repeated
evaluate() calls when the element is absent.
Apply quote_plus to date_posted and sort_by passthrough values to
prevent malformed URLs from unexpected input. Use consistent 1-indexed
page numbers in all debug log messages.
Warn when search page rate-limit retry also fails. Add console.debug
in scroll_job_sidebar when no scrollable container is found.
Skip sidebar scrolling when <main> is absent to avoid 5s timeout on
edge-case pages. Fix off-by-one in total_pages log message. Add
page count assertion to test_deduplication_across_pages.
Append text to page_texts before breaking on no new IDs so the LLM
can read LinkedIn's feedback (e.g. "No jobs found") instead of
receiving empty sections.
Add await_count == 2 assertion to test_page_texts_joined_with_separator
matching the pattern already used in test_deduplication_across_pages.
Switch from innerText to textContent in _get_total_search_pages
so the "Page X of Y" text is readable regardless of CSS visibility.
- Replace console.debug in scroll_job_sidebar JS with sentinel return
  so the message is logged via Python logger instead
- Wrap _get_total_search_pages in its own try/except to prevent an
  exception from discarding already-fetched page text and job IDs
- Inline offset calculation into URL ternary for clarity
- Add debug log when sidebar container is found but no new content
  loads (scrolled == 0)
- Add debug log when <main> is absent and body fallback is used on
  search pages
- Use -2 sentinel for "job card link vanished" vs -1 for "no
  scrollable container" vs 0 for "no new content loaded"
- Return {source, text} from search page JS evaluate so the body
  fallback log fires based on actual DOM state, not the pre-evaluate
  wait_for_selector flag
- Add URL sanity check before _extract_job_ids to prevent extracting
  IDs from a stale page after a swallowed navigation failure
- Add test_no_ids_on_first_page_captures_text to pin the behavior
  where non-empty text with zero job IDs is returned in sections
- Change total_pages mock to None in test_pagination_uses_fixed_page_size
  since max_pages=2 caps the loop before total_pages is relevant
…uard

- Move _NOISE_MARKERS comment to directly precede the list it describes
- Log when <main> appears after wait_for_selector timeout but before
  evaluate (sidebar scroll skipped on late-appearing element)
- Add test_url_redirect_skips_id_extraction to exercise the URL
  sanity guard that prevents extracting IDs from a stale/redirect page
Capture _get_total_search_pages mock in test_stops_at_total_pages
and verify await_count == 1 to pin the query-once optimization.
…ols_add_job_ids_sidebar_scrolling_and_pagination_to_search_jobs

feat(tools): add job IDs, sidebar scrolling, and pagination to search_jobs
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…l commands (stickerdaniel#202)

<!-- greptile_comment -->

<h3>Greptile Summary</h3>

This PR adds a "Verifying Bug Reports" section to `AGENTS.md` with step-by-step `curl` commands for testing the MCP server end-to-end via HTTP transport. The `SESSION_ID` extraction via `grep`/`awk`/`tr -d '\r'` is correct and properly handles Windows-style line endings in curl header output.

However, the **server startup command blocks the terminal** — without `&` or an explicit note to use a separate shell, developers or agents following the script linearly will never reach the `curl` commands.

<h3>Confidence Score: 4/5</h3>

- Safe to merge once the server startup command is backgrounded or explicit terminal-switching instructions are added.
- The change is documentation-only and does not affect runtime code. The session-ID extraction logic is correct. The primary issue is a usability blocker: the server startup command blocks the terminal, preventing the documented workflow from executing end-to-end in a single shell. This is straightforward to fix with `&` or an explicit note.
- AGENTS.md — specifically the server startup command (line 138) needs to either background the process or include explicit instructions to use a separate terminal.

<sub>Last reviewed commit: e8e8eb9</sub>

> Greptile also left **1 inline comment** on this PR.

<!-- /greptile_comment -->
Activity feed pages lazy-load post content after tab headers render.
Add wait_for_function check and slower scroll params for /recent-activity/
URLs so posts section returns actual content instead of just tab headers.

Resolves: stickerdaniel#201
…ity-feed-posts-empty

fix(scraping): Wait for activity feed content before extracting
stickerdaniel and others added 27 commits April 6, 2026 10:53
…ump_version_to_4.8.2

chore: Bump version to 4.8.2
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [docker/login-action](https://redirect.github.com/docker/login-action) ([changelog](https://redirect.github.com/docker/login-action/compare/b45d80f862d83dbcd57f89517bcf500b2ab88fb2..4907a6ddec9925e35a0a9e82d7399ccc52663121)) | action | digest | `b45d80f` → `4907a6d` |
| ghcr.io/astral-sh/uv | final | digest | `c4f5de3` → `90bbb3c` |

---

### Configuration

📅 **Schedule**: Branch creation - "before 6am on Monday" (UTC), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

👻 **Immortal**: This PR will be recreated if closed unmerged. Get [config help](https://redirect.github.com/renovatebot/renovate/discussions) if that's undesired.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/stickerdaniel/linkedin-mcp-server).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xMDIuMTEiLCJ1cGRhdGVkSW5WZXIiOiI0My4xMDIuMTEiLCJ0YXJnZXRCcmFuY2giOiJtYWluIiwibGFiZWxzIjpbXX0=-->
…olve_ty_type-checker_errors_blocking_ci

fix: Resolve ty type-checker errors blocking CI
…prove_release_notes_with_direct_download_link_and_changelog_config

docs: Improve release notes with direct download link and changelog config
…tools

- Use --no-editable --no-dev --compile-bytecode in builder stage
- Remove git (no git-based deps exist)
- Remove COPY . /app/ and /app/data from runtime stage
- Bump requires-python to >=3.12,<3.15 (all 371 tests pass on 3.14.3)
- Add Python 3.14 classifier
- Restore README.md in Docker context for clean wheel builds
- Add Renovate rule to block automatic Python Docker tag bumps
…r-security

chore: optimize Dockerfile security and multi-stage build
…ump_version_to_4.8.3

chore: Bump version to 4.8.3
- Remove redundant text.length > 200 guard from details-page wait condition; startsWith checks are sufficient and the length threshold would cause 10s timeouts on legitimately short sections
…aping_wait_for_detail_panel_before_extracting_experience_sections

fix(scraping): wait for detail panel before extracting experience sections
- Fix off-by-one in MID line causing box misalignment with double-width emoji
- Update comment to say Unicode box-drawing instead of ASCII
…dd_dynamic_ascii_download_button_to_release_notes

style: Add dynamic ASCII download button to release notes
…me-status-sync

docs(readme): sync tool status table
- keep only tool-specific issue links in Features & Tool Status\n- add send_message issue stickerdaniel#344 and mark unaffected tools as working\n\nCloses stickerdaniel#346
…me-tool-status-sync

docs(readme): sync tool status table
Adds a new get_saved_posts MCP tool that navigates to
https://www.linkedin.com/my-items/saved-posts/ and extracts the
user's bookmarked posts, including author, full text, and post URLs.

Changes:
- linkedin_mcp_server/tools/messaging.py: register get_saved_posts tool
  with readOnlyHint annotation, limit param (1-50), same pattern as get_inbox
- linkedin_mcp_server/scraping/extractor.py: add get_saved_posts method
  on LinkedInExtractor, plus _expand_see_more_buttons and
  _click_show_more_results private helpers for pagination

Closes: resolves the missing saved-posts coverage gap
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 13, 2026

Greptile Summary

Adds get_saved_posts, a new MCP tool that navigates to https://www.linkedin.com/my-items/saved-posts/, scrolls for pagination, expands truncated posts, and returns bookmarked post text plus filtered references. The tool registration in messaging.py closely mirrors get_inbox and looks correct; the main concern is in the extractor.

  • _expand_see_more_buttons is called before the scroll loop, so posts loaded via scrolling or the "show more results" click remain truncated when content is extracted. Moving the expansion call to after all loading steps resolves this.

Confidence Score: 4/5

Safe to merge after fixing the see-more expansion ordering — all other findings are style or minor robustness suggestions.

One P1 logic defect: _expand_see_more_buttons runs before scrolling loads new content, so posts surfaced by scroll/pagination return truncated text. The tool registration and auth error handling are solid. All other findings are P2.

linkedin_mcp_server/scraping/extractor.py — specifically the ordering of _expand_see_more_buttons relative to the scroll and show-more calls.

Important Files Changed

Filename Overview
linkedin_mcp_server/scraping/extractor.py Adds get_saved_posts, _expand_see_more_buttons, and _click_show_more_results; expansion runs before scrolling so newly loaded posts remain truncated (P1), and inline import asyncio is redundant.
linkedin_mcp_server/tools/messaging.py Registers the get_saved_posts MCP tool with correct annotations, limit validation, auth error handling, and progress reporting — closely follows the get_inbox pattern.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Tool as get_saved_posts (tool)
    participant Extractor as LinkedInExtractor
    participant Page as LinkedIn Page

    Client->>Tool: call(limit=N)
    Tool->>Extractor: get_saved_posts(limit)
    Extractor->>Page: navigate /my-items/saved-posts/
    Page-->>Extractor: page loaded
    Extractor->>Page: _expand_see_more_buttons() ⚠️ before scroll
    Page-->>Extractor: initial posts expanded
    Extractor->>Page: _scroll_main_scrollable_region(attempts=limit//5)
    Page-->>Extractor: new posts loaded (unexpanded)
    Extractor->>Page: _click_show_more_results() once
    Page-->>Extractor: additional posts loaded (unexpanded)
    Extractor->>Page: _extract_root_content([main])
    Page-->>Extractor: raw text + references
    Extractor->>Extractor: strip_noise, build_references, filter /feed/update/ and /in/
    Extractor-->>Tool: url, sections, references
    Tool-->>Client: result
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2232-2241

Comment:
**See-more expansion runs before loading, leaving new posts truncated**

`_expand_see_more_buttons` is called once before the scroll loop, so it only expands posts that were visible on initial page load. Any posts surfaced by `_scroll_main_scrollable_region` or `_click_show_more_results` will still have their content truncated when `_extract_root_content` runs. Moving the expansion call to after all loading steps is complete ensures it covers all loaded posts.

```suggestion
        # Scroll to load more posts up to the limit
        scrolls = max(1, limit // 5)
        await self._scroll_main_scrollable_region(
            position="bottom", attempts=scrolls, pause_time=0.8
        )

        # Click additional "Show more results" buttons if present
        await self._click_show_more_results()

        # Expand truncated posts by clicking "see more" buttons (after all content is loaded)
        await self._expand_see_more_buttons()
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2279-2280

Comment:
**Redundant `import asyncio` inside method bodies**

`asyncio` is already imported at the top of the file (line 5). The inline `import asyncio` in both `_expand_see_more_buttons` (here) and `_click_show_more_results` (line 2297) are unnecessary. Python deduplicates repeated imports but this is non-idiomatic and clutters the method body.

```suggestion
            await asyncio.sleep(0.5)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2284-2300

Comment:
**`_click_show_more_results` fires only once**

For `limit` values above ~10, a single "show more results" click is unlikely to surface enough posts — the page typically loads a fixed batch per click. Combined with the scroll loop (which may not trigger a "show more results" button on its own), users requesting 30–50 posts may see fewer results than asked for with no indication of truncation. Consider looping the click until the button disappears or a target count is met, similar to how scrolling uses `attempts`.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat(scraping): add get_saved_posts tool..." | Re-trigger Greptile

Comment on lines +2232 to +2241
await self._expand_see_more_buttons()

# Scroll to load more posts up to the limit
scrolls = max(1, limit // 5)
await self._scroll_main_scrollable_region(
position="bottom", attempts=scrolls, pause_time=0.8
)

# Click additional "Show more results" buttons if present
await self._click_show_more_results()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 See-more expansion runs before loading, leaving new posts truncated

_expand_see_more_buttons is called once before the scroll loop, so it only expands posts that were visible on initial page load. Any posts surfaced by _scroll_main_scrollable_region or _click_show_more_results will still have their content truncated when _extract_root_content runs. Moving the expansion call to after all loading steps is complete ensures it covers all loaded posts.

Suggested change
await self._expand_see_more_buttons()
# Scroll to load more posts up to the limit
scrolls = max(1, limit // 5)
await self._scroll_main_scrollable_region(
position="bottom", attempts=scrolls, pause_time=0.8
)
# Click additional "Show more results" buttons if present
await self._click_show_more_results()
# Scroll to load more posts up to the limit
scrolls = max(1, limit // 5)
await self._scroll_main_scrollable_region(
position="bottom", attempts=scrolls, pause_time=0.8
)
# Click additional "Show more results" buttons if present
await self._click_show_more_results()
# Expand truncated posts by clicking "see more" buttons (after all content is loaded)
await self._expand_see_more_buttons()
Prompt To Fix With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2232-2241

Comment:
**See-more expansion runs before loading, leaving new posts truncated**

`_expand_see_more_buttons` is called once before the scroll loop, so it only expands posts that were visible on initial page load. Any posts surfaced by `_scroll_main_scrollable_region` or `_click_show_more_results` will still have their content truncated when `_extract_root_content` runs. Moving the expansion call to after all loading steps is complete ensures it covers all loaded posts.

```suggestion
        # Scroll to load more posts up to the limit
        scrolls = max(1, limit // 5)
        await self._scroll_main_scrollable_region(
            position="bottom", attempts=scrolls, pause_time=0.8
        )

        # Click additional "Show more results" buttons if present
        await self._click_show_more_results()

        # Expand truncated posts by clicking "see more" buttons (after all content is loaded)
        await self._expand_see_more_buttons()
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +2279 to +2280
import asyncio
await asyncio.sleep(0.5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant import asyncio inside method bodies

asyncio is already imported at the top of the file (line 5). The inline import asyncio in both _expand_see_more_buttons (here) and _click_show_more_results (line 2297) are unnecessary. Python deduplicates repeated imports but this is non-idiomatic and clutters the method body.

Suggested change
import asyncio
await asyncio.sleep(0.5)
await asyncio.sleep(0.5)
Prompt To Fix With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2279-2280

Comment:
**Redundant `import asyncio` inside method bodies**

`asyncio` is already imported at the top of the file (line 5). The inline `import asyncio` in both `_expand_see_more_buttons` (here) and `_click_show_more_results` (line 2297) are unnecessary. Python deduplicates repeated imports but this is non-idiomatic and clutters the method body.

```suggestion
            await asyncio.sleep(0.5)
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +2284 to +2300
async def _click_show_more_results(self) -> None:
"""Click the show-more-results button if present to load additional posts."""
try:
await self._page.evaluate(
"""() => {
const buttons = Array.from(document.querySelectorAll('button'))
.filter(btn => {
const text = (btn.innerText || btn.textContent || '').trim().toLowerCase();
return text.includes('show more') || text.includes('afficher plus');
});
if (buttons.length > 0) buttons[0].click();
}"""
)
import asyncio
await asyncio.sleep(1.0)
except Exception:
pass # Non-critical
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _click_show_more_results fires only once

For limit values above ~10, a single "show more results" click is unlikely to surface enough posts — the page typically loads a fixed batch per click. Combined with the scroll loop (which may not trigger a "show more results" button on its own), users requesting 30–50 posts may see fewer results than asked for with no indication of truncation. Consider looping the click until the button disappears or a target count is met, similar to how scrolling uses attempts.

Prompt To Fix With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 2284-2300

Comment:
**`_click_show_more_results` fires only once**

For `limit` values above ~10, a single "show more results" click is unlikely to surface enough posts — the page typically loads a fixed batch per click. Combined with the scroll loop (which may not trigger a "show more results" button on its own), users requesting 30–50 posts may see fewer results than asked for with no indication of truncation. Consider looping the click until the button disappears or a target count is met, similar to how scrolling uses `attempts`.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants