feat(scan): add local-parser provider for portal scans by lejrn · Pull Request #595 · santifer/career-ops

lejrn · 2026-05-07T13:11:05Z

Summary

Rebased onto latest main and refactored the local-parser work to fit the new provider plugin architecture (providers/*.mjs + resolveProvider() in scan.mjs).

Add providers/local-parser.mjs for companies with parser.command + parser.script in portals.yml
Route parser-first; on parser failure, fall back to Ashby/Greenhouse/Lever API providers (same pattern as other providers)
Extend modes/scan.md with Nivel 0 (local parser) and local_parser_ok rules so agent scan skips redundant Playwright/API/WebSearch when Nivel 0 succeeds
Document generic parser contract in templates/portals.example.yml and docs/local-parser-cookbook.md
test-all.mjs contract checks (module exists, API fallback, no bundled runtime parsers)

Closes #594

PR cleanup (vs earlier revisions)

Rebase: Branch is one commit on current upstream/main (includes the provider refactor from Feature: cold email #593 / scan plugin work).
No bundled Python parsers: Removed scripts/parsers/cohere_jobs.py and other company-specific runtime scripts from this PR (per maintainer direction: runtime-neutral contract only).
Tests/fixtures elsewhere: Cohere/Mobileye fixture parsers, test-local-parser.mjs, test-scan-token-comparison.mjs, and benchmark tables live on fork branch feature/local-parser-integration-tests (lejrn/career-ops) — not in this PR.

Test plan

node test-all.mjs --quick
node --check scan.mjs and node --check providers/local-parser.mjs
Configure a company in portals.yml with parser: pointing at your own script; run node scan.mjs --dry-run --company <Name>
Confirm parser success uses local-parser in scan summary; confirm API fallback when parser script is missing or fails
Agent scan: verify modes/scan.md local_parser_ok behavior (skip Playwright/API for parser-covered companies)

coderabbitai · 2026-05-07T13:11:22Z

📝 Walkthrough

Walkthrough

Adds configurable per-company local parser support: new local-parser provider runs configured commands, parses normalized jobs JSON from stdout, integrates with scan pipeline (provider resolution, provenance, fallback), updates docs and templates, adds .gitkeep and .gitignore rules, and includes tests.

Changes

Local Parser Scanning Feature

Layer / File(s)	Summary
Version, gitignore, keepfile, and portals template `VERSION`, `package.json`, `.gitignore`, `data/parser-output/.gitkeep`, `templates/portals.example.yml`	Bump to 1.8.1, ignore generated parser JSON while preserving `.gitkeep`, add `.gitkeep`, and expand portals example docs to describe `scan_method: local_parser` and parser config contract.
scan.mjs provider resolution and fetch flow `scan.mjs`	Make PORTALS path overrideable, add `resolveProvider(skipIds)` and priority for `local-parser`, validate targets, log local-parser counts, compute `sourceName` per-target, retry provider resolution skipping `local-parser` on fetch errors, and write provenance using `sourceName`.
providers/local-parser implementation `providers/local-parser.mjs`	New provider that runs configured parser command via `execFile` with placeholder-expanded args, parses stdout JSON (array
User docs and cookbook `modes/scan.md`, `docs/SCRIPTS.md`, `docs/local-parser-cookbook.md`	Document Nivel 0 local-parser in scanning workflow, `portals.yml` parser contract and placeholder rules, stdout JSON shapes and normalization, artifact storage under `data/parser-output/{company}/` with `.gitkeep`, failure/fallback behavior, and guidance to skip redundant agent/API passes when local parser succeeds.
Test-all updates for local parser contract `test-all.mjs`	Add a LOCAL PARSER CONTRACT test section asserting scan/provider/docs contain expected local-parser guards, fallback behavior, provider module presence, and that portals example documents a generic parser (no bundled vendor script); renumber following sections.

Sequence Diagram(s)

(omitted — changes are primarily localized provider addition, docs, and wiring involving two main runtime components)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

#339 — Contains the same objective to add local-parser support and matches the implemented acceptance criteria (parser stdout shapes, fallback behavior, relative URL resolution).

Possibly related PRs

santifer/career-ops#602: Related changes touching provider resolution and scan provenance handling.

Suggested labels

🔴 core-architecture

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	All coding requirements from issue `#594` are met: local parser execution without shell interpolation [scan.mjs, providers/local-parser.mjs], JSON stdout parsing [providers/local-parser.mjs], relative URL resolution [providers/local-parser.mjs], failure reporting without abort [scan.mjs], artifact storage/ignore configuration [.gitignore, .gitkeep], and comprehensive documentation [modes/scan.md, templates/portals.example.yml, docs/local-parser-cookbook.md].
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#594` scope: local parser support, configuration examples, documentation, and version bumps. No unrelated modifications detected.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a local-parser provider for portal scans, which is the primary feature introduced across scan.mjs, documentation, configuration templates, and the new local-parser provider module.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/cohere-scan-token-comparison-pr.md`:
- Around line 1-13: Add a top-level H1 heading to the report and insert the
required header fields by adding lines beginning with "**URL:**" and
"**Legitimacy:**" directly under the H1 (e.g., "# What does this PR do?" becomes
"# <title>" followed by "**URL:** <value>" and "**Legitimacy:** <value>").
Ensure the header block appears at the very top of the document before the
existing "## What does this PR do?" content and format the two fields exactly as
"**URL:**" and "**Legitimacy:**".

In `@modes/scan.md`:
- Line 61: Update the documentation in modes/scan.md to clarify that parser
artifacts should not be checked into git (store them under
data/parser-output/{company}/ and add them to .gitignore) but that any .gitkeep
placeholder files used to preserve the directory structure must remain in the
repo; explicitly state that the scanner does not need to retain full JSON after
stdout is read and that only .gitkeep files are an exception to the "keep out of
git" rule.
- Around line 53-59: Update the parser contract example in modes/scan.md to list
all three accepted JSON stdout formats: the array format ([{...}]), the
object-with-jobs format ({ "jobs": [{...}] }), and the object-with-results
format ({ "results": [{...}] }); show each format as a separate JSON example and
keep the note that company is optional and scan.mjs will fall back to
tracked_companies when company is missing so parser implementers know all
allowed shapes to support.

In `@scan.mjs`:
- Around line 416-423: The local-parser branch should fall back to the API
parser when runLocalParser fails; wrap the call to runLocalParser(company) in a
try/catch (or check its failure result) and on error log the failure and then
set jobs by fetching the API JSON (using fetchJson(source.url)) and invoking
PARSERS[source.type](json, company.name), while still setting sourceName to
`${source.type}-api`; ensure you reference runLocalParser, fetchJson and PARSERS
and handle missing source.url or unknown source.type errors similarly so the
code degrades to the API scan rather than dropping the company.

In `@scripts/parsers/cohere_jobs.py`:
- Around line 28-47: The code currently trusts any source_url before calling
urlopen; update ashby_api_url (and callers like fetch_jobs) to enforce an
HTTPS-only allowlist for Ashby hosts: parse the URL and require parsed.scheme ==
"https" and parsed.netloc be one of the approved Ashby hostnames (e.g.,
"api.ashbyhq.com" and any official public job-board hostnames your product
supports); if the host is "api.ashbyhq.com" return the original URL, otherwise
validate the host is in the allowlist before extracting board_slug and
constructing the API URL, and raise ValueError for disallowed schemes/hosts to
prevent opening arbitrary URLs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b2c029fd-9f87-48c4-8489-3f8d62666d31

📥 Commits

Reviewing files that changed from the base of the PR and between 8e554cc and fceaf4a.

📒 Files selected for processing (9)

.gitignore
data/parser-output/.gitkeep
data/parser-output/cohere/.gitkeep
docs/SCRIPTS.md
docs/cohere-scan-token-comparison-pr.md
modes/scan.md
scan.mjs
scripts/parsers/cohere_jobs.py
templates/portals.example.yml

coderabbitai · 2026-05-07T13:14:58Z

+## What does this PR do?
+
+Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes:
+
+- Playwright-rendered scraping of the Cohere Ashby board.
+- `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`.
+
+The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path.
+
+## Related issue
+
+N/A - measurement and documentation artifact.
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add required report header fields and top-level heading.

This report should start with an H1 and include **URL:** and **Legitimacy:** in the header block.

Proposed patch

-## What does this PR do? +# Cohere scan token comparison (PR artifact) + +**URL:** https://github.com/santifer/career-ops/pull/595 +**Legitimacy:** Internal PR measurement artifact (non-production benchmark) + +## What does this PR do?

As per coding guidelines: "All reports must include URL and Legitimacy fields in the header — format as URL: and Legitimacy: in the report".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## What does this PR do?

Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes:

- Playwright-rendered scraping of the Cohere Ashby board.

- `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`.

The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path.

## Related issue

N/A - measurement and documentation artifact.

# Cohere scan token comparison (PR artifact)

**URL:** https://github.com/santifer/career-ops/pull/595

**Legitimacy:** Internal PR measurement artifact (non-production benchmark)

## What does this PR do?

Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes:

- Playwright-rendered scraping of the Cohere Ashby board.

- `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`.

The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path.

## Related issue

N/A - measurement and documentation artifact.

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 1-1: First line in a file should be a top-level heading

(MD041, first-line-heading, first-line-h1)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/cohere-scan-token-comparison-pr.md` around lines 1 - 13, Add a top-level H1 heading to the report and insert the required header fields by adding lines beginning with "**URL:**" and "**Legitimacy:**" directly under the H1 (e.g., "# What does this PR do?" becomes "# <title>" followed by "**URL:** <value>" and "**Legitimacy:** <value>"). Ensure the header block appears at the very top of the document before the existing "## What does this PR do?" content and format the two fields exactly as "**URL:**" and "**Legitimacy:**".

coderabbitai · 2026-05-07T13:14:59Z

+def ashby_api_url(source_url: str) -> str:
+    parsed = urlparse(source_url)
+    if parsed.netloc == "api.ashbyhq.com":
+        return source_url
+
+    board_slug = parsed.path.strip("/").split("/")[0]
+    if not board_slug:
+        raise ValueError(f"Cannot infer Ashby board slug from {source_url}")
+
+    return f"https://api.ashbyhq.com/posting-api/job-board/{board_slug}"
+
+
+def fetch_jobs(source_url: str) -> list[dict]:
+    req = urllib.request.Request(
+        ashby_api_url(source_url),
+        headers={"User-Agent": "Mozilla/5.0"},
+    )
+    with urllib.request.urlopen(req, timeout=30) as resp:
+        data = json.loads(resp.read())
+    return data["jobs"]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restrict source URL to HTTPS Ashby hosts before opening it.

source_url can currently pass through with insufficient scheme/host validation before urlopen. Add strict allowlisting to prevent unintended scheme/endpoint usage.

Proposed patch

def ashby_api_url(source_url: str) -> str: parsed = urlparse(source_url) - if parsed.netloc == "api.ashbyhq.com": - return source_url + if parsed.scheme not in {"http", "https"}: + raise ValueError(f"Unsupported URL scheme: {parsed.scheme or 'missing'}") + + if parsed.netloc == "api.ashbyhq.com": + if parsed.scheme != "https": + raise ValueError("Ashby API URL must use https") + if not parsed.path.startswith("/posting-api/job-board/"): + raise ValueError("Unexpected Ashby API endpoint") + return source_url + + if parsed.netloc != "jobs.ashbyhq.com": + raise ValueError(f"Unsupported host for source URL: {parsed.netloc}") board_slug = parsed.path.strip("/").split("/")[0] if not board_slug: raise ValueError(f"Cannot infer Ashby board slug from {source_url}") return f"https://api.ashbyhq.com/posting-api/job-board/{board_slug}"

🧰 Tools

🪛 Ruff (0.15.12)

[warning] 35-35: Avoid specifying long messages outside the exception class

(TRY003)

[error] 41-44: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

[error] 45-45: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parsers/cohere_jobs.py` around lines 28 - 47, The code currently trusts any source_url before calling urlopen; update ashby_api_url (and callers like fetch_jobs) to enforce an HTTPS-only allowlist for Ashby hosts: parse the URL and require parsed.scheme == "https" and parsed.netloc be one of the approved Ashby hostnames (e.g., "api.ashbyhq.com" and any official public job-board hostnames your product supports); if the host is "api.ashbyhq.com" return the original URL, otherwise validate the host is in the allowlist before extracting board_slug and constructing the API URL, and raise ValueError for disallowed schemes/hosts to prevent opening arbitrary URLs.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

modes/scan.md (1)

61-61: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify .gitkeep is the exception to “keep artifacts out of git.”

This line should explicitly preserve .gitkeep placeholders while excluding generated JSON artifacts.

Suggested patch

-El escáner no necesita conservar el JSON completo después de leer stdout. Si un parser también genera un artefacto para auditoría o depuración, guardarlo en `data/parser-output/{company}/` y mantenerlo fuera de git.
+El escáner no necesita conservar el JSON completo después de leer stdout. Si un parser también genera un artefacto para auditoría o depuración, guardarlo en `data/parser-output/{company}/` y mantenerlo fuera de git (los JSON en `.gitignore`; los `.gitkeep` se mantienen en git para conservar la estructura).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modes/scan.md` at line 61, Update the sentence in modes/scan.md to explicitly
state that generated parser artifacts (e.g., JSON in
data/parser-output/{company}/) must be kept out of git but that .gitkeep files
are the allowed exception and should be committed; also ensure the repository
.gitignore excludes the generated JSON artifacts in data/parser-output/ while
not ignoring .gitkeep so the placeholder stays in source control.

scripts/parsers/cohere_jobs.py (1)

28-47: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restrict parser source URL to HTTPS Ashby allowlist before fetch.

ashby_api_url still allows unsafe/unexpected URLs through to urlopen when netloc is api.ashbyhq.com (scheme/path are not validated). Enforce https and allowed Ashby hosts/paths before issuing the request.

Suggested patch

 def ashby_api_url(source_url: str) -> str:
     parsed = urlparse(source_url)
-    if parsed.netloc == "api.ashbyhq.com":
-        return source_url
+    if parsed.scheme != "https":
+        raise ValueError(f"Unsupported URL scheme: {parsed.scheme or 'missing'}")
+
+    if parsed.netloc == "api.ashbyhq.com":
+        if not parsed.path.startswith("/posting-api/job-board/"):
+            raise ValueError("Unexpected Ashby API endpoint")
+        return source_url
+
+    if parsed.netloc != "jobs.ashbyhq.com":
+        raise ValueError(f"Unsupported host for source URL: {parsed.netloc}")

     board_slug = parsed.path.strip("/").split("/")[0]
     if not board_slug:
         raise ValueError(f"Cannot infer Ashby board slug from {source_url}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parsers/cohere_jobs.py` around lines 28 - 47, The ashby_api_url
function currently lets unsafe URLs reach urlopen because it only checks netloc;
update ashby_api_url and its usage in fetch_jobs to enforce an HTTPS allowlist:
require parsed.scheme == "https" and parsed.netloc in the allowed set (at
minimum "api.ashbyhq.com"), and if parsed.netloc == "api.ashbyhq.com" validate
the path prefix (e.g., it must start with "/posting-api/" or the expected
posting path) before returning source_url; otherwise, when constructing the API
URL from a board page (the existing board_slug path flow), always build an
explicit "https://api.ashbyhq.com/posting-api/job-board/{board_slug}" and raise
ValueError for any non-https scheme, unknown host, or malformed path so that
fetch_jobs only ever calls urlopen with a validated HTTPS Ashby API URL.

docs/cohere-scan-token-comparison-pr.md (1)

1-13: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add required report header block at the top (H1 + URL + Legitimacy).

This report is missing the required header fields and top-level heading format.

Suggested patch

-## What does this PR do?
+# Cohere scan token comparison (PR artifact)
+
+**URL:** https://github.com/santifer/career-ops/pull/595
+**Legitimacy:** Internal PR measurement artifact (non-production benchmark)
+
+## What does this PR do?

As per coding guidelines, "All reports must include URL and Legitimacy fields in the header — format as URL: and Legitimacy: in the report".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/cohere-scan-token-comparison-pr.md` around lines 1 - 13, Add the
required top header block to this report: insert an H1 title line at the very
top (e.g., "# Cohere scan token comparison"), then add "URL:" followed by the
report's source URL and "Legitimacy:" followed by the legitimacy statement
(e.g., "Legitimacy: internal measurement/dry-run") as plain lines immediately
under the H1; update the existing "## What does this PR do?" section to follow
that header so the document begins with the mandated H1 + URL + Legitimacy
fields.

scan.mjs (1)

416-423: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fall back to the ATS API when the local parser fails.

If runLocalParser(company) throws on Lines 416-418, the outer catch records the error and drops that company entirely. This still misses the PR objective of running the local parser before falling back to Greenhouse/Ashby/Lever.

Suggested fix

       if (source.kind === 'local-parser') {
-        jobs = await runLocalParser(company);
-        sourceName = 'local-parser';
+        try {
+          jobs = await runLocalParser(company);
+          sourceName = 'local-parser';
+        } catch (parserErr) {
+          const apiFallback = detectApi(company);
+          if (!apiFallback) throw parserErr;
+
+          const json = await fetchJson(apiFallback.url);
+          jobs = PARSERS[apiFallback.type](json, company.name);
+          sourceName = `${apiFallback.type}-api`;
+          errors.push({
+            company: company.name,
+            error: `local parser failed, used API fallback: ${parserErr.message}`,
+          });
+        }
       } else {
         const json = await fetchJson(source.url);
         jobs = PARSERS[source.type](json, company.name);
         sourceName = `${source.type}-api`;
       }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan.mjs` around lines 416 - 423, The current branch that calls
runLocalParser(company) can throw and abort processing the company; wrap the
runLocalParser call in its own try/catch so that on error you log the
local-parser failure but then fall back to the ATS API path: call
fetchJson(source.url), set jobs = PARSERS[source.type](json, company.name) and
set sourceName = `${source.type}-api`; ensure you still surface/log the original
local-parser error but continue processing rather than letting the outer catch
drop the company. Use the existing symbols runLocalParser, fetchJson, PARSERS,
source.kind, source.url, source.type, jobs and sourceName to locate and
implement the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scan.mjs`:
- Around line 383-389: The filtering assumes c.name is always a string and calls
toLowerCase(), which can throw for malformed entries; update the
selectedCompanies/filter logic to guard c.name (e.g., treat non-string names as
empty string or skip those entries) before calling toLowerCase and ensure
filterCompany is normalized (e.g., default to empty string) so a bad
tracked_companies item doesn't abort the scan; apply this guard where
selectedCompanies is built (the companies.filter(...) and subsequent .filter(c
=> c.name.toLowerCase().includes(filterCompany)) usage) and also ensure
downstream code using detectSource still receives well-formed company objects or
skips null/invalid ones.

---

Duplicate comments:
In `@docs/cohere-scan-token-comparison-pr.md`:
- Around line 1-13: Add the required top header block to this report: insert an
H1 title line at the very top (e.g., "# Cohere scan token comparison"), then add
"URL:" followed by the report's source URL and "Legitimacy:" followed by the
legitimacy statement (e.g., "Legitimacy: internal measurement/dry-run") as plain
lines immediately under the H1; update the existing "## What does this PR do?"
section to follow that header so the document begins with the mandated H1 + URL
+ Legitimacy fields.

In `@modes/scan.md`:
- Line 61: Update the sentence in modes/scan.md to explicitly state that
generated parser artifacts (e.g., JSON in data/parser-output/{company}/) must be
kept out of git but that .gitkeep files are the allowed exception and should be
committed; also ensure the repository .gitignore excludes the generated JSON
artifacts in data/parser-output/ while not ignoring .gitkeep so the placeholder
stays in source control.

In `@scan.mjs`:
- Around line 416-423: The current branch that calls runLocalParser(company) can
throw and abort processing the company; wrap the runLocalParser call in its own
try/catch so that on error you log the local-parser failure but then fall back
to the ATS API path: call fetchJson(source.url), set jobs =
PARSERS[source.type](json, company.name) and set sourceName =
`${source.type}-api`; ensure you still surface/log the original local-parser
error but continue processing rather than letting the outer catch drop the
company. Use the existing symbols runLocalParser, fetchJson, PARSERS,
source.kind, source.url, source.type, jobs and sourceName to locate and
implement the change.

In `@scripts/parsers/cohere_jobs.py`:
- Around line 28-47: The ashby_api_url function currently lets unsafe URLs reach
urlopen because it only checks netloc; update ashby_api_url and its usage in
fetch_jobs to enforce an HTTPS allowlist: require parsed.scheme == "https" and
parsed.netloc in the allowed set (at minimum "api.ashbyhq.com"), and if
parsed.netloc == "api.ashbyhq.com" validate the path prefix (e.g., it must start
with "/posting-api/" or the expected posting path) before returning source_url;
otherwise, when constructing the API URL from a board page (the existing
board_slug path flow), always build an explicit
"https://api.ashbyhq.com/posting-api/job-board/{board_slug}" and raise
ValueError for any non-https scheme, unknown host, or malformed path so that
fetch_jobs only ever calls urlopen with a validated HTTPS Ashby API URL.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a4ae2065-d06c-46fa-822c-af3c04da5e0d

📥 Commits

Reviewing files that changed from the base of the PR and between fceaf4a and 08ea970.

📒 Files selected for processing (9)

.gitignore
data/parser-output/.gitkeep
data/parser-output/cohere/.gitkeep
docs/SCRIPTS.md
docs/cohere-scan-token-comparison-pr.md
modes/scan.md
scan.mjs
scripts/parsers/cohere_jobs.py
templates/portals.example.yml

santifer · 2026-05-15T16:32:41Z

Hey @lejrn — local parser source is a useful extension to the scanner. Two questions before merging:

CONFLICTING with main: needs rebase onto current.
Python dependency: the PR adds scripts/parsers/cohere_jobs.py, which introduces a Python file into a JS-only repo. I'd rather not add Python as a runtime requirement for career-ops users (many install via Claude Code with only Node + Go).

Could the Cohere example live elsewhere (your own repo, a docs/cookbook page, etc.) and the PR ship just the local_parser contract (the JS-side scan.mjs changes + portals.example.yml schema + docs)? Users who want a Python parser write their own and point at it.

The contract itself (stdout JSON spec, before-fallback execution) is clean. Just want to keep the runtime minimal.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modes/scan.md`:
- Around line 116-120: The priority list is ambiguous about runtime context;
update the section so it explicitly states this is the agent-mode execution
order (e.g., title: "Agent-mode execution priority") or split into two
subsections "Agent-mode execution order" and "Scanner/runtime (scan.mjs)
execution notes" and move the numbered levels (Local parser, Playwright, API,
WebSearch) under the agent-mode heading; ensure you reference scan.mjs in the
scanner/runtime subsection to clarify the runtime ordering differences.

In `@scan.mjs`:
- Around line 197-198: Validate and sanitize parser.timeout_ms and
parser.max_buffer_bytes before calling execFile: ensure the derived timeout and
maxBuffer are finite numbers greater than zero (use Number(...) then
isFinite(...) and >0 checks) and fall back to LOCAL_PARSER_TIMEOUT_MS and
LOCAL_PARSER_MAX_BUFFER_BYTES when invalid; update the variables timeout and
maxBuffer (used when invoking execFile) to these sanitized values so execFile
never receives NaN, non-finite, or non-positive arguments.

In `@test-all.mjs`:
- Around line 287-306: The tests currently use literal string includes against
scan.mjs which is brittle; replace them with resilient pattern/AST-based checks:
for the company-normalization contract, parse scan.mjs (scanScript) and assert a
typeof guard on c.name (e.g., a UnaryExpression or BinaryExpression checking
typeof c.name === 'string') and that subsequent MemberExpressions call trim()
and toLowerCase() on c.name (or its safe variant) rather than matching exact
source text; for the parser-fallback contract, detect a call expression to
detectApi(company) and verify a control-flow path or adjacent statements that
log "local parser failed, used API fallback" and call fetchJson(apiFallback.url)
(use regex or AST traversal to ensure the sequence/structure exists, not raw
includes).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 529d6682-f1a7-4a5d-a898-a2098247722c

📥 Commits

Reviewing files that changed from the base of the PR and between 08ea970 and 4ebd978.

📒 Files selected for processing (10)

.gitignore
VERSION
data/parser-output/.gitkeep
docs/SCRIPTS.md
docs/local-parser-cookbook.md
modes/scan.md
package.json
scan.mjs
templates/portals.example.yml
test-all.mjs

coderabbitai · 2026-05-17T19:47:08Z

+  const timeout = Number(parser.timeout_ms || LOCAL_PARSER_TIMEOUT_MS);
+  const maxBuffer = Number(parser.max_buffer_bytes || LOCAL_PARSER_MAX_BUFFER_BYTES);


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate parser timeout_ms and max_buffer_bytes before passing to execFile.

Line 197 and Line 198 accept non-finite/invalid values (NaN, <=0) from config, which can turn a simple config typo into a runtime failure for that company scan path.

Proposed fix

- const timeout = Number(parser.timeout_ms || LOCAL_PARSER_TIMEOUT_MS); - const maxBuffer = Number(parser.max_buffer_bytes || LOCAL_PARSER_MAX_BUFFER_BYTES); + const requestedTimeout = Number(parser.timeout_ms); + const timeout = + Number.isFinite(requestedTimeout) && requestedTimeout > 0 + ? Math.trunc(requestedTimeout) + : LOCAL_PARSER_TIMEOUT_MS; + + const requestedMaxBuffer = Number(parser.max_buffer_bytes); + const maxBuffer = + Number.isFinite(requestedMaxBuffer) && requestedMaxBuffer > 0 + ? Math.trunc(requestedMaxBuffer) + : LOCAL_PARSER_MAX_BUFFER_BYTES;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scan.mjs` around lines 197 - 198, Validate and sanitize parser.timeout_ms and parser.max_buffer_bytes before calling execFile: ensure the derived timeout and maxBuffer are finite numbers greater than zero (use Number(...) then isFinite(...) and >0 checks) and fall back to LOCAL_PARSER_TIMEOUT_MS and LOCAL_PARSER_MAX_BUFFER_BYTES when invalid; update the variables timeout and maxBuffer (used when invoking execFile) to these sanitized values so execFile never receives NaN, non-finite, or non-positive arguments.

coderabbitai · 2026-05-17T19:47:08Z

+const scanScript = readFile('scan.mjs');
+if (
+  scanScript.includes("typeof c.name === 'string'") &&
+  scanScript.includes('c.name.trim()') &&
+  scanScript.includes('c.name.toLowerCase()')
+) {
+  pass('scan.mjs guards company names before filtering');
+} else {
+  fail('scan.mjs does not guard company names before filtering');
+}
+
+if (
+  scanScript.includes('detectApi(company)') &&
+  scanScript.includes('local parser failed, used API fallback') &&
+  scanScript.includes('fetchJson(apiFallback.url)')
+) {
+  pass('scan.mjs falls back to ATS API when local parser fails');
+} else {
+  fail('scan.mjs does not fall back to ATS API when local parser fails');
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace literal includes(...) contract checks with resilient structure checks.

These assertions are currently tied to exact source strings, so they can fail on harmless refactors and pass even when behavior regresses (e.g., target text in comments). Prefer regex/AST-shape checks for guarded company normalization and parser-fallback flow to keep this contract test reliable.

Suggested direction

- if ( - scanScript.includes("typeof c.name === 'string'") && - scanScript.includes('c.name.trim()') && - scanScript.includes('c.name.toLowerCase()') - ) { + if ( + /typeof\s+\w+\.name\s*===\s*['"]string['"]/.test(scanScript) && + /\.\s*name\s*\.\s*trim\s*$/.test(scanScript) && + /\.\s*name\s*\.\s*toLowerCase\s*\(/.test(scanScript) + ) { pass('scan.mjs guards company names before filtering'); } else { fail('scan.mjs does not guard company names before filtering'); } - if ( - scanScript.includes('detectApi(company)') && - scanScript.includes('local parser failed, used API fallback') && - scanScript.includes('fetchJson(apiFallback.url)') - ) { + if ( + /detectApi\s*\(\s*company\s*$/.test(scanScript) && + /local parser failed,\s*used API fallback/.test(scanScript) && + /fetchJson\s*$\s*apiFallback\.url\s*$/.test(scanScript) + ) { pass('scan.mjs falls back to ATS API when local parser fails'); } else { fail('scan.mjs does not fall back to ATS API when local parser fails'); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const scanScript = readFile('scan.mjs');

if (

scanScript.includes("typeof c.name === 'string'") &&

scanScript.includes('c.name.trim()') &&

scanScript.includes('c.name.toLowerCase()')

) {

pass('scan.mjs guards company names before filtering');

} else {

fail('scan.mjs does not guard company names before filtering');

}

if (

scanScript.includes('detectApi(company)') &&

scanScript.includes('local parser failed, used API fallback') &&

scanScript.includes('fetchJson(apiFallback.url)')

) {

pass('scan.mjs falls back to ATS API when local parser fails');

} else {

fail('scan.mjs does not fall back to ATS API when local parser fails');

}

const scanScript = readFile('scan.mjs');

if (

/typeof\s+\w+\.name\s*===\s*['"]string['"]/.test(scanScript) &&

/\.\s*name\s*\.\s*trim\s*$/.test(scanScript) &&

/\.\s*name\s*\.\s*toLowerCase\s*\(/.test(scanScript)

) {

pass('scan.mjs guards company names before filtering');

} else {

fail('scan.mjs does not guard company names before filtering');

}

if (

/detectApi\s*\(\s*company\s*$/.test(scanScript) &&

/local parser failed,\s*used API fallback/.test(scanScript) &&

/fetchJson\s*$\s*apiFallback\.url\s*$/.test(scanScript)

) {

pass('scan.mjs falls back to ATS API when local parser fails');

} else {

fail('scan.mjs does not fall back to ATS API when local parser fails');

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test-all.mjs` around lines 287 - 306, The tests currently use literal string includes against scan.mjs which is brittle; replace them with resilient pattern/AST-based checks: for the company-normalization contract, parse scan.mjs (scanScript) and assert a typeof guard on c.name (e.g., a UnaryExpression or BinaryExpression checking typeof c.name === 'string') and that subsequent MemberExpressions call trim() and toLowerCase() on c.name (or its safe variant) rather than matching exact source text; for the parser-fallback contract, detect a call expression to detectApi(company) and verify a control-flow path or adjacent statements that log "local parser failed, used API fallback" and call fetchJson(apiFallback.url) (use regex or AST traversal to ensure the sequence/structure exists, not raw includes).

lejrn · 2026-05-17T19:50:08Z

Hey @santifer, thank you for your code review and comments!
I followed your guidance and did these:

Rebase: Done onto current main.
Python: Removed bundled example parser; PR is contract + scanner + docs only; parsers live outside the repo or user-written.
Docs: docs/local-parser-cookbook.md plus commented JS/Python examples under templates/portals.example.yml; token savings shown in a small comparison table (generic wording).
Scanner: Local parser still runs first; on failure we fall back to Greenhouse/Ashby/Lever API when detectable; malformed tracked_companies entries no longer crash filtering.

Forgive me if I did some mistake, and please guide me what to fix, if needed.

Thank you!

Introduce providers/local-parser.mjs for executable parsers in portals.yml, with scan.mjs routing parser-first and ATS API fallback. Extend modes/scan.md with Nivel 0 contract and local_parser_ok to skip redundant Playwright/API work. Add generic portals.example.yml docs, local-parser-cookbook.md, and test-all contract checks. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

providers/local-parser.mjs (1)

79-80: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sanitize parser numeric limits before calling execFile.

timeout_ms / max_buffer_bytes are not validated for finite positive numbers. Invalid values can propagate directly to execFile options.

Proposed fix

-  const timeout = Number(parser.timeout_ms || LOCAL_PARSER_TIMEOUT_MS);
-  const maxBuffer = Number(parser.max_buffer_bytes || LOCAL_PARSER_MAX_BUFFER_BYTES);
+  const requestedTimeout = Number(parser.timeout_ms);
+  const timeout =
+    Number.isFinite(requestedTimeout) && requestedTimeout > 0
+      ? Math.trunc(requestedTimeout)
+      : LOCAL_PARSER_TIMEOUT_MS;
+
+  const requestedMaxBuffer = Number(parser.max_buffer_bytes);
+  const maxBuffer =
+    Number.isFinite(requestedMaxBuffer) && requestedMaxBuffer > 0
+      ? Math.trunc(requestedMaxBuffer)
+      : LOCAL_PARSER_MAX_BUFFER_BYTES;

#!/bin/bash
rg -n "timeout_ms|max_buffer_bytes|execFileAsync|Number\\.isFinite" providers/local-parser.mjs -C2

As per coding guidelines, **/*.mjs: Ensure scripts handle missing data/ directories gracefully.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@providers/local-parser.mjs` around lines 79 - 80, The parsed numeric limits
timeout and maxBuffer are not validated before being passed to execFileAsync;
update the code around the expressions that compute timeout (using
parser.timeout_ms and LOCAL_PARSER_TIMEOUT_MS) and maxBuffer (using
parser.max_buffer_bytes and LOCAL_PARSER_MAX_BUFFER_BYTES) to coerce to numbers,
check Number.isFinite(...) and that values are > 0 (and integer where
appropriate), and fall back to the corresponding LOCAL_PARSER_* defaults when
invalid; ensure the validated variables used by execFileAsync are safe positive
finite numbers to prevent passing bad options to execFile.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/local-parser-cookbook.md`:
- Line 30: Replace the Spanish words used in the sentence that references agent
scan mode and scan behavior so the doc uses consistent English: change "niveles"
to "levels" and "Nivel 0" to "Level 0" in the sentence that mentions
`/career-ops scan`, Playwright and API payloads, and the `local_parser_ok`
behavior (also referenced alongside `modes/scan.md`) so the doc consistently
reads "levels" / "Level 0".

In `@modes/scan.md`:
- Around line 156-170: The ordered list numbering in modes/scan.md uses explicit
numbers like "3.5." and "4." which triggers markdownlint MD029; update the
ordered list items (e.g., the heading "3.5. **Nivel 0 — Local parser**", its
sub-steps a–g, and the subsequent "4. **Nivel 1 — Playwright scan**" heading) to
use the auto-numbering style by replacing explicit numerals with "1." for each
list item (and any nested ordered lists) so markdown auto-numbers them
consistently and satisfies MD029.

In `@scan.mjs`:
- Around line 88-92: In resolveProvider, the explicit-provider branch ignores
skipIds: when entry.provider is set you currently immediately fetch from
providers and return, which can reselect a skipped provider (e.g., local-parser)
and prevent API fallback; change the branch so it first checks skipIds (e.g.,
skipIds.has(entry.id) or equivalent) and if the entry id is skipped then do not
return the explicit provider but fall through to the normal selection/fallback
path; otherwise, keep the existing providers.get(entry.provider) lookup and
return provider or unknown-provider error. Ensure you reference resolveProvider,
entry.provider, providers, and skipIds in the change.

In `@test-all.mjs`:
- Around line 331-341: The test currently calls
readFile('templates/portals.example.yml') unguarded which will throw if the file
is absent; update the test in test-all.mjs to first call
fileExists('templates/portals.example.yml') and if it returns false call
fail(...) with a clear message that the portals example is missing, otherwise
read the file into portalExample and run the existing includes checks and
pass/fail logic; reference the readFile usage and the portalExample/pass/fail
block to apply the guard.

---

Duplicate comments:
In `@providers/local-parser.mjs`:
- Around line 79-80: The parsed numeric limits timeout and maxBuffer are not
validated before being passed to execFileAsync; update the code around the
expressions that compute timeout (using parser.timeout_ms and
LOCAL_PARSER_TIMEOUT_MS) and maxBuffer (using parser.max_buffer_bytes and
LOCAL_PARSER_MAX_BUFFER_BYTES) to coerce to numbers, check Number.isFinite(...)
and that values are > 0 (and integer where appropriate), and fall back to the
corresponding LOCAL_PARSER_* defaults when invalid; ensure the validated
variables used by execFileAsync are safe positive finite numbers to prevent
passing bad options to execFile.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 32e27756-5881-4c43-b7cd-929cc6efd41f

📥 Commits

Reviewing files that changed from the base of the PR and between 4ebd978 and 98fccce.

📒 Files selected for processing (11)

.gitignore
VERSION
data/parser-output/.gitkeep
docs/SCRIPTS.md
docs/local-parser-cookbook.md
modes/scan.md
package.json
providers/local-parser.mjs
scan.mjs
templates/portals.example.yml
test-all.mjs

coderabbitai · 2026-05-18T00:51:58Z

+
+`scan.mjs` uses **0 LLM tokens** for discovery: parsers run locally and only normalized job rows enter the pipeline.
+
+In agent scan mode (`/career-ops scan`), Playwright and API niveles send large page or JSON payloads into the model. When Nivel 0 succeeds, `modes/scan.md` requires skipping those niveles for the same company (`local_parser_ok`).


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use consistent English terminology in this English doc.

Line 30 mixes Spanish (niveles, Nivel) with English text. Replace with levels / Level 0 for consistency and readability.

🧰 Tools

🪛 LanguageTool

[grammar] ~30-~30: Ensure spelling is correct
Context: .../career-ops scan), Playwright and API niveles send large page or JSON payloads into t...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~30-~30: Ensure spelling is correct
Context: ...e or JSON payloads into the model. When Nivel 0 succeeds, modes/scan.md requires sk...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~30-~30: Ensure spelling is correct
Context: ...modes/scan.md requires skipping those niveles for the same company (local_parser_ok...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/local-parser-cookbook.md` at line 30, Replace the Spanish words used in the sentence that references agent scan mode and scan behavior so the doc uses consistent English: change "niveles" to "levels" and "Nivel 0" to "Level 0" in the sentence that mentions `/career-ops scan`, Playwright and API payloads, and the `local_parser_ok` behavior (also referenced alongside `modes/scan.md`) so the doc consistently reads "levels" / "Level 0".

coderabbitai · 2026-05-18T00:51:58Z

+3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
+   Inicializar `local_parser_ok = []`.
+   Preferir ejecutar `node scan.mjs` una vez para cubrir todos los parsers + APIs zero-token; si se hace manualmente, repetir la lógica siguiente.
+   Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente:
+   a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell
+   b. Expandir placeholders `{careers_url}` y `{company}` en argumentos
+   c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`)
+   d. Normalizar cada job a `{title, url, company, location}`
+   e. Resolver URLs relativas contra `careers_url`
+   f. Si el parser falla, registrar error, intentar fallback por API ATS si existe, y continuar con las demás empresas (**no** añadir a `local_parser_ok`)
+   g. Si el parser termina con éxito (pasos c–e sin error fatal), añadir `entry.name` a `local_parser_ok` y acumular jobs en candidatos
+
 4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
-   Para cada empresa en `tracked_companies` con `enabled: true` y `careers_url` definida:
+   Para cada empresa en `tracked_companies` con `enabled: true`, `careers_url` definida, y **nombre no listado en `local_parser_ok`**:
   a. `browser_navigate` a la `careers_url`


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix ordered-list numbering to satisfy markdownlint MD029.

Line 168 currently continues with 4. after 3.5., which triggers the configured ordered-list style check. Use 1. style for list items (Markdown auto-numbers).

Proposed doc fix

-3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token): +1. **Nivel 0 — Local parser** (`scan.mjs`, zero-token): ... -4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5): +1. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5): ... -5. **Nivel 2 — ATS APIs / feeds** (paralelo): +1. **Nivel 2 — ATS APIs / feeds** (paralelo): ... -6. **Nivel 3 — WebSearch queries** (paralelo si posible): +1. **Nivel 3 — WebSearch queries** (paralelo si posible):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):

Inicializar `local_parser_ok = []`.

Preferir ejecutar `node scan.mjs` una vez para cubrir todos los parsers + APIs zero-token; si se hace manualmente, repetir la lógica siguiente.

Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente:

a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell

b. Expandir placeholders `{careers_url}` y `{company}` en argumentos

c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`)

d. Normalizar cada job a `{title, url, company, location}`

e. Resolver URLs relativas contra `careers_url`

f. Si el parser falla, registrar error, intentar fallback por API ATS si existe, y continuar con las demás empresas (**no** añadir a `local_parser_ok`)

g. Si el parser termina con éxito (pasos c–e sin error fatal), añadir `entry.name` a `local_parser_ok` y acumular jobs en candidatos

4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):

Para cada empresa en `tracked_companies` con `enabled: true` y `careers_url` definida:

Para cada empresa en `tracked_companies` con `enabled: true`, `careers_url` definida, y **nombre no listado en `local_parser_ok`**:

a. `browser_navigate` a la `careers_url`

1. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):

Inicializar `local_parser_ok = []`.

Preferir ejecutar `node scan.mjs` una vez para cubrir todos los parsers + APIs zero-token; si se hace manualmente, repetir la lógica siguiente.

Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente:

a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell

b. Expandir placeholders `{careers_url}` y `{company}` en argumentos

c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`)

d. Normalizar cada job a `{title, url, company, location}`

e. Resolver URLs relativas contra `careers_url`

f. Si el parser falla, registrar error, intentar fallback por API ATS si existe, y continuar con las demás empresas (**no** añadir a `local_parser_ok`)

g. Si el parser termina con éxito (pasos c–e sin error fatal), añadir `entry.name` a `local_parser_ok` y acumular jobs en candidatos

1. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):

Para cada empresa en `tracked_companies` con `enabled: true`, `careers_url` definida, y **nombre no listado en `local_parser_ok`**:

a. `browser_navigate` a la `careers_url`

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 168-168: Ordered list item prefix
Expected: 1; Actual: 4; Style: 1/2/3

(MD029, ol-prefix)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modes/scan.md` around lines 156 - 170, The ordered list numbering in modes/scan.md uses explicit numbers like "3.5." and "4." which triggers markdownlint MD029; update the ordered list items (e.g., the heading "3.5. **Nivel 0 — Local parser**", its sub-steps a–g, and the subsequent "4. **Nivel 1 — Playwright scan**" heading) to use the auto-numbering style by replacing explicit numerals with "1." for each list item (and any nested ordered lists) so markdown auto-numbers them consistently and satisfies MD029.

coderabbitai · 2026-05-18T00:51:59Z

  if (entry.provider) {
    const p = providers.get(entry.provider);
    if (!p) return { error: `unknown provider: ${entry.provider}` };
    return { provider: p };
  }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Respect skipIds for explicit provider overrides.

resolveProvider() bypasses skipIds when entry.provider is set. In the local-parser failure path, this can reselect local-parser and block intended API fallback for explicitly configured entries.

Proposed fix

function resolveProvider(entry, providers, { skipIds = [] } = {}) { if (entry.provider) { + if (skipIds.includes(entry.provider)) return null; const p = providers.get(entry.provider); if (!p) return { error: `unknown provider: ${entry.provider}` }; return { provider: p }; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (entry.provider) {

const p = providers.get(entry.provider);

if (!p) return { error: `unknown provider: ${entry.provider}` };

return { provider: p };

}

if (entry.provider) {

if (skipIds.includes(entry.provider)) return null;

const p = providers.get(entry.provider);

if (!p) return { error: `unknown provider: ${entry.provider}` };

return { provider: p };

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scan.mjs` around lines 88 - 92, In resolveProvider, the explicit-provider branch ignores skipIds: when entry.provider is set you currently immediately fetch from providers and return, which can reselect a skipped provider (e.g., local-parser) and prevent API fallback; change the branch so it first checks skipIds (e.g., skipIds.has(entry.id) or equivalent) and if the entry id is skipped then do not return the explicit provider but fall through to the normal selection/fallback path; otherwise, keep the existing providers.get(entry.provider) lookup and return provider or unknown-provider error. Ensure you reference resolveProvider, entry.provider, providers, and skipIds in the change.

coderabbitai · 2026-05-18T00:51:59Z

+const portalExample = readFile('templates/portals.example.yml');
+if (
+  !portalExample.includes('cohere_jobs.py') &&
+  portalExample.includes('scripts/parsers/example-js-company-jobs.js') &&
+  portalExample.includes('scripts/parsers/example_python_company_jobs.py') &&
+  portalExample.includes('already know their target careers URL')
+) {
+  pass('portals example documents a generic local parser contract');
+} else {
+  fail('portals example still points at a bundled Cohere parser');
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard templates/portals.example.yml before reading to avoid hard crash.

Line 331 calls readFile(...) unconditionally. If the file is missing/renamed, the whole suite throws instead of reporting a failed check. Please gate this with fileExists(...) and fail gracefully.

Proposed fix

-const portalExample = readFile('templates/portals.example.yml'); -if ( - !portalExample.includes('cohere_jobs.py') && - portalExample.includes('scripts/parsers/example-js-company-jobs.js') && - portalExample.includes('scripts/parsers/example_python_company_jobs.py') && - portalExample.includes('already know their target careers URL') -) { - pass('portals example documents a generic local parser contract'); -} else { - fail('portals example still points at a bundled Cohere parser'); -} +if (!fileExists('templates/portals.example.yml')) { + fail('Missing templates/portals.example.yml'); +} else { + const portalExample = readFile('templates/portals.example.yml'); + if ( + !portalExample.includes('cohere_jobs.py') && + portalExample.includes('scripts/parsers/example-js-company-jobs.js') && + portalExample.includes('scripts/parsers/example_python_company_jobs.py') && + portalExample.includes('already know their target careers URL') + ) { + pass('portals example documents a generic local parser contract'); + } else { + fail('portals example still points at a bundled Cohere parser'); + } +}

As per coding guidelines, "/*.mjs: ... Ensure scripts handle missing data/ directories gracefully.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test-all.mjs` around lines 331 - 341, The test currently calls readFile('templates/portals.example.yml') unguarded which will throw if the file is absent; update the test in test-all.mjs to first call fileExists('templates/portals.example.yml') and if it returns false call fail(...) with a clear message that the portals example is missing, otherwise read the file into portalExample and run the existing includes checks and pass/fail logic; reference the readFile usage and the portalExample/pass/fail block to apply the guard.

lejrn · 2026-05-18T00:59:53Z

Hey @santifer again,
I had to do some refactor.
At first, I fixed the issues mentioned prior to this message -> then Github PR shows there are conflicts -> I fetch all of thte 38 commits that have been added prior to this PR -> I refactor my local parser feature to comply the new Providers feature of how the scan.md and scan.mjs is built -> I apply some tests -> I log the results -> I clean the branch and commit and push again.

Hopefully it works now.. Please update me if anything requires modifications. Thanks!

github-actions Bot added 📄 docs ⚠️ agent-behavior 🔧 scripts labels May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

lejrn force-pushed the feature/local-parser-scan-w-cohere-exmp branch from fceaf4a to 08ea970 Compare May 7, 2026 13:16

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread scan.mjs Outdated

lejrn force-pushed the feature/local-parser-scan-w-cohere-exmp branch from 8060d74 to 4ebd978 Compare May 17, 2026 19:43

github-actions Bot added the 📦 dependencies label May 17, 2026

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

lejrn force-pushed the feature/local-parser-scan-w-cohere-exmp branch from 4ebd978 to 98fccce Compare May 18, 2026 00:48

lejrn changed the title ~~Add local parser source support for portal scans~~ feat(scan): add local-parser provider for portal scans May 18, 2026

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

		const timeout = Number(parser.timeout_ms \|\| LOCAL_PARSER_TIMEOUT_MS);
		const maxBuffer = Number(parser.max_buffer_bytes \|\| LOCAL_PARSER_MAX_BUFFER_BYTES);


		`scan.mjs` uses 0 LLM tokens for discovery: parsers run locally and only normalized job rows enter the pipeline.

		In agent scan mode (`/career-ops scan`), Playwright and API niveles send large page or JSON payloads into the model. When Nivel 0 succeeds, `modes/scan.md` requires skipping those niveles for the same company (`local_parser_ok`).

Uh oh!

Conversation

lejrn commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

PR cleanup (vs earlier revisions)

Test plan

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

santifer commented May 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

lejrn commented May 17, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

lejrn commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lejrn commented May 7, 2026 •

edited

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading