Skip to content

feat(scan): add local-parser provider for portal scans#595

Open
lejrn wants to merge 1 commit into
santifer:mainfrom
lejrn:feature/local-parser-scan-w-cohere-exmp
Open

feat(scan): add local-parser provider for portal scans#595
lejrn wants to merge 1 commit into
santifer:mainfrom
lejrn:feature/local-parser-scan-w-cohere-exmp

Conversation

@lejrn
Copy link
Copy Markdown

@lejrn lejrn commented May 7, 2026

Summary

Rebased onto latest main and refactored the local-parser work to fit the new provider plugin architecture (providers/*.mjs + resolveProvider() in scan.mjs).

  • Add providers/local-parser.mjs for companies with parser.command + parser.script in portals.yml
  • Route parser-first; on parser failure, fall back to Ashby/Greenhouse/Lever API providers (same pattern as other providers)
  • Extend modes/scan.md with Nivel 0 (local parser) and local_parser_ok rules so agent scan skips redundant Playwright/API/WebSearch when Nivel 0 succeeds
  • Document generic parser contract in templates/portals.example.yml and docs/local-parser-cookbook.md
  • test-all.mjs contract checks (module exists, API fallback, no bundled runtime parsers)

Closes #594

PR cleanup (vs earlier revisions)

  • Rebase: Branch is one commit on current upstream/main (includes the provider refactor from Feature: cold email #593 / scan plugin work).
  • No bundled Python parsers: Removed scripts/parsers/cohere_jobs.py and other company-specific runtime scripts from this PR (per maintainer direction: runtime-neutral contract only).
  • Tests/fixtures elsewhere: Cohere/Mobileye fixture parsers, test-local-parser.mjs, test-scan-token-comparison.mjs, and benchmark tables live on fork branch feature/local-parser-integration-tests (lejrn/career-ops) — not in this PR.

Test plan

  • node test-all.mjs --quick
  • node --check scan.mjs and node --check providers/local-parser.mjs
  • Configure a company in portals.yml with parser: pointing at your own script; run node scan.mjs --dry-run --company <Name>
  • Confirm parser success uses local-parser in scan summary; confirm API fallback when parser script is missing or fails
  • Agent scan: verify modes/scan.md local_parser_ok behavior (skip Playwright/API for parser-covered companies)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

📝 Walkthrough

Walkthrough

Adds configurable per-company local parser support: new local-parser provider runs configured commands, parses normalized jobs JSON from stdout, integrates with scan pipeline (provider resolution, provenance, fallback), updates docs and templates, adds .gitkeep and .gitignore rules, and includes tests.

Changes

Local Parser Scanning Feature

Layer / File(s) Summary
Version, gitignore, keepfile, and portals template
VERSION, package.json, .gitignore, data/parser-output/.gitkeep, templates/portals.example.yml
Bump to 1.8.1, ignore generated parser JSON while preserving .gitkeep, add .gitkeep, and expand portals example docs to describe scan_method: local_parser and parser config contract.
scan.mjs provider resolution and fetch flow
scan.mjs
Make PORTALS path overrideable, add resolveProvider(skipIds) and priority for local-parser, validate targets, log local-parser counts, compute sourceName per-target, retry provider resolution skipping local-parser on fetch errors, and write provenance using sourceName.
providers/local-parser implementation
providers/local-parser.mjs
New provider that runs configured parser command via execFile with placeholder-expanded args, parses stdout JSON (array
User docs and cookbook
modes/scan.md, docs/SCRIPTS.md, docs/local-parser-cookbook.md
Document Nivel 0 local-parser in scanning workflow, portals.yml parser contract and placeholder rules, stdout JSON shapes and normalization, artifact storage under data/parser-output/{company}/ with .gitkeep, failure/fallback behavior, and guidance to skip redundant agent/API passes when local parser succeeds.
Test-all updates for local parser contract
test-all.mjs
Add a LOCAL PARSER CONTRACT test section asserting scan/provider/docs contain expected local-parser guards, fallback behavior, provider module presence, and that portals example documents a generic parser (no bundled vendor script); renumber following sections.

Sequence Diagram(s)

(omitted — changes are primarily localized provider addition, docs, and wiring involving two main runtime components)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #339 — Contains the same objective to add local-parser support and matches the implemented acceptance criteria (parser stdout shapes, fallback behavior, relative URL resolution).

Possibly related PRs

Suggested labels

🔴 core-architecture

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed All coding requirements from issue #594 are met: local parser execution without shell interpolation [scan.mjs, providers/local-parser.mjs], JSON stdout parsing [providers/local-parser.mjs], relative URL resolution [providers/local-parser.mjs], failure reporting without abort [scan.mjs], artifact storage/ignore configuration [.gitignore, .gitkeep], and comprehensive documentation [modes/scan.md, templates/portals.example.yml, docs/local-parser-cookbook.md].
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #594 scope: local parser support, configuration examples, documentation, and version bumps. No unrelated modifications detected.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding a local-parser provider for portal scans, which is the primary feature introduced across scan.mjs, documentation, configuration templates, and the new local-parser provider module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/cohere-scan-token-comparison-pr.md`:
- Around line 1-13: Add a top-level H1 heading to the report and insert the
required header fields by adding lines beginning with "**URL:**" and
"**Legitimacy:**" directly under the H1 (e.g., "# What does this PR do?" becomes
"# <title>" followed by "**URL:** <value>" and "**Legitimacy:** <value>").
Ensure the header block appears at the very top of the document before the
existing "## What does this PR do?" content and format the two fields exactly as
"**URL:**" and "**Legitimacy:**".

In `@modes/scan.md`:
- Line 61: Update the documentation in modes/scan.md to clarify that parser
artifacts should not be checked into git (store them under
data/parser-output/{company}/ and add them to .gitignore) but that any .gitkeep
placeholder files used to preserve the directory structure must remain in the
repo; explicitly state that the scanner does not need to retain full JSON after
stdout is read and that only .gitkeep files are an exception to the "keep out of
git" rule.
- Around line 53-59: Update the parser contract example in modes/scan.md to list
all three accepted JSON stdout formats: the array format ([{...}]), the
object-with-jobs format ({ "jobs": [{...}] }), and the object-with-results
format ({ "results": [{...}] }); show each format as a separate JSON example and
keep the note that company is optional and scan.mjs will fall back to
tracked_companies when company is missing so parser implementers know all
allowed shapes to support.

In `@scan.mjs`:
- Around line 416-423: The local-parser branch should fall back to the API
parser when runLocalParser fails; wrap the call to runLocalParser(company) in a
try/catch (or check its failure result) and on error log the failure and then
set jobs by fetching the API JSON (using fetchJson(source.url)) and invoking
PARSERS[source.type](json, company.name), while still setting sourceName to
`${source.type}-api`; ensure you reference runLocalParser, fetchJson and PARSERS
and handle missing source.url or unknown source.type errors similarly so the
code degrades to the API scan rather than dropping the company.

In `@scripts/parsers/cohere_jobs.py`:
- Around line 28-47: The code currently trusts any source_url before calling
urlopen; update ashby_api_url (and callers like fetch_jobs) to enforce an
HTTPS-only allowlist for Ashby hosts: parse the URL and require parsed.scheme ==
"https" and parsed.netloc be one of the approved Ashby hostnames (e.g.,
"api.ashbyhq.com" and any official public job-board hostnames your product
supports); if the host is "api.ashbyhq.com" return the original URL, otherwise
validate the host is in the allowlist before extracting board_slug and
constructing the API URL, and raise ValueError for disallowed schemes/hosts to
prevent opening arbitrary URLs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b2c029fd-9f87-48c4-8489-3f8d62666d31

📥 Commits

Reviewing files that changed from the base of the PR and between 8e554cc and fceaf4a.

📒 Files selected for processing (9)
  • .gitignore
  • data/parser-output/.gitkeep
  • data/parser-output/cohere/.gitkeep
  • docs/SCRIPTS.md
  • docs/cohere-scan-token-comparison-pr.md
  • modes/scan.md
  • scan.mjs
  • scripts/parsers/cohere_jobs.py
  • templates/portals.example.yml

Comment thread docs/cohere-scan-token-comparison-pr.md Outdated
Comment on lines +1 to +13
## What does this PR do?

Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes:

- Playwright-rendered scraping of the Cohere Ashby board.
- `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`.

The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path.

## Related issue

N/A - measurement and documentation artifact.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add required report header fields and top-level heading.

This report should start with an H1 and include **URL:** and **Legitimacy:** in the header block.

Proposed patch
-## What does this PR do?
+# Cohere scan token comparison (PR artifact)
+
+**URL:** https://github.com/santifer/career-ops/pull/595
+**Legitimacy:** Internal PR measurement artifact (non-production benchmark)
+
+## What does this PR do?

As per coding guidelines: "All reports must include URL and Legitimacy fields in the header — format as URL: and Legitimacy: in the report".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## What does this PR do?
Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes:
- Playwright-rendered scraping of the Cohere Ashby board.
- `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`.
The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path.
## Related issue
N/A - measurement and documentation artifact.
# Cohere scan token comparison (PR artifact)
**URL:** https://github.com/santifer/career-ops/pull/595
**Legitimacy:** Internal PR measurement artifact (non-production benchmark)
## What does this PR do?
Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes:
- Playwright-rendered scraping of the Cohere Ashby board.
- `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`.
The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path.
## Related issue
N/A - measurement and documentation artifact.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 1-1: First line in a file should be a top-level heading

(MD041, first-line-heading, first-line-h1)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/cohere-scan-token-comparison-pr.md` around lines 1 - 13, Add a top-level
H1 heading to the report and insert the required header fields by adding lines
beginning with "**URL:**" and "**Legitimacy:**" directly under the H1 (e.g., "#
What does this PR do?" becomes "# <title>" followed by "**URL:** <value>" and
"**Legitimacy:** <value>"). Ensure the header block appears at the very top of
the document before the existing "## What does this PR do?" content and format
the two fields exactly as "**URL:**" and "**Legitimacy:**".

Comment thread modes/scan.md
Comment thread modes/scan.md Outdated
Comment thread scan.mjs Outdated
Comment thread scripts/parsers/cohere_jobs.py Outdated
Comment on lines +28 to +47
def ashby_api_url(source_url: str) -> str:
parsed = urlparse(source_url)
if parsed.netloc == "api.ashbyhq.com":
return source_url

board_slug = parsed.path.strip("/").split("/")[0]
if not board_slug:
raise ValueError(f"Cannot infer Ashby board slug from {source_url}")

return f"https://api.ashbyhq.com/posting-api/job-board/{board_slug}"


def fetch_jobs(source_url: str) -> list[dict]:
req = urllib.request.Request(
ashby_api_url(source_url),
headers={"User-Agent": "Mozilla/5.0"},
)
with urllib.request.urlopen(req, timeout=30) as resp:
data = json.loads(resp.read())
return data["jobs"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restrict source URL to HTTPS Ashby hosts before opening it.

source_url can currently pass through with insufficient scheme/host validation before urlopen. Add strict allowlisting to prevent unintended scheme/endpoint usage.

Proposed patch
 def ashby_api_url(source_url: str) -> str:
     parsed = urlparse(source_url)
-    if parsed.netloc == "api.ashbyhq.com":
-        return source_url
+    if parsed.scheme not in {"http", "https"}:
+        raise ValueError(f"Unsupported URL scheme: {parsed.scheme or 'missing'}")
+
+    if parsed.netloc == "api.ashbyhq.com":
+        if parsed.scheme != "https":
+            raise ValueError("Ashby API URL must use https")
+        if not parsed.path.startswith("/posting-api/job-board/"):
+            raise ValueError("Unexpected Ashby API endpoint")
+        return source_url
+
+    if parsed.netloc != "jobs.ashbyhq.com":
+        raise ValueError(f"Unsupported host for source URL: {parsed.netloc}")
 
     board_slug = parsed.path.strip("/").split("/")[0]
     if not board_slug:
         raise ValueError(f"Cannot infer Ashby board slug from {source_url}")
 
     return f"https://api.ashbyhq.com/posting-api/job-board/{board_slug}"
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 35-35: Avoid specifying long messages outside the exception class

(TRY003)


[error] 41-44: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


[error] 45-45: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parsers/cohere_jobs.py` around lines 28 - 47, The code currently
trusts any source_url before calling urlopen; update ashby_api_url (and callers
like fetch_jobs) to enforce an HTTPS-only allowlist for Ashby hosts: parse the
URL and require parsed.scheme == "https" and parsed.netloc be one of the
approved Ashby hostnames (e.g., "api.ashbyhq.com" and any official public
job-board hostnames your product supports); if the host is "api.ashbyhq.com"
return the original URL, otherwise validate the host is in the allowlist before
extracting board_slug and constructing the API URL, and raise ValueError for
disallowed schemes/hosts to prevent opening arbitrary URLs.

@lejrn lejrn force-pushed the feature/local-parser-scan-w-cohere-exmp branch from fceaf4a to 08ea970 Compare May 7, 2026 13:16
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
modes/scan.md (1)

61-61: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify .gitkeep is the exception to “keep artifacts out of git.”

This line should explicitly preserve .gitkeep placeholders while excluding generated JSON artifacts.

Suggested patch
-El escáner no necesita conservar el JSON completo después de leer stdout. Si un parser también genera un artefacto para auditoría o depuración, guardarlo en `data/parser-output/{company}/` y mantenerlo fuera de git.
+El escáner no necesita conservar el JSON completo después de leer stdout. Si un parser también genera un artefacto para auditoría o depuración, guardarlo en `data/parser-output/{company}/` y mantenerlo fuera de git (los JSON en `.gitignore`; los `.gitkeep` se mantienen en git para conservar la estructura).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modes/scan.md` at line 61, Update the sentence in modes/scan.md to explicitly
state that generated parser artifacts (e.g., JSON in
data/parser-output/{company}/) must be kept out of git but that .gitkeep files
are the allowed exception and should be committed; also ensure the repository
.gitignore excludes the generated JSON artifacts in data/parser-output/ while
not ignoring .gitkeep so the placeholder stays in source control.
scripts/parsers/cohere_jobs.py (1)

28-47: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restrict parser source URL to HTTPS Ashby allowlist before fetch.

ashby_api_url still allows unsafe/unexpected URLs through to urlopen when netloc is api.ashbyhq.com (scheme/path are not validated). Enforce https and allowed Ashby hosts/paths before issuing the request.

Suggested patch
 def ashby_api_url(source_url: str) -> str:
     parsed = urlparse(source_url)
-    if parsed.netloc == "api.ashbyhq.com":
-        return source_url
+    if parsed.scheme != "https":
+        raise ValueError(f"Unsupported URL scheme: {parsed.scheme or 'missing'}")
+
+    if parsed.netloc == "api.ashbyhq.com":
+        if not parsed.path.startswith("/posting-api/job-board/"):
+            raise ValueError("Unexpected Ashby API endpoint")
+        return source_url
+
+    if parsed.netloc != "jobs.ashbyhq.com":
+        raise ValueError(f"Unsupported host for source URL: {parsed.netloc}")

     board_slug = parsed.path.strip("/").split("/")[0]
     if not board_slug:
         raise ValueError(f"Cannot infer Ashby board slug from {source_url}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parsers/cohere_jobs.py` around lines 28 - 47, The ashby_api_url
function currently lets unsafe URLs reach urlopen because it only checks netloc;
update ashby_api_url and its usage in fetch_jobs to enforce an HTTPS allowlist:
require parsed.scheme == "https" and parsed.netloc in the allowed set (at
minimum "api.ashbyhq.com"), and if parsed.netloc == "api.ashbyhq.com" validate
the path prefix (e.g., it must start with "/posting-api/" or the expected
posting path) before returning source_url; otherwise, when constructing the API
URL from a board page (the existing board_slug path flow), always build an
explicit "https://api.ashbyhq.com/posting-api/job-board/{board_slug}" and raise
ValueError for any non-https scheme, unknown host, or malformed path so that
fetch_jobs only ever calls urlopen with a validated HTTPS Ashby API URL.
docs/cohere-scan-token-comparison-pr.md (1)

1-13: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add required report header block at the top (H1 + URL + Legitimacy).

This report is missing the required header fields and top-level heading format.

Suggested patch
-## What does this PR do?
+# Cohere scan token comparison (PR artifact)
+
+**URL:** https://github.com/santifer/career-ops/pull/595
+**Legitimacy:** Internal PR measurement artifact (non-production benchmark)
+
+## What does this PR do?

As per coding guidelines, "All reports must include URL and Legitimacy fields in the header — format as URL: and Legitimacy: in the report".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/cohere-scan-token-comparison-pr.md` around lines 1 - 13, Add the
required top header block to this report: insert an H1 title line at the very
top (e.g., "# Cohere scan token comparison"), then add "URL:" followed by the
report's source URL and "Legitimacy:" followed by the legitimacy statement
(e.g., "Legitimacy: internal measurement/dry-run") as plain lines immediately
under the H1; update the existing "## What does this PR do?" section to follow
that header so the document begins with the mandated H1 + URL + Legitimacy
fields.
scan.mjs (1)

416-423: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fall back to the ATS API when the local parser fails.

If runLocalParser(company) throws on Lines 416-418, the outer catch records the error and drops that company entirely. This still misses the PR objective of running the local parser before falling back to Greenhouse/Ashby/Lever.

Suggested fix
       if (source.kind === 'local-parser') {
-        jobs = await runLocalParser(company);
-        sourceName = 'local-parser';
+        try {
+          jobs = await runLocalParser(company);
+          sourceName = 'local-parser';
+        } catch (parserErr) {
+          const apiFallback = detectApi(company);
+          if (!apiFallback) throw parserErr;
+
+          const json = await fetchJson(apiFallback.url);
+          jobs = PARSERS[apiFallback.type](json, company.name);
+          sourceName = `${apiFallback.type}-api`;
+          errors.push({
+            company: company.name,
+            error: `local parser failed, used API fallback: ${parserErr.message}`,
+          });
+        }
       } else {
         const json = await fetchJson(source.url);
         jobs = PARSERS[source.type](json, company.name);
         sourceName = `${source.type}-api`;
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan.mjs` around lines 416 - 423, The current branch that calls
runLocalParser(company) can throw and abort processing the company; wrap the
runLocalParser call in its own try/catch so that on error you log the
local-parser failure but then fall back to the ATS API path: call
fetchJson(source.url), set jobs = PARSERS[source.type](json, company.name) and
set sourceName = `${source.type}-api`; ensure you still surface/log the original
local-parser error but continue processing rather than letting the outer catch
drop the company. Use the existing symbols runLocalParser, fetchJson, PARSERS,
source.kind, source.url, source.type, jobs and sourceName to locate and
implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scan.mjs`:
- Around line 383-389: The filtering assumes c.name is always a string and calls
toLowerCase(), which can throw for malformed entries; update the
selectedCompanies/filter logic to guard c.name (e.g., treat non-string names as
empty string or skip those entries) before calling toLowerCase and ensure
filterCompany is normalized (e.g., default to empty string) so a bad
tracked_companies item doesn't abort the scan; apply this guard where
selectedCompanies is built (the companies.filter(...) and subsequent .filter(c
=> c.name.toLowerCase().includes(filterCompany)) usage) and also ensure
downstream code using detectSource still receives well-formed company objects or
skips null/invalid ones.

---

Duplicate comments:
In `@docs/cohere-scan-token-comparison-pr.md`:
- Around line 1-13: Add the required top header block to this report: insert an
H1 title line at the very top (e.g., "# Cohere scan token comparison"), then add
"URL:" followed by the report's source URL and "Legitimacy:" followed by the
legitimacy statement (e.g., "Legitimacy: internal measurement/dry-run") as plain
lines immediately under the H1; update the existing "## What does this PR do?"
section to follow that header so the document begins with the mandated H1 + URL
+ Legitimacy fields.

In `@modes/scan.md`:
- Line 61: Update the sentence in modes/scan.md to explicitly state that
generated parser artifacts (e.g., JSON in data/parser-output/{company}/) must be
kept out of git but that .gitkeep files are the allowed exception and should be
committed; also ensure the repository .gitignore excludes the generated JSON
artifacts in data/parser-output/ while not ignoring .gitkeep so the placeholder
stays in source control.

In `@scan.mjs`:
- Around line 416-423: The current branch that calls runLocalParser(company) can
throw and abort processing the company; wrap the runLocalParser call in its own
try/catch so that on error you log the local-parser failure but then fall back
to the ATS API path: call fetchJson(source.url), set jobs =
PARSERS[source.type](json, company.name) and set sourceName =
`${source.type}-api`; ensure you still surface/log the original local-parser
error but continue processing rather than letting the outer catch drop the
company. Use the existing symbols runLocalParser, fetchJson, PARSERS,
source.kind, source.url, source.type, jobs and sourceName to locate and
implement the change.

In `@scripts/parsers/cohere_jobs.py`:
- Around line 28-47: The ashby_api_url function currently lets unsafe URLs reach
urlopen because it only checks netloc; update ashby_api_url and its usage in
fetch_jobs to enforce an HTTPS allowlist: require parsed.scheme == "https" and
parsed.netloc in the allowed set (at minimum "api.ashbyhq.com"), and if
parsed.netloc == "api.ashbyhq.com" validate the path prefix (e.g., it must start
with "/posting-api/" or the expected posting path) before returning source_url;
otherwise, when constructing the API URL from a board page (the existing
board_slug path flow), always build an explicit
"https://api.ashbyhq.com/posting-api/job-board/{board_slug}" and raise
ValueError for any non-https scheme, unknown host, or malformed path so that
fetch_jobs only ever calls urlopen with a validated HTTPS Ashby API URL.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a4ae2065-d06c-46fa-822c-af3c04da5e0d

📥 Commits

Reviewing files that changed from the base of the PR and between fceaf4a and 08ea970.

📒 Files selected for processing (9)
  • .gitignore
  • data/parser-output/.gitkeep
  • data/parser-output/cohere/.gitkeep
  • docs/SCRIPTS.md
  • docs/cohere-scan-token-comparison-pr.md
  • modes/scan.md
  • scan.mjs
  • scripts/parsers/cohere_jobs.py
  • templates/portals.example.yml

Comment thread scan.mjs Outdated
@santifer
Copy link
Copy Markdown
Owner

Hey @lejrn — local parser source is a useful extension to the scanner. Two questions before merging:

  1. CONFLICTING with main: needs rebase onto current.

  2. Python dependency: the PR adds scripts/parsers/cohere_jobs.py, which introduces a Python file into a JS-only repo. I'd rather not add Python as a runtime requirement for career-ops users (many install via Claude Code with only Node + Go).

    Could the Cohere example live elsewhere (your own repo, a docs/cookbook page, etc.) and the PR ship just the local_parser contract (the JS-side scan.mjs changes + portals.example.yml schema + docs)? Users who want a Python parser write their own and point at it.

The contract itself (stdout JSON spec, before-fallback execution) is clean. Just want to keep the runtime minimal.

@lejrn lejrn force-pushed the feature/local-parser-scan-w-cohere-exmp branch from 8060d74 to 4ebd978 Compare May 17, 2026 19:43
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modes/scan.md`:
- Around line 116-120: The priority list is ambiguous about runtime context;
update the section so it explicitly states this is the agent-mode execution
order (e.g., title: "Agent-mode execution priority") or split into two
subsections "Agent-mode execution order" and "Scanner/runtime (scan.mjs)
execution notes" and move the numbered levels (Local parser, Playwright, API,
WebSearch) under the agent-mode heading; ensure you reference scan.mjs in the
scanner/runtime subsection to clarify the runtime ordering differences.

In `@scan.mjs`:
- Around line 197-198: Validate and sanitize parser.timeout_ms and
parser.max_buffer_bytes before calling execFile: ensure the derived timeout and
maxBuffer are finite numbers greater than zero (use Number(...) then
isFinite(...) and >0 checks) and fall back to LOCAL_PARSER_TIMEOUT_MS and
LOCAL_PARSER_MAX_BUFFER_BYTES when invalid; update the variables timeout and
maxBuffer (used when invoking execFile) to these sanitized values so execFile
never receives NaN, non-finite, or non-positive arguments.

In `@test-all.mjs`:
- Around line 287-306: The tests currently use literal string includes against
scan.mjs which is brittle; replace them with resilient pattern/AST-based checks:
for the company-normalization contract, parse scan.mjs (scanScript) and assert a
typeof guard on c.name (e.g., a UnaryExpression or BinaryExpression checking
typeof c.name === 'string') and that subsequent MemberExpressions call trim()
and toLowerCase() on c.name (or its safe variant) rather than matching exact
source text; for the parser-fallback contract, detect a call expression to
detectApi(company) and verify a control-flow path or adjacent statements that
log "local parser failed, used API fallback" and call fetchJson(apiFallback.url)
(use regex or AST traversal to ensure the sequence/structure exists, not raw
includes).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 529d6682-f1a7-4a5d-a898-a2098247722c

📥 Commits

Reviewing files that changed from the base of the PR and between 08ea970 and 4ebd978.

📒 Files selected for processing (10)
  • .gitignore
  • VERSION
  • data/parser-output/.gitkeep
  • docs/SCRIPTS.md
  • docs/local-parser-cookbook.md
  • modes/scan.md
  • package.json
  • scan.mjs
  • templates/portals.example.yml
  • test-all.mjs

Comment thread modes/scan.md Outdated
Comment thread scan.mjs Outdated
Comment on lines +197 to +198
const timeout = Number(parser.timeout_ms || LOCAL_PARSER_TIMEOUT_MS);
const maxBuffer = Number(parser.max_buffer_bytes || LOCAL_PARSER_MAX_BUFFER_BYTES);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate parser timeout_ms and max_buffer_bytes before passing to execFile.

Line 197 and Line 198 accept non-finite/invalid values (NaN, <=0) from config, which can turn a simple config typo into a runtime failure for that company scan path.

Proposed fix
-  const timeout = Number(parser.timeout_ms || LOCAL_PARSER_TIMEOUT_MS);
-  const maxBuffer = Number(parser.max_buffer_bytes || LOCAL_PARSER_MAX_BUFFER_BYTES);
+  const requestedTimeout = Number(parser.timeout_ms);
+  const timeout =
+    Number.isFinite(requestedTimeout) && requestedTimeout > 0
+      ? Math.trunc(requestedTimeout)
+      : LOCAL_PARSER_TIMEOUT_MS;
+
+  const requestedMaxBuffer = Number(parser.max_buffer_bytes);
+  const maxBuffer =
+    Number.isFinite(requestedMaxBuffer) && requestedMaxBuffer > 0
+      ? Math.trunc(requestedMaxBuffer)
+      : LOCAL_PARSER_MAX_BUFFER_BYTES;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan.mjs` around lines 197 - 198, Validate and sanitize parser.timeout_ms and
parser.max_buffer_bytes before calling execFile: ensure the derived timeout and
maxBuffer are finite numbers greater than zero (use Number(...) then
isFinite(...) and >0 checks) and fall back to LOCAL_PARSER_TIMEOUT_MS and
LOCAL_PARSER_MAX_BUFFER_BYTES when invalid; update the variables timeout and
maxBuffer (used when invoking execFile) to these sanitized values so execFile
never receives NaN, non-finite, or non-positive arguments.

Comment thread test-all.mjs
Comment on lines +287 to +306
const scanScript = readFile('scan.mjs');
if (
scanScript.includes("typeof c.name === 'string'") &&
scanScript.includes('c.name.trim()') &&
scanScript.includes('c.name.toLowerCase()')
) {
pass('scan.mjs guards company names before filtering');
} else {
fail('scan.mjs does not guard company names before filtering');
}

if (
scanScript.includes('detectApi(company)') &&
scanScript.includes('local parser failed, used API fallback') &&
scanScript.includes('fetchJson(apiFallback.url)')
) {
pass('scan.mjs falls back to ATS API when local parser fails');
} else {
fail('scan.mjs does not fall back to ATS API when local parser fails');
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace literal includes(...) contract checks with resilient structure checks.

These assertions are currently tied to exact source strings, so they can fail on harmless refactors and pass even when behavior regresses (e.g., target text in comments). Prefer regex/AST-shape checks for guarded company normalization and parser-fallback flow to keep this contract test reliable.

Suggested direction
- if (
-   scanScript.includes("typeof c.name === 'string'") &&
-   scanScript.includes('c.name.trim()') &&
-   scanScript.includes('c.name.toLowerCase()')
- ) {
+ if (
+   /typeof\s+\w+\.name\s*===\s*['"]string['"]/.test(scanScript) &&
+   /\.\s*name\s*\.\s*trim\s*\(/.test(scanScript) &&
+   /\.\s*name\s*\.\s*toLowerCase\s*\(/.test(scanScript)
+ ) {
    pass('scan.mjs guards company names before filtering');
  } else {
    fail('scan.mjs does not guard company names before filtering');
  }

- if (
-   scanScript.includes('detectApi(company)') &&
-   scanScript.includes('local parser failed, used API fallback') &&
-   scanScript.includes('fetchJson(apiFallback.url)')
- ) {
+ if (
+   /detectApi\s*\(\s*company\s*\)/.test(scanScript) &&
+   /local parser failed,\s*used API fallback/.test(scanScript) &&
+   /fetchJson\s*\(\s*apiFallback\.url\s*\)/.test(scanScript)
+ ) {
    pass('scan.mjs falls back to ATS API when local parser fails');
  } else {
    fail('scan.mjs does not fall back to ATS API when local parser fails');
  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const scanScript = readFile('scan.mjs');
if (
scanScript.includes("typeof c.name === 'string'") &&
scanScript.includes('c.name.trim()') &&
scanScript.includes('c.name.toLowerCase()')
) {
pass('scan.mjs guards company names before filtering');
} else {
fail('scan.mjs does not guard company names before filtering');
}
if (
scanScript.includes('detectApi(company)') &&
scanScript.includes('local parser failed, used API fallback') &&
scanScript.includes('fetchJson(apiFallback.url)')
) {
pass('scan.mjs falls back to ATS API when local parser fails');
} else {
fail('scan.mjs does not fall back to ATS API when local parser fails');
}
const scanScript = readFile('scan.mjs');
if (
/typeof\s+\w+\.name\s*===\s*['"]string['"]/.test(scanScript) &&
/\.\s*name\s*\.\s*trim\s*\(/.test(scanScript) &&
/\.\s*name\s*\.\s*toLowerCase\s*\(/.test(scanScript)
) {
pass('scan.mjs guards company names before filtering');
} else {
fail('scan.mjs does not guard company names before filtering');
}
if (
/detectApi\s*\(\s*company\s*\)/.test(scanScript) &&
/local parser failed,\s*used API fallback/.test(scanScript) &&
/fetchJson\s*\(\s*apiFallback\.url\s*\)/.test(scanScript)
) {
pass('scan.mjs falls back to ATS API when local parser fails');
} else {
fail('scan.mjs does not fall back to ATS API when local parser fails');
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test-all.mjs` around lines 287 - 306, The tests currently use literal string
includes against scan.mjs which is brittle; replace them with resilient
pattern/AST-based checks: for the company-normalization contract, parse scan.mjs
(scanScript) and assert a typeof guard on c.name (e.g., a UnaryExpression or
BinaryExpression checking typeof c.name === 'string') and that subsequent
MemberExpressions call trim() and toLowerCase() on c.name (or its safe variant)
rather than matching exact source text; for the parser-fallback contract, detect
a call expression to detectApi(company) and verify a control-flow path or
adjacent statements that log "local parser failed, used API fallback" and call
fetchJson(apiFallback.url) (use regex or AST traversal to ensure the
sequence/structure exists, not raw includes).

@lejrn
Copy link
Copy Markdown
Author

lejrn commented May 17, 2026

Hey @santifer, thank you for your code review and comments!
I followed your guidance and did these:

  • Rebase: Done onto current main.

  • Python: Removed bundled example parser; PR is contract + scanner + docs only; parsers live outside the repo or user-written.

  • Docs: docs/local-parser-cookbook.md plus commented JS/Python examples under templates/portals.example.yml; token savings shown in a small comparison table (generic wording).

  • Scanner: Local parser still runs first; on failure we fall back to Greenhouse/Ashby/Lever API when detectable; malformed tracked_companies entries no longer crash filtering.

Forgive me if I did some mistake, and please guide me what to fix, if needed.

Thank you!

Introduce providers/local-parser.mjs for executable parsers in portals.yml,
with scan.mjs routing parser-first and ATS API fallback. Extend modes/scan.md
with Nivel 0 contract and local_parser_ok to skip redundant Playwright/API
work. Add generic portals.example.yml docs, local-parser-cookbook.md, and
test-all contract checks.

Co-authored-by: Cursor <cursoragent@cursor.com>
@lejrn lejrn force-pushed the feature/local-parser-scan-w-cohere-exmp branch from 4ebd978 to 98fccce Compare May 18, 2026 00:48
@lejrn lejrn changed the title Add local parser source support for portal scans feat(scan): add local-parser provider for portal scans May 18, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
providers/local-parser.mjs (1)

79-80: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sanitize parser numeric limits before calling execFile.

timeout_ms / max_buffer_bytes are not validated for finite positive numbers. Invalid values can propagate directly to execFile options.

Proposed fix
-  const timeout = Number(parser.timeout_ms || LOCAL_PARSER_TIMEOUT_MS);
-  const maxBuffer = Number(parser.max_buffer_bytes || LOCAL_PARSER_MAX_BUFFER_BYTES);
+  const requestedTimeout = Number(parser.timeout_ms);
+  const timeout =
+    Number.isFinite(requestedTimeout) && requestedTimeout > 0
+      ? Math.trunc(requestedTimeout)
+      : LOCAL_PARSER_TIMEOUT_MS;
+
+  const requestedMaxBuffer = Number(parser.max_buffer_bytes);
+  const maxBuffer =
+    Number.isFinite(requestedMaxBuffer) && requestedMaxBuffer > 0
+      ? Math.trunc(requestedMaxBuffer)
+      : LOCAL_PARSER_MAX_BUFFER_BYTES;
#!/bin/bash
rg -n "timeout_ms|max_buffer_bytes|execFileAsync|Number\\.isFinite" providers/local-parser.mjs -C2

As per coding guidelines, **/*.mjs: Ensure scripts handle missing data/ directories gracefully.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@providers/local-parser.mjs` around lines 79 - 80, The parsed numeric limits
timeout and maxBuffer are not validated before being passed to execFileAsync;
update the code around the expressions that compute timeout (using
parser.timeout_ms and LOCAL_PARSER_TIMEOUT_MS) and maxBuffer (using
parser.max_buffer_bytes and LOCAL_PARSER_MAX_BUFFER_BYTES) to coerce to numbers,
check Number.isFinite(...) and that values are > 0 (and integer where
appropriate), and fall back to the corresponding LOCAL_PARSER_* defaults when
invalid; ensure the validated variables used by execFileAsync are safe positive
finite numbers to prevent passing bad options to execFile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/local-parser-cookbook.md`:
- Line 30: Replace the Spanish words used in the sentence that references agent
scan mode and scan behavior so the doc uses consistent English: change "niveles"
to "levels" and "Nivel 0" to "Level 0" in the sentence that mentions
`/career-ops scan`, Playwright and API payloads, and the `local_parser_ok`
behavior (also referenced alongside `modes/scan.md`) so the doc consistently
reads "levels" / "Level 0".

In `@modes/scan.md`:
- Around line 156-170: The ordered list numbering in modes/scan.md uses explicit
numbers like "3.5." and "4." which triggers markdownlint MD029; update the
ordered list items (e.g., the heading "3.5. **Nivel 0 — Local parser**", its
sub-steps a–g, and the subsequent "4. **Nivel 1 — Playwright scan**" heading) to
use the auto-numbering style by replacing explicit numerals with "1." for each
list item (and any nested ordered lists) so markdown auto-numbers them
consistently and satisfies MD029.

In `@scan.mjs`:
- Around line 88-92: In resolveProvider, the explicit-provider branch ignores
skipIds: when entry.provider is set you currently immediately fetch from
providers and return, which can reselect a skipped provider (e.g., local-parser)
and prevent API fallback; change the branch so it first checks skipIds (e.g.,
skipIds.has(entry.id) or equivalent) and if the entry id is skipped then do not
return the explicit provider but fall through to the normal selection/fallback
path; otherwise, keep the existing providers.get(entry.provider) lookup and
return provider or unknown-provider error. Ensure you reference resolveProvider,
entry.provider, providers, and skipIds in the change.

In `@test-all.mjs`:
- Around line 331-341: The test currently calls
readFile('templates/portals.example.yml') unguarded which will throw if the file
is absent; update the test in test-all.mjs to first call
fileExists('templates/portals.example.yml') and if it returns false call
fail(...) with a clear message that the portals example is missing, otherwise
read the file into portalExample and run the existing includes checks and
pass/fail logic; reference the readFile usage and the portalExample/pass/fail
block to apply the guard.

---

Duplicate comments:
In `@providers/local-parser.mjs`:
- Around line 79-80: The parsed numeric limits timeout and maxBuffer are not
validated before being passed to execFileAsync; update the code around the
expressions that compute timeout (using parser.timeout_ms and
LOCAL_PARSER_TIMEOUT_MS) and maxBuffer (using parser.max_buffer_bytes and
LOCAL_PARSER_MAX_BUFFER_BYTES) to coerce to numbers, check Number.isFinite(...)
and that values are > 0 (and integer where appropriate), and fall back to the
corresponding LOCAL_PARSER_* defaults when invalid; ensure the validated
variables used by execFileAsync are safe positive finite numbers to prevent
passing bad options to execFile.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 32e27756-5881-4c43-b7cd-929cc6efd41f

📥 Commits

Reviewing files that changed from the base of the PR and between 4ebd978 and 98fccce.

📒 Files selected for processing (11)
  • .gitignore
  • VERSION
  • data/parser-output/.gitkeep
  • docs/SCRIPTS.md
  • docs/local-parser-cookbook.md
  • modes/scan.md
  • package.json
  • providers/local-parser.mjs
  • scan.mjs
  • templates/portals.example.yml
  • test-all.mjs


`scan.mjs` uses **0 LLM tokens** for discovery: parsers run locally and only normalized job rows enter the pipeline.

In agent scan mode (`/career-ops scan`), Playwright and API niveles send large page or JSON payloads into the model. When Nivel 0 succeeds, `modes/scan.md` requires skipping those niveles for the same company (`local_parser_ok`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use consistent English terminology in this English doc.

Line 30 mixes Spanish (niveles, Nivel) with English text. Replace with levels / Level 0 for consistency and readability.

🧰 Tools
🪛 LanguageTool

[grammar] ~30-~30: Ensure spelling is correct
Context: .../career-ops scan), Playwright and API niveles send large page or JSON payloads into t...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~30-~30: Ensure spelling is correct
Context: ...e or JSON payloads into the model. When Nivel 0 succeeds, modes/scan.md requires sk...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~30-~30: Ensure spelling is correct
Context: ...modes/scan.md requires skipping those niveles for the same company (local_parser_ok...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/local-parser-cookbook.md` at line 30, Replace the Spanish words used in
the sentence that references agent scan mode and scan behavior so the doc uses
consistent English: change "niveles" to "levels" and "Nivel 0" to "Level 0" in
the sentence that mentions `/career-ops scan`, Playwright and API payloads, and
the `local_parser_ok` behavior (also referenced alongside `modes/scan.md`) so
the doc consistently reads "levels" / "Level 0".

Comment thread modes/scan.md
Comment on lines +156 to 170
3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
Inicializar `local_parser_ok = []`.
Preferir ejecutar `node scan.mjs` una vez para cubrir todos los parsers + APIs zero-token; si se hace manualmente, repetir la lógica siguiente.
Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente:
a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell
b. Expandir placeholders `{careers_url}` y `{company}` en argumentos
c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`)
d. Normalizar cada job a `{title, url, company, location}`
e. Resolver URLs relativas contra `careers_url`
f. Si el parser falla, registrar error, intentar fallback por API ATS si existe, y continuar con las demás empresas (**no** añadir a `local_parser_ok`)
g. Si el parser termina con éxito (pasos c–e sin error fatal), añadir `entry.name` a `local_parser_ok` y acumular jobs en candidatos

4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
Para cada empresa en `tracked_companies` con `enabled: true` y `careers_url` definida:
Para cada empresa en `tracked_companies` con `enabled: true`, `careers_url` definida, y **nombre no listado en `local_parser_ok`**:
a. `browser_navigate` a la `careers_url`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix ordered-list numbering to satisfy markdownlint MD029.

Line 168 currently continues with 4. after 3.5., which triggers the configured ordered-list style check. Use 1. style for list items (Markdown auto-numbers).

Proposed doc fix
-3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
+1. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
   ...
-4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
+1. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
   ...
-5. **Nivel 2 — ATS APIs / feeds** (paralelo):
+1. **Nivel 2 — ATS APIs / feeds** (paralelo):
   ...
-6. **Nivel 3 — WebSearch queries** (paralelo si posible):
+1. **Nivel 3 — WebSearch queries** (paralelo si posible):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
Inicializar `local_parser_ok = []`.
Preferir ejecutar `node scan.mjs` una vez para cubrir todos los parsers + APIs zero-token; si se hace manualmente, repetir la lógica siguiente.
Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente:
a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell
b. Expandir placeholders `{careers_url}` y `{company}` en argumentos
c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`)
d. Normalizar cada job a `{title, url, company, location}`
e. Resolver URLs relativas contra `careers_url`
f. Si el parser falla, registrar error, intentar fallback por API ATS si existe, y continuar con las demás empresas (**no** añadir a `local_parser_ok`)
g. Si el parser termina con éxito (pasos c–e sin error fatal), añadir `entry.name` a `local_parser_ok` y acumular jobs en candidatos
4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
Para cada empresa en `tracked_companies` con `enabled: true` y `careers_url` definida:
Para cada empresa en `tracked_companies` con `enabled: true`, `careers_url` definida, y **nombre no listado en `local_parser_ok`**:
a. `browser_navigate` a la `careers_url`
1. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
Inicializar `local_parser_ok = []`.
Preferir ejecutar `node scan.mjs` una vez para cubrir todos los parsers + APIs zero-token; si se hace manualmente, repetir la lógica siguiente.
Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente:
a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell
b. Expandir placeholders `{careers_url}` y `{company}` en argumentos
c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`)
d. Normalizar cada job a `{title, url, company, location}`
e. Resolver URLs relativas contra `careers_url`
f. Si el parser falla, registrar error, intentar fallback por API ATS si existe, y continuar con las demás empresas (**no** añadir a `local_parser_ok`)
g. Si el parser termina con éxito (pasos c–e sin error fatal), añadir `entry.name` a `local_parser_ok` y acumular jobs en candidatos
1. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
Para cada empresa en `tracked_companies` con `enabled: true`, `careers_url` definida, y **nombre no listado en `local_parser_ok`**:
a. `browser_navigate` a la `careers_url`
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 168-168: Ordered list item prefix
Expected: 1; Actual: 4; Style: 1/2/3

(MD029, ol-prefix)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modes/scan.md` around lines 156 - 170, The ordered list numbering in
modes/scan.md uses explicit numbers like "3.5." and "4." which triggers
markdownlint MD029; update the ordered list items (e.g., the heading "3.5.
**Nivel 0 — Local parser**", its sub-steps a–g, and the subsequent "4. **Nivel 1
— Playwright scan**" heading) to use the auto-numbering style by replacing
explicit numerals with "1." for each list item (and any nested ordered lists) so
markdown auto-numbers them consistently and satisfies MD029.

Comment thread scan.mjs
Comment on lines 88 to 92
if (entry.provider) {
const p = providers.get(entry.provider);
if (!p) return { error: `unknown provider: ${entry.provider}` };
return { provider: p };
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Respect skipIds for explicit provider overrides.

resolveProvider() bypasses skipIds when entry.provider is set. In the local-parser failure path, this can reselect local-parser and block intended API fallback for explicitly configured entries.

Proposed fix
 function resolveProvider(entry, providers, { skipIds = [] } = {}) {
   if (entry.provider) {
+    if (skipIds.includes(entry.provider)) return null;
     const p = providers.get(entry.provider);
     if (!p) return { error: `unknown provider: ${entry.provider}` };
     return { provider: p };
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (entry.provider) {
const p = providers.get(entry.provider);
if (!p) return { error: `unknown provider: ${entry.provider}` };
return { provider: p };
}
if (entry.provider) {
if (skipIds.includes(entry.provider)) return null;
const p = providers.get(entry.provider);
if (!p) return { error: `unknown provider: ${entry.provider}` };
return { provider: p };
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan.mjs` around lines 88 - 92, In resolveProvider, the explicit-provider
branch ignores skipIds: when entry.provider is set you currently immediately
fetch from providers and return, which can reselect a skipped provider (e.g.,
local-parser) and prevent API fallback; change the branch so it first checks
skipIds (e.g., skipIds.has(entry.id) or equivalent) and if the entry id is
skipped then do not return the explicit provider but fall through to the normal
selection/fallback path; otherwise, keep the existing
providers.get(entry.provider) lookup and return provider or unknown-provider
error. Ensure you reference resolveProvider, entry.provider, providers, and
skipIds in the change.

Comment thread test-all.mjs
Comment on lines +331 to +341
const portalExample = readFile('templates/portals.example.yml');
if (
!portalExample.includes('cohere_jobs.py') &&
portalExample.includes('scripts/parsers/example-js-company-jobs.js') &&
portalExample.includes('scripts/parsers/example_python_company_jobs.py') &&
portalExample.includes('already know their target careers URL')
) {
pass('portals example documents a generic local parser contract');
} else {
fail('portals example still points at a bundled Cohere parser');
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard templates/portals.example.yml before reading to avoid hard crash.

Line 331 calls readFile(...) unconditionally. If the file is missing/renamed, the whole suite throws instead of reporting a failed check. Please gate this with fileExists(...) and fail gracefully.

Proposed fix
-const portalExample = readFile('templates/portals.example.yml');
-if (
-  !portalExample.includes('cohere_jobs.py') &&
-  portalExample.includes('scripts/parsers/example-js-company-jobs.js') &&
-  portalExample.includes('scripts/parsers/example_python_company_jobs.py') &&
-  portalExample.includes('already know their target careers URL')
-) {
-  pass('portals example documents a generic local parser contract');
-} else {
-  fail('portals example still points at a bundled Cohere parser');
-}
+if (!fileExists('templates/portals.example.yml')) {
+  fail('Missing templates/portals.example.yml');
+} else {
+  const portalExample = readFile('templates/portals.example.yml');
+  if (
+    !portalExample.includes('cohere_jobs.py') &&
+    portalExample.includes('scripts/parsers/example-js-company-jobs.js') &&
+    portalExample.includes('scripts/parsers/example_python_company_jobs.py') &&
+    portalExample.includes('already know their target careers URL')
+  ) {
+    pass('portals example documents a generic local parser contract');
+  } else {
+    fail('portals example still points at a bundled Cohere parser');
+  }
+}

As per coding guidelines, "/*.mjs: ... Ensure scripts handle missing data/ directories gracefully.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test-all.mjs` around lines 331 - 341, The test currently calls
readFile('templates/portals.example.yml') unguarded which will throw if the file
is absent; update the test in test-all.mjs to first call
fileExists('templates/portals.example.yml') and if it returns false call
fail(...) with a clear message that the portals example is missing, otherwise
read the file into portalExample and run the existing includes checks and
pass/fail logic; reference the readFile usage and the portalExample/pass/fail
block to apply the guard.

@lejrn
Copy link
Copy Markdown
Author

lejrn commented May 18, 2026

Hey @santifer again,
I had to do some refactor.
At first, I fixed the issues mentioned prior to this message -> then Github PR shows there are conflicts -> I fetch all of thte 38 commits that have been added prior to this PR -> I refactor my local parser feature to comply the new Providers feature of how the scan.md and scan.mjs is built -> I apply some tests -> I log the results -> I clean the branch and commit and push again.

Hopefully it works now.. Please update me if anything requires modifications. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add local parser support for zero-token portal scans

2 participants