feat(scan): add local-parser provider for portal scans#595
Conversation
📝 WalkthroughWalkthroughAdds configurable per-company local parser support: new ChangesLocal Parser Scanning Feature
Sequence Diagram(s)(omitted — changes are primarily localized provider addition, docs, and wiring involving two main runtime components) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/cohere-scan-token-comparison-pr.md`:
- Around line 1-13: Add a top-level H1 heading to the report and insert the
required header fields by adding lines beginning with "**URL:**" and
"**Legitimacy:**" directly under the H1 (e.g., "# What does this PR do?" becomes
"# <title>" followed by "**URL:** <value>" and "**Legitimacy:** <value>").
Ensure the header block appears at the very top of the document before the
existing "## What does this PR do?" content and format the two fields exactly as
"**URL:**" and "**Legitimacy:**".
In `@modes/scan.md`:
- Line 61: Update the documentation in modes/scan.md to clarify that parser
artifacts should not be checked into git (store them under
data/parser-output/{company}/ and add them to .gitignore) but that any .gitkeep
placeholder files used to preserve the directory structure must remain in the
repo; explicitly state that the scanner does not need to retain full JSON after
stdout is read and that only .gitkeep files are an exception to the "keep out of
git" rule.
- Around line 53-59: Update the parser contract example in modes/scan.md to list
all three accepted JSON stdout formats: the array format ([{...}]), the
object-with-jobs format ({ "jobs": [{...}] }), and the object-with-results
format ({ "results": [{...}] }); show each format as a separate JSON example and
keep the note that company is optional and scan.mjs will fall back to
tracked_companies when company is missing so parser implementers know all
allowed shapes to support.
In `@scan.mjs`:
- Around line 416-423: The local-parser branch should fall back to the API
parser when runLocalParser fails; wrap the call to runLocalParser(company) in a
try/catch (or check its failure result) and on error log the failure and then
set jobs by fetching the API JSON (using fetchJson(source.url)) and invoking
PARSERS[source.type](json, company.name), while still setting sourceName to
`${source.type}-api`; ensure you reference runLocalParser, fetchJson and PARSERS
and handle missing source.url or unknown source.type errors similarly so the
code degrades to the API scan rather than dropping the company.
In `@scripts/parsers/cohere_jobs.py`:
- Around line 28-47: The code currently trusts any source_url before calling
urlopen; update ashby_api_url (and callers like fetch_jobs) to enforce an
HTTPS-only allowlist for Ashby hosts: parse the URL and require parsed.scheme ==
"https" and parsed.netloc be one of the approved Ashby hostnames (e.g.,
"api.ashbyhq.com" and any official public job-board hostnames your product
supports); if the host is "api.ashbyhq.com" return the original URL, otherwise
validate the host is in the allowlist before extracting board_slug and
constructing the API URL, and raise ValueError for disallowed schemes/hosts to
prevent opening arbitrary URLs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b2c029fd-9f87-48c4-8489-3f8d62666d31
📒 Files selected for processing (9)
.gitignoredata/parser-output/.gitkeepdata/parser-output/cohere/.gitkeepdocs/SCRIPTS.mddocs/cohere-scan-token-comparison-pr.mdmodes/scan.mdscan.mjsscripts/parsers/cohere_jobs.pytemplates/portals.example.yml
| ## What does this PR do? | ||
|
|
||
| Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes: | ||
|
|
||
| - Playwright-rendered scraping of the Cohere Ashby board. | ||
| - `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`. | ||
|
|
||
| The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path. | ||
|
|
||
| ## Related issue | ||
|
|
||
| N/A - measurement and documentation artifact. | ||
|
|
There was a problem hiding this comment.
Add required report header fields and top-level heading.
This report should start with an H1 and include **URL:** and **Legitimacy:** in the header block.
Proposed patch
-## What does this PR do?
+# Cohere scan token comparison (PR artifact)
+
+**URL:** https://github.com/santifer/career-ops/pull/595
+**Legitimacy:** Internal PR measurement artifact (non-production benchmark)
+
+## What does this PR do?As per coding guidelines: "All reports must include URL and Legitimacy fields in the header — format as URL: and Legitimacy: in the report".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## What does this PR do? | |
| Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes: | |
| - Playwright-rendered scraping of the Cohere Ashby board. | |
| - `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`. | |
| The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path. | |
| ## Related issue | |
| N/A - measurement and documentation artifact. | |
| # Cohere scan token comparison (PR artifact) | |
| **URL:** https://github.com/santifer/career-ops/pull/595 | |
| **Legitimacy:** Internal PR measurement artifact (non-production benchmark) | |
| ## What does this PR do? | |
| Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes: | |
| - Playwright-rendered scraping of the Cohere Ashby board. | |
| - `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`. | |
| The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path. | |
| ## Related issue | |
| N/A - measurement and documentation artifact. |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 1-1: First line in a file should be a top-level heading
(MD041, first-line-heading, first-line-h1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/cohere-scan-token-comparison-pr.md` around lines 1 - 13, Add a top-level
H1 heading to the report and insert the required header fields by adding lines
beginning with "**URL:**" and "**Legitimacy:**" directly under the H1 (e.g., "#
What does this PR do?" becomes "# <title>" followed by "**URL:** <value>" and
"**Legitimacy:** <value>"). Ensure the header block appears at the very top of
the document before the existing "## What does this PR do?" content and format
the two fields exactly as "**URL:**" and "**Legitimacy:**".
| def ashby_api_url(source_url: str) -> str: | ||
| parsed = urlparse(source_url) | ||
| if parsed.netloc == "api.ashbyhq.com": | ||
| return source_url | ||
|
|
||
| board_slug = parsed.path.strip("/").split("/")[0] | ||
| if not board_slug: | ||
| raise ValueError(f"Cannot infer Ashby board slug from {source_url}") | ||
|
|
||
| return f"https://api.ashbyhq.com/posting-api/job-board/{board_slug}" | ||
|
|
||
|
|
||
| def fetch_jobs(source_url: str) -> list[dict]: | ||
| req = urllib.request.Request( | ||
| ashby_api_url(source_url), | ||
| headers={"User-Agent": "Mozilla/5.0"}, | ||
| ) | ||
| with urllib.request.urlopen(req, timeout=30) as resp: | ||
| data = json.loads(resp.read()) | ||
| return data["jobs"] |
There was a problem hiding this comment.
Restrict source URL to HTTPS Ashby hosts before opening it.
source_url can currently pass through with insufficient scheme/host validation before urlopen. Add strict allowlisting to prevent unintended scheme/endpoint usage.
Proposed patch
def ashby_api_url(source_url: str) -> str:
parsed = urlparse(source_url)
- if parsed.netloc == "api.ashbyhq.com":
- return source_url
+ if parsed.scheme not in {"http", "https"}:
+ raise ValueError(f"Unsupported URL scheme: {parsed.scheme or 'missing'}")
+
+ if parsed.netloc == "api.ashbyhq.com":
+ if parsed.scheme != "https":
+ raise ValueError("Ashby API URL must use https")
+ if not parsed.path.startswith("/posting-api/job-board/"):
+ raise ValueError("Unexpected Ashby API endpoint")
+ return source_url
+
+ if parsed.netloc != "jobs.ashbyhq.com":
+ raise ValueError(f"Unsupported host for source URL: {parsed.netloc}")
board_slug = parsed.path.strip("/").split("/")[0]
if not board_slug:
raise ValueError(f"Cannot infer Ashby board slug from {source_url}")
return f"https://api.ashbyhq.com/posting-api/job-board/{board_slug}"🧰 Tools
🪛 Ruff (0.15.12)
[warning] 35-35: Avoid specifying long messages outside the exception class
(TRY003)
[error] 41-44: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
[error] 45-45: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parsers/cohere_jobs.py` around lines 28 - 47, The code currently
trusts any source_url before calling urlopen; update ashby_api_url (and callers
like fetch_jobs) to enforce an HTTPS-only allowlist for Ashby hosts: parse the
URL and require parsed.scheme == "https" and parsed.netloc be one of the
approved Ashby hostnames (e.g., "api.ashbyhq.com" and any official public
job-board hostnames your product supports); if the host is "api.ashbyhq.com"
return the original URL, otherwise validate the host is in the allowlist before
extracting board_slug and constructing the API URL, and raise ValueError for
disallowed schemes/hosts to prevent opening arbitrary URLs.
fceaf4a to
08ea970
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
modes/scan.md (1)
61-61:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClarify
.gitkeepis the exception to “keep artifacts out of git.”This line should explicitly preserve
.gitkeepplaceholders while excluding generated JSON artifacts.Suggested patch
-El escáner no necesita conservar el JSON completo después de leer stdout. Si un parser también genera un artefacto para auditoría o depuración, guardarlo en `data/parser-output/{company}/` y mantenerlo fuera de git. +El escáner no necesita conservar el JSON completo después de leer stdout. Si un parser también genera un artefacto para auditoría o depuración, guardarlo en `data/parser-output/{company}/` y mantenerlo fuera de git (los JSON en `.gitignore`; los `.gitkeep` se mantienen en git para conservar la estructura).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modes/scan.md` at line 61, Update the sentence in modes/scan.md to explicitly state that generated parser artifacts (e.g., JSON in data/parser-output/{company}/) must be kept out of git but that .gitkeep files are the allowed exception and should be committed; also ensure the repository .gitignore excludes the generated JSON artifacts in data/parser-output/ while not ignoring .gitkeep so the placeholder stays in source control.scripts/parsers/cohere_jobs.py (1)
28-47:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestrict parser source URL to HTTPS Ashby allowlist before fetch.
ashby_api_urlstill allows unsafe/unexpected URLs through tourlopenwhennetlocisapi.ashbyhq.com(scheme/path are not validated). Enforcehttpsand allowed Ashby hosts/paths before issuing the request.Suggested patch
def ashby_api_url(source_url: str) -> str: parsed = urlparse(source_url) - if parsed.netloc == "api.ashbyhq.com": - return source_url + if parsed.scheme != "https": + raise ValueError(f"Unsupported URL scheme: {parsed.scheme or 'missing'}") + + if parsed.netloc == "api.ashbyhq.com": + if not parsed.path.startswith("/posting-api/job-board/"): + raise ValueError("Unexpected Ashby API endpoint") + return source_url + + if parsed.netloc != "jobs.ashbyhq.com": + raise ValueError(f"Unsupported host for source URL: {parsed.netloc}") board_slug = parsed.path.strip("/").split("/")[0] if not board_slug: raise ValueError(f"Cannot infer Ashby board slug from {source_url}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parsers/cohere_jobs.py` around lines 28 - 47, The ashby_api_url function currently lets unsafe URLs reach urlopen because it only checks netloc; update ashby_api_url and its usage in fetch_jobs to enforce an HTTPS allowlist: require parsed.scheme == "https" and parsed.netloc in the allowed set (at minimum "api.ashbyhq.com"), and if parsed.netloc == "api.ashbyhq.com" validate the path prefix (e.g., it must start with "/posting-api/" or the expected posting path) before returning source_url; otherwise, when constructing the API URL from a board page (the existing board_slug path flow), always build an explicit "https://api.ashbyhq.com/posting-api/job-board/{board_slug}" and raise ValueError for any non-https scheme, unknown host, or malformed path so that fetch_jobs only ever calls urlopen with a validated HTTPS Ashby API URL.docs/cohere-scan-token-comparison-pr.md (1)
1-13:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd required report header block at the top (H1 + URL + Legitimacy).
This report is missing the required header fields and top-level heading format.
Suggested patch
-## What does this PR do? +# Cohere scan token comparison (PR artifact) + +**URL:** https://github.com/santifer/career-ops/pull/595 +**Legitimacy:** Internal PR measurement artifact (non-production benchmark) + +## What does this PR do?As per coding guidelines, "All reports must include URL and Legitimacy fields in the header — format as URL: and Legitimacy: in the report".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/cohere-scan-token-comparison-pr.md` around lines 1 - 13, Add the required top header block to this report: insert an H1 title line at the very top (e.g., "# Cohere scan token comparison"), then add "URL:" followed by the report's source URL and "Legitimacy:" followed by the legitimacy statement (e.g., "Legitimacy: internal measurement/dry-run") as plain lines immediately under the H1; update the existing "## What does this PR do?" section to follow that header so the document begins with the mandated H1 + URL + Legitimacy fields.scan.mjs (1)
416-423:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFall back to the ATS API when the local parser fails.
If
runLocalParser(company)throws on Lines 416-418, the outer catch records the error and drops that company entirely. This still misses the PR objective of running the local parser before falling back to Greenhouse/Ashby/Lever.Suggested fix
if (source.kind === 'local-parser') { - jobs = await runLocalParser(company); - sourceName = 'local-parser'; + try { + jobs = await runLocalParser(company); + sourceName = 'local-parser'; + } catch (parserErr) { + const apiFallback = detectApi(company); + if (!apiFallback) throw parserErr; + + const json = await fetchJson(apiFallback.url); + jobs = PARSERS[apiFallback.type](json, company.name); + sourceName = `${apiFallback.type}-api`; + errors.push({ + company: company.name, + error: `local parser failed, used API fallback: ${parserErr.message}`, + }); + } } else { const json = await fetchJson(source.url); jobs = PARSERS[source.type](json, company.name); sourceName = `${source.type}-api`; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scan.mjs` around lines 416 - 423, The current branch that calls runLocalParser(company) can throw and abort processing the company; wrap the runLocalParser call in its own try/catch so that on error you log the local-parser failure but then fall back to the ATS API path: call fetchJson(source.url), set jobs = PARSERS[source.type](json, company.name) and set sourceName = `${source.type}-api`; ensure you still surface/log the original local-parser error but continue processing rather than letting the outer catch drop the company. Use the existing symbols runLocalParser, fetchJson, PARSERS, source.kind, source.url, source.type, jobs and sourceName to locate and implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scan.mjs`:
- Around line 383-389: The filtering assumes c.name is always a string and calls
toLowerCase(), which can throw for malformed entries; update the
selectedCompanies/filter logic to guard c.name (e.g., treat non-string names as
empty string or skip those entries) before calling toLowerCase and ensure
filterCompany is normalized (e.g., default to empty string) so a bad
tracked_companies item doesn't abort the scan; apply this guard where
selectedCompanies is built (the companies.filter(...) and subsequent .filter(c
=> c.name.toLowerCase().includes(filterCompany)) usage) and also ensure
downstream code using detectSource still receives well-formed company objects or
skips null/invalid ones.
---
Duplicate comments:
In `@docs/cohere-scan-token-comparison-pr.md`:
- Around line 1-13: Add the required top header block to this report: insert an
H1 title line at the very top (e.g., "# Cohere scan token comparison"), then add
"URL:" followed by the report's source URL and "Legitimacy:" followed by the
legitimacy statement (e.g., "Legitimacy: internal measurement/dry-run") as plain
lines immediately under the H1; update the existing "## What does this PR do?"
section to follow that header so the document begins with the mandated H1 + URL
+ Legitimacy fields.
In `@modes/scan.md`:
- Line 61: Update the sentence in modes/scan.md to explicitly state that
generated parser artifacts (e.g., JSON in data/parser-output/{company}/) must be
kept out of git but that .gitkeep files are the allowed exception and should be
committed; also ensure the repository .gitignore excludes the generated JSON
artifacts in data/parser-output/ while not ignoring .gitkeep so the placeholder
stays in source control.
In `@scan.mjs`:
- Around line 416-423: The current branch that calls runLocalParser(company) can
throw and abort processing the company; wrap the runLocalParser call in its own
try/catch so that on error you log the local-parser failure but then fall back
to the ATS API path: call fetchJson(source.url), set jobs =
PARSERS[source.type](json, company.name) and set sourceName =
`${source.type}-api`; ensure you still surface/log the original local-parser
error but continue processing rather than letting the outer catch drop the
company. Use the existing symbols runLocalParser, fetchJson, PARSERS,
source.kind, source.url, source.type, jobs and sourceName to locate and
implement the change.
In `@scripts/parsers/cohere_jobs.py`:
- Around line 28-47: The ashby_api_url function currently lets unsafe URLs reach
urlopen because it only checks netloc; update ashby_api_url and its usage in
fetch_jobs to enforce an HTTPS allowlist: require parsed.scheme == "https" and
parsed.netloc in the allowed set (at minimum "api.ashbyhq.com"), and if
parsed.netloc == "api.ashbyhq.com" validate the path prefix (e.g., it must start
with "/posting-api/" or the expected posting path) before returning source_url;
otherwise, when constructing the API URL from a board page (the existing
board_slug path flow), always build an explicit
"https://api.ashbyhq.com/posting-api/job-board/{board_slug}" and raise
ValueError for any non-https scheme, unknown host, or malformed path so that
fetch_jobs only ever calls urlopen with a validated HTTPS Ashby API URL.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a4ae2065-d06c-46fa-822c-af3c04da5e0d
📒 Files selected for processing (9)
.gitignoredata/parser-output/.gitkeepdata/parser-output/cohere/.gitkeepdocs/SCRIPTS.mddocs/cohere-scan-token-comparison-pr.mdmodes/scan.mdscan.mjsscripts/parsers/cohere_jobs.pytemplates/portals.example.yml
|
Hey @lejrn — local parser source is a useful extension to the scanner. Two questions before merging:
The contract itself (stdout JSON spec, before-fallback execution) is clean. Just want to keep the runtime minimal. |
8060d74 to
4ebd978
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modes/scan.md`:
- Around line 116-120: The priority list is ambiguous about runtime context;
update the section so it explicitly states this is the agent-mode execution
order (e.g., title: "Agent-mode execution priority") or split into two
subsections "Agent-mode execution order" and "Scanner/runtime (scan.mjs)
execution notes" and move the numbered levels (Local parser, Playwright, API,
WebSearch) under the agent-mode heading; ensure you reference scan.mjs in the
scanner/runtime subsection to clarify the runtime ordering differences.
In `@scan.mjs`:
- Around line 197-198: Validate and sanitize parser.timeout_ms and
parser.max_buffer_bytes before calling execFile: ensure the derived timeout and
maxBuffer are finite numbers greater than zero (use Number(...) then
isFinite(...) and >0 checks) and fall back to LOCAL_PARSER_TIMEOUT_MS and
LOCAL_PARSER_MAX_BUFFER_BYTES when invalid; update the variables timeout and
maxBuffer (used when invoking execFile) to these sanitized values so execFile
never receives NaN, non-finite, or non-positive arguments.
In `@test-all.mjs`:
- Around line 287-306: The tests currently use literal string includes against
scan.mjs which is brittle; replace them with resilient pattern/AST-based checks:
for the company-normalization contract, parse scan.mjs (scanScript) and assert a
typeof guard on c.name (e.g., a UnaryExpression or BinaryExpression checking
typeof c.name === 'string') and that subsequent MemberExpressions call trim()
and toLowerCase() on c.name (or its safe variant) rather than matching exact
source text; for the parser-fallback contract, detect a call expression to
detectApi(company) and verify a control-flow path or adjacent statements that
log "local parser failed, used API fallback" and call fetchJson(apiFallback.url)
(use regex or AST traversal to ensure the sequence/structure exists, not raw
includes).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 529d6682-f1a7-4a5d-a898-a2098247722c
📒 Files selected for processing (10)
.gitignoreVERSIONdata/parser-output/.gitkeepdocs/SCRIPTS.mddocs/local-parser-cookbook.mdmodes/scan.mdpackage.jsonscan.mjstemplates/portals.example.ymltest-all.mjs
| const timeout = Number(parser.timeout_ms || LOCAL_PARSER_TIMEOUT_MS); | ||
| const maxBuffer = Number(parser.max_buffer_bytes || LOCAL_PARSER_MAX_BUFFER_BYTES); |
There was a problem hiding this comment.
Validate parser timeout_ms and max_buffer_bytes before passing to execFile.
Line 197 and Line 198 accept non-finite/invalid values (NaN, <=0) from config, which can turn a simple config typo into a runtime failure for that company scan path.
Proposed fix
- const timeout = Number(parser.timeout_ms || LOCAL_PARSER_TIMEOUT_MS);
- const maxBuffer = Number(parser.max_buffer_bytes || LOCAL_PARSER_MAX_BUFFER_BYTES);
+ const requestedTimeout = Number(parser.timeout_ms);
+ const timeout =
+ Number.isFinite(requestedTimeout) && requestedTimeout > 0
+ ? Math.trunc(requestedTimeout)
+ : LOCAL_PARSER_TIMEOUT_MS;
+
+ const requestedMaxBuffer = Number(parser.max_buffer_bytes);
+ const maxBuffer =
+ Number.isFinite(requestedMaxBuffer) && requestedMaxBuffer > 0
+ ? Math.trunc(requestedMaxBuffer)
+ : LOCAL_PARSER_MAX_BUFFER_BYTES;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scan.mjs` around lines 197 - 198, Validate and sanitize parser.timeout_ms and
parser.max_buffer_bytes before calling execFile: ensure the derived timeout and
maxBuffer are finite numbers greater than zero (use Number(...) then
isFinite(...) and >0 checks) and fall back to LOCAL_PARSER_TIMEOUT_MS and
LOCAL_PARSER_MAX_BUFFER_BYTES when invalid; update the variables timeout and
maxBuffer (used when invoking execFile) to these sanitized values so execFile
never receives NaN, non-finite, or non-positive arguments.
| const scanScript = readFile('scan.mjs'); | ||
| if ( | ||
| scanScript.includes("typeof c.name === 'string'") && | ||
| scanScript.includes('c.name.trim()') && | ||
| scanScript.includes('c.name.toLowerCase()') | ||
| ) { | ||
| pass('scan.mjs guards company names before filtering'); | ||
| } else { | ||
| fail('scan.mjs does not guard company names before filtering'); | ||
| } | ||
|
|
||
| if ( | ||
| scanScript.includes('detectApi(company)') && | ||
| scanScript.includes('local parser failed, used API fallback') && | ||
| scanScript.includes('fetchJson(apiFallback.url)') | ||
| ) { | ||
| pass('scan.mjs falls back to ATS API when local parser fails'); | ||
| } else { | ||
| fail('scan.mjs does not fall back to ATS API when local parser fails'); | ||
| } |
There was a problem hiding this comment.
Replace literal includes(...) contract checks with resilient structure checks.
These assertions are currently tied to exact source strings, so they can fail on harmless refactors and pass even when behavior regresses (e.g., target text in comments). Prefer regex/AST-shape checks for guarded company normalization and parser-fallback flow to keep this contract test reliable.
Suggested direction
- if (
- scanScript.includes("typeof c.name === 'string'") &&
- scanScript.includes('c.name.trim()') &&
- scanScript.includes('c.name.toLowerCase()')
- ) {
+ if (
+ /typeof\s+\w+\.name\s*===\s*['"]string['"]/.test(scanScript) &&
+ /\.\s*name\s*\.\s*trim\s*\(/.test(scanScript) &&
+ /\.\s*name\s*\.\s*toLowerCase\s*\(/.test(scanScript)
+ ) {
pass('scan.mjs guards company names before filtering');
} else {
fail('scan.mjs does not guard company names before filtering');
}
- if (
- scanScript.includes('detectApi(company)') &&
- scanScript.includes('local parser failed, used API fallback') &&
- scanScript.includes('fetchJson(apiFallback.url)')
- ) {
+ if (
+ /detectApi\s*\(\s*company\s*\)/.test(scanScript) &&
+ /local parser failed,\s*used API fallback/.test(scanScript) &&
+ /fetchJson\s*\(\s*apiFallback\.url\s*\)/.test(scanScript)
+ ) {
pass('scan.mjs falls back to ATS API when local parser fails');
} else {
fail('scan.mjs does not fall back to ATS API when local parser fails');
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const scanScript = readFile('scan.mjs'); | |
| if ( | |
| scanScript.includes("typeof c.name === 'string'") && | |
| scanScript.includes('c.name.trim()') && | |
| scanScript.includes('c.name.toLowerCase()') | |
| ) { | |
| pass('scan.mjs guards company names before filtering'); | |
| } else { | |
| fail('scan.mjs does not guard company names before filtering'); | |
| } | |
| if ( | |
| scanScript.includes('detectApi(company)') && | |
| scanScript.includes('local parser failed, used API fallback') && | |
| scanScript.includes('fetchJson(apiFallback.url)') | |
| ) { | |
| pass('scan.mjs falls back to ATS API when local parser fails'); | |
| } else { | |
| fail('scan.mjs does not fall back to ATS API when local parser fails'); | |
| } | |
| const scanScript = readFile('scan.mjs'); | |
| if ( | |
| /typeof\s+\w+\.name\s*===\s*['"]string['"]/.test(scanScript) && | |
| /\.\s*name\s*\.\s*trim\s*\(/.test(scanScript) && | |
| /\.\s*name\s*\.\s*toLowerCase\s*\(/.test(scanScript) | |
| ) { | |
| pass('scan.mjs guards company names before filtering'); | |
| } else { | |
| fail('scan.mjs does not guard company names before filtering'); | |
| } | |
| if ( | |
| /detectApi\s*\(\s*company\s*\)/.test(scanScript) && | |
| /local parser failed,\s*used API fallback/.test(scanScript) && | |
| /fetchJson\s*\(\s*apiFallback\.url\s*\)/.test(scanScript) | |
| ) { | |
| pass('scan.mjs falls back to ATS API when local parser fails'); | |
| } else { | |
| fail('scan.mjs does not fall back to ATS API when local parser fails'); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test-all.mjs` around lines 287 - 306, The tests currently use literal string
includes against scan.mjs which is brittle; replace them with resilient
pattern/AST-based checks: for the company-normalization contract, parse scan.mjs
(scanScript) and assert a typeof guard on c.name (e.g., a UnaryExpression or
BinaryExpression checking typeof c.name === 'string') and that subsequent
MemberExpressions call trim() and toLowerCase() on c.name (or its safe variant)
rather than matching exact source text; for the parser-fallback contract, detect
a call expression to detectApi(company) and verify a control-flow path or
adjacent statements that log "local parser failed, used API fallback" and call
fetchJson(apiFallback.url) (use regex or AST traversal to ensure the
sequence/structure exists, not raw includes).
|
Hey @santifer, thank you for your code review and comments!
Forgive me if I did some mistake, and please guide me what to fix, if needed. Thank you! |
Introduce providers/local-parser.mjs for executable parsers in portals.yml, with scan.mjs routing parser-first and ATS API fallback. Extend modes/scan.md with Nivel 0 contract and local_parser_ok to skip redundant Playwright/API work. Add generic portals.example.yml docs, local-parser-cookbook.md, and test-all contract checks. Co-authored-by: Cursor <cursoragent@cursor.com>
4ebd978 to
98fccce
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
providers/local-parser.mjs (1)
79-80:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSanitize parser numeric limits before calling
execFile.
timeout_ms/max_buffer_bytesare not validated for finite positive numbers. Invalid values can propagate directly toexecFileoptions.Proposed fix
- const timeout = Number(parser.timeout_ms || LOCAL_PARSER_TIMEOUT_MS); - const maxBuffer = Number(parser.max_buffer_bytes || LOCAL_PARSER_MAX_BUFFER_BYTES); + const requestedTimeout = Number(parser.timeout_ms); + const timeout = + Number.isFinite(requestedTimeout) && requestedTimeout > 0 + ? Math.trunc(requestedTimeout) + : LOCAL_PARSER_TIMEOUT_MS; + + const requestedMaxBuffer = Number(parser.max_buffer_bytes); + const maxBuffer = + Number.isFinite(requestedMaxBuffer) && requestedMaxBuffer > 0 + ? Math.trunc(requestedMaxBuffer) + : LOCAL_PARSER_MAX_BUFFER_BYTES;#!/bin/bash rg -n "timeout_ms|max_buffer_bytes|execFileAsync|Number\\.isFinite" providers/local-parser.mjs -C2As per coding guidelines,
**/*.mjs: Ensure scripts handle missing data/ directories gracefully.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@providers/local-parser.mjs` around lines 79 - 80, The parsed numeric limits timeout and maxBuffer are not validated before being passed to execFileAsync; update the code around the expressions that compute timeout (using parser.timeout_ms and LOCAL_PARSER_TIMEOUT_MS) and maxBuffer (using parser.max_buffer_bytes and LOCAL_PARSER_MAX_BUFFER_BYTES) to coerce to numbers, check Number.isFinite(...) and that values are > 0 (and integer where appropriate), and fall back to the corresponding LOCAL_PARSER_* defaults when invalid; ensure the validated variables used by execFileAsync are safe positive finite numbers to prevent passing bad options to execFile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/local-parser-cookbook.md`:
- Line 30: Replace the Spanish words used in the sentence that references agent
scan mode and scan behavior so the doc uses consistent English: change "niveles"
to "levels" and "Nivel 0" to "Level 0" in the sentence that mentions
`/career-ops scan`, Playwright and API payloads, and the `local_parser_ok`
behavior (also referenced alongside `modes/scan.md`) so the doc consistently
reads "levels" / "Level 0".
In `@modes/scan.md`:
- Around line 156-170: The ordered list numbering in modes/scan.md uses explicit
numbers like "3.5." and "4." which triggers markdownlint MD029; update the
ordered list items (e.g., the heading "3.5. **Nivel 0 — Local parser**", its
sub-steps a–g, and the subsequent "4. **Nivel 1 — Playwright scan**" heading) to
use the auto-numbering style by replacing explicit numerals with "1." for each
list item (and any nested ordered lists) so markdown auto-numbers them
consistently and satisfies MD029.
In `@scan.mjs`:
- Around line 88-92: In resolveProvider, the explicit-provider branch ignores
skipIds: when entry.provider is set you currently immediately fetch from
providers and return, which can reselect a skipped provider (e.g., local-parser)
and prevent API fallback; change the branch so it first checks skipIds (e.g.,
skipIds.has(entry.id) or equivalent) and if the entry id is skipped then do not
return the explicit provider but fall through to the normal selection/fallback
path; otherwise, keep the existing providers.get(entry.provider) lookup and
return provider or unknown-provider error. Ensure you reference resolveProvider,
entry.provider, providers, and skipIds in the change.
In `@test-all.mjs`:
- Around line 331-341: The test currently calls
readFile('templates/portals.example.yml') unguarded which will throw if the file
is absent; update the test in test-all.mjs to first call
fileExists('templates/portals.example.yml') and if it returns false call
fail(...) with a clear message that the portals example is missing, otherwise
read the file into portalExample and run the existing includes checks and
pass/fail logic; reference the readFile usage and the portalExample/pass/fail
block to apply the guard.
---
Duplicate comments:
In `@providers/local-parser.mjs`:
- Around line 79-80: The parsed numeric limits timeout and maxBuffer are not
validated before being passed to execFileAsync; update the code around the
expressions that compute timeout (using parser.timeout_ms and
LOCAL_PARSER_TIMEOUT_MS) and maxBuffer (using parser.max_buffer_bytes and
LOCAL_PARSER_MAX_BUFFER_BYTES) to coerce to numbers, check Number.isFinite(...)
and that values are > 0 (and integer where appropriate), and fall back to the
corresponding LOCAL_PARSER_* defaults when invalid; ensure the validated
variables used by execFileAsync are safe positive finite numbers to prevent
passing bad options to execFile.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 32e27756-5881-4c43-b7cd-929cc6efd41f
📒 Files selected for processing (11)
.gitignoreVERSIONdata/parser-output/.gitkeepdocs/SCRIPTS.mddocs/local-parser-cookbook.mdmodes/scan.mdpackage.jsonproviders/local-parser.mjsscan.mjstemplates/portals.example.ymltest-all.mjs
|
|
||
| `scan.mjs` uses **0 LLM tokens** for discovery: parsers run locally and only normalized job rows enter the pipeline. | ||
|
|
||
| In agent scan mode (`/career-ops scan`), Playwright and API niveles send large page or JSON payloads into the model. When Nivel 0 succeeds, `modes/scan.md` requires skipping those niveles for the same company (`local_parser_ok`). |
There was a problem hiding this comment.
Use consistent English terminology in this English doc.
Line 30 mixes Spanish (niveles, Nivel) with English text. Replace with levels / Level 0 for consistency and readability.
🧰 Tools
🪛 LanguageTool
[grammar] ~30-~30: Ensure spelling is correct
Context: .../career-ops scan), Playwright and API niveles send large page or JSON payloads into t...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~30-~30: Ensure spelling is correct
Context: ...e or JSON payloads into the model. When Nivel 0 succeeds, modes/scan.md requires sk...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~30-~30: Ensure spelling is correct
Context: ...modes/scan.md requires skipping those niveles for the same company (local_parser_ok...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/local-parser-cookbook.md` at line 30, Replace the Spanish words used in
the sentence that references agent scan mode and scan behavior so the doc uses
consistent English: change "niveles" to "levels" and "Nivel 0" to "Level 0" in
the sentence that mentions `/career-ops scan`, Playwright and API payloads, and
the `local_parser_ok` behavior (also referenced alongside `modes/scan.md`) so
the doc consistently reads "levels" / "Level 0".
| 3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token): | ||
| Inicializar `local_parser_ok = []`. | ||
| Preferir ejecutar `node scan.mjs` una vez para cubrir todos los parsers + APIs zero-token; si se hace manualmente, repetir la lógica siguiente. | ||
| Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente: | ||
| a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell | ||
| b. Expandir placeholders `{careers_url}` y `{company}` en argumentos | ||
| c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`) | ||
| d. Normalizar cada job a `{title, url, company, location}` | ||
| e. Resolver URLs relativas contra `careers_url` | ||
| f. Si el parser falla, registrar error, intentar fallback por API ATS si existe, y continuar con las demás empresas (**no** añadir a `local_parser_ok`) | ||
| g. Si el parser termina con éxito (pasos c–e sin error fatal), añadir `entry.name` a `local_parser_ok` y acumular jobs en candidatos | ||
|
|
||
| 4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5): | ||
| Para cada empresa en `tracked_companies` con `enabled: true` y `careers_url` definida: | ||
| Para cada empresa en `tracked_companies` con `enabled: true`, `careers_url` definida, y **nombre no listado en `local_parser_ok`**: | ||
| a. `browser_navigate` a la `careers_url` |
There was a problem hiding this comment.
Fix ordered-list numbering to satisfy markdownlint MD029.
Line 168 currently continues with 4. after 3.5., which triggers the configured ordered-list style check. Use 1. style for list items (Markdown auto-numbers).
Proposed doc fix
-3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
+1. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
...
-4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
+1. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
...
-5. **Nivel 2 — ATS APIs / feeds** (paralelo):
+1. **Nivel 2 — ATS APIs / feeds** (paralelo):
...
-6. **Nivel 3 — WebSearch queries** (paralelo si posible):
+1. **Nivel 3 — WebSearch queries** (paralelo si posible):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token): | |
| Inicializar `local_parser_ok = []`. | |
| Preferir ejecutar `node scan.mjs` una vez para cubrir todos los parsers + APIs zero-token; si se hace manualmente, repetir la lógica siguiente. | |
| Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente: | |
| a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell | |
| b. Expandir placeholders `{careers_url}` y `{company}` en argumentos | |
| c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`) | |
| d. Normalizar cada job a `{title, url, company, location}` | |
| e. Resolver URLs relativas contra `careers_url` | |
| f. Si el parser falla, registrar error, intentar fallback por API ATS si existe, y continuar con las demás empresas (**no** añadir a `local_parser_ok`) | |
| g. Si el parser termina con éxito (pasos c–e sin error fatal), añadir `entry.name` a `local_parser_ok` y acumular jobs en candidatos | |
| 4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5): | |
| Para cada empresa en `tracked_companies` con `enabled: true` y `careers_url` definida: | |
| Para cada empresa en `tracked_companies` con `enabled: true`, `careers_url` definida, y **nombre no listado en `local_parser_ok`**: | |
| a. `browser_navigate` a la `careers_url` | |
| 1. **Nivel 0 — Local parser** (`scan.mjs`, zero-token): | |
| Inicializar `local_parser_ok = []`. | |
| Preferir ejecutar `node scan.mjs` una vez para cubrir todos los parsers + APIs zero-token; si se hace manualmente, repetir la lógica siguiente. | |
| Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente: | |
| a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell | |
| b. Expandir placeholders `{careers_url}` y `{company}` en argumentos | |
| c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`) | |
| d. Normalizar cada job a `{title, url, company, location}` | |
| e. Resolver URLs relativas contra `careers_url` | |
| f. Si el parser falla, registrar error, intentar fallback por API ATS si existe, y continuar con las demás empresas (**no** añadir a `local_parser_ok`) | |
| g. Si el parser termina con éxito (pasos c–e sin error fatal), añadir `entry.name` a `local_parser_ok` y acumular jobs en candidatos | |
| 1. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5): | |
| Para cada empresa en `tracked_companies` con `enabled: true`, `careers_url` definida, y **nombre no listado en `local_parser_ok`**: | |
| a. `browser_navigate` a la `careers_url` |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 168-168: Ordered list item prefix
Expected: 1; Actual: 4; Style: 1/2/3
(MD029, ol-prefix)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modes/scan.md` around lines 156 - 170, The ordered list numbering in
modes/scan.md uses explicit numbers like "3.5." and "4." which triggers
markdownlint MD029; update the ordered list items (e.g., the heading "3.5.
**Nivel 0 — Local parser**", its sub-steps a–g, and the subsequent "4. **Nivel 1
— Playwright scan**" heading) to use the auto-numbering style by replacing
explicit numerals with "1." for each list item (and any nested ordered lists) so
markdown auto-numbers them consistently and satisfies MD029.
| if (entry.provider) { | ||
| const p = providers.get(entry.provider); | ||
| if (!p) return { error: `unknown provider: ${entry.provider}` }; | ||
| return { provider: p }; | ||
| } |
There was a problem hiding this comment.
Respect skipIds for explicit provider overrides.
resolveProvider() bypasses skipIds when entry.provider is set. In the local-parser failure path, this can reselect local-parser and block intended API fallback for explicitly configured entries.
Proposed fix
function resolveProvider(entry, providers, { skipIds = [] } = {}) {
if (entry.provider) {
+ if (skipIds.includes(entry.provider)) return null;
const p = providers.get(entry.provider);
if (!p) return { error: `unknown provider: ${entry.provider}` };
return { provider: p };
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (entry.provider) { | |
| const p = providers.get(entry.provider); | |
| if (!p) return { error: `unknown provider: ${entry.provider}` }; | |
| return { provider: p }; | |
| } | |
| if (entry.provider) { | |
| if (skipIds.includes(entry.provider)) return null; | |
| const p = providers.get(entry.provider); | |
| if (!p) return { error: `unknown provider: ${entry.provider}` }; | |
| return { provider: p }; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scan.mjs` around lines 88 - 92, In resolveProvider, the explicit-provider
branch ignores skipIds: when entry.provider is set you currently immediately
fetch from providers and return, which can reselect a skipped provider (e.g.,
local-parser) and prevent API fallback; change the branch so it first checks
skipIds (e.g., skipIds.has(entry.id) or equivalent) and if the entry id is
skipped then do not return the explicit provider but fall through to the normal
selection/fallback path; otherwise, keep the existing
providers.get(entry.provider) lookup and return provider or unknown-provider
error. Ensure you reference resolveProvider, entry.provider, providers, and
skipIds in the change.
| const portalExample = readFile('templates/portals.example.yml'); | ||
| if ( | ||
| !portalExample.includes('cohere_jobs.py') && | ||
| portalExample.includes('scripts/parsers/example-js-company-jobs.js') && | ||
| portalExample.includes('scripts/parsers/example_python_company_jobs.py') && | ||
| portalExample.includes('already know their target careers URL') | ||
| ) { | ||
| pass('portals example documents a generic local parser contract'); | ||
| } else { | ||
| fail('portals example still points at a bundled Cohere parser'); | ||
| } |
There was a problem hiding this comment.
Guard templates/portals.example.yml before reading to avoid hard crash.
Line 331 calls readFile(...) unconditionally. If the file is missing/renamed, the whole suite throws instead of reporting a failed check. Please gate this with fileExists(...) and fail gracefully.
Proposed fix
-const portalExample = readFile('templates/portals.example.yml');
-if (
- !portalExample.includes('cohere_jobs.py') &&
- portalExample.includes('scripts/parsers/example-js-company-jobs.js') &&
- portalExample.includes('scripts/parsers/example_python_company_jobs.py') &&
- portalExample.includes('already know their target careers URL')
-) {
- pass('portals example documents a generic local parser contract');
-} else {
- fail('portals example still points at a bundled Cohere parser');
-}
+if (!fileExists('templates/portals.example.yml')) {
+ fail('Missing templates/portals.example.yml');
+} else {
+ const portalExample = readFile('templates/portals.example.yml');
+ if (
+ !portalExample.includes('cohere_jobs.py') &&
+ portalExample.includes('scripts/parsers/example-js-company-jobs.js') &&
+ portalExample.includes('scripts/parsers/example_python_company_jobs.py') &&
+ portalExample.includes('already know their target careers URL')
+ ) {
+ pass('portals example documents a generic local parser contract');
+ } else {
+ fail('portals example still points at a bundled Cohere parser');
+ }
+}As per coding guidelines, "/*.mjs: ... Ensure scripts handle missing data/ directories gracefully.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test-all.mjs` around lines 331 - 341, The test currently calls
readFile('templates/portals.example.yml') unguarded which will throw if the file
is absent; update the test in test-all.mjs to first call
fileExists('templates/portals.example.yml') and if it returns false call
fail(...) with a clear message that the portals example is missing, otherwise
read the file into portalExample and run the existing includes checks and
pass/fail logic; reference the readFile usage and the portalExample/pass/fail
block to apply the guard.
|
Hey @santifer again, Hopefully it works now.. Please update me if anything requires modifications. Thanks! |
Summary
Rebased onto latest
mainand refactored the local-parser work to fit the new provider plugin architecture (providers/*.mjs+resolveProvider()inscan.mjs).providers/local-parser.mjsfor companies withparser.command+parser.scriptinportals.ymlmodes/scan.mdwith Nivel 0 (local parser) andlocal_parser_okrules so agent scan skips redundant Playwright/API/WebSearch when Nivel 0 succeedstemplates/portals.example.ymlanddocs/local-parser-cookbook.mdtest-all.mjscontract checks (module exists, API fallback, no bundled runtime parsers)Closes #594
PR cleanup (vs earlier revisions)
upstream/main(includes the provider refactor from Feature: cold email #593 / scan plugin work).scripts/parsers/cohere_jobs.pyand other company-specific runtime scripts from this PR (per maintainer direction: runtime-neutral contract only).test-local-parser.mjs,test-scan-token-comparison.mjs, and benchmark tables live on fork branchfeature/local-parser-integration-tests(lejrn/career-ops) — not in this PR.Test plan
node test-all.mjs --quicknode --check scan.mjsandnode --check providers/local-parser.mjsportals.ymlwithparser:pointing at your own script; runnode scan.mjs --dry-run --company <Name>local-parserin scan summary; confirm API fallback when parser script is missing or failsmodes/scan.mdlocal_parser_okbehavior (skip Playwright/API for parser-covered companies)