feat(scan): add local parser source support

lejrn · cursoragent · lejrn · commit fceaf4aaa3ad · 2026-05-06T02:40:53.000+02:00
Add a zero-token local parser source for scan.mjs and document it with a Cohere example so SSR/static career pages can be scanned without Playwright.

Co-authored-by: Cursor &lt;cursoragent@cursor.com&gt;
diff --git a/.gitignore b/.gitignore
@@ -3,6 +3,9 @@ cv.md
 data/applications.md
 data/pipeline.md
 data/scan-history.tsv
+data/parser-output/**/*.json
+!data/parser-output/.gitkeep
+!data/parser-output/**/.gitkeep
 reports/*.md
 !reports/.gitkeep
 output/*
diff --git a/data/parser-output/.gitkeep b/data/parser-output/.gitkeep
@@ -0,0 +1 @@
+
diff --git a/data/parser-output/cohere/.gitkeep b/data/parser-output/cohere/.gitkeep
@@ -0,0 +1 @@
+
diff --git a/docs/SCRIPTS.md b/docs/SCRIPTS.md
@@ -180,7 +180,23 @@ Each URL gets a verdict: `active`, `expired`, or `uncertain` with a reason.
 
 ## scan
 
-Zero-token portal scanner. Hits ATS APIs (Greenhouse, Ashby, Lever) and career pages directly — no LLM tokens consumed. Reads `portals.yml` for target companies and search queries, outputs matching listings to stdout and optionally appends to `data/pipeline.md`.
+Zero-token portal scanner. Runs configured local parsers for SSR/static career pages and hits ATS APIs (Greenhouse, Ashby, Lever) directly — no LLM tokens consumed. Reads `portals.yml` for target companies, outputs matching listings to stdout, and optionally appends to `data/pipeline.md`.
+
+For custom SSR pages, configure a tracked company with `scan_method: local_parser` and a `parser` block. The parser command must print JSON jobs to stdout:
+
+```yaml
+parser:
+  command: python3
+  script: scripts/parsers/cohere_jobs.py
+  args:
+    - --url
+    - "{careers_url}"
+    - --stdout-jobs
+    - --no-output
+  format: jobs-json-v1
+```
+
+If a parser writes full extraction artifacts for debugging or audit, store them under `data/parser-output/{company}/`. `scan.mjs` reads stdout and does not require those JSON files after parsing.
 
 ```bash
 npm run scan
diff --git a/docs/cohere-scan-token-comparison-pr.md b/docs/cohere-scan-token-comparison-pr.md
@@ -0,0 +1,49 @@
+## What does this PR do?
+
+Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes:
+
+- Playwright-rendered scraping of the Cohere Ashby board.
+- `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`.
+
+The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path.
+
+## Related issue
+
+N/A - measurement and documentation artifact.
+
+## Type of change
+
+- [ ] Bug fix
+- [ ] New feature
+- [x] Documentation / translation
+- [ ] Refactor (no behavior change)
+
+## Summary
+
+The local parser path is the cheaper search path for Cohere because it runs locally through `scan.mjs` and does not send scraped job data to an LLM. In this test, the local parser found 71 Cohere Engineering/R&D jobs, then the scanner filtered and deduplicated them down to 33 dry-run new offers.
+
+The Playwright path rendered `https://jobs.ashbyhq.com/cohere` in Chromium and extracted the page text/job links from the live careers board. That path found 129 unique job URLs across the whole board.
+
+These counts are not perfectly apples-to-apples: the Playwright scrape read the full Cohere board, while the local parser is intentionally scoped to Engineering/R&D departments.
+
+## Token Comparison
+
+| Mode | Test command / method | Jobs found | LLM tokens used by search | Token estimate basis |
+|---|---|---:|---:|---|
+| Playwright scrape | Headless Chromium render of `https://jobs.ashbyhq.com/cohere` | 129 unique job URLs | Not directly exposed by Cursor | Rendered page body was 17,526 characters, roughly 4,382 estimated payload tokens using `characters / 4` |
+| Local parser | `node scan.mjs --dry-run --company Cohere` | 71 parser jobs, 33 dry-run new offers after scanner filters/dedup | 0 | `scan.mjs` uses local Python + JSON parsing and does not send the scraped data to an LLM |
+
+## Token Estimate Disclaimer
+
+Cursor does not expose exact billable token usage per slash-command/tool run in this environment, so the Playwright number above is a payload estimate, not an invoice-grade token counter.
+
+The estimate uses the rendered page body size as a proxy for what an agent would need to read from the browser snapshot. Actual model input can differ because browser snapshots include accessibility structure, refs, prompt context, tool-call metadata, and conversation history. A compact snapshot may be smaller than raw page text; a full agent run with surrounding instructions may be larger.
+
+The local parser result is different: the search itself uses zero LLM tokens because `scan.mjs` runs locally. If the full parser JSON were pasted back into the chat for analysis, that would consume tokens, but that is not part of the scanner search path.
+
+## Test Plan
+
+- Ran `node scan.mjs --dry-run --company Cohere` after temporarily enabling Cohere in `portals.yml`.
+- Confirmed dry-run mode did not write to `data/pipeline.md` or `data/scan-history.tsv`.
+- Restored `portals.yml` after the test.
+- Ran a headless Playwright script to render and measure the Cohere Ashby board payload.
diff --git a/modes/scan.md b/modes/scan.md
@@ -2,7 +2,7 @@
 
 Escanea portales de empleo configurados, filtra por relevancia de título, y añade nuevas ofertas al pipeline para evaluación posterior.
 
-> **Nota (v1.5+):** El escáner por defecto (`scan.mjs` / `npm run scan`) es **zero-token** y sólo consulta directamente las APIs públicas de Greenhouse, Ashby y Lever. Los niveles con Playwright/WebSearch descritos abajo son el flujo **agente** (ejecutado por Claude/Codex), no lo que hace `scan.mjs`. Si una empresa no tiene API Greenhouse/Ashby/Lever, `scan.mjs` la ignorará; para esos casos, el agente debe completar manualmente el Nivel 1 (Playwright) o Nivel 3 (WebSearch).
+> **Nota (v1.6+):** El escáner por defecto (`scan.mjs` / `npm run scan`) es **zero-token** y usa fuentes estructuradas: parsers locales configurados por empresa y APIs públicas de Greenhouse, Ashby y Lever. Los niveles con Playwright/WebSearch descritos abajo son el flujo **agente** (ejecutado por Claude/Codex), no lo que hace `scan.mjs`. Si una empresa no tiene parser local ni API Greenhouse/Ashby/Lever, `scan.mjs` la ignorará; para esos casos, el agente debe completar manualmente el Nivel 1 (Playwright) o Nivel 3 (WebSearch).
 
 ## Ejecución recomendada
 
@@ -21,9 +21,44 @@ Agent(
 Leer `portals.yml` que contiene:
 - `search_queries`: Lista de queries WebSearch con `site:` filters por portal (descubrimiento amplio)
 - `tracked_companies`: Empresas específicas con `careers_url` para navegación directa
+- `tracked_companies[].parser`: Parser local opcional para páginas SSR o HTML estable
 - `title_filter`: Keywords positive/negative/seniority_boost para filtrado de títulos
 
-## Estrategia de descubrimiento (3 niveles)
+## Estrategia de descubrimiento (4 niveles)
+
+### Nivel 0 — Local parser (MÁS BARATO)
+
+**Para cada empresa en `tracked_companies` con `parser:` configurado:** ejecutar el parser local definido en `portals.yml`. Este nivel es ideal cuando la página de careers usa SSR o HTML estable y ya existe un script Python/Node que extrae los jobs sin ayuda del agente.
+
+Contrato recomendado:
+
+```yaml
+- name: Cohere
+  careers_url: https://jobs.ashbyhq.com/cohere
+  scan_method: local_parser
+  parser:
+    command: python3
+    script: scripts/parsers/cohere_jobs.py
+    args:
+      - --url
+      - "{careers_url}"
+      - --stdout-jobs
+      - --no-output
+    format: jobs-json-v1
+  enabled: true
+```
+
+El parser debe imprimir JSON a stdout:
+
+```json
+[
+  { "title": "Senior AI Engineer", "url": "https://example.com/jobs/123", "location": "Remote" }
+]
+```
+
+`company` es opcional; si no viene, `scan.mjs` usa el nombre de `tracked_companies`.
+
+El escáner no necesita conservar el JSON completo después de leer stdout. Si un parser también genera un artefacto para auditoría o depuración, guardarlo en `data/parser-output/{company}/` y mantenerlo fuera de git.
 
 ### Nivel 1 — Playwright directo (PRINCIPAL)
 
@@ -60,9 +95,10 @@ Para empresas con API pública o feed estructurado, usar la respuesta JSON/XML c
 Los `search_queries` con `site:` filters cubren portales de forma transversal (todos los Ashby, todos los Greenhouse, etc.). Útil para descubrir empresas NUEVAS que aún no están en `tracked_companies`, pero los resultados pueden estar desfasados.
 
 **Prioridad de ejecución:**
-1. Nivel 1: Playwright → todas las `tracked_companies` con `careers_url`
-2. Nivel 2: API → todas las `tracked_companies` con `api:`
-3. Nivel 3: WebSearch → todos los `search_queries` con `enabled: true`
+1. Nivel 0: Local parser → empresas con `parser:` configurado y script existente
+2. Nivel 1: Playwright → todas las `tracked_companies` con `careers_url`
+3. Nivel 2: API → todas las `tracked_companies` con `api:`
+4. Nivel 3: WebSearch → todos los `search_queries` con `enabled: true`
 
 Los niveles son aditivos — se ejecutan todos, los resultados se mezclan y deduplicar.
 
@@ -72,6 +108,15 @@ Los niveles son aditivos — se ejecutan todos, los resultados se mezclan y dedu
 2. **Leer historial**: `data/scan-history.tsv` → URLs ya vistas
 3. **Leer dedup sources**: `data/applications.md` + `data/pipeline.md`
 
+3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
+   Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente:
+   a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell
+   b. Expandir placeholders `{careers_url}` y `{company}` en argumentos
+   c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`)
+   d. Normalizar cada job a `{title, url, company, location}`
+   e. Resolver URLs relativas contra `careers_url`
+   f. Si el parser falla, registrar error y continuar con las demás empresas
+
 4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
    Para cada empresa en `tracked_companies` con `enabled: true` y `careers_url` definida:
    a. `browser_navigate` a la `careers_url`
diff --git a/scan.mjs b/scan.mjs
diff --git a/scripts/parsers/cohere_jobs.py b/scripts/parsers/cohere_jobs.py
diff --git a/templates/portals.example.yml b/templates/portals.example.yml