Skip to content

Commit fceaf4a

Browse files
lejrncursoragent
andcommitted
feat(scan): add local parser source support
Add a zero-token local parser source for scan.mjs and document it with a Cohere example so SSR/static career pages can be scanned without Playwright. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent b8a3a12 commit fceaf4a

9 files changed

Lines changed: 428 additions & 24 deletions

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@ cv.md
33
data/applications.md
44
data/pipeline.md
55
data/scan-history.tsv
6+
data/parser-output/**/*.json
7+
!data/parser-output/.gitkeep
8+
!data/parser-output/**/.gitkeep
69
reports/*.md
710
!reports/.gitkeep
811
output/*

data/parser-output/.gitkeep

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

data/parser-output/cohere/.gitkeep

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

docs/SCRIPTS.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,23 @@ Each URL gets a verdict: `active`, `expired`, or `uncertain` with a reason.
180180

181181
## scan
182182

183-
Zero-token portal scanner. Hits ATS APIs (Greenhouse, Ashby, Lever) and career pages directly — no LLM tokens consumed. Reads `portals.yml` for target companies and search queries, outputs matching listings to stdout and optionally appends to `data/pipeline.md`.
183+
Zero-token portal scanner. Runs configured local parsers for SSR/static career pages and hits ATS APIs (Greenhouse, Ashby, Lever) directly — no LLM tokens consumed. Reads `portals.yml` for target companies, outputs matching listings to stdout, and optionally appends to `data/pipeline.md`.
184+
185+
For custom SSR pages, configure a tracked company with `scan_method: local_parser` and a `parser` block. The parser command must print JSON jobs to stdout:
186+
187+
```yaml
188+
parser:
189+
command: python3
190+
script: scripts/parsers/cohere_jobs.py
191+
args:
192+
- --url
193+
- "{careers_url}"
194+
- --stdout-jobs
195+
- --no-output
196+
format: jobs-json-v1
197+
```
198+
199+
If a parser writes full extraction artifacts for debugging or audit, store them under `data/parser-output/{company}/`. `scan.mjs` reads stdout and does not require those JSON files after parsing.
184200

185201
```bash
186202
npm run scan
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
## What does this PR do?
2+
3+
Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes:
4+
5+
- Playwright-rendered scraping of the Cohere Ashby board.
6+
- `scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`.
7+
8+
The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path.
9+
10+
## Related issue
11+
12+
N/A - measurement and documentation artifact.
13+
14+
## Type of change
15+
16+
- [ ] Bug fix
17+
- [ ] New feature
18+
- [x] Documentation / translation
19+
- [ ] Refactor (no behavior change)
20+
21+
## Summary
22+
23+
The local parser path is the cheaper search path for Cohere because it runs locally through `scan.mjs` and does not send scraped job data to an LLM. In this test, the local parser found 71 Cohere Engineering/R&D jobs, then the scanner filtered and deduplicated them down to 33 dry-run new offers.
24+
25+
The Playwright path rendered `https://jobs.ashbyhq.com/cohere` in Chromium and extracted the page text/job links from the live careers board. That path found 129 unique job URLs across the whole board.
26+
27+
These counts are not perfectly apples-to-apples: the Playwright scrape read the full Cohere board, while the local parser is intentionally scoped to Engineering/R&D departments.
28+
29+
## Token Comparison
30+
31+
| Mode | Test command / method | Jobs found | LLM tokens used by search | Token estimate basis |
32+
|---|---|---:|---:|---|
33+
| Playwright scrape | Headless Chromium render of `https://jobs.ashbyhq.com/cohere` | 129 unique job URLs | Not directly exposed by Cursor | Rendered page body was 17,526 characters, roughly 4,382 estimated payload tokens using `characters / 4` |
34+
| Local parser | `node scan.mjs --dry-run --company Cohere` | 71 parser jobs, 33 dry-run new offers after scanner filters/dedup | 0 | `scan.mjs` uses local Python + JSON parsing and does not send the scraped data to an LLM |
35+
36+
## Token Estimate Disclaimer
37+
38+
Cursor does not expose exact billable token usage per slash-command/tool run in this environment, so the Playwright number above is a payload estimate, not an invoice-grade token counter.
39+
40+
The estimate uses the rendered page body size as a proxy for what an agent would need to read from the browser snapshot. Actual model input can differ because browser snapshots include accessibility structure, refs, prompt context, tool-call metadata, and conversation history. A compact snapshot may be smaller than raw page text; a full agent run with surrounding instructions may be larger.
41+
42+
The local parser result is different: the search itself uses zero LLM tokens because `scan.mjs` runs locally. If the full parser JSON were pasted back into the chat for analysis, that would consume tokens, but that is not part of the scanner search path.
43+
44+
## Test Plan
45+
46+
- Ran `node scan.mjs --dry-run --company Cohere` after temporarily enabling Cohere in `portals.yml`.
47+
- Confirmed dry-run mode did not write to `data/pipeline.md` or `data/scan-history.tsv`.
48+
- Restored `portals.yml` after the test.
49+
- Ran a headless Playwright script to render and measure the Cohere Ashby board payload.

modes/scan.md

Lines changed: 50 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Escanea portales de empleo configurados, filtra por relevancia de título, y añade nuevas ofertas al pipeline para evaluación posterior.
44

5-
> **Nota (v1.5+):** El escáner por defecto (`scan.mjs` / `npm run scan`) es **zero-token** y sólo consulta directamente las APIs públicas de Greenhouse, Ashby y Lever. Los niveles con Playwright/WebSearch descritos abajo son el flujo **agente** (ejecutado por Claude/Codex), no lo que hace `scan.mjs`. Si una empresa no tiene API Greenhouse/Ashby/Lever, `scan.mjs` la ignorará; para esos casos, el agente debe completar manualmente el Nivel 1 (Playwright) o Nivel 3 (WebSearch).
5+
> **Nota (v1.6+):** El escáner por defecto (`scan.mjs` / `npm run scan`) es **zero-token** y usa fuentes estructuradas: parsers locales configurados por empresa y APIs públicas de Greenhouse, Ashby y Lever. Los niveles con Playwright/WebSearch descritos abajo son el flujo **agente** (ejecutado por Claude/Codex), no lo que hace `scan.mjs`. Si una empresa no tiene parser local ni API Greenhouse/Ashby/Lever, `scan.mjs` la ignorará; para esos casos, el agente debe completar manualmente el Nivel 1 (Playwright) o Nivel 3 (WebSearch).
66
77
## Ejecución recomendada
88

@@ -21,9 +21,44 @@ Agent(
2121
Leer `portals.yml` que contiene:
2222
- `search_queries`: Lista de queries WebSearch con `site:` filters por portal (descubrimiento amplio)
2323
- `tracked_companies`: Empresas específicas con `careers_url` para navegación directa
24+
- `tracked_companies[].parser`: Parser local opcional para páginas SSR o HTML estable
2425
- `title_filter`: Keywords positive/negative/seniority_boost para filtrado de títulos
2526

26-
## Estrategia de descubrimiento (3 niveles)
27+
## Estrategia de descubrimiento (4 niveles)
28+
29+
### Nivel 0 — Local parser (MÁS BARATO)
30+
31+
**Para cada empresa en `tracked_companies` con `parser:` configurado:** ejecutar el parser local definido en `portals.yml`. Este nivel es ideal cuando la página de careers usa SSR o HTML estable y ya existe un script Python/Node que extrae los jobs sin ayuda del agente.
32+
33+
Contrato recomendado:
34+
35+
```yaml
36+
- name: Cohere
37+
careers_url: https://jobs.ashbyhq.com/cohere
38+
scan_method: local_parser
39+
parser:
40+
command: python3
41+
script: scripts/parsers/cohere_jobs.py
42+
args:
43+
- --url
44+
- "{careers_url}"
45+
- --stdout-jobs
46+
- --no-output
47+
format: jobs-json-v1
48+
enabled: true
49+
```
50+
51+
El parser debe imprimir JSON a stdout:
52+
53+
```json
54+
[
55+
{ "title": "Senior AI Engineer", "url": "https://example.com/jobs/123", "location": "Remote" }
56+
]
57+
```
58+
59+
`company` es opcional; si no viene, `scan.mjs` usa el nombre de `tracked_companies`.
60+
61+
El escáner no necesita conservar el JSON completo después de leer stdout. Si un parser también genera un artefacto para auditoría o depuración, guardarlo en `data/parser-output/{company}/` y mantenerlo fuera de git.
2762

2863
### Nivel 1 — Playwright directo (PRINCIPAL)
2964

@@ -60,9 +95,10 @@ Para empresas con API pública o feed estructurado, usar la respuesta JSON/XML c
6095
Los `search_queries` con `site:` filters cubren portales de forma transversal (todos los Ashby, todos los Greenhouse, etc.). Útil para descubrir empresas NUEVAS que aún no están en `tracked_companies`, pero los resultados pueden estar desfasados.
6196

6297
**Prioridad de ejecución:**
63-
1. Nivel 1: Playwright → todas las `tracked_companies` con `careers_url`
64-
2. Nivel 2: API → todas las `tracked_companies` con `api:`
65-
3. Nivel 3: WebSearch → todos los `search_queries` con `enabled: true`
98+
1. Nivel 0: Local parser → empresas con `parser:` configurado y script existente
99+
2. Nivel 1: Playwright → todas las `tracked_companies` con `careers_url`
100+
3. Nivel 2: API → todas las `tracked_companies` con `api:`
101+
4. Nivel 3: WebSearch → todos los `search_queries` con `enabled: true`
66102

67103
Los niveles son aditivos — se ejecutan todos, los resultados se mezclan y deduplicar.
68104

@@ -72,6 +108,15 @@ Los niveles son aditivos — se ejecutan todos, los resultados se mezclan y dedu
72108
2. **Leer historial**: `data/scan-history.tsv` → URLs ya vistas
73109
3. **Leer dedup sources**: `data/applications.md` + `data/pipeline.md`
74110

111+
3.5. **Nivel 0 — Local parser** (`scan.mjs`, zero-token):
112+
Para cada empresa en `tracked_companies` con `enabled: true`, `parser.command` y script existente:
113+
a. Ejecutar `parser.command` con `parser.script` + `parser.args` usando ejecución local sin shell
114+
b. Expandir placeholders `{careers_url}` y `{company}` en argumentos
115+
c. Leer JSON de stdout (`[]`, `{ jobs: [] }`, o `{ results: [] }`)
116+
d. Normalizar cada job a `{title, url, company, location}`
117+
e. Resolver URLs relativas contra `careers_url`
118+
f. Si el parser falla, registrar error y continuar con las demás empresas
119+
75120
4. **Nivel 1 — Playwright scan** (paralelo en batches de 3-5):
76121
Para cada empresa en `tracked_companies` con `enabled: true` y `careers_url` definida:
77122
a. `browser_navigate` a la `careers_url`

0 commit comments

Comments
 (0)