You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a zero-token local parser source for scan.mjs and document it with a Cohere example so SSR/static career pages can be scanned without Playwright.
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy file name to clipboardExpand all lines: docs/SCRIPTS.md
+17-1Lines changed: 17 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -180,7 +180,23 @@ Each URL gets a verdict: `active`, `expired`, or `uncertain` with a reason.
180
180
181
181
## scan
182
182
183
-
Zero-token portal scanner. Hits ATS APIs (Greenhouse, Ashby, Lever) and career pages directly — no LLM tokens consumed. Reads `portals.yml` for target companies and search queries, outputs matching listings to stdout and optionally appends to `data/pipeline.md`.
183
+
Zero-token portal scanner. Runs configured local parsers for SSR/static career pages and hits ATS APIs (Greenhouse, Ashby, Lever) directly — no LLM tokens consumed. Reads `portals.yml` for target companies, outputs matching listings to stdout, and optionally appends to `data/pipeline.md`.
184
+
185
+
For custom SSR pages, configure a tracked company with `scan_method: local_parser` and a `parser` block. The parser command must print JSON jobs to stdout:
186
+
187
+
```yaml
188
+
parser:
189
+
command: python3
190
+
script: scripts/parsers/cohere_jobs.py
191
+
args:
192
+
- --url
193
+
- "{careers_url}"
194
+
- --stdout-jobs
195
+
- --no-output
196
+
format: jobs-json-v1
197
+
```
198
+
199
+
If a parser writes full extraction artifacts for debugging or audit, store them under `data/parser-output/{company}/`. `scan.mjs` reads stdout and does not require those JSON files after parsing.
Documents a dry-run comparison of `/career-ops scan` for Cohere in two scan modes:
4
+
5
+
- Playwright-rendered scraping of the Cohere Ashby board.
6
+
-`scan.mjs --dry-run --company Cohere` using the configured Cohere `local_parser`.
7
+
8
+
The comparison is intended to show the token tradeoff between an agent reading a rendered careers page and the zero-token local parser path.
9
+
10
+
## Related issue
11
+
12
+
N/A - measurement and documentation artifact.
13
+
14
+
## Type of change
15
+
16
+
-[ ] Bug fix
17
+
-[ ] New feature
18
+
-[x] Documentation / translation
19
+
-[ ] Refactor (no behavior change)
20
+
21
+
## Summary
22
+
23
+
The local parser path is the cheaper search path for Cohere because it runs locally through `scan.mjs` and does not send scraped job data to an LLM. In this test, the local parser found 71 Cohere Engineering/R&D jobs, then the scanner filtered and deduplicated them down to 33 dry-run new offers.
24
+
25
+
The Playwright path rendered `https://jobs.ashbyhq.com/cohere` in Chromium and extracted the page text/job links from the live careers board. That path found 129 unique job URLs across the whole board.
26
+
27
+
These counts are not perfectly apples-to-apples: the Playwright scrape read the full Cohere board, while the local parser is intentionally scoped to Engineering/R&D departments.
28
+
29
+
## Token Comparison
30
+
31
+
| Mode | Test command / method | Jobs found | LLM tokens used by search | Token estimate basis |
32
+
|---|---|---:|---:|---|
33
+
| Playwright scrape | Headless Chromium render of `https://jobs.ashbyhq.com/cohere`| 129 unique job URLs | Not directly exposed by Cursor | Rendered page body was 17,526 characters, roughly 4,382 estimated payload tokens using `characters / 4`|
34
+
| Local parser |`node scan.mjs --dry-run --company Cohere`| 71 parser jobs, 33 dry-run new offers after scanner filters/dedup | 0 |`scan.mjs` uses local Python + JSON parsing and does not send the scraped data to an LLM |
35
+
36
+
## Token Estimate Disclaimer
37
+
38
+
Cursor does not expose exact billable token usage per slash-command/tool run in this environment, so the Playwright number above is a payload estimate, not an invoice-grade token counter.
39
+
40
+
The estimate uses the rendered page body size as a proxy for what an agent would need to read from the browser snapshot. Actual model input can differ because browser snapshots include accessibility structure, refs, prompt context, tool-call metadata, and conversation history. A compact snapshot may be smaller than raw page text; a full agent run with surrounding instructions may be larger.
41
+
42
+
The local parser result is different: the search itself uses zero LLM tokens because `scan.mjs` runs locally. If the full parser JSON were pasted back into the chat for analysis, that would consume tokens, but that is not part of the scanner search path.
43
+
44
+
## Test Plan
45
+
46
+
- Ran `node scan.mjs --dry-run --company Cohere` after temporarily enabling Cohere in `portals.yml`.
47
+
- Confirmed dry-run mode did not write to `data/pipeline.md` or `data/scan-history.tsv`.
48
+
- Restored `portals.yml` after the test.
49
+
- Ran a headless Playwright script to render and measure the Cohere Ashby board payload.
Copy file name to clipboardExpand all lines: modes/scan.md
+50-5Lines changed: 50 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
Escanea portales de empleo configurados, filtra por relevancia de título, y añade nuevas ofertas al pipeline para evaluación posterior.
4
4
5
-
> **Nota (v1.5+):** El escáner por defecto (`scan.mjs` / `npm run scan`) es **zero-token** y sólo consulta directamente las APIs públicas de Greenhouse, Ashby y Lever. Los niveles con Playwright/WebSearch descritos abajo son el flujo **agente** (ejecutado por Claude/Codex), no lo que hace `scan.mjs`. Si una empresa no tiene API Greenhouse/Ashby/Lever, `scan.mjs` la ignorará; para esos casos, el agente debe completar manualmente el Nivel 1 (Playwright) o Nivel 3 (WebSearch).
5
+
> **Nota (v1.6+):** El escáner por defecto (`scan.mjs` / `npm run scan`) es **zero-token** y usa fuentes estructuradas: parsers locales configurados por empresa y APIs públicas de Greenhouse, Ashby y Lever. Los niveles con Playwright/WebSearch descritos abajo son el flujo **agente** (ejecutado por Claude/Codex), no lo que hace `scan.mjs`. Si una empresa no tiene parser local ni API Greenhouse/Ashby/Lever, `scan.mjs` la ignorará; para esos casos, el agente debe completar manualmente el Nivel 1 (Playwright) o Nivel 3 (WebSearch).
6
6
7
7
## Ejecución recomendada
8
8
@@ -21,9 +21,44 @@ Agent(
21
21
Leer `portals.yml` que contiene:
22
22
-`search_queries`: Lista de queries WebSearch con `site:` filters por portal (descubrimiento amplio)
23
23
-`tracked_companies`: Empresas específicas con `careers_url` para navegación directa
24
+
-`tracked_companies[].parser`: Parser local opcional para páginas SSR o HTML estable
24
25
-`title_filter`: Keywords positive/negative/seniority_boost para filtrado de títulos
25
26
26
-
## Estrategia de descubrimiento (3 niveles)
27
+
## Estrategia de descubrimiento (4 niveles)
28
+
29
+
### Nivel 0 — Local parser (MÁS BARATO)
30
+
31
+
**Para cada empresa en `tracked_companies` con `parser:` configurado:** ejecutar el parser local definido en `portals.yml`. Este nivel es ideal cuando la página de careers usa SSR o HTML estable y ya existe un script Python/Node que extrae los jobs sin ayuda del agente.
32
+
33
+
Contrato recomendado:
34
+
35
+
```yaml
36
+
- name: Cohere
37
+
careers_url: https://jobs.ashbyhq.com/cohere
38
+
scan_method: local_parser
39
+
parser:
40
+
command: python3
41
+
script: scripts/parsers/cohere_jobs.py
42
+
args:
43
+
- --url
44
+
- "{careers_url}"
45
+
- --stdout-jobs
46
+
- --no-output
47
+
format: jobs-json-v1
48
+
enabled: true
49
+
```
50
+
51
+
El parser debe imprimir JSON a stdout:
52
+
53
+
```json
54
+
[
55
+
{ "title": "Senior AI Engineer", "url": "https://example.com/jobs/123", "location": "Remote" }
56
+
]
57
+
```
58
+
59
+
`company` es opcional; si no viene, `scan.mjs` usa el nombre de `tracked_companies`.
60
+
61
+
El escáner no necesita conservar el JSON completo después de leer stdout. Si un parser también genera un artefacto para auditoría o depuración, guardarlo en `data/parser-output/{company}/` y mantenerlo fuera de git.
27
62
28
63
### Nivel 1 — Playwright directo (PRINCIPAL)
29
64
@@ -60,9 +95,10 @@ Para empresas con API pública o feed estructurado, usar la respuesta JSON/XML c
60
95
Los `search_queries` con `site:` filters cubren portales de forma transversal (todos los Ashby, todos los Greenhouse, etc.). Útil para descubrir empresas NUEVAS que aún no están en `tracked_companies`, pero los resultados pueden estar desfasados.
61
96
62
97
**Prioridad de ejecución:**
63
-
1. Nivel 1: Playwright → todas las `tracked_companies` con `careers_url`
64
-
2. Nivel 2: API → todas las `tracked_companies` con `api:`
65
-
3. Nivel 3: WebSearch → todos los `search_queries` con `enabled: true`
98
+
1. Nivel 0: Local parser → empresas con `parser:` configurado y script existente
99
+
2. Nivel 1: Playwright → todas las `tracked_companies` con `careers_url`
100
+
3. Nivel 2: API → todas las `tracked_companies` con `api:`
101
+
4. Nivel 3: WebSearch → todos los `search_queries` con `enabled: true`
66
102
67
103
Los niveles son aditivos — se ejecutan todos, los resultados se mezclan y deduplicar.
68
104
@@ -72,6 +108,15 @@ Los niveles son aditivos — se ejecutan todos, los resultados se mezclan y dedu
72
108
2.**Leer historial**: `data/scan-history.tsv` → URLs ya vistas
0 commit comments