Skip to content

Commit 1161c6a

Browse files
chore: release 0.9.1
1 parent 1eaf6b5 commit 1161c6a

43 files changed

Lines changed: 4280 additions & 2 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
---
2+
id: adzuna
3+
title: Adzuna Extractor
4+
description: API-based Adzuna extraction with orchestrator ingestion and progress updates.
5+
sidebar_position: 6
6+
---
7+
8+
## What it is
9+
10+
Original website: [adzuna.com](https://www.adzuna.com)
11+
12+
Adzuna is an API-backed extractor implemented in two lean pieces:
13+
14+
1. `extractors/adzuna/src/main.ts` fetches paginated Adzuna search results and writes `jobs.json`.
15+
2. `orchestrator/src/server/services/adzuna.ts` runs the extractor, parses progress lines, and maps rows into `CreateJobInput`.
16+
17+
It de-duplicates in the existing repository path using `sourceJobId` fallback to `jobUrl`.
18+
19+
## Why it exists
20+
21+
Adzuna provides stable API discovery for countries that are not covered by UK-only sources. It adds a lower-maintenance source without introducing new API routes or UI sections.
22+
23+
## How to use it
24+
25+
1. Create an Adzuna developer account.
26+
2. Open [Adzuna Access Details](https://developer.adzuna.com/admin/access_details).
27+
3. Copy your **App ID** and **App Key**.
28+
4. In Job Ops, open **Settings** and paste them into `Adzuna App ID` and `Adzuna App Key` under **Environment & Workspaces**.
29+
5. In **Pipeline Run** (Automatic tab), select a compatible country and enable **Adzuna** in Sources.
30+
6. Start the run; Adzuna progress appears in the existing crawl progress stream.
31+
32+
City behavior:
33+
34+
- If **Search cities** are set in Automatic advanced settings, Adzuna runs once per city.
35+
- City runs use strict post-filtering (`job.location` contains requested city) to avoid broad country-level spillover.
36+
- The normalized job payload now preserves structured location evidence from Adzuna's `location.display_name` field.
37+
38+
Default controls:
39+
40+
- `ADZUNA_APP_ID`
41+
- `ADZUNA_APP_KEY`
42+
- `ADZUNA_MAX_JOBS_PER_TERM` (default `50`)
43+
- `ADZUNA_LOCATION_QUERY` (optional city/location text)
44+
45+
Supported countries in this integration:
46+
47+
- United Kingdom, United States, Austria, Australia, Belgium, Brazil, Canada, Switzerland, Germany, Spain, France, India, Italy, Mexico, Netherlands, New Zealand, Poland, Singapore, South Africa.
48+
49+
## Common problems
50+
51+
### Adzuna is disabled in source selection
52+
53+
- `Adzuna App ID` and `Adzuna App Key` are missing from Settings (or env).
54+
55+
### Adzuna is skipped for my selected country
56+
57+
- The selected country is not in the supported list above.
58+
59+
### Adzuna fails with authorization errors
60+
61+
- Verify `ADZUNA_APP_ID` and `ADZUNA_APP_KEY` are valid and active in your Adzuna account.
62+
63+
## Related pages
64+
65+
- [Extractors Overview](/docs/next/extractors/overview)
66+
- [Pipeline Run](/docs/next/features/pipeline-run)
67+
- [Settings](/docs/next/features/settings)
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
id: golang-jobs
3+
title: Golang Jobs Extractor
4+
description: Golang Jobs extraction integrated through the site's public Supabase-backed feed.
5+
sidebar_position: 10
6+
---
7+
8+
## What it is
9+
10+
Original website: [Golang Jobs](https://www.golangjobs.tech/)
11+
12+
This extractor reads the public Golang Jobs feed exposed through the site's browser-facing Supabase API and maps those rows into the existing job-ops schema.
13+
14+
Implementation split:
15+
16+
1. `extractors/golangjobs/src/run.ts` paginates the public feed, applies local term, country, city, and workplace filters, and maps returned rows into `CreateJobInput`.
17+
2. `extractors/golangjobs/src/manifest.ts` adapts pipeline settings, emits progress updates, and registers the source for runtime discovery.
18+
19+
## Why it exists
20+
21+
Golang Jobs adds a Go-focused niche board that broad aggregators often miss.
22+
23+
Using the same public feed the site already serves in the browser keeps the integration lighter and more stable than scraping rendered React pages.
24+
25+
## How to use it
26+
27+
1. Open **Run jobs** and choose **Automatic**.
28+
2. Leave **Golang Jobs** enabled in **Sources** or toggle it on.
29+
3. Set your usual automatic run controls:
30+
- `searchTerms` are matched locally against title, company, description, requirements, and location.
31+
- selected country or explicit city filters are applied after feed download.
32+
- workplace type is respected from the location shape returned by the feed.
33+
- run budget path (`jobspyResultsWanted`) is reused as a per-term cap.
34+
4. Start the run and monitor progress in the pipeline progress card.
35+
36+
Defaults and constraints:
37+
38+
- The extractor includes a built-in browser-facing anon key for the upstream public feed.
39+
- You can override that default with `GOLANG_JOBS_SUPABASE_ANON_KEY` if the upstream rotates the key.
40+
- The upstream feed is already Go-specific, but it is still broader than most job-ops searches, so local filtering remains important.
41+
- The extractor currently relies on the public `jobs` and `cities` relationship exposed by the site; if the site changes that schema, the extractor will need updating.
42+
- Remote roles are inferred from `cities.name === "Remote"`.
43+
44+
## Common problems
45+
46+
### Golang Jobs does not appear in sources
47+
48+
- Check that the app is running a build that includes the new extractor manifest and shared source metadata.
49+
50+
### Golang Jobs health checks or runs fail immediately
51+
52+
- If the upstream rotates its public browser key, set `GOLANG_JOBS_SUPABASE_ANON_KEY` in the server/container environment to override the built-in default.
53+
- Rebuild the container after adding new environment variables if you run job-ops through Docker.
54+
55+
### Results are broader than expected
56+
57+
- The source is niche but still broad within the Go ecosystem.
58+
- Add more specific search terms or explicit cities when you want a narrower result set.
59+
60+
### Onsite-only runs return no Golang Jobs jobs
61+
62+
- Many rows on this board are remote and are marked as such from the linked city record.
63+
- Include `remote` in workplace type selection if you want this source to contribute jobs.
64+
65+
## Related pages
66+
67+
- [Extractors Overview](/docs/next/extractors/overview)
68+
- [Pipeline Run](/docs/next/features/pipeline-run)
69+
- [Add an Extractor](/docs/next/workflows/add-an-extractor)
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
id: gradcracker
3+
title: Gradcracker Extractor
4+
description: How the Gradcracker crawler builds search URLs and extracts jobs.
5+
sidebar_position: 2
6+
---
7+
8+
A plain-English walkthrough of the Gradcracker extractor in `extractors/gradcracker`.
9+
10+
Original website: [gradcracker.com](https://www.gradcracker.com)
11+
12+
## What it is
13+
14+
The Gradcracker extractor finds UK graduate roles from [gradcracker.com](https://www.gradcracker.com).
15+
16+
It now uses a fast HTTP-first scraper for normal runs. The scraper fetches Gradcracker list and detail HTML with a browser-like HTTP fingerprint, parses job cards locally, and decodes Gradcracker apply links without opening a browser. The older Playwright/Crawlee flow remains as a fallback when the HTTP path is blocked.
17+
18+
## Why it exists
19+
20+
Gradcracker is useful for UK graduate and early-career STEM roles that broad aggregators often miss.
21+
22+
The HTTP-first implementation keeps the same normalized job output while avoiding the startup cost of launching a browser for every successful run.
23+
24+
## How to use it
25+
26+
1. Open **Run jobs** and choose **Automatic**.
27+
2. Select **United Kingdom** as the country.
28+
3. Leave **Gradcracker** enabled in **Sources** or toggle it on.
29+
4. Set your usual search terms and run budget.
30+
5. Start the run and monitor progress in the pipeline progress card.
31+
32+
Defaults and controls:
33+
34+
- Search terms are converted to Gradcracker role slugs, such as `software systems` to `software-systems`.
35+
- Defaults include `web-development` and `software-systems`.
36+
- `GRADCRACKER_MAX_JOBS_PER_TERM` controls the per-term cap.
37+
- `GRADCRACKER_HTTP_DETAIL_CONCURRENCY` controls concurrent detail-page fetches. The default is `2`.
38+
- `GRADCRACKER_HTTP_REQUEST_DELAY_MS` controls the minimum delay between HTTP request starts. The default is `1000`.
39+
- `JOBOPS_SKIP_APPLY_FOR_EXISTING=1` and `JOBOPS_EXISTING_JOB_URLS_FILE` are still honored by the browser fallback.
40+
- `GRADCRACKER_FORCE_BROWSER=1` forces the legacy Playwright/Crawlee path.
41+
- `GRADCRACKER_DISABLE_BROWSER_FALLBACK=1` returns the HTTP scraper result directly if the fast path is blocked.
42+
43+
Implementation flow:
44+
45+
1. Build search URLs from UK regions and role terms.
46+
2. Fetch list pages and parse `article[wire:key]` job cards.
47+
3. Fetch detail pages for new jobs only.
48+
4. Extract `.body-content` description text.
49+
5. Decode Gradcracker `/out/...` apply URLs from the `u` query parameter.
50+
6. Reuse saved Cloudflare clearance cookies from the headed solve flow on the HTTP retry.
51+
7. Fall back to Playwright/Crawlee only when the HTTP path cannot proceed.
52+
53+
## Common problems
54+
55+
### Gradcracker does not return jobs
56+
57+
- Confirm the selected country is **United Kingdom**.
58+
- Try Gradcracker-specific terms such as `software systems`, `web development`, or `data science`.
59+
- Lower the run budget if a term is too broad and you only need the newest listings.
60+
61+
### The HTTP scraper is blocked
62+
63+
- Leave browser fallback enabled so the extractor can use the existing Playwright/Crawlee challenge handling.
64+
- When the app opens a challenge browser, complete the challenge and wait for the solver to save a `cf_clearance` cookie. The next HTTP retry uses that saved cookie and the same browser user agent.
65+
- Set `GRADCRACKER_FORCE_BROWSER=1` when you specifically need to debug the legacy browser flow.
66+
67+
### Apply links stay on Gradcracker
68+
69+
- Some listings may not expose a decodable `/out/...` target.
70+
- The extractor still stores the Gradcracker job URL, so those postings remain usable even when the final application URL is unavailable.
71+
72+
## Related pages
73+
74+
- [Extractors Overview](/docs/next/extractors/overview)
75+
- [Pipeline Run](/docs/next/features/pipeline-run)
76+
- [Add an Extractor](/docs/next/workflows/add-an-extractor)
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
id: hiring-cafe
3+
title: Hiring Cafe Extractor
4+
description: Browser-backed Hiring Cafe extraction integrated into the pipeline source selector.
5+
sidebar_position: 7
6+
---
7+
8+
## What it is
9+
10+
Original website: [hiring.cafe](https://hiring.cafe)
11+
12+
Special thanks: Initial implementation inspiration came from [umur957/hiring-cafe-job-scraper](https://github.com/umur957/hiring-cafe-job-scraper).
13+
14+
Hiring Cafe is a browser-backed extractor that queries Hiring Cafe search APIs and maps results into the orchestrator `CreateJobInput` shape.
15+
16+
Implementation split:
17+
18+
1. `extractors/hiringcafe/src/main.ts` builds search state, calls Hiring Cafe APIs, fetches job detail pages when search hits omit full descriptions, and writes dataset JSON.
19+
2. `orchestrator/src/server/services/hiring-cafe.ts` runs the extractor, streams progress events, and maps rows for pipeline import.
20+
21+
## Why it exists
22+
23+
Hiring Cafe adds another non-credentialed source that can be enabled from the existing source picker, without adding new settings UI.
24+
25+
It also supports term-by-term search and country-aware search state using the same pipeline knobs you already set for automatic runs.
26+
27+
## How to use it
28+
29+
1. Open **Run jobs** and choose **Automatic**.
30+
2. **Hiring Cafe** is enabled by default in **Sources** (toggle it off if you do not want it for this run).
31+
3. Set your existing automatic run knobs:
32+
- `searchTerms` drive per-term Hiring Cafe `searchQuery`.
33+
- selected country maps into Hiring Cafe location search state.
34+
- run budget path (`jobspyResultsWanted`) is reused as the max jobs-per-term cap.
35+
- optional **Search cities** narrow results by city.
36+
- workplace type is forwarded from the automatic run modal as a global run filter.
37+
4. Start the run and watch progress in the pipeline progress card.
38+
39+
Defaults and constraints:
40+
41+
- No new Hiring Cafe settings fields were added.
42+
- `worldwide` and `usa/ca` run in broad mode without a strict country location filter.
43+
- Hiring Cafe is enabled by default in source selection.
44+
- Full job descriptions are loaded from Hiring Cafe detail pages when the search result payload only includes summary fields.
45+
- The normalized job payload now preserves structured location evidence from the formatted workplace and city/state/country fields.
46+
- `HIRING_CAFE_DATE_FETCHED_PAST_N_DAYS` controls recency window when running extractor directly (default `7`).
47+
- When a city is provided via `searchCities`, Hiring Cafe uses city radius search (default `1` mile) and strict city post-filtering.
48+
- Workplace type is global to the run and is not configured separately per city in this integration.
49+
- City geocoding is resolved through Nominatim (OpenStreetMap data); if you scale extractor traffic, add attribution and cache repeated city lookups.
50+
51+
Local run example:
52+
53+
```bash
54+
HIRING_CAFE_SEARCH_TERMS='["backend engineer"]' \
55+
HIRING_CAFE_COUNTRY='united kingdom' \
56+
HIRING_CAFE_MAX_JOBS_PER_TERM='50' \
57+
npm --workspace hiringcafe-extractor run start
58+
```
59+
60+
## Common problems
61+
62+
### Hiring Cafe returns 429 / Vercel security checkpoint
63+
64+
- The extractor first attempts Camoufox-backed Firefox and falls back to vanilla Firefox startup if Camoufox is unstable locally.
65+
- If upstream blocks continue, retry later or reduce run concurrency at the pipeline level by selecting fewer sources.
66+
67+
### Hiring Cafe does not appear in sources
68+
69+
- Check that client is running on latest build containing the new source list.
70+
- Hiring Cafe is source-only and does not require credentials, so it should appear once the new build is loaded.
71+
72+
### Results are lower than expected
73+
74+
- Cap is tied to automatic run budget path (`jobspyResultsWanted`) and search term count.
75+
- Country mapping can narrow results when a strict country location is applied.
76+
77+
## Related pages
78+
79+
- [Extractors Overview](/docs/next/extractors/overview)
80+
- [Pipeline Run](/docs/next/features/pipeline-run)
81+
- [Settings](/docs/next/features/settings)
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
id: jobindex
3+
title: Jobindex Extractor
4+
description: Denmark-only Jobindex extraction through the browser page Stash payload.
5+
sidebar_position: 11
6+
---
7+
8+
## What it is
9+
10+
Original website: [Jobindex](https://www.jobindex.dk/)
11+
12+
This extractor reads Jobindex search result pages and parses the embedded `var Stash = ...` payload that powers the browser result app. It supports Denmark searches with query terms and optional city filtering through Jobindex `geoareaid` resolution.
13+
14+
Implementation split:
15+
16+
1. `extractors/jobindex/src/run.ts` fetches `/jobsoegning?q=...`, resolves selected city locations through `storeData.geoareaOptions`, appends matching `geoareaid` filters, parses `searchResponse.results`, paginates with `page=2`, and maps rows into `CreateJobInput`.
17+
2. `extractors/jobindex/src/manifest.ts` enforces Denmark-only runs, adapts pipeline settings, emits progress updates, and registers the source for runtime discovery.
18+
19+
## Why it exists
20+
21+
Jobindex is a strong local source for Denmark roles, and the embedded Stash payload exposes structured result data without needing browser rendering.
22+
23+
Using the page payload keeps the scraper small while still capturing company, location, dates, rating, listing URL, and direct application links when they appear in the result HTML.
24+
25+
## How to use it
26+
27+
1. Open **Run jobs** and choose **Automatic**.
28+
2. Select **Denmark** as the country.
29+
3. Leave **Jobindex** enabled in **Sources** or toggle it on.
30+
4. Enter search terms such as:
31+
```text
32+
software engineer
33+
platform engineer
34+
backend developer
35+
```
36+
5. Start the run and monitor list-page progress in the pipeline progress card.
37+
38+
Defaults and constraints:
39+
40+
- The extractor only runs when selected country is `denmark`.
41+
- When city locations are selected, the extractor resolves them to Jobindex `geoareaid` filters and applies them on the search URL.
42+
- `JOBINDEX_MAX_JOBS_PER_TERM` controls the default per-term cap when no automatic run budget override is present.
43+
- Direct application links are parsed from the result card heading where available; otherwise the Jobindex listing URL is used.
44+
- Job descriptions come from paragraph text in the result card, not from full detail-page scraping.
45+
46+
## Common problems
47+
48+
### Jobindex does not run
49+
50+
- Confirm the selected country is Denmark.
51+
- Check that the app build includes `extractors/jobindex/src/manifest.ts` and the shared `jobindex` source metadata.
52+
53+
### Results are not filtered to a city
54+
55+
- City filtering only applies when the selected city can be resolved to a Jobindex `geoareaid`.
56+
- If no match is found, the extractor falls back to the query-only URL for that search term.
57+
- Danish names and common transliterations such as `Kobenhavn`, `Aabenraa`, and `Sonderborg` are supported.
58+
59+
### Application links point to Jobindex
60+
61+
- Some listings do not expose a direct external apply link in the result card.
62+
- In those cases the extractor keeps the stable Jobindex listing URL so the job is still actionable.
63+
64+
## Related pages
65+
66+
- [Extractors Overview](/docs/next/extractors/overview)
67+
- [Pipeline Run](/docs/next/features/pipeline-run)
68+
- [Add an Extractor](/docs/next/workflows/add-an-extractor)

0 commit comments

Comments
 (0)