DaKheera47
diff --git a/‎docs-site/versioned_docs/version-0.9.1/extractors/adzuna.md‎
Lines changed: 67 additions & 0 deletions b/‎docs-site/versioned_docs/version-0.9.1/extractors/adzuna.md‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎docs-site/versioned_docs/version-0.9.1/extractors/golang-jobs.md‎
Lines changed: 69 additions & 0 deletions b/‎docs-site/versioned_docs/version-0.9.1/extractors/golang-jobs.md‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎docs-site/versioned_docs/version-0.9.1/extractors/gradcracker.md‎
Lines changed: 76 additions & 0 deletions b/‎docs-site/versioned_docs/version-0.9.1/extractors/gradcracker.md‎
Lines changed: 76 additions & 0 deletions
diff --git a/‎docs-site/versioned_docs/version-0.9.1/extractors/hiring-cafe.md‎
Lines changed: 81 additions & 0 deletions b/‎docs-site/versioned_docs/version-0.9.1/extractors/hiring-cafe.md‎
Lines changed: 81 additions & 0 deletions
diff --git a/‎docs-site/versioned_docs/version-0.9.1/extractors/jobindex.md‎
Lines changed: 68 additions & 0 deletions b/‎docs-site/versioned_docs/version-0.9.1/extractors/jobindex.md‎
Lines changed: 68 additions & 0 deletions
@@ -0,0 +1,67 @@
+---
+id: adzuna
+title: Adzuna Extractor
+description: API-based Adzuna extraction with orchestrator ingestion and progress updates.
+sidebar_position: 6
+---
+
+## What it is
+
+Original website: [adzuna.com](https://www.adzuna.com)
+
+Adzuna is an API-backed extractor implemented in two lean pieces:
+
+1. `extractors/adzuna/src/main.ts` fetches paginated Adzuna search results and writes `jobs.json`.
+2. `orchestrator/src/server/services/adzuna.ts` runs the extractor, parses progress lines, and maps rows into `CreateJobInput`.
+
+It de-duplicates in the existing repository path using `sourceJobId` fallback to `jobUrl`.
+
+## Why it exists
+
+Adzuna provides stable API discovery for countries that are not covered by UK-only sources. It adds a lower-maintenance source without introducing new API routes or UI sections.
+
+## How to use it
+
+1. Create an Adzuna developer account.
+2. Open [Adzuna Access Details](https://developer.adzuna.com/admin/access_details).
+3. Copy your **App ID** and **App Key**.
+4. In Job Ops, open **Settings** and paste them into `Adzuna App ID` and `Adzuna App Key` under **Environment & Workspaces**.
+5. In **Pipeline Run** (Automatic tab), select a compatible country and enable **Adzuna** in Sources.
+6. Start the run; Adzuna progress appears in the existing crawl progress stream.
+
+City behavior:
+
+- If **Search cities** are set in Automatic advanced settings, Adzuna runs once per city.
+- City runs use strict post-filtering (`job.location` contains requested city) to avoid broad country-level spillover.
+- The normalized job payload now preserves structured location evidence from Adzuna's `location.display_name` field.
+
+Default controls:
+
+- `ADZUNA_APP_ID`
+- `ADZUNA_APP_KEY`
+- `ADZUNA_MAX_JOBS_PER_TERM` (default `50`)
+- `ADZUNA_LOCATION_QUERY` (optional city/location text)
+
+Supported countries in this integration:
+
+- United Kingdom, United States, Austria, Australia, Belgium, Brazil, Canada, Switzerland, Germany, Spain, France, India, Italy, Mexico, Netherlands, New Zealand, Poland, Singapore, South Africa.
+
+## Common problems
+
+### Adzuna is disabled in source selection
+
+- `Adzuna App ID` and `Adzuna App Key` are missing from Settings (or env).
+
+### Adzuna is skipped for my selected country
+
+- The selected country is not in the supported list above.
+
+### Adzuna fails with authorization errors
+
+- Verify `ADZUNA_APP_ID` and `ADZUNA_APP_KEY` are valid and active in your Adzuna account.
+
+## Related pages
+
+- [Extractors Overview](/docs/next/extractors/overview)
+- [Pipeline Run](/docs/next/features/pipeline-run)
+- [Settings](/docs/next/features/settings)
@@ -0,0 +1,69 @@
+---
+id: golang-jobs
+title: Golang Jobs Extractor
+description: Golang Jobs extraction integrated through the site's public Supabase-backed feed.
+sidebar_position: 10
+---
+
+## What it is
+
+Original website: [Golang Jobs](https://www.golangjobs.tech/)
+
+This extractor reads the public Golang Jobs feed exposed through the site's browser-facing Supabase API and maps those rows into the existing job-ops schema.
+
+Implementation split:
+
+1. `extractors/golangjobs/src/run.ts` paginates the public feed, applies local term, country, city, and workplace filters, and maps returned rows into `CreateJobInput`.
+2. `extractors/golangjobs/src/manifest.ts` adapts pipeline settings, emits progress updates, and registers the source for runtime discovery.
+
+## Why it exists
+
+Golang Jobs adds a Go-focused niche board that broad aggregators often miss.
+
+Using the same public feed the site already serves in the browser keeps the integration lighter and more stable than scraping rendered React pages.
+
+## How to use it
+
+1. Open **Run jobs** and choose **Automatic**.
+2. Leave **Golang Jobs** enabled in **Sources** or toggle it on.
+3. Set your usual automatic run controls:
+   - `searchTerms` are matched locally against title, company, description, requirements, and location.
+   - selected country or explicit city filters are applied after feed download.
+   - workplace type is respected from the location shape returned by the feed.
+   - run budget path (`jobspyResultsWanted`) is reused as a per-term cap.
+4. Start the run and monitor progress in the pipeline progress card.
+
+Defaults and constraints:
+
+- The extractor includes a built-in browser-facing anon key for the upstream public feed.
+- You can override that default with `GOLANG_JOBS_SUPABASE_ANON_KEY` if the upstream rotates the key.
+- The upstream feed is already Go-specific, but it is still broader than most job-ops searches, so local filtering remains important.
+- The extractor currently relies on the public `jobs` and `cities` relationship exposed by the site; if the site changes that schema, the extractor will need updating.
+- Remote roles are inferred from `cities.name === "Remote"`.
+
+## Common problems
+
+### Golang Jobs does not appear in sources
+
+- Check that the app is running a build that includes the new extractor manifest and shared source metadata.
+
+### Golang Jobs health checks or runs fail immediately
+
+- If the upstream rotates its public browser key, set `GOLANG_JOBS_SUPABASE_ANON_KEY` in the server/container environment to override the built-in default.
+- Rebuild the container after adding new environment variables if you run job-ops through Docker.
+
+### Results are broader than expected
+
+- The source is niche but still broad within the Go ecosystem.
+- Add more specific search terms or explicit cities when you want a narrower result set.
+
+### Onsite-only runs return no Golang Jobs jobs
+
+- Many rows on this board are remote and are marked as such from the linked city record.
+- Include `remote` in workplace type selection if you want this source to contribute jobs.
+
+## Related pages
+
+- [Extractors Overview](/docs/next/extractors/overview)
+- [Pipeline Run](/docs/next/features/pipeline-run)
+- [Add an Extractor](/docs/next/workflows/add-an-extractor)
@@ -0,0 +1,76 @@
+---
+id: gradcracker
+title: Gradcracker Extractor
+description: How the Gradcracker crawler builds search URLs and extracts jobs.
+sidebar_position: 2
+---
+
+A plain-English walkthrough of the Gradcracker extractor in `extractors/gradcracker`.
+
+Original website: [gradcracker.com](https://www.gradcracker.com)
+
+## What it is
+
+The Gradcracker extractor finds UK graduate roles from [gradcracker.com](https://www.gradcracker.com).
+
+It now uses a fast HTTP-first scraper for normal runs. The scraper fetches Gradcracker list and detail HTML with a browser-like HTTP fingerprint, parses job cards locally, and decodes Gradcracker apply links without opening a browser. The older Playwright/Crawlee flow remains as a fallback when the HTTP path is blocked.
+
+## Why it exists
+
+Gradcracker is useful for UK graduate and early-career STEM roles that broad aggregators often miss.
+
+The HTTP-first implementation keeps the same normalized job output while avoiding the startup cost of launching a browser for every successful run.
+
+## How to use it
+
+1. Open **Run jobs** and choose **Automatic**.
+2. Select **United Kingdom** as the country.
+3. Leave **Gradcracker** enabled in **Sources** or toggle it on.
+4. Set your usual search terms and run budget.
+5. Start the run and monitor progress in the pipeline progress card.
+
+Defaults and controls:
+
+- Search terms are converted to Gradcracker role slugs, such as `software systems` to `software-systems`.
+- Defaults include `web-development` and `software-systems`.
+- `GRADCRACKER_MAX_JOBS_PER_TERM` controls the per-term cap.
+- `GRADCRACKER_HTTP_DETAIL_CONCURRENCY` controls concurrent detail-page fetches. The default is `2`.
+- `GRADCRACKER_HTTP_REQUEST_DELAY_MS` controls the minimum delay between HTTP request starts. The default is `1000`.
+- `JOBOPS_SKIP_APPLY_FOR_EXISTING=1` and `JOBOPS_EXISTING_JOB_URLS_FILE` are still honored by the browser fallback.
+- `GRADCRACKER_FORCE_BROWSER=1` forces the legacy Playwright/Crawlee path.
+- `GRADCRACKER_DISABLE_BROWSER_FALLBACK=1` returns the HTTP scraper result directly if the fast path is blocked.
+
+Implementation flow:
+
+1. Build search URLs from UK regions and role terms.
+2. Fetch list pages and parse `article[wire:key]` job cards.
+3. Fetch detail pages for new jobs only.
+4. Extract `.body-content` description text.
+5. Decode Gradcracker `/out/...` apply URLs from the `u` query parameter.
+6. Reuse saved Cloudflare clearance cookies from the headed solve flow on the HTTP retry.
+7. Fall back to Playwright/Crawlee only when the HTTP path cannot proceed.
+
+## Common problems
+
+### Gradcracker does not return jobs
+
+- Confirm the selected country is **United Kingdom**.
+- Try Gradcracker-specific terms such as `software systems`, `web development`, or `data science`.
+- Lower the run budget if a term is too broad and you only need the newest listings.
+
+### The HTTP scraper is blocked
+
+- Leave browser fallback enabled so the extractor can use the existing Playwright/Crawlee challenge handling.
+- When the app opens a challenge browser, complete the challenge and wait for the solver to save a `cf_clearance` cookie. The next HTTP retry uses that saved cookie and the same browser user agent.
+- Set `GRADCRACKER_FORCE_BROWSER=1` when you specifically need to debug the legacy browser flow.
+
+### Apply links stay on Gradcracker
+
+- Some listings may not expose a decodable `/out/...` target.
+- The extractor still stores the Gradcracker job URL, so those postings remain usable even when the final application URL is unavailable.
+
+## Related pages
+
+- [Extractors Overview](/docs/next/extractors/overview)
+- [Pipeline Run](/docs/next/features/pipeline-run)
+- [Add an Extractor](/docs/next/workflows/add-an-extractor)
@@ -0,0 +1,81 @@
+---
+id: hiring-cafe
+title: Hiring Cafe Extractor
+description: Browser-backed Hiring Cafe extraction integrated into the pipeline source selector.
+sidebar_position: 7
+---
+
+## What it is
+
+Original website: [hiring.cafe](https://hiring.cafe)
+
+Special thanks: Initial implementation inspiration came from [umur957/hiring-cafe-job-scraper](https://github.com/umur957/hiring-cafe-job-scraper).
+
+Hiring Cafe is a browser-backed extractor that queries Hiring Cafe search APIs and maps results into the orchestrator `CreateJobInput` shape.
+
+Implementation split:
+
+1. `extractors/hiringcafe/src/main.ts` builds search state, calls Hiring Cafe APIs, fetches job detail pages when search hits omit full descriptions, and writes dataset JSON.
+2. `orchestrator/src/server/services/hiring-cafe.ts` runs the extractor, streams progress events, and maps rows for pipeline import.
+
+## Why it exists
+
+Hiring Cafe adds another non-credentialed source that can be enabled from the existing source picker, without adding new settings UI.
+
+It also supports term-by-term search and country-aware search state using the same pipeline knobs you already set for automatic runs.
+
+## How to use it
+
+1. Open **Run jobs** and choose **Automatic**.
+2. **Hiring Cafe** is enabled by default in **Sources** (toggle it off if you do not want it for this run).
+3. Set your existing automatic run knobs:
+   - `searchTerms` drive per-term Hiring Cafe `searchQuery`.
+   - selected country maps into Hiring Cafe location search state.
+   - run budget path (`jobspyResultsWanted`) is reused as the max jobs-per-term cap.
+   - optional **Search cities** narrow results by city.
+   - workplace type is forwarded from the automatic run modal as a global run filter.
+4. Start the run and watch progress in the pipeline progress card.
+
+Defaults and constraints:
+
+- No new Hiring Cafe settings fields were added.
+- `worldwide` and `usa/ca` run in broad mode without a strict country location filter.
+- Hiring Cafe is enabled by default in source selection.
+- Full job descriptions are loaded from Hiring Cafe detail pages when the search result payload only includes summary fields.
+- The normalized job payload now preserves structured location evidence from the formatted workplace and city/state/country fields.
+- `HIRING_CAFE_DATE_FETCHED_PAST_N_DAYS` controls recency window when running extractor directly (default `7`).
+- When a city is provided via `searchCities`, Hiring Cafe uses city radius search (default `1` mile) and strict city post-filtering.
+- Workplace type is global to the run and is not configured separately per city in this integration.
+- City geocoding is resolved through Nominatim (OpenStreetMap data); if you scale extractor traffic, add attribution and cache repeated city lookups.
+
+Local run example:
+
+```bash
+HIRING_CAFE_SEARCH_TERMS='["backend engineer"]' \
+HIRING_CAFE_COUNTRY='united kingdom' \
+HIRING_CAFE_MAX_JOBS_PER_TERM='50' \
+npm --workspace hiringcafe-extractor run start
+```
+
+## Common problems
+
+### Hiring Cafe returns 429 / Vercel security checkpoint
+
+- The extractor first attempts Camoufox-backed Firefox and falls back to vanilla Firefox startup if Camoufox is unstable locally.
+- If upstream blocks continue, retry later or reduce run concurrency at the pipeline level by selecting fewer sources.
+
+### Hiring Cafe does not appear in sources
+
+- Check that client is running on latest build containing the new source list.
+- Hiring Cafe is source-only and does not require credentials, so it should appear once the new build is loaded.
+
+### Results are lower than expected
+
+- Cap is tied to automatic run budget path (`jobspyResultsWanted`) and search term count.
+- Country mapping can narrow results when a strict country location is applied.
+
+## Related pages
+
+- [Extractors Overview](/docs/next/extractors/overview)
+- [Pipeline Run](/docs/next/features/pipeline-run)
+- [Settings](/docs/next/features/settings)
@@ -0,0 +1,68 @@
+---
+id: jobindex
+title: Jobindex Extractor
+description: Denmark-only Jobindex extraction through the browser page Stash payload.
+sidebar_position: 11
+---
+
+## What it is
+
+Original website: [Jobindex](https://www.jobindex.dk/)
+
+This extractor reads Jobindex search result pages and parses the embedded `var Stash = ...` payload that powers the browser result app. It supports Denmark searches with query terms and optional city filtering through Jobindex `geoareaid` resolution.
+
+Implementation split:
+
+1. `extractors/jobindex/src/run.ts` fetches `/jobsoegning?q=...`, resolves selected city locations through `storeData.geoareaOptions`, appends matching `geoareaid` filters, parses `searchResponse.results`, paginates with `page=2`, and maps rows into `CreateJobInput`.
+2. `extractors/jobindex/src/manifest.ts` enforces Denmark-only runs, adapts pipeline settings, emits progress updates, and registers the source for runtime discovery.
+
+## Why it exists
+
+Jobindex is a strong local source for Denmark roles, and the embedded Stash payload exposes structured result data without needing browser rendering.
+
+Using the page payload keeps the scraper small while still capturing company, location, dates, rating, listing URL, and direct application links when they appear in the result HTML.
+
+## How to use it
+
+1. Open **Run jobs** and choose **Automatic**.
+2. Select **Denmark** as the country.
+3. Leave **Jobindex** enabled in **Sources** or toggle it on.
+4. Enter search terms such as:
+   ```text
+   software engineer
+   platform engineer
+   backend developer
+   ```
+5. Start the run and monitor list-page progress in the pipeline progress card.
+
+Defaults and constraints:
+
+- The extractor only runs when selected country is `denmark`.
+- When city locations are selected, the extractor resolves them to Jobindex `geoareaid` filters and applies them on the search URL.
+- `JOBINDEX_MAX_JOBS_PER_TERM` controls the default per-term cap when no automatic run budget override is present.
+- Direct application links are parsed from the result card heading where available; otherwise the Jobindex listing URL is used.
+- Job descriptions come from paragraph text in the result card, not from full detail-page scraping.
+
+## Common problems
+
+### Jobindex does not run
+
+- Confirm the selected country is Denmark.
+- Check that the app build includes `extractors/jobindex/src/manifest.ts` and the shared `jobindex` source metadata.
+
+### Results are not filtered to a city
+
+- City filtering only applies when the selected city can be resolved to a Jobindex `geoareaid`.
+- If no match is found, the extractor falls back to the query-only URL for that search term.
+- Danish names and common transliterations such as `Kobenhavn`, `Aabenraa`, and `Sonderborg` are supported.
+
+### Application links point to Jobindex
+
+- Some listings do not expose a direct external apply link in the result card.
+- In those cases the extractor keeps the stable Jobindex listing URL so the job is still actionable.
+
+## Related pages
+
+- [Extractors Overview](/docs/next/extractors/overview)
+- [Pipeline Run](/docs/next/features/pipeline-run)
+- [Add an Extractor](/docs/next/workflows/add-an-extractor)