feat(scan): add Workable, SmartRecruiters, Recruitee ATS parsers#653
feat(scan): add Workable, SmartRecruiters, Recruitee ATS parsers#653jrojomartinez wants to merge 7 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds three providers (Workable, SmartRecruiters, Recruitee) that derive tenant feed/API URLs from careers URLs, validate HTTPS and allowlisted hostnames, fetch with redirects disabled, parse responses into normalized job objects, and add tests and documentation. ChangesJob Feed Providers with Auto-Detection and Parsing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Workable's documented JSON API requires an auth token; the only no-auth public surface is a Markdown feed at `apply.workable.com/<slug>/jobs.md`. The provider auto-detects from the `apply.workable.com/<slug>` careers_url pattern, fetches via ctx.fetchText, and parses the table rows. Follows the SSRF defence pattern from providers/greenhouse.mjs: hostname allowlist + URL parse + HTTPS check + redirect:'error' on the fetch call. Exports parseWorkableMarkdown as a named export so test-all.mjs §11 can unit-test the parser independently of the network. Tests in test-all.mjs §11: - detect() resolves apply.workable.com/<slug> → /jobs.md feed - detect() returns null for non-workable URLs - parseWorkableMarkdown extracts title/location/company correctly - parseWorkableMarkdown strips .md suffix from job URLs - empty / null inputs yield empty results without crashing - fetch() with allowed hostname reaches the http context Refs santifer#651
Auto-detects from careers_url pattern
`https://(careers|jobs).smartrecruiters.com/<slug>` and hits the
public /postings endpoint. tracked_companies entries can also set
`provider: smartrecruiters` to bypass detection (useful when the
public careers URL is a branded custom domain like `careers.adyen.com`).
Follows the SSRF defence pattern from providers/greenhouse.mjs:
hostname allowlist (api.smartrecruiters.com) + URL parse + HTTPS
check + redirect:'error'.
Notable parse decisions:
- location: prefer location.fullLocation; else assemble from
city/region/country (skipping empties); append "Remote" when
location.remote is true.
- url: rewrite j.ref's api.smartrecruiters.com prefix to
jobs.smartrecruiters.com so the link points at the public job
page, not the API. Falls back to a synthetic URL when ref is
missing.
Exports parseSmartRecruitersResponse as a named export so
test-all.mjs §12 can unit-test the parser.
Tests in test-all.mjs §12:
- detect() resolves both careers.* and jobs.* hostnames
- detect() returns null for non-SR URLs
- parser uses fullLocation when present
- parser assembles city/country/remote when fullLocation absent
- parser rewrites api.smartrecruiters.com → jobs.smartrecruiters.com
- parser synthesises a URL when ref is missing
- empty / malformed inputs yield empty results without crashing
Refs santifer#651
Auto-detects from careers_url pattern `https://<slug>.recruitee.com`
and hits the public /api/offers/ endpoint. tracked_companies entries
can also set `provider: recruitee` to bypass detection.
SSRF defence: per-tenant subdomains are the variable part, so a
static hostname allowlist isn't workable. Uses a regex match on
`<safe-slug>.recruitee.com` (`^[a-z0-9][a-z0-9-]*\.recruitee\.com$`)
+ HTTPS check + redirect:'error'. The regex constrains the slug to
safe characters, preventing attacker-controlled hostnames from
slipping through.
Notable parse decisions:
- url: prefer `careers_url` (the public job page), fall back to
`url` (some installs use it instead), empty string otherwise.
- location: prefer the explicit `location` field; else assemble
from city/country with "Remote" appended when remote is true.
Exports parseRecruiteeResponse as a named export for tests.
Tests in test-all.mjs §13:
- detect() resolves <slug>.recruitee.com → /api/offers/
- detect() returns null for non-recruitee URLs
- parser prefers careers_url over url
- parser assembles location from city/country/remote
- parser uses explicit location field when present
- empty / null inputs yield empty results without crashing
Refs santifer#651
f88f2a2 to
a67e794
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@providers/recruitee.mjs`:
- Around line 26-30: resolveApiUrl currently uses a regex on the whole
careers_url and can be tricked by arbitrary strings; instead, parse
entry.careers_url with the URL constructor, verify protocol === 'https:',
validate hostname exactly matches the pattern "<slug>.recruitee.com" (where slug
matches /^[a-z0-9][a-z0-9-]*$/) by splitting hostname on '.' and checking parts
length and values, then extract slug from the hostname and return
`https://${slug}.recruitee.com/api/offers/`; ensure resolveApiUrl catches URL
parsing errors and returns null for missing, non-https, or non-matching
hostnames to avoid SSRF/command-injection/path-traversal risks.
In `@providers/smartrecruiters.mjs`:
- Around line 26-30: The resolveApiUrl function should parse entry.careers_url
with the URL constructor (guarding with try/catch for invalid/missing values),
then require urlObj.hostname to equal exactly "careers.smartrecruiters.com" or
"jobs.smartrecruiters.com" before extracting the slug from urlObj.pathname
(e.g., the first non-empty path segment) and returning the same API string
(https://api.smartrecruiters.com/v1/companies/{slug}/postings?limit=100&offset=0&status=PUBLIC);
if parsing fails, hostname doesn't match, or the slug is missing, return null.
- Around line 76-78: Validate and parse j.ref with the URL constructor before
doing any replace: check that j.ref is a valid URL whose hostname is
"api.smartrecruiters.com" and whose pathname starts with "/v1/companies/"; only
then map it to the jobs.smartrecruiters.com pattern (preserving protocol and
path parts) and otherwise fall back to a sanitized slug. Replace the current
inline replace logic for the url variable with a guarded branch: attempt to
parse j.ref, validate host/path, build the jobs URL from parsed parts if valid,
else construct the fallback using a slugified companyName (lowercase, trim,
collapse whitespace, remove/replace non-alphanumeric chars with hyphens and
strip leading/trailing hyphens) combined with j.id and slugified; ensure you
handle missing companyName/j.id safely and never trust raw j.ref to prevent
malformed URLs or SSRF.
In `@providers/workable.mjs`:
- Around line 26-30: The current resolveFeedUrl(entry) uses a substring regex
and can misdetect non-Workable URLs; instead, parse entry.careers_url with new
URL() inside resolveFeedUrl, catch any thrown errors and return null for
missing/invalid URLs, verify url.protocol === 'https:' and url.hostname ===
'apply.workable.com', then extract the slug from url.pathname (the first path
segment) and return `https://apply.workable.com/${slug}/jobs.md`; do not rely on
a regex on the raw string and ensure all error paths return null to avoid
SSRF/invalid inputs.
In `@test-all.mjs`:
- Around line 373-387: Add a true-negative SSRF test that ensures untrusted
hosts are rejected and fetchText/fetchJson are never invoked: call
workable.fetch with a careers_url like
"https://evil.example/apply.workable.com/slug" (or similar) and provide
transport handlers where fetchText and fetchJson throw if called; then assert
workable.fetch rejects (or throws) for that input so the test verifies the
untrusted-host path rejects before any network helper is invoked. Reference
workable.fetch and the transport methods fetchText/fetchJson when making the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3c7a9f83-d383-4497-ac17-9b85efab2eb7
📒 Files selected for processing (5)
providers/recruitee.mjsproviders/smartrecruiters.mjsproviders/workable.mjstemplates/portals.example.ymltest-all.mjs
Pre-emptive hardening following the same defensive pattern CodeRabbit flagged on PR santifer#652. All changes are within the providers shipped in this PR; no scan.mjs / framework changes. - All three providers: `careers_url` is now type-checked before .match() so a non-string YAML value (number, object, array) returns null from detect() rather than throwing. - smartrecruiters: ref-rewrite uses an anchored regex (`/^https:\/\/api\.smartrecruiters\.com\/v1\/companies\//`) so the replacement only fires at the URL prefix. The fallback URL path (when both j.ref AND j.id are missing) now returns an empty string instead of synthesising a URL containing the literal "undefined" — the empty string is the contract-allowed default for url per _types.js > Job. Magic 100 in the postings limit is now a named SR_PAGE_SIZE constant. - workable: parseWorkableMarkdown now extracts URLs via a line-level regex `/\[View\]\(([^)]+)\)/` rather than a column-position match, so a title containing a stray `|` doesn't shift cols[7] and silently drop the URL. Rows that still don't resolve a URL are skipped (no empty-URL entries leak into the dedup tracker). - test-all.mjs: 6 new assertions covering the defensive paths (non-string careers_url across all 3 providers, the SR no-ref/no-id fallback, the Workable stray-pipe survival, and a real Workable fetch() rejection test against an unresolvable careers_url). Refs santifer#651
There was a problem hiding this comment.
♻️ Duplicate comments (1)
providers/smartrecruiters.mjs (1)
79-79:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSlugify
companyNamein fallback URL construction.When
j.refis missing, the fallback URL uses(companyName || '').toLowerCase()directly, which preserves spaces and special characters (e.g., "SGS Group" → "sgs group"). This produces malformed URL paths.🔧 Suggested fix
+ const companySlug = (companyName || '').toLowerCase().replace(/[^a-z0-9]+/g, '-').replace(/^-|-$/g, ''); const url = j.ref ? j.ref.replace(/^https:\/\/api\.smartrecruiters\.com\/v1\/companies\//, 'https://jobs.smartrecruiters.com/') - : j.id ? `https://jobs.smartrecruiters.com/${(companyName || '').toLowerCase()}/${j.id}-${slugified}` : ''; + : j.id ? `https://jobs.smartrecruiters.com/${companySlug}/${j.id}-${slugified}` : '';🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@providers/smartrecruiters.mjs` at line 79, The fallback URL uses (companyName || '').toLowerCase() which leaves spaces/special chars unescaped; update the ternary branch that builds the URL for j.id to slugify companyName the same way as the existing slugified job name (use the same slugifying logic/helper used to compute slugified) and insert that slugifiedCompanyName in place of (companyName || '').toLowerCase() so the URL path becomes https://jobs.smartrecruiters.com/{slugifiedCompanyName}/{j.id}-{slugified}.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@providers/smartrecruiters.mjs`:
- Line 79: The fallback URL uses (companyName || '').toLowerCase() which leaves
spaces/special chars unescaped; update the ternary branch that builds the URL
for j.id to slugify companyName the same way as the existing slugified job name
(use the same slugifying logic/helper used to compute slugified) and insert that
slugifiedCompanyName in place of (companyName || '').toLowerCase() so the URL
path becomes
https://jobs.smartrecruiters.com/{slugifiedCompanyName}/{j.id}-{slugified}.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b1eec1a7-9d2d-41d2-98cb-3e14f71d0722
📒 Files selected for processing (4)
providers/recruitee.mjsproviders/smartrecruiters.mjsproviders/workable.mjstest-all.mjs
Addresses 5 CodeRabbit comments on PR santifer#653 asking for tighter validation than substring regex on raw URL strings. - All 3 providers: detect()/resolveXxxUrl() now use new URL() to parse careers_url, verify protocol === 'https:', check hostname exactly (Workable: apply.workable.com; SmartRecruiters: careers./jobs.smartrecruiters.com; Recruitee: regex-validated <slug>.recruitee.com), then derive the slug from the parsed pathname/hostname. This rejects path-spoofed inputs like https://evil.example/apply.workable.com/slug (substring regex would have falsely matched). - smartrecruiters parseSmartRecruitersResponse: j.ref is now validated (parses as URL, hostname must be api.smartrecruiters.com, pathname must start with /v1/companies/) before the prefix rewrite. Invalid refs fall through to the fallback URL path. The fallback companyName is now slugified (non-alphanumerics → -, strip leading/trailing -) so "My Acme & Co." → "my-acme-co" rather than producing a URL with raw spaces/symbols. - test-all.mjs: 5 new assertions covering the path-spoof rejection for all 3 providers, the untrusted-ref-host fall-through, and the companyName slugification. Refs santifer#651
|
Thanks for the thorough review — addressed all 5 comments in the latest force-push. Summary:
Ready for re-review. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
1 similar comment
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@providers/recruitee.mjs`:
- Around line 80-84: parseRecruiteeResponse currently returns j.careers_url or
j.url verbatim which can lead to off-domain or non-HTTPS job.url values; modify
parseRecruiteeResponse to validate and normalize the URL: parse
j.careers_url/j.url with the URL constructor, ensure protocol === 'https:' and
hostname matches the tenant API host (or an allowed host list derived from the
request/tenant), and only return the validated URL (otherwise return '' or
undefined); reference parseRecruiteeResponse, job.url, j.careers_url, and j.url
when implementing this check.
In `@providers/smartrecruiters.mjs`:
- Line 41: The current postings URL builder hardcodes limit and offset (uses
SR_PAGE_SIZE and offset=0) and the fetch logic only issues a single request,
truncating results; modify the postings fetch flow to page through results by
looping requests: call the URL builder (the function that returns
`https://api.smartrecruiters.com/v1/companies/${slug}/postings?...`) with an
increasing offset (offset += SR_PAGE_SIZE) and fetch repeatedly until the API
returns an empty `content` array, aggregating each response's content into a
single results array; ensure SR_PAGE_SIZE is used for limit, handle HTTP errors
as the existing fetch logic does, and return the combined list instead of a
single-page response.
In `@providers/workable.mjs`:
- Around line 84-88: In parseWorkableMarkdown(), don't trust the raw url from
urlMatch; attempt to construct a URL object from the extracted string (wrap in
try/catch to handle malformed values) and validate that urlObj.protocol ===
'https:' and urlObj.hostname matches your allowed host(s) (or the same
origin/host used when fetching the feed) before pushing to jobs; if validation
fails or URL construction throws, skip that row (i.e., do not push to jobs). Use
the existing symbols urlMatch, url, and jobs and add hostname/protocol checks
and error-safe parsing inside that loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c4afeaa3-d7f3-4f45-873a-760e2b2b4cb2
📒 Files selected for processing (4)
providers/recruitee.mjsproviders/smartrecruiters.mjsproviders/workable.mjstest-all.mjs
Addresses 3 CodeRabbit comments on PR santifer#653 (round 2). - recruitee: parseRecruiteeResponse now validates the offer URL via new URL() + protocol === 'https:' + RECRUITEE_HOST_RE hostname check. Off-domain or non-HTTPS values are dropped (url = '' per the Job contract) rather than passed through verbatim. - workable: parseWorkableMarkdown now validates each [View] link the same way (hostname must be apply.workable.com, protocol must be https). Rows that fail validation are skipped (continue), matching the existing "skip rows with no resolvable URL" semantic. - smartrecruiters: fetch() now paginates the /postings endpoint instead of returning only the first 100 results. Added resolveSlug() and buildPostingsUrl(slug, offset) helpers, refactored resolveApiUrl() to delegate to them, and the fetch loop walks offsets 0, SR_PAGE_SIZE, 2*SR_PAGE_SIZE, ... until either an empty page or a short page (less than SR_PAGE_SIZE). Safety cap SR_MAX_PAGES = 50 (= 5000 postings) prevents runaway loops against a broken API. - test-all.mjs: 4 new assertions - Workable: off-domain + non-https [View] links are dropped - Recruitee: off-domain + non-https + missing offer URLs → url='' - SmartRecruiters: 2-page aggregation (150 items across 2 pages) - SmartRecruiters: stop on the first empty page (1 request) Refs santifer#651
|
Round 2 fixes pushed:
Ready for re-review. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai review — checking against latest commit (434375b) after the round-2 fixes for the 3 comments from 17:04. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 65977
✅ Round-2 fixes verified
✅ All earlier comments confirmed resolved
🔍 One minor nit (non-blocking)In // name = '' → slugified = '' → url = `…/${j.id}-`
url = `https://jobs.smartrecruiters.com/${companySlug}/${j.id}-${slugified}`;You could guard this with: url = slugified
? `https://jobs.smartrecruiters.com/${companySlug}/${j.id}-${slugified}`
: `https://jobs.smartrecruiters.com/${companySlug}/${j.id}`;Not a security concern — just cosmetic for the rare case of a nameless posting. OverallThe implementation is solid — clean architecture, proper SSRF mitigations ( [approve] |
Rewritten for the 1.8.0 plugin-based provider architecture. The original PR edited
detectApi/PARSERSinscan.mjs, but those have been removed in 1.8.0 — providers now live inproviders/*.mjsand follow the contract inproviders/_types.js. This PR delivers three new provider files.Summary
Adds Workable, SmartRecruiters, and Recruitee as zero-token providers. Strictly additive — existing providers untouched; a user with none of these in
tracked_companiessees no behaviour change.Files
providers/workable.mjs— markdown-feed parser (Workable's only no-auth surface)providers/smartrecruiters.mjs— public /postings APIproviders/recruitee.mjs— public /api/offers/ per-tenant APItest-all.mjs— adds §11 / §12 / §13 with ~27 unit-test assertionstemplates/portals.example.yml— documents the new URL patternsDesign note —
fetchTextis already there1.8.0's
providers/_http.mjsexports bothfetchJsonandfetchText. Workable's documented JSON API requires an auth token and the legacy unauthenticated endpoint 404s universally; the only no-auth public feed is a Markdown document atapply.workable.com/{slug}/jobs.md. The Workable provider usesctx.fetchText+ the newparseWorkableMarkdownparser. No_http.mjschanges needed.SSRF defence (matches
providers/greenhouse.mjs)Each provider:
new URL(...).https:protocol.apply.workable.com,api.smartrecruiters.com) — or regex for Recruitee since slugs vary per tenant (^[a-z0-9][a-z0-9-]*\.recruitee\.com$).redirect: 'error'on the fetch call to prevent server-side-redirect SSRF.Tests
node test-all.mjs --quickpasses (upstream baseline + ~27 new assertions across §11 / §12 / §13)detect()matches its URL pattern and returns null otherwise.mdsuffix; SmartRecruiters parser rewritesj.refto the public hostname; Recruitee parser preferscareers_urloverurlfetch()honours the hostname allowlist (sample test exercises the success path)Validated downstream
optimile(Ghent / Belgium / Hybrid)sgs(known-active tenant)channableSummary by CodeRabbit