feat(notion): add scan-notion.mjs — read job postings from Notion database into pipeline#671
feat(notion): add scan-notion.mjs — read job postings from Notion database into pipeline#671Schlaflied wants to merge 3 commits into
Conversation
…abase into pipeline Adds a zero-token scanner that queries a Notion job database, filters by title keywords, deduplicates, and appends new URLs to pipeline.md. Features: - Queries Notion database filtered by configurable status property - Extracts URL, title, company from configurable property names - Same title keyword filter as scan.mjs / scan-gmail.mjs - Deduplicates against scan-history.tsv + applications.md + pipeline.md - Optionally marks processed pages as "Queued" in Notion after scan - --days N: filter by Notion page creation date (created_time filter) - --dry-run, --auth flags - Config via config/profile.yml under notion: key Closes santifer#668
📝 WalkthroughWalkthroughA new CLI script ChangesNotion Database Scanner Implementation
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scan-notion.mjs`:
- Around line 220-235: appendPipeline currently assumes the parent directory for
PIPELINE_PATH exists and can throw ENOENT; modify appendPipeline to ensure the
directory exists before any write by computing the directory (use
path.dirname(PIPELINE_PATH)) and creating it if missing (e.g., fs.mkdirSync(dir,
{ recursive: true })) prior to the existsSync/writeFileSync/appendFileSync calls
so the initial run on a clean checkout succeeds; keep references to
PIPELINE_PATH and the appendPipeline function when making the change.
- Around line 116-124: The updatePageStatus function currently always builds a
select payload; modify it to first retrieve the page/database property type (use
an existing helper or call notionFetch to GET the property schema) for the given
propertyName, then switch payload shapes: if the property is "status" send {
status: { name: statusValue } }, otherwise send { select: { name: statusValue }
}; update both updatePageStatus and the other update call sites (the similar
logic referenced around the other patch/update usage) to use this
property-type-aware branching and reuse notionFetch for the PATCH request.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ddb599fd-a41c-426d-878f-5c1f9456cd1c
📒 Files selected for processing (1)
scan-notion.mjs
- appendPipeline: ensure data/ directory exists before writing pipeline.md
(mkdirSync recursive) so first run on a clean checkout does not throw ENOENT
- updatePageStatus: detect property type via GET /pages/{id} and send
{ status: } vs { select: } payload accordingly — avoids API 400 errors
on databases using the newer Notion 'status' property type
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
scan-notion.mjs (2)
371-383:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSkip pages with no extracted title before classifying or queueing them.
Because this branch is gated by
title &&, an empty or misconfigured title falls through as a match, lands in the default bucket, and can be markedQueued. Add an explicit!titleskip before the filter/classification path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scan-notion.mjs` around lines 371 - 383, The code currently proceeds to classify and queue pages because the existing branch only skips when title && !passesFilter(title); add an explicit check for missing titles (e.g., if (!title) { ... continue; }) before the filter/classification block: increment a suitable counter (e.g., totalSkippedNoTitle), write a historyRows entry noting the skip (use the same format as other entries, e.g., `${normalUrl}\t${TODAY}\tNotion\t${title}\t${company}\tskipped_no_title`), mark seenUrls.add(normalUrl), and do not push into newItems, toMarkQueued, or call classifyBucket/page.id when cfg.statusQueued is true. Ensure you reference the same symbols (title, passesFilter, classifyBucket, newItems, historyRows, seenUrls, toMarkQueued, page.id, cfg.statusQueued) so the change sits immediately before the existing filter/classify logic.
305-308:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winUse a
statusfilter for real Notion status properties.The scan path still always sends
{ select: { equals: ... } }here. Notion’s database-query docs treatselectandstatusas different filter types, and astatusproperty must be queried with astatusfilter, so databases using the defaultStatusfield can still fail before any pages are processed. (developers.notion.com)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scan-notion.mjs` around lines 305 - 308, The current filter construction always uses select (let filter = { property: cfg.statusProperty, select: { equals: cfg.statusUnreviewed } }), which fails for Notion status properties; update the code that builds filter to choose the filter type based on the property kind — if the configured property is a Notion "status" field, create { property: cfg.statusProperty, status: { equals: cfg.statusUnreviewed } }, otherwise keep the select form; detect the property kind via whatever metadata you have (e.g., a cfg flag like statusPropertyType or by querying the database/prop schema) and apply the appropriate branch where the filter variable is set.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scan-notion.mjs`:
- Around line 55-65: The startup currently aborts if PROFILE_PATH is missing
which prevents the env fallback; change the logic so absence of
config/profile.yml does not immediately exit: if (!existsSync(PROFILE_PATH)) set
profile = {} (or skip reading the file) and continue, then compute token =
n.token || process.env.NOTION_TOKEN and databaseId = n.database_id ||
process.env.NOTION_DATABASE_ID and only call process.exit(1) when AUTH_MODE is
false AND both token and databaseId are still missing. Update the block around
PROFILE_PATH, loadYaml/readFileSync, and the subsequent auth check (symbols:
PROFILE_PATH, existsSync, loadYaml, profile, n, token, databaseId, AUTH_MODE) to
implement this conditional exit behavior.
---
Duplicate comments:
In `@scan-notion.mjs`:
- Around line 371-383: The code currently proceeds to classify and queue pages
because the existing branch only skips when title && !passesFilter(title); add
an explicit check for missing titles (e.g., if (!title) { ... continue; })
before the filter/classification block: increment a suitable counter (e.g.,
totalSkippedNoTitle), write a historyRows entry noting the skip (use the same
format as other entries, e.g.,
`${normalUrl}\t${TODAY}\tNotion\t${title}\t${company}\tskipped_no_title`), mark
seenUrls.add(normalUrl), and do not push into newItems, toMarkQueued, or call
classifyBucket/page.id when cfg.statusQueued is true. Ensure you reference the
same symbols (title, passesFilter, classifyBucket, newItems, historyRows,
seenUrls, toMarkQueued, page.id, cfg.statusQueued) so the change sits
immediately before the existing filter/classify logic.
- Around line 305-308: The current filter construction always uses select (let
filter = { property: cfg.statusProperty, select: { equals: cfg.statusUnreviewed
} }), which fails for Notion status properties; update the code that builds
filter to choose the filter type based on the property kind — if the configured
property is a Notion "status" field, create { property: cfg.statusProperty,
status: { equals: cfg.statusUnreviewed } }, otherwise keep the select form;
detect the property kind via whatever metadata you have (e.g., a cfg flag like
statusPropertyType or by querying the database/prop schema) and apply the
appropriate branch where the filter variable is set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 46902718-5c5b-414c-9a62-02fbfb0eeb28
📒 Files selected for processing (1)
scan-notion.mjs
- loadConfig: don't exit on missing profile.yml — warn and fall back to
NOTION_TOKEN / NOTION_DATABASE_ID env vars so env-only setups work
- query filter: add statusPropertyType config (select | status, default
select) and build filter with correct shape per Notion API — databases
using the native 'status' property type now get { status: { equals: } }
instead of { select: { equals: } } which would 400
- main loop: add explicit !title check before passesFilter so pages with
no extractable title are skipped (skipped_no_title) rather than falling
through to the added path with classifyBucket on an empty string
|
@coderabbitai resolve |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scan-notion.mjs`:
- Around line 49-51: The current parsing of --days (variables daysArgIdx,
_daysParsed, and DAYS) silently falls back to null when the flag is present but
the next argv token is missing or not a positive number; change this to fail
fast: if process.argv includes '--days' and _daysParsed is not a finite positive
number, emit a clear error (including the offending token) and exit non‑zero (or
throw) so the script does not proceed to scan the entire DB; keep the existing
positive-number check and only allow null when the flag is entirely absent.
- Around line 153-158: The extractUrl function currently returns only the first
rich_text segment; update its rich_text branch to concatenate all segments like
extractTitle does: when prop.type === 'rich_text', map over prop.rich_text to
collect each segment's plain_text and join('') to produce the full string
(falling back to '' if absent), leaving the url branch unchanged so
multi-segment Notion URLs are not truncated.
- Around line 18-23: The code assumes statusPropertyType is 'select' by default
and builds a Notion filter accordingly, which fails for native Status
properties; fix this by either auto-detecting the property type from the
database schema (fetch the database via the Notion client and inspect
database.properties[statusPropertyName].type) before constructing the filter in
scan-notion.mjs, or by documenting and requiring a new config key
status_property_type (accepting 'select' or 'status') and validating it early
(throw a clear error if the actual database property type doesn't match the
config). Update logic that builds the filter (where statusPropertyType is used)
to branch on the detected/validated type and construct the correct Notion filter
shape for 'status' vs 'select'.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 95deddef-6684-45a5-9c3f-bc82c794eeb7
📒 Files selected for processing (1)
scan-notion.mjs
✅ Actions performedComments resolved. Approval is disabled; enable |
Closes #668
What this adds
scan-notion.mjsreads a user's Notion job database (where they manually collect job postings), filters by role keywords, deduplicates, and appends new URLs topipeline.mdfor evaluation.This is the read direction — pulling job URLs from Notion into career-ops.
See #670 for the write direction (
sync-notion.mjs— pushing the tracker back to Notion as a Kanban).How it works
Status = "To Review")scan.mjsandscan-gmail.mjsdata/scan-history.tsv,data/applications.md,data/pipeline.mdpipeline.mdbucketed by role categoryFlags
status_unreviewedstatus--dry-run--days N--authConfig
Scope
scan-notion.mjs(418 lines)js-yaml(already inpackage.json) + nativefetchTest plan
--authlists accessible Notion databases with IDs--dry-runprints what would be added without writing files--days Nappliescreated_timefilter correctlystatus_queuedpatches page status in Notion after scantoken/database_idexits with a clear setup message🤖 Generated with Claude Code
Summary by CodeRabbit