Skip to content

feat(notion): add scan-notion.mjs — read job postings from Notion database into pipeline#671

Open
Schlaflied wants to merge 3 commits into
santifer:mainfrom
Schlaflied:feat/scan-notion
Open

feat(notion): add scan-notion.mjs — read job postings from Notion database into pipeline#671
Schlaflied wants to merge 3 commits into
santifer:mainfrom
Schlaflied:feat/scan-notion

Conversation

@Schlaflied
Copy link
Copy Markdown

@Schlaflied Schlaflied commented May 16, 2026

Closes #668

What this adds

scan-notion.mjs reads a user's Notion job database (where they manually collect job postings), filters by role keywords, deduplicates, and appends new URLs to pipeline.md for evaluation.

This is the read direction — pulling job URLs from Notion into career-ops.
See #670 for the write direction (sync-notion.mjs — pushing the tracker back to Notion as a Kanban).

How it works

  1. Queries the configured Notion database filtered by a status property (e.g., Status = "To Review")
  2. Extracts URL, title, and company from configurable property names
  3. Applies the same positive/negative keyword title filter used in scan.mjs and scan-gmail.mjs
  4. Deduplicates against data/scan-history.tsv, data/applications.md, data/pipeline.md
  5. Appends new entries to pipeline.md bucketed by role category
  6. Optionally patches processed pages to a "Queued" status in Notion

Flags

Flag Behaviour
(none) Scan all pages with status_unreviewed status
--dry-run Preview without writing any files
--days N Filter to pages created in last N days
--auth Test token + list accessible databases with IDs

Config

# config/profile.yml
notion:
  token: secret_xxx
  database_id: <32-char hex>
  status_property: Status
  status_unreviewed: To Review
  status_queued: Queued          # optional — blank to skip
  url_property: URL
  title_property: Name
  company_property: Company      # optional

Scope

  • 1 new file: scan-notion.mjs (418 lines)
  • 0 existing files modified
  • 0 new npm dependenciesjs-yaml (already in package.json) + native fetch

Test plan

  • --auth lists accessible Notion databases with IDs
  • --dry-run prints what would be added without writing files
  • Pages with matching status are fetched and title-filtered correctly
  • Dedup skips URLs already in tracker / history / pipeline
  • --days N applies created_time filter correctly
  • status_queued patches page status in Notion after scan
  • Missing token / database_id exits with a clear setup message

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added a CLI to scan and sync job postings from a configured Notion database.
    • Performs include/exclude keyword filtering, deterministic bucket assignment, and automatic URL deduplication (query params ignored).
    • Appends new items to the pipeline and records scan outcomes in history, with optional dry-run mode.
    • Can optionally mark scanned pages as queued in Notion and reports per-item errors.
    • Adds auth validation and date-range scanning.

Review Change Stack

…abase into pipeline

Adds a zero-token scanner that queries a Notion job database, filters
by title keywords, deduplicates, and appends new URLs to pipeline.md.

Features:
- Queries Notion database filtered by configurable status property
- Extracts URL, title, company from configurable property names
- Same title keyword filter as scan.mjs / scan-gmail.mjs
- Deduplicates against scan-history.tsv + applications.md + pipeline.md
- Optionally marks processed pages as "Queued" in Notion after scan
- --days N: filter by Notion page creation date (created_time filter)
- --dry-run, --auth flags
- Config via config/profile.yml under notion: key

Closes santifer#668
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

📝 Walkthrough

Walkthrough

A new CLI script scan-notion.mjs queries a Notion database of job postings, extracts URLs and titles, applies keyword filters, deduplicates against existing tracker files, and appends new matches to pipeline.md and scan-history.tsv. It supports --dry-run, --days N date filtering, --auth token validation, and optional Notion status updates.

Changes

Notion Database Scanner Implementation

Layer / File(s) Summary
Entry point and configuration loading
scan-notion.mjs
CLI flag parsing (--dry-run, --auth, --days), core constants for Notion API endpoints and file paths, and configuration loader that reads Notion token, database ID, and property names from config/profile.yml.
Notion API integration
scan-notion.mjs
Authenticated Notion REST API helpers: notionFetch() for error-aware requests, queryDatabase() with pagination support, property-type probing, and updatePageStatus() for PATCH updates; runAuth() validates token and discovers accessible database IDs.
Property extraction and keyword filtering
scan-notion.mjs
Property extractors (extractTitle, extractUrl, extractText, extractStatusValue) handle multiple Notion property types (title/rich_text/url/select/status); passesFilter() applies keyword inclusion/exclusion; classifyBucket() assigns items to pipeline sections by title keywords.
Deduplication and local file persistence
scan-notion.mjs
loadSeenUrls() reads and normalizes URLs from data/scan-history.tsv, data/applications.md, and data/pipeline.md; appendPipeline() writes new items grouped by bucket; appendHistory() appends tab-separated rows with timestamp/URL/title.
Orchestration and execution
scan-notion.mjs
Main orchestration builds Notion query filters (status + optional created_time), paginates matching pages, extracts/normalizes fields, records skipped reasons, groups accepted items by bucket, conditionally writes files (unless --dry-run), optionally PATCHes queued status per accepted page, and handles top-level errors.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding scan-notion.mjs to read job postings from a Notion database into the pipeline.
Linked Issues check ✅ Passed The implementation fully addresses all coding requirements from issue #668: database querying with status filtering, URL/title/company extraction, keyword filtering, deduplication, pipeline.md appending with bucketing, Notion status patching, and CLI flags (--dry-run, --days, --auth).
Out of Scope Changes check ✅ Passed All changes are within scope: a single new file (scan-notion.mjs) with no modifications to existing files or new npm dependencies, exactly as specified in issue #668.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scan-notion.mjs`:
- Around line 220-235: appendPipeline currently assumes the parent directory for
PIPELINE_PATH exists and can throw ENOENT; modify appendPipeline to ensure the
directory exists before any write by computing the directory (use
path.dirname(PIPELINE_PATH)) and creating it if missing (e.g., fs.mkdirSync(dir,
{ recursive: true })) prior to the existsSync/writeFileSync/appendFileSync calls
so the initial run on a clean checkout succeeds; keep references to
PIPELINE_PATH and the appendPipeline function when making the change.
- Around line 116-124: The updatePageStatus function currently always builds a
select payload; modify it to first retrieve the page/database property type (use
an existing helper or call notionFetch to GET the property schema) for the given
propertyName, then switch payload shapes: if the property is "status" send {
status: { name: statusValue } }, otherwise send { select: { name: statusValue }
}; update both updatePageStatus and the other update call sites (the similar
logic referenced around the other patch/update usage) to use this
property-type-aware branching and reuse notionFetch for the PATCH request.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ddb599fd-a41c-426d-878f-5c1f9456cd1c

📥 Commits

Reviewing files that changed from the base of the PR and between 5d1f3a3 and 70922d9.

📒 Files selected for processing (1)
  • scan-notion.mjs

Comment thread scan-notion.mjs
Comment thread scan-notion.mjs
Comment thread scan-notion.mjs
- appendPipeline: ensure data/ directory exists before writing pipeline.md
  (mkdirSync recursive) so first run on a clean checkout does not throw ENOENT
- updatePageStatus: detect property type via GET /pages/{id} and send
  { status: } vs { select: } payload accordingly — avoids API 400 errors
  on databases using the newer Notion 'status' property type
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
scan-notion.mjs (2)

371-383: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip pages with no extracted title before classifying or queueing them.

Because this branch is gated by title &&, an empty or misconfigured title falls through as a match, lands in the default bucket, and can be marked Queued. Add an explicit !title skip before the filter/classification path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan-notion.mjs` around lines 371 - 383, The code currently proceeds to
classify and queue pages because the existing branch only skips when title &&
!passesFilter(title); add an explicit check for missing titles (e.g., if
(!title) { ... continue; }) before the filter/classification block: increment a
suitable counter (e.g., totalSkippedNoTitle), write a historyRows entry noting
the skip (use the same format as other entries, e.g.,
`${normalUrl}\t${TODAY}\tNotion\t${title}\t${company}\tskipped_no_title`), mark
seenUrls.add(normalUrl), and do not push into newItems, toMarkQueued, or call
classifyBucket/page.id when cfg.statusQueued is true. Ensure you reference the
same symbols (title, passesFilter, classifyBucket, newItems, historyRows,
seenUrls, toMarkQueued, page.id, cfg.statusQueued) so the change sits
immediately before the existing filter/classify logic.

305-308: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Use a status filter for real Notion status properties.

The scan path still always sends { select: { equals: ... } } here. Notion’s database-query docs treat select and status as different filter types, and a status property must be queried with a status filter, so databases using the default Status field can still fail before any pages are processed. (developers.notion.com)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan-notion.mjs` around lines 305 - 308, The current filter construction
always uses select (let filter = { property: cfg.statusProperty, select: {
equals: cfg.statusUnreviewed } }), which fails for Notion status properties;
update the code that builds filter to choose the filter type based on the
property kind — if the configured property is a Notion "status" field, create {
property: cfg.statusProperty, status: { equals: cfg.statusUnreviewed } },
otherwise keep the select form; detect the property kind via whatever metadata
you have (e.g., a cfg flag like statusPropertyType or by querying the
database/prop schema) and apply the appropriate branch where the filter variable
is set.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scan-notion.mjs`:
- Around line 55-65: The startup currently aborts if PROFILE_PATH is missing
which prevents the env fallback; change the logic so absence of
config/profile.yml does not immediately exit: if (!existsSync(PROFILE_PATH)) set
profile = {} (or skip reading the file) and continue, then compute token =
n.token || process.env.NOTION_TOKEN and databaseId = n.database_id ||
process.env.NOTION_DATABASE_ID and only call process.exit(1) when AUTH_MODE is
false AND both token and databaseId are still missing. Update the block around
PROFILE_PATH, loadYaml/readFileSync, and the subsequent auth check (symbols:
PROFILE_PATH, existsSync, loadYaml, profile, n, token, databaseId, AUTH_MODE) to
implement this conditional exit behavior.

---

Duplicate comments:
In `@scan-notion.mjs`:
- Around line 371-383: The code currently proceeds to classify and queue pages
because the existing branch only skips when title && !passesFilter(title); add
an explicit check for missing titles (e.g., if (!title) { ... continue; })
before the filter/classification block: increment a suitable counter (e.g.,
totalSkippedNoTitle), write a historyRows entry noting the skip (use the same
format as other entries, e.g.,
`${normalUrl}\t${TODAY}\tNotion\t${title}\t${company}\tskipped_no_title`), mark
seenUrls.add(normalUrl), and do not push into newItems, toMarkQueued, or call
classifyBucket/page.id when cfg.statusQueued is true. Ensure you reference the
same symbols (title, passesFilter, classifyBucket, newItems, historyRows,
seenUrls, toMarkQueued, page.id, cfg.statusQueued) so the change sits
immediately before the existing filter/classify logic.
- Around line 305-308: The current filter construction always uses select (let
filter = { property: cfg.statusProperty, select: { equals: cfg.statusUnreviewed
} }), which fails for Notion status properties; update the code that builds
filter to choose the filter type based on the property kind — if the configured
property is a Notion "status" field, create { property: cfg.statusProperty,
status: { equals: cfg.statusUnreviewed } }, otherwise keep the select form;
detect the property kind via whatever metadata you have (e.g., a cfg flag like
statusPropertyType or by querying the database/prop schema) and apply the
appropriate branch where the filter variable is set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 46902718-5c5b-414c-9a62-02fbfb0eeb28

📥 Commits

Reviewing files that changed from the base of the PR and between 70922d9 and 822e7ea.

📒 Files selected for processing (1)
  • scan-notion.mjs

Comment thread scan-notion.mjs Outdated
- loadConfig: don't exit on missing profile.yml — warn and fall back to
  NOTION_TOKEN / NOTION_DATABASE_ID env vars so env-only setups work
- query filter: add statusPropertyType config (select | status, default
  select) and build filter with correct shape per Notion API — databases
  using the native 'status' property type now get { status: { equals: } }
  instead of { select: { equals: } } which would 400
- main loop: add explicit !title check before passesFilter so pages with
  no extractable title are skipped (skipped_no_title) rather than falling
  through to the added path with classifyBucket on an empty string
@Schlaflied
Copy link
Copy Markdown
Author

@coderabbitai resolve

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scan-notion.mjs`:
- Around line 49-51: The current parsing of --days (variables daysArgIdx,
_daysParsed, and DAYS) silently falls back to null when the flag is present but
the next argv token is missing or not a positive number; change this to fail
fast: if process.argv includes '--days' and _daysParsed is not a finite positive
number, emit a clear error (including the offending token) and exit non‑zero (or
throw) so the script does not proceed to scan the entire DB; keep the existing
positive-number check and only allow null when the flag is entirely absent.
- Around line 153-158: The extractUrl function currently returns only the first
rich_text segment; update its rich_text branch to concatenate all segments like
extractTitle does: when prop.type === 'rich_text', map over prop.rich_text to
collect each segment's plain_text and join('') to produce the full string
(falling back to '' if absent), leaving the url branch unchanged so
multi-segment Notion URLs are not truncated.
- Around line 18-23: The code assumes statusPropertyType is 'select' by default
and builds a Notion filter accordingly, which fails for native Status
properties; fix this by either auto-detecting the property type from the
database schema (fetch the database via the Notion client and inspect
database.properties[statusPropertyName].type) before constructing the filter in
scan-notion.mjs, or by documenting and requiring a new config key
status_property_type (accepting 'select' or 'status') and validating it early
(throw a clear error if the actual database property type doesn't match the
config). Update logic that builds the filter (where statusPropertyType is used)
to branch on the detected/validated type and construct the correct Notion filter
shape for 'status' vs 'select'.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 95deddef-6684-45a5-9c3f-bc82c794eeb7

📥 Commits

Reviewing files that changed from the base of the PR and between 822e7ea and 8e3db3d.

📒 Files selected for processing (1)
  • scan-notion.mjs

Comment thread scan-notion.mjs
Comment thread scan-notion.mjs
Comment thread scan-notion.mjs
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

✅ Actions performed

Comments resolved. Approval is disabled; enable reviews.request_changes_workflow to allow explicit top-level @coderabbitai resolve or @coderabbitai approve commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: scan-notion.mjs — read job postings from Notion database into pipeline

1 participant