WebScrape Reference

Purpose

Operator runbook for advanced and maintenance tasks.

Use SKILL.md for the concise day-to-day workflow.

Standard Daily Workflow

Run collector:

python3 -m collector.pipeline.scrape
# one profile only:
python3 -m collector.pipeline.scrape --profile <profile>

Read generated artifacts:

data/runs/<profile>/latest.json

If needed, create report output:

python3 -m collector.pipeline.report --profile <profile> --target <slug> --days 3

Recommended Cadence

Run scrape once daily at roughly the same time.
Generate --days 1 reports for urgent daily review.
Generate --days 7 reports for deeper weekly review.

Target Management

Registry files:

data/config/targets.json
data/config/targets.history.jsonl

Add target:

python3 -m collector.pipeline.manage_targets add \
  --profile <profile> \
  --slug <slug> \
  --name "<display name>" \
  --source webpage=https://example.com \
  --source linkedin_company=https://www.linkedin.com/company/example/ \
  --source x_profile=https://x.com/example

Add target without auto-discovery:

python3 -m collector.pipeline.manage_targets add \
  --profile <profile> \
  --slug <slug> \
  --name "<display name>" \
  --source webpage=https://example.com \
  --skip-discovery

List targets:

python3 -m collector.pipeline.manage_targets list --profile <profile>
python3 -m collector.pipeline.manage_targets list --profile <profile> --active-only

Deactivate target:

python3 -m collector.pipeline.manage_targets remove --profile <profile> --slug <slug>

Hard-delete target:

python3 -m collector.pipeline.manage_targets remove --profile <profile> --slug <slug> --hard-delete

Source Discovery and Cleanup

Discover source suggestions:

python3 -m collector.pipeline.discover_sources --profile <profile>

If BRAVE_SEARCH_API_KEY or BRAVE_API_KEY is set, discovery uses Brave Search to find more feed/blog/news/changelog candidates.

Rediscover and append sources for one target:

python3 -m collector.pipeline.manage_targets rediscover --profile <profile> --slug <slug>

Cleanup LinkedIn source entries:

python3 -m collector.pipeline.manage_targets cleanup-linkedin --profile <profile>
python3 -m collector.pipeline.manage_targets cleanup-linkedin --profile <profile> --slug <slug>

Manual Source Edits

Open data/config/targets.json.
Choose a profile under profiles (for example default).
Locate target by slug.
Append source entries in sources list.

Example:

{
  "type": "rss",
  "url": "https://example.com/rss.xml",
  "label": "News RSS"
}

Common source types:

rss
blog
changelog
webpage
linkedin_company
sitemap

Manual edit rules:

Keep URLs public and unauthenticated.
Avoid duplicate (type, url) pairs.
Prefer section/index pages over one-off article URLs.
Run a new scrape after edits:

python3 -m collector.pipeline.scrape

Artifact Intent

JSON: stable machine-readable summary/manifests.
JSONL: append-only item streams for audit and downstream filtering.

Troubleshooting

If a source fails, inspect errors in data/runs/<profile>/latest.json.
LinkedIn blocking is expected in some environments; collector can fall back via Brave when BRAVE_SEARCH_API_KEY is set.
If schema validation fails, verify data/config/targets.json against config/sources.schema.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebScrape Reference

Purpose

Standard Daily Workflow

Recommended Cadence

Target Management

Source Discovery and Cleanup

Manual Source Edits

Artifact Intent

Troubleshooting

FilesExpand file tree

reference.md

Latest commit

History

reference.md

File metadata and controls

WebScrape Reference

Purpose

Standard Daily Workflow

Recommended Cadence

Target Management

Source Discovery and Cleanup

Manual Source Edits

Artifact Intent

Troubleshooting