Skip to content

Latest commit

 

History

History
153 lines (107 loc) · 3.45 KB

File metadata and controls

153 lines (107 loc) · 3.45 KB

WebScrape Reference

Purpose

Operator runbook for advanced and maintenance tasks.

Use SKILL.md for the concise day-to-day workflow.

Standard Daily Workflow

  1. Run collector:
python3 -m collector.pipeline.scrape
# one profile only:
python3 -m collector.pipeline.scrape --profile <profile>
  1. Read generated artifacts:
  • data/runs/<profile>/latest.json
  1. If needed, create report output:
python3 -m collector.pipeline.report --profile <profile> --target <slug> --days 3

Recommended Cadence

  • Run scrape once daily at roughly the same time.
  • Generate --days 1 reports for urgent daily review.
  • Generate --days 7 reports for deeper weekly review.

Target Management

Registry files:

  • data/config/targets.json
  • data/config/targets.history.jsonl

Add target:

python3 -m collector.pipeline.manage_targets add \
  --profile <profile> \
  --slug <slug> \
  --name "<display name>" \
  --source webpage=https://example.com \
  --source linkedin_company=https://www.linkedin.com/company/example/ \
  --source x_profile=https://x.com/example

Add target without auto-discovery:

python3 -m collector.pipeline.manage_targets add \
  --profile <profile> \
  --slug <slug> \
  --name "<display name>" \
  --source webpage=https://example.com \
  --skip-discovery

List targets:

python3 -m collector.pipeline.manage_targets list --profile <profile>
python3 -m collector.pipeline.manage_targets list --profile <profile> --active-only

Deactivate target:

python3 -m collector.pipeline.manage_targets remove --profile <profile> --slug <slug>

Hard-delete target:

python3 -m collector.pipeline.manage_targets remove --profile <profile> --slug <slug> --hard-delete

Source Discovery and Cleanup

Discover source suggestions:

python3 -m collector.pipeline.discover_sources --profile <profile>

If BRAVE_SEARCH_API_KEY or BRAVE_API_KEY is set, discovery uses Brave Search to find more feed/blog/news/changelog candidates.

Rediscover and append sources for one target:

python3 -m collector.pipeline.manage_targets rediscover --profile <profile> --slug <slug>

Cleanup LinkedIn source entries:

python3 -m collector.pipeline.manage_targets cleanup-linkedin --profile <profile>
python3 -m collector.pipeline.manage_targets cleanup-linkedin --profile <profile> --slug <slug>

Manual Source Edits

  1. Open data/config/targets.json.
  2. Choose a profile under profiles (for example default).
  3. Locate target by slug.
  4. Append source entries in sources list.

Example:

{
  "type": "rss",
  "url": "https://example.com/rss.xml",
  "label": "News RSS"
}

Common source types:

  • rss
  • blog
  • changelog
  • webpage
  • linkedin_company
  • sitemap

Manual edit rules:

  • Keep URLs public and unauthenticated.
  • Avoid duplicate (type, url) pairs.
  • Prefer section/index pages over one-off article URLs.
  • Run a new scrape after edits:
python3 -m collector.pipeline.scrape

Artifact Intent

  • JSON: stable machine-readable summary/manifests.
  • JSONL: append-only item streams for audit and downstream filtering.

Troubleshooting

  • If a source fails, inspect errors in data/runs/<profile>/latest.json.
  • LinkedIn blocking is expected in some environments; collector can fall back via Brave when BRAVE_SEARCH_API_KEY is set.
  • If schema validation fails, verify data/config/targets.json against config/sources.schema.json.