Skip to content

Make "biotope get" a universal ingress verb#32

Draft
peymanvahidi wants to merge 6 commits into
mainfrom
feat/get-universal-ingress
Draft

Make "biotope get" a universal ingress verb#32
peymanvahidi wants to merge 6 commits into
mainfrom
feat/get-universal-ingress

Conversation

@peymanvahidi

Copy link
Copy Markdown
Collaborator

Generalises biotope get beyond single-URL downloads. It now dispatches on source shape, local files, local directories, and bounded --crawl website scrapes, copying data into the project, baking the manifest, and recording source provenance (dct:source + fetch timestamp) in one shot.

  • New --into and --status options; default destination is now data/<basename>
  • New polite, bounded static-HTML crawler (adds beautifulsoup4)
  • biotope add records external-source provenance on baked manifests

fixes #22

`biotope get --crawl` parses fetched HTML to discover same-host links.
Introduce set_source/get_source plus the SOURCE_KEY (`dct:source`) and
FETCHED_AT_KEY (`biotope:fetchedAt`) constants so manifests can record where
external data was brought in from — distinct from `prov:wasDerivedFrom`, which
links to other datasets inside the project.

Also fix resolve_target's directory branch to append `.jsonld` to the full
relative path rather than using `.with_suffix`, which clobbered a dot in the
directory name (`example.com` → `example.jsonld`) and broke the manifest↔data
mirroring. Only dotted directory names are affected; plain names are unchanged.
Thread optional `source`/`fetched_at` through the overrides dict and stamp them
via _apply_source_provenance in both _add_file and _bake_directory. Plain
`biotope add` (no external origin) stays a no-op, so existing manifests are
unaffected; `biotope get` populates them.

Also fix _resolve_dataset_ref's directory branch to append `.jsonld` (same
dotted-dir bug as resolve_target), so a scraped `data/example.com` dataset
remains addressable by `biotope mark` and `--derived-from`.
New biotope/scrape.py backs `biotope get --crawl`: a same-host breadth-first
crawl to a bounded depth with a page cap. v1 is deliberately conservative —

- robots.txt honoured by default (4xx → allow-all, 5xx → disallow-all,
  network failure → allow-all, Crawl-delay respected);
- per-request rate limiting (--rate);
- content-type gated to HTML, responses written as bytes;
- URL→file mapping is path-traversal-safe and de-duplicated by output path;
- redirects that land off-host are skipped, and the final URL is recorded.

The module is free of any biotope-manifest knowledge: it returns the saved
pages so the caller can bake the manifest.
`biotope get <source>` now classifies its source and brings the data in, bakes
the manifest, and records provenance in one shot:

- local file        → copy + bake
- local directory   → recursive copy + bake (+ .biotope.yaml scaffold)
- http(s) file/page → download + bake (existing behaviour preserved)
- website scrape    → --crawl bounded BFS, one file per page, one manifest

`--into` (default `data`; `--output-dir`/`-o` kept as aliases) controls where
data lands. Manifests get `dct:source` + `biotope:fetchedAt`; scraped pages each
carry their own `dct:source`. Status auto-classifies like `biotope add`
(`--status` overrides), and the full add-parity metadata flags are accepted.
`--no-add` brings data in without baking; s3://, scp://, … abort as deferred.

Security guards: reject a path-escaping `--into` (absolute or `..`) before it
can dump data outside the tree, sanitise server-supplied download filenames to
a contained basename, and skip symlinks when copying directories.
- New docs/api-docs/get.md + nav entry (mkdocs.yml, mkdocs.nav.yml).
- Rewrite the AGENTS.md "Bring data in" section to use `biotope get` for all
  ingress (drops the `cp -r … && biotope add` step) and make `get` the
  canonical second step of the workflow; `add` narrows to in-tree data.
- Update README and docs/index command surface accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generalise biotope get into a universal ingress verb (local paths, directories, scraping)

1 participant