Make "biotope get" a universal ingress verb#32
Draft
peymanvahidi wants to merge 6 commits into
Draft
Conversation
`biotope get --crawl` parses fetched HTML to discover same-host links.
Introduce set_source/get_source plus the SOURCE_KEY (`dct:source`) and FETCHED_AT_KEY (`biotope:fetchedAt`) constants so manifests can record where external data was brought in from — distinct from `prov:wasDerivedFrom`, which links to other datasets inside the project. Also fix resolve_target's directory branch to append `.jsonld` to the full relative path rather than using `.with_suffix`, which clobbered a dot in the directory name (`example.com` → `example.jsonld`) and broke the manifest↔data mirroring. Only dotted directory names are affected; plain names are unchanged.
Thread optional `source`/`fetched_at` through the overrides dict and stamp them via _apply_source_provenance in both _add_file and _bake_directory. Plain `biotope add` (no external origin) stays a no-op, so existing manifests are unaffected; `biotope get` populates them. Also fix _resolve_dataset_ref's directory branch to append `.jsonld` (same dotted-dir bug as resolve_target), so a scraped `data/example.com` dataset remains addressable by `biotope mark` and `--derived-from`.
New biotope/scrape.py backs `biotope get --crawl`: a same-host breadth-first crawl to a bounded depth with a page cap. v1 is deliberately conservative — - robots.txt honoured by default (4xx → allow-all, 5xx → disallow-all, network failure → allow-all, Crawl-delay respected); - per-request rate limiting (--rate); - content-type gated to HTML, responses written as bytes; - URL→file mapping is path-traversal-safe and de-duplicated by output path; - redirects that land off-host are skipped, and the final URL is recorded. The module is free of any biotope-manifest knowledge: it returns the saved pages so the caller can bake the manifest.
`biotope get <source>` now classifies its source and brings the data in, bakes the manifest, and records provenance in one shot: - local file → copy + bake - local directory → recursive copy + bake (+ .biotope.yaml scaffold) - http(s) file/page → download + bake (existing behaviour preserved) - website scrape → --crawl bounded BFS, one file per page, one manifest `--into` (default `data`; `--output-dir`/`-o` kept as aliases) controls where data lands. Manifests get `dct:source` + `biotope:fetchedAt`; scraped pages each carry their own `dct:source`. Status auto-classifies like `biotope add` (`--status` overrides), and the full add-parity metadata flags are accepted. `--no-add` brings data in without baking; s3://, scp://, … abort as deferred. Security guards: reject a path-escaping `--into` (absolute or `..`) before it can dump data outside the tree, sanitise server-supplied download filenames to a contained basename, and skip symlinks when copying directories.
- New docs/api-docs/get.md + nav entry (mkdocs.yml, mkdocs.nav.yml). - Rewrite the AGENTS.md "Bring data in" section to use `biotope get` for all ingress (drops the `cp -r … && biotope add` step) and make `get` the canonical second step of the workflow; `add` narrows to in-tree data. - Update README and docs/index command surface accordingly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Generalises
biotope getbeyond single-URL downloads. It now dispatches on source shape, local files, local directories, and bounded--crawlwebsite scrapes, copying data into the project, baking the manifest, and recording source provenance (dct:source+ fetch timestamp) in one shot.--intoand--statusoptions; default destination is nowdata/<basename>beautifulsoup4)biotope addrecords external-source provenance on baked manifestsfixes #22