Skip to content

Latest commit

 

History

History
176 lines (126 loc) · 11.9 KB

File metadata and controls

176 lines (126 loc) · 11.9 KB

profile

Extract, derive & infer metadata from a CSV (local path or URL) - using the statistical profile of a dataset, mapped and driven by a configurable metadata scheming YAML spec (DCAT-US v3, DCAT-AP v3 and Croissant 1.1 bundled; Geoconnex when built with the geoconnex feature), with optional CKAN/DCAT metadata discovery for URL inputs. This enables FAIRification at scale.

Table of Contents | Source: src/cmd/profile.rs | 📇🧠🤖📚⛩️ CKAN

Description | Examples | Usage | Arguments | Profile Options | Common Options

Description

Profile a CSV (local path or URL) and emit a .metadata.json file carrying five top-level blocks:

dpp — inferred dataset signals: lat/lon/date columns, file size, row count, encoding, etc. (the legacy datapusher-plus inference block). stats — per-column summary statistics from qsv stats. frequency — per-column value counts from qsv frequency. ckan — a CKAN-shaped block (package + resources) that datapusher-plus consumes to prepopulate CKAN packages. projection — the dataset re-expressed in the active profile's metadata vocabulary. Default is DCAT-US v3; bundled alternates are dcat-ap-v3 (EU portals), croissant (ML/AI registries) and geoconnex (water-data federations). Consumable directly by data.gov harvesters, EU DCAT-AP catalogs, mlcommons / Hugging Face / Kaggle, and Internet of Water tooling.

Behind the scenes qsv runs the same statistical + frequency analysis datapusher-plus (DP+) runs in CKAN, builds a Jinja2 evaluation context from the results, and — when an optional CKAN scheming YAML spec is supplied — evaluates the spec's formula / suggestion_formula templates against that context. Jinja2 helpers and filters are a native Rust port of DP+'s jinja2_helpers.py, built on minijinja.

When the input is a URL whose response carries DCAT markup (HTTP Link: rel=describedBy), qsv discovers the publisher's stated metadata and merges it as a base layer beneath the inferred projection.

For an example CKAN scheming YAML spec, see:
https://github.com/dathere/datapusher-plus/blob/main/ckanext/datapusher_plus/dataset-druf.yaml

For more extensive examples, see https://github.com/dathere/qsv/blob/master/tests/test_profile.rs. See also https://github.com/dathere/qsv/wiki/Metadata-Profiling

Examples

Quick: dpp/stats/frequency + default DCAT-US v3 projection.

qsv profile data.csv

Pipe stdin; output defaults to stdin.metadata.json.

cat data.csv | qsv profile

URL input: discover the publisher's DCAT markup and merge it as a base layer.

qsv profile https://data.example.gov/datasets/sample.csv

Seed publisher/contact info from a JSON file; write to a chosen output path.

qsv profile data.csv --initial-context publisher.json -o data.metadata.json

data.gov-style harvest: validate against DCAT-US v3 JSON Schema, abort on violations, wrap in a Catalog envelope.

qsv profile data.csv --validate --strict --catalog -o data.metadata.json

DCAT-AP v3 for EU data portals; pyshacl validates the bundled SHACL shapes.

qsv profile open-data.csv --profile dcat-ap-v3 --validate --strict

Croissant JSON-LD for an ML dataset; mlcroissant validates the output.

qsv profile train.csv --profile croissant --validate -o train.croissant.json

Geoconnex JSON-LD for hydrologic data (qsv built with the geoconnex feature).

qsv profile gages.csv --profile geoconnex --validate --strict

Evaluate a CKAN scheming spec: Jinja2 formulas compute spatial/temporal extents, accrual periodicity, and other derived fields.

qsv profile data.csv --spec dataset-druf.yaml -o data.metadata.json

CKAN-only output: drop the projection block, keep dpp/stats/frequency/ckan.

qsv profile data.csv --no-projection --spec dataset-druf.yaml

Custom YAML profile from disk (embedded names always win over same-named files, so use a non-clashing name for custom profiles).

qsv profile data.csv --profile ./my-org-dcat.yaml --validate

Usage

qsv profile [options] [<input>]
qsv profile --help

Arguments

Argument  Description
 <input>  Path or URL to the CSV to profile. When - or omitted, reads from stdin. When the URL has DCAT markup, qsv will attempt to discover and ingest it as a base layer of metadata (unless --no-dcat-discovery is set). See --no-dcat-discovery and --dcat-discovery-timeout for details and opt-out.

Profile Options

           Option            Type Description Default
 ‑‑spec  string CKAN scheming YAML spec file. If omitted, only the inferred dpp block (lat/lon/date columns, dataset stats) is emitted; no formulas are evaluated.
 ‑‑initial‑context  string JSON file providing seed values for the package / resource dicts plus optional JSON-Pointer overrides for the final projection block. Replaces the older --package-meta / --resource-meta flags. Top-level keys: package, resource, dataset_info. Each leaf value may be wrapped as {"value": ..., "force": true} to mark it as overriding any value discovered from URL DCAT markup AND any value qsv inferred. Force is honored across all three subtrees: dataset_info entries override their target path verbatim; package / resource entries route through the active profile's field_mappings: table (e.g. package.title force=true lands at /projection/dct:title, beating inference and discovery). Forced values for slots the profile does not surface are silently dropped (no-op). See tests/resources/profile/dcat-init-context.README.md for a fully-populated example.
 ‑‑no‑projection  flag Skip the metadata projection block (dcat/croissant/ geoconnex, depending on the active profile).
 ‑‑no‑ckan  flag Skip the CKAN-shape block.
 ‑‑croissant‑frequency  flag Embed per-column value-frequency distributions in the metadata projection. The croissant profile renders them as inline cr:RecordSets (one <col>-frequency RecordSet of {value, count, percentage} rows per column), per the spec's "distribution of values is a statistic on the field" guidance. Off by default (keeps the projection compact); the raw counts always remain in the top-level frequency block regardless. Other bundled profiles ignore this flag.
 ‑‑dcat‑legacy‑license  flag Transitional: re-emit dct:license on the Dataset alongside the v3-required Distribution-level copy. Default: off (strict v3, license on Distribution only).
 ‑‑no‑dcat‑discovery  flag Skip DCAT-markup discovery on URL inputs. Discovery sniffs HTTP Link: rel=describedBy (and, in future, sibling .metadata.json / JSON-LD <script> blocks) to use the publisher's stated metadata as a base layer.
 ‑‑dcat‑discovery‑timeout  integer Per-request timeout for DCAT-markup discovery probes. Default: 5.
 ‑‑validate  flag Validate the emitted projection block against the active profile's declared validators. For dcat-us-v3 that's the vendored GSA JSON Schema bundle (see resources/dcat-us-v3/); for dcat-ap-v3 / geoconnex it's pyshacl over the bundled SHACL shapes; for croissant it's mlcroissant. Catches missing mandatory fields, cardinality issues, and shape violations. Violations append to projection_warnings by default.
 ‑‑strict  flag With --validate, fail the command on JSON Schema violations or non-Info external- validator findings (Required/Recommended severities) instead of just warning. Note: RFC4180 structural failures from qsv validate (emitted when a spec declares validators) are always appended as warnings, regardless of this flag.
 ‑‑allow‑external‑validator  flag Opt in to spawning the validator binary declared by validation.external when the profile was loaded from an arbitrary YAML file. Bundled profiles (dcat-us-v3, dcat-ap-v3, croissant, geoconnex) always run their declared external validators because the profile content is vetted at qsv release time. Without this flag, file-loaded profiles emit a Recommended-severity warning instead of running the binary, so an untrusted YAML can't silently execute arbitrary commands. Default: off.
 ‑‑catalog  flag Wrap the emitted DCAT-US v3 Dataset inside a dcat:Catalog envelope (Catalog{dataset:[...]}). Useful for federation harvesters (data.gov, CKAN ingest) that expect Catalog-shaped top-level metadata. Default: off (Dataset-only, backwards-compatible).
 ‑‑profile  string Metadata projection profile to use. Embedded names: dcat-us-v3 (default), dcat-ap-v3, croissant; geoconnex (when built with the geoconnex feature — qsv default; qsvdp opt-in via -F datapusher_plus,geoconnex). A path to a custom YAML profile is also accepted; embedded names always win over same-named files. See resources/profiles/README.md for the schema and authoring guide.
 ‑‑force  flag Force recomputing cardinality and unique values even if a stats cache file exists.
 ‑j,
‑‑jobs 
integer The number of jobs to run in parallel for the underlying stats/frequency passes. When not set, the number of jobs is set to the number of CPUs detected.
 ‑o,
‑‑output 
string Output JSON path. Default: .metadata.json.

Common Options

     Option      Type Description Default
 ‑h,
‑‑help 
flag Display this message
 ‑n,
‑‑no‑headers 
flag When set, the first row will not be interpreted as headers. Namely, it will be processed with the rest of the rows. Otherwise, the first row will always appear as the header row in the output.
 ‑d,
‑‑delimiter 
string The field delimiter for reading CSV data. Must be a single character.
 ‑‑memcheck  flag Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics.

Source: src/cmd/profile.rs | Table of Contents | README