profile

Extract, derive & infer metadata from a CSV (local path or URL) - using the statistical profile of a dataset, mapped and driven by a configurable metadata scheming YAML spec (DCAT-US v3, DCAT-AP v3 and Croissant 1.1 bundled; Geoconnex when built with the geoconnex feature), with optional CKAN/DCAT metadata discovery for URL inputs. This enables FAIRification at scale.

Table of Contents | Source: src/cmd/profile.rs | 📇🧠🤖📚⛩️

Description ↩

Profile a CSV (local path or URL) and emit a .metadata.json file carrying five top-level blocks:

dpp — inferred dataset signals: lat/lon/date columns, file size, row count, encoding, etc. (the legacy datapusher-plus inference block). stats — per-column summary statistics from qsv stats. frequency — per-column value counts from qsv frequency. ckan — a CKAN-shaped block (package + resources) that datapusher-plus consumes to prepopulate CKAN packages. projection — the dataset re-expressed in the active profile's metadata vocabulary. Default is DCAT-US v3; bundled alternates are dcat-ap-v3 (EU portals), croissant (ML/AI registries) and geoconnex (water-data federations). Consumable directly by data.gov harvesters, EU DCAT-AP catalogs, mlcommons / Hugging Face / Kaggle, and Internet of Water tooling.

Behind the scenes qsv runs the same statistical + frequency analysis datapusher-plus (DP+) runs in CKAN, builds a Jinja2 evaluation context from the results, and — when an optional CKAN scheming YAML spec is supplied — evaluates the spec's formula / suggestion_formula templates against that context. Jinja2 helpers and filters are a native Rust port of DP+'s jinja2_helpers.py, built on minijinja.

When the input is a URL whose response carries DCAT markup (HTTP Link: rel=describedBy), qsv discovers the publisher's stated metadata and merges it as a base layer beneath the inferred projection.

For an example CKAN scheming YAML spec, see:
https://github.com/dathere/datapusher-plus/blob/main/ckanext/datapusher_plus/dataset-druf.yaml

For more extensive examples, see https://github.com/dathere/qsv/blob/master/tests/test_profile.rs. See also https://github.com/dathere/qsv/wiki/Metadata-Profiling

Examples ↩

Quick: dpp/stats/frequency + default DCAT-US v3 projection.

qsv profile data.csv

Pipe stdin; output defaults to stdin.metadata.json.

cat data.csv | qsv profile

URL input: discover the publisher's DCAT markup and merge it as a base layer.

qsv profile https://data.example.gov/datasets/sample.csv

Seed publisher/contact info from a JSON file; write to a chosen output path.

qsv profile data.csv --initial-context publisher.json -o data.metadata.json

data.gov-style harvest: validate against DCAT-US v3 JSON Schema, abort on violations, wrap in a Catalog envelope.

qsv profile data.csv --validate --strict --catalog -o data.metadata.json

DCAT-AP v3 for EU data portals; pyshacl validates the bundled SHACL shapes.

qsv profile open-data.csv --profile dcat-ap-v3 --validate --strict

Croissant JSON-LD for an ML dataset; mlcroissant validates the output.

qsv profile train.csv --profile croissant --validate -o train.croissant.json

Geoconnex JSON-LD for hydrologic data (qsv built with the geoconnex feature).

qsv profile gages.csv --profile geoconnex --validate --strict

Evaluate a CKAN scheming spec: Jinja2 formulas compute spatial/temporal extents, accrual periodicity, and other derived fields.

qsv profile data.csv --spec dataset-druf.yaml -o data.metadata.json

CKAN-only output: drop the projection block, keep dpp/stats/frequency/ckan.

qsv profile data.csv --no-projection --spec dataset-druf.yaml

Custom YAML profile from disk (embedded names always win over same-named files, so use a non-clashing name for custom profiles).

qsv profile data.csv --profile ./my-org-dcat.yaml --validate

Usage ↩

qsv profile [options] [<input>]
qsv profile --help

Arguments ↩

Argument	Description
`<input>`	Path or URL to the CSV to profile. When `-` or omitted, reads from stdin. When the URL has DCAT markup, qsv will attempt to discover and ingest it as a base layer of metadata (unless --no-dcat-discovery is set). See --no-dcat-discovery and --dcat-discovery-timeout for details and opt-out.

Profile Options ↩

Option	Type	Description
`‑‑spec`	string	CKAN scheming YAML spec file. If omitted, only the inferred `dpp` block (lat/lon/date columns, dataset stats) is emitted; no formulas are evaluated.
`‑‑initial‑context`	string	JSON file providing seed values for the package / resource dicts plus optional JSON-Pointer overrides for the final projection block. Replaces the older --package-meta / --resource-meta flags. Top-level keys: `package`, `resource`, `dataset_info`. Each leaf value may be wrapped as {"value": ..., "force": true} to mark it as overriding any value discovered from URL DCAT markup AND any value qsv inferred. Force is honored across all three subtrees: dataset_info entries override their target path verbatim; package / resource entries route through the active profile's `field_mappings:` table (e.g. `package.title force=true` lands at `/projection/dct:title`, beating inference and discovery). Forced values for slots the profile does not surface are silently dropped (no-op). See tests/resources/profile/dcat-init-context.README.md for a fully-populated example.
`‑‑no‑projection`	flag	Skip the metadata projection block (dcat/croissant/ geoconnex, depending on the active profile).
`‑‑no‑ckan`	flag	Skip the CKAN-shape block.
`‑‑croissant‑frequency`	flag	Embed per-column value-frequency distributions in the metadata projection. The croissant profile renders them as inline cr:RecordSets (one `<col>-frequency` RecordSet of {value, count, percentage} rows per column), per the spec's "distribution of values is a statistic on the field" guidance. Off by default (keeps the projection compact); the raw counts always remain in the top-level `frequency` block regardless. Other bundled profiles ignore this flag.
`‑‑dcat‑legacy‑license`	flag	Transitional: re-emit dct:license on the Dataset alongside the v3-required Distribution-level copy. Default: off (strict v3, license on Distribution only).
`‑‑no‑dcat‑discovery`	flag	Skip DCAT-markup discovery on URL inputs. Discovery sniffs HTTP Link: rel=describedBy (and, in future, sibling .metadata.json / JSON-LD <script> blocks) to use the publisher's stated metadata as a base layer.
`‑‑dcat‑discovery‑timeout`	integer	Per-request timeout for DCAT-markup discovery probes. Default: 5.
`‑‑validate`	flag	Validate the emitted projection block against the active profile's declared validators. For dcat-us-v3 that's the vendored GSA JSON Schema bundle (see resources/dcat-us-v3/); for dcat-ap-v3 / geoconnex it's pyshacl over the bundled SHACL shapes; for croissant it's mlcroissant. Catches missing mandatory fields, cardinality issues, and shape violations. Violations append to projection_warnings by default.
`‑‑strict`	flag	With --validate, fail the command on JSON Schema violations or non-Info external- validator findings (Required/Recommended severities) instead of just warning. Note: RFC4180 structural failures from `qsv validate` (emitted when a spec declares `validators`) are always appended as warnings, regardless of this flag.
`‑‑allow‑external‑validator`	flag	Opt in to spawning the validator binary declared by `validation.external` when the profile was loaded from an arbitrary YAML file. Bundled profiles (dcat-us-v3, dcat-ap-v3, croissant, geoconnex) always run their declared external validators because the profile content is vetted at qsv release time. Without this flag, file-loaded profiles emit a Recommended-severity warning instead of running the binary, so an untrusted YAML can't silently execute arbitrary commands. Default: off.
`‑‑catalog`	flag	Wrap the emitted DCAT-US v3 Dataset inside a dcat:Catalog envelope (Catalog{dataset:[...]}). Useful for federation harvesters (data.gov, CKAN ingest) that expect Catalog-shaped top-level metadata. Default: off (Dataset-only, backwards-compatible).
`‑‑profile`	string	Metadata projection profile to use. Embedded names: dcat-us-v3 (default), dcat-ap-v3, croissant; geoconnex (when built with the `geoconnex` feature — qsv default; qsvdp opt-in via -F datapusher_plus,geoconnex). A path to a custom YAML profile is also accepted; embedded names always win over same-named files. See resources/profiles/README.md for the schema and authoring guide.
`‑‑force`	flag	Force recomputing cardinality and unique values even if a stats cache file exists.
`‑j,` `‑‑jobs`	integer	The number of jobs to run in parallel for the underlying stats/frequency passes. When not set, the number of jobs is set to the number of CPUs detected.
`‑o,` `‑‑output`	string	Output JSON path. Default: .metadata.json.

Common Options ↩

Option	Type	Description
`‑h,` `‑‑help`	flag	Display this message
`‑n,` `‑‑no‑headers`	flag	When set, the first row will not be interpreted as headers. Namely, it will be processed with the rest of the rows. Otherwise, the first row will always appear as the header row in the output.
`‑d,` `‑‑delimiter`	string	The field delimiter for reading CSV data. Must be a single character.
`‑‑memcheck`	flag	Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics.

Source: src/cmd/profile.rs | Table of Contents | README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profile

Description ↩

Examples ↩

Usage ↩

Arguments ↩

Profile Options ↩

Common Options ↩

FilesExpand file tree

profile.md

Latest commit

History

profile.md

File metadata and controls

profile

Description ↩

Examples ↩

Usage ↩

Arguments ↩

Profile Options ↩

Common Options ↩