Suggestion: addition of a `metadata` field for normative, linter-validated subfields

Hey all, I have a suggestion that I thought could help to simplify downstream usage of OSV records, so just bringing it here for discussion - thank you!

### Background

The OSV schema currently has `database_specific` and `ecosystem_specific` fields, which are great for custom data that doesn't fit into the rest of the schema.

However, there can be many cases where multiple databases relay the same types of data within these fields, but representations of them are inconsistent across records and databases. This is because there is no formal coordination between the `*_specific` fields, as they are intended for free-form data that is defined by the publishing database, therefore being outside the scope of the OSV schema.

This forces downstream tools and consumers to account for multiple inconsistent representations of metadata that can directly influence prioritisation and reporting workflows in vulnerability scanners.

For example, **CWE classifications** can help scanners categorize and prioritize vulnerabilities, such as determining that a vulnerability is malware-related (CWE-506). Currently, CWE information is commonly represented as an array named `cwe_ids` within `database_specific`, but this structure is not documented and cannot be relied upon as consistent by scanners.

At the same time, it may not be appropriate for certain types of metadata to be locked into the JSON schema validation as new fields, in order to keep the schema lean and backwards-compatible.

**This suggestion proposes the addition of a `metadata` object to the JSON schema, whose purpose is to contain normative representations of commonly shared types of metadata.**

**The expected structure of subfields within `metadata` would be documented in this repository and validated through the linter (rather than immediately locked into the schema), in the same way that the linter is used to validate the content of existing fields.**

These standards wouldn't be appropriate to define within `*_specific` fields, which are intentionally free-form and database-owned.

A coordinated structure of any new `metadata` subfield would be introduced via the normal repository approval process. To be eligible for standardisation in `metadata`, a field **strictly** must:
* Not be unique to a specific database or ecosystem.
* Have demonstrated downstream value for scanners and consumers.

If the expected format of a certain field is designated via `metadata` and is seen to become stable and widely adopted, it can be later be ratified in the schema (within `metadata`).

Therefore, only fields that are inherently metadata should be eligible. Fields that would logically belong within other schema properties (e.g., new types of severity scores, version ranges) would be out of scope.

A few other examples for which a common standard could be defined within `metadata`:
* **CNA assigner** - Useful for assessing scoring source trust or authority.
* **KEV** - Indicates known exploitation in the wild.
* **SSVC** - Enables decision-based prioritisation workflows.
* **The recently [proposed `symbols` field](https://github.com/ossf/osv-schema/issues/501)** - Useful for reachability analysis. As this data is currently fragmented across `ecosystem_specific`, perhaps it could be an appropriate candidate for normalization within `metadata`.

### Pros

* Can prevent inconsistent representations of the same data across databases and ecosystems.
* This proposal does not introduce new expressive capability to OSV records. Instead, it provides a coordination mechanism for normalizing common types of metadata that currently can (and could) be fragmented across `database_specific` and `ecosystem_specific` fields.
  * `*_specific` fields remain fully free-form and defined by the publishing database.
  * `metadata` is reserved for fields intended for use by multiple databases that require a predictable structure due to their high downstream value. It's a location from which downstream tools can interpret data without the need for maintaining bespoke parsing rules for different databases.
* Provides a clear stabilization path prior to potential schema ratification of a field. Pre-emptively defining a consistent structure through `metadata` documentation and the linter can prevent fragmentation across `*_specific` fields before a field is ready for formal schema inclusion.
* Doesn't require additional fields to be locked into the schema.
  * Downstream tools can use the linter to validate `metadata` subfields before consuming them.
* Works like the linter does for existing fields, validating the contents of subfields with a defined name.
* Provides an approach to many previously raised issues on OSV.dev and osv-schema, e.g.
  * [Request for CWE to be introduced to the schema](https://github.com/ossf/osv-schema/issues/254)
  * [Including CWEs in `database_specific`](https://github.com/google/osv.dev/issues/2292)
  * [Not requiring a new schema for the metadata endpoint](https://github.com/google/osv.dev/issues/3245)

### Cons
* The distinction between `metadata` and `*_specific` fields cannot be enforced by JSON schema validation and remains a semantic convention.
* Introducing a coordination layer outside the core schema may add conceptual complexity for contributors unfamiliar with the distinction between `metadata` and `*_specific` fields.
* `metadata` may be too much of a generalized name.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: addition of a `metadata` field for normative, linter-validated subfields #518

Background

Pros

Cons

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suggestion: addition of a metadata field for normative, linter-validated subfields #518

Description

Background

Pros

Cons

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Suggestion: addition of a `metadata` field for normative, linter-validated subfields #518