Hey all, I have a suggestion that I thought could help to simplify downstream usage of OSV records, so just bringing it here for discussion - thank you!
Background
The OSV schema currently has database_specific and ecosystem_specific fields, which are great for custom data that doesn't fit into the rest of the schema.
However, there can be many cases where multiple databases relay the same types of data within these fields, but representations of them are inconsistent across records and databases. This is because there is no formal coordination between the *_specific fields, as they are intended for free-form data that is defined by the publishing database, therefore being outside the scope of the OSV schema.
This forces downstream tools and consumers to account for multiple inconsistent representations of metadata that can directly influence prioritisation and reporting workflows in vulnerability scanners.
For example, CWE classifications can help scanners categorize and prioritize vulnerabilities, such as determining that a vulnerability is malware-related (CWE-506). Currently, CWE information is commonly represented as an array named cwe_ids within database_specific, but this structure is not documented and cannot be relied upon as consistent by scanners.
At the same time, it may not be appropriate for certain types of metadata to be locked into the JSON schema validation as new fields, in order to keep the schema lean and backwards-compatible.
This suggestion proposes the addition of a metadata object to the JSON schema, whose purpose is to contain normative representations of commonly shared types of metadata.
The expected structure of subfields within metadata would be documented in this repository and validated through the linter (rather than immediately locked into the schema), in the same way that the linter is used to validate the content of existing fields.
These standards wouldn't be appropriate to define within *_specific fields, which are intentionally free-form and database-owned.
A coordinated structure of any new metadata subfield would be introduced via the normal repository approval process. To be eligible for standardisation in metadata, a field strictly must:
- Not be unique to a specific database or ecosystem.
- Have demonstrated downstream value for scanners and consumers.
If the expected format of a certain field is designated via metadata and is seen to become stable and widely adopted, it can be later be ratified in the schema (within metadata).
Therefore, only fields that are inherently metadata should be eligible. Fields that would logically belong within other schema properties (e.g., new types of severity scores, version ranges) would be out of scope.
A few other examples for which a common standard could be defined within metadata:
- CNA assigner - Useful for assessing scoring source trust or authority.
- KEV - Indicates known exploitation in the wild.
- SSVC - Enables decision-based prioritisation workflows.
- The recently proposed
symbols field - Useful for reachability analysis. As this data is currently fragmented across ecosystem_specific, perhaps it could be an appropriate candidate for normalization within metadata.
Pros
- Can prevent inconsistent representations of the same data across databases and ecosystems.
- This proposal does not introduce new expressive capability to OSV records. Instead, it provides a coordination mechanism for normalizing common types of metadata that currently can (and could) be fragmented across
database_specific and ecosystem_specific fields.
*_specific fields remain fully free-form and defined by the publishing database.
metadata is reserved for fields intended for use by multiple databases that require a predictable structure due to their high downstream value. It's a location from which downstream tools can interpret data without the need for maintaining bespoke parsing rules for different databases.
- Provides a clear stabilization path prior to potential schema ratification of a field. Pre-emptively defining a consistent structure through
metadata documentation and the linter can prevent fragmentation across *_specific fields before a field is ready for formal schema inclusion.
- Doesn't require additional fields to be locked into the schema.
- Downstream tools can use the linter to validate
metadata subfields before consuming them.
- Works like the linter does for existing fields, validating the contents of subfields with a defined name.
- Provides an approach to many previously raised issues on OSV.dev and osv-schema, e.g.
Cons
- The distinction between
metadata and *_specific fields cannot be enforced by JSON schema validation and remains a semantic convention.
- Introducing a coordination layer outside the core schema may add conceptual complexity for contributors unfamiliar with the distinction between
metadata and *_specific fields.
metadata may be too much of a generalized name.
Hey all, I have a suggestion that I thought could help to simplify downstream usage of OSV records, so just bringing it here for discussion - thank you!
Background
The OSV schema currently has
database_specificandecosystem_specificfields, which are great for custom data that doesn't fit into the rest of the schema.However, there can be many cases where multiple databases relay the same types of data within these fields, but representations of them are inconsistent across records and databases. This is because there is no formal coordination between the
*_specificfields, as they are intended for free-form data that is defined by the publishing database, therefore being outside the scope of the OSV schema.This forces downstream tools and consumers to account for multiple inconsistent representations of metadata that can directly influence prioritisation and reporting workflows in vulnerability scanners.
For example, CWE classifications can help scanners categorize and prioritize vulnerabilities, such as determining that a vulnerability is malware-related (CWE-506). Currently, CWE information is commonly represented as an array named
cwe_idswithindatabase_specific, but this structure is not documented and cannot be relied upon as consistent by scanners.At the same time, it may not be appropriate for certain types of metadata to be locked into the JSON schema validation as new fields, in order to keep the schema lean and backwards-compatible.
This suggestion proposes the addition of a
metadataobject to the JSON schema, whose purpose is to contain normative representations of commonly shared types of metadata.The expected structure of subfields within
metadatawould be documented in this repository and validated through the linter (rather than immediately locked into the schema), in the same way that the linter is used to validate the content of existing fields.These standards wouldn't be appropriate to define within
*_specificfields, which are intentionally free-form and database-owned.A coordinated structure of any new
metadatasubfield would be introduced via the normal repository approval process. To be eligible for standardisation inmetadata, a field strictly must:If the expected format of a certain field is designated via
metadataand is seen to become stable and widely adopted, it can be later be ratified in the schema (withinmetadata).Therefore, only fields that are inherently metadata should be eligible. Fields that would logically belong within other schema properties (e.g., new types of severity scores, version ranges) would be out of scope.
A few other examples for which a common standard could be defined within
metadata:symbolsfield - Useful for reachability analysis. As this data is currently fragmented acrossecosystem_specific, perhaps it could be an appropriate candidate for normalization withinmetadata.Pros
database_specificandecosystem_specificfields.*_specificfields remain fully free-form and defined by the publishing database.metadatais reserved for fields intended for use by multiple databases that require a predictable structure due to their high downstream value. It's a location from which downstream tools can interpret data without the need for maintaining bespoke parsing rules for different databases.metadatadocumentation and the linter can prevent fragmentation across*_specificfields before a field is ready for formal schema inclusion.metadatasubfields before consuming them.database_specificCons
metadataand*_specificfields cannot be enforced by JSON schema validation and remains a semantic convention.metadataand*_specificfields.metadatamay be too much of a generalized name.