Skip to content

Improve support for manifest files (Common crawl example) #920

@benjelloun

Description

@benjelloun

From email thread:

I think we can represent CC data by treating the paths.gz as a manifest that "contains" the fileset. This is described vaguely in the 1.0 spec:

"A FileSet is a set of files located in a container, which can be an archive FileObject or a "manifest" file."

We should improve the description in the 1.1 spec, and give an example

You can extend the description you have created by adding a "containedIn" relationship between the FileObject and the corresponding FileSet:

"distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "warc.paths.gz",
       ...
    },
    {
      "@type": "cr:FileSet",
      "@id": "warc_paths",
      "containedIn": "warc.paths.gz",
      ...
    },
    {
      "@type": "cr:FileObject",
      "@id": "wat.paths.gz",
       ...
    },
    {
      "@type": "cr:FileSet",
      "@id": "wat_paths",
      "containedIn": "wat.paths.gz",
      ...
    },
    ...
  ]

We are stretching things a bit by having the contained be both an gz archive and a manifest, but that seems reasonable to me. We could add some syntax to make things more explicit if needed.

Metadata

Metadata

Labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions