Skip to content

Draft zip proposal for extended CCF crawl croissants + Provenance mockup #967

Description

@handecelikkanat

@benjelloun cc. @wumpus

Sharing a draft zip file as followup to #961

CCF_crawl_croissants_and_provenance_mockup.zip

Zip file includes:

  • 117 croissant drafts, one for each of our crawls.
  • 1 mockup example for provenance citation to our crawls
    • This kind of hierarchy doesnt exist in our crawls, so we wont actually have this file in CCF, but a mockup for datasets referring to CCF.

We would like feedback especially on:

  • How we are using provenance

    • eg. I havent used id's because these are not referred to in the same croissant
    • but is it valid/important to use id's to refer to other croissants?
  • in how we use "distribution" with FileObjects and FileSets

    • They include a bunch of FileObjects that act as manifest files - including paths to files included in the related FileSet
      • eg., warc.paths.gz FileObject pointing to
    • And one additional FileObject example that keeps the data itself, so just a FileObject

Please let us know if anything looks awry!

Changes since #961:

  • New FileObject added: {crawl_id}.domains-top-1000 (crawls > 2012)
  • Switched to using MAJOR.MINOR.PATCH also for build version: 1.0.0+1.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions