Skip to content

Draft zip proposal for extended CCF crawl croissants + Provenance mockup #967

@handecelikkanat

Description

@handecelikkanat

@benjelloun cc. @wumpus

Sharing a draft zip file as followup to #961

CCF_crawl_croissants_and_provenance_mockup.zip

Zip file includes:

  • 117 croissant drafts, one for each of our crawls.
  • 1 mockup example for provenance citation to our crawls
    • This kind of hierarchy doesnt exist in our crawls, so we wont actually have this file in CCF, but a mockup for datasets referring to CCF.

We would like feedback especially on:

  • How we are using provenance

    • eg. I havent used id's because these are not referred to in the same croissant
    • but is it valid/important to use id's to refer to other croissants?
  • in how we use "distribution" with FileObjects and FileSets

    • They include a bunch of FileObjects that act as manifest files - including paths to files included in the related FileSet
      • eg., warc.paths.gz FileObject pointing to
    • And one additional FileObject example that keeps the data itself, so just a FileObject

Please let us know if anything looks awry!

Changes since #961:

  • New FileObject added: {crawl_id}.domains-top-1000 (crawls > 2012)
  • Switched to using MAJOR.MINOR.PATCH also for build version: 1.0.0+1.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions