Skip to content

Conversation

@handecelikkanat
Copy link
Contributor

This PR add remaining 6 RecordSets:
- wat-records
- wet-records
- robotstxt-records
- non200responses-records
- cc-index-records
- cc-index-table-records

@github-actions
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@handecelikkanat
Copy link
Contributor Author

@mqzhou-dev I added the remaining 6 RecordSets, do they look alright to you?

Btw - did your previous PR (#1001) allow adding prefixes to the urls read in by read_lines? Like adding the fixed prefix "http://data.commoncrawl.org/" to the left of each read line? I am not sure if this is left for future or already added in :)

Asking in order to add to our metadata if it can already be represented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants