Create post-processing QA for scrape-meta

## Additional details

The ca_sacramento_pd scraper from #36 did not automatically parse out YouTube playlists, and the asset download functionality doesn't do it either (plus, we want to be in control of metadata and have every asset matched to a metadata item, right?).

I took a newer copy of `pytube` called `pytubefix` and knocked out a parser that grabs the metadata. See #195 on pytube.

**I'm suggesting the CLI should handle YouTube playlist parsing after each regular metadata scrape; if there's at least one URL that matches a fingerprint, the whole file gets reparsed before concluding.**

**We should also evaluate metadata for Vimeo playlists, per #193 and possibly #110 .**

**Unrelated suggestion: Post-processing should also check for overlapping asset_url entries, which suggests a problem with the original code or data.**

The YouTube sample parser here has some good and some bad.

Good: One line decides if it's only outputting metadata for YouTube playlist links, or mixing it up with the entire thing.

Good: It works!

Bad: It's slow. As near as I can tell, there's not a method to pull metadata from playlists per se. So in this case I think we're looking at using the playlist data to get at the individual videos, and then from there calling to get the watch URL, length, date published and maybe another piece or two of metadata.

Bad: It's slow even without a throttle function built in.

So Sacramento PD had something like 83 playlists with about 4,000 files to parse ... and that took 90 minutes.

## Related pull request(s)

#39 Sacramento PD scraper

#110 Vimeo handling

#193 Vimeo playlists

#200 repeat entries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create post-processing QA for scrape-meta #196

Additional details

Related pull request(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Create post-processing QA for scrape-meta #196

Description

Additional details

Related pull request(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions