Skip to content

Create post-processing QA for scrape-meta #196

@stucka

Description

@stucka

Additional details

The ca_sacramento_pd scraper from #36 did not automatically parse out YouTube playlists, and the asset download functionality doesn't do it either (plus, we want to be in control of metadata and have every asset matched to a metadata item, right?).

I took a newer copy of pytube called pytubefix and knocked out a parser that grabs the metadata. See #195 on pytube.

I'm suggesting the CLI should handle YouTube playlist parsing after each regular metadata scrape; if there's at least one URL that matches a fingerprint, the whole file gets reparsed before concluding.

We should also evaluate metadata for Vimeo playlists, per #193 and possibly #110 .

Unrelated suggestion: Post-processing should also check for overlapping asset_url entries, which suggests a problem with the original code or data.

The YouTube sample parser here has some good and some bad.

Good: One line decides if it's only outputting metadata for YouTube playlist links, or mixing it up with the entire thing.

Good: It works!

Bad: It's slow. As near as I can tell, there's not a method to pull metadata from playlists per se. So in this case I think we're looking at using the playlist data to get at the individual videos, and then from there calling to get the watch URL, length, date published and maybe another piece or two of metadata.

Bad: It's slow even without a throttle function built in.

So Sacramento PD had something like 83 playlists with about 4,000 files to parse ... and that took 90 minutes.

Related pull request(s)

#39 Sacramento PD scraper

#110 Vimeo handling

#193 Vimeo playlists

#200 repeat entries

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions