Skip to content

TigreGotico/tutubo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tutubo

YouTube and YouTube Music metadata library. Searches videos, channels, playlists, music tracks, albums, and artists — with per-item content-type classification and lazy channel tab iteration. No pytube dependency.

Install

pip install tutubo
# Downloading requires yt-dlp (optional):
pip install yt-dlp
# Stealth transport — browser-fingerprinted TLS via curl_cffi (optional):
pip install tutubo[stealth]

30-second quickstart

from tutubo import YoutubeSearch

# Search and classify
for v in YoutubeSearch("rob zombie").iterate_videos(max_res=5):
    print(v.title, v.length, v.published_time)
    print("  content_type:", v.content_type)
    print("  badges:", v.badges)

# Intent-focused factory — appends "full movie" to the query
for v in YoutubeSearch.for_movies("blade runner").iterate_movies(max_res=3):
    print(v.title, v.length)

High-level shape

Class What it gives you
YoutubeSearch Searches youtube.com; yields VideoPreview, ChannelPreview, PlaylistPreview, mixes, related queries
YoutubeMusicSearch Searches music.youtube.com; yields MusicTrack, MusicAlbum, MusicArtist, MusicPlaylist, MusicVideo
Channel Fetches a channel page; exposes .videos, .shorts, .streams, .live, .playlists, .podcasts
Playlist Fetches a playlist page; exposes .videos (lazy-paginated generator)
ContentType 30-value enum inferred from title, duration, badges, and channel tags

Key invariants

  • A Video from a channel tab has view_count as a human string (e.g. "31K views") and published_time as a relative string (e.g. "5 hours ago"). It has no length field — duration is not available from channel-page renderers.
  • A VideoPreview from a search result has length as seconds (int) and view_count as an exact integer.
  • MusicAlbum is a subclass of MusicPlaylist. Both expose .tracks as a list[MusicTrack], .track_count, .year, and .playlist_url.
  • Channel.live — fetches /@handle/live (a watch-page redirect); returns one Video or None — the currently on-air stream.
  • Channel.streams — fetches /@handle/streams (a browse tab); returns a lazy list of all livestream videos (past + current).
  • VideoPreview.content_type does not use channel tags (search results don't include them). Video.content_type (from Channel.videos) does, giving better accuracy for ambiguous titles.

Factory classmethods — 24 intent-focused search shortcuts

Every factory appends a keyword phrase to improve YouTube's ranking. Pair with the matching typed iterator for content-type enforcement:

from tutubo import YoutubeSearch

for v in YoutubeSearch.for_concerts("black sabbath").iterate_concerts(max_res=5):
    print(v.title, v.length)

for v in YoutubeSearch.for_podcasts("lex fridman").iterate_podcasts():
    print(v.title)

All 24 factories: for_movies, for_short_films, for_trailers, for_documentaries, for_behind_the_scenes, for_anime, for_tv_episodes, for_audiobooks, for_audio_dramas, for_podcasts, for_stand_up, for_interviews, for_lectures, for_concerts, for_news, for_live_news, for_sport, for_gaming, for_tutorials, for_reactions, for_compilations, for_kids, for_music_videos, for_music_audio. See docs/search.md.

YouTube Music search

YoutubeMusicSearch is a separate class from YoutubeSearch. It queries the YouTube Music API and returns structured music objects:

from tutubo import YoutubeMusicSearch

s = YoutubeMusicSearch("black sabbath paranoid")

for track in s.iterate_tracks(max_res=5):
    print(track.title, track.artist, track.length)
    print("  audio_only:", track.is_audio_only, "| music_video:", track.is_music_video)

for album in s.iterate_albums(max_res=3):
    print(album.title, album.artist, album.year, f"({album.track_count} tracks)")
    for t in album.tracks:
        print(f"  {t.track_number}. {t.title} [{t.length}s]")

for artist in s.iterate_artists(max_res=2):
    print(artist.name, artist.subscribers)

Channel metadata and tab iteration

from tutubo import Channel

c = Channel("https://www.youtube.com/@Metallica")
print(c.channel_name, c.subscribers, c.video_count_label)
print("keywords:", c.keywords[:5])
print("rss:", c.rss_url)

# Regular uploads
for video in c.videos:
    print(video.title, video.view_count, video.published_time)
    print("  content_type:", video.content_type)

# Currently on-air stream — one Video or None (reads /@handle/live)
live = c.live
if live:
    print("LIVE:", live.title, live.watch_url)

# Full stream archive — paginated list from /@handle/streams
for stream in c.streams:
    print(stream.title, stream.is_live)

# Podcast shows
c2 = Channel("https://www.youtube.com/@TheDissenterRL")
for pod in c2.podcasts:
    print(pod.title, pod.episode_count)
    pl = pod.get()
    for ep in pl.videos:
        print("  episode:", ep.watch_url)
        break

mediavocab integration

mediavocab is a hard runtime dependency. It provides ContentType, classify_video(), parse_title(), extract_tags(), and the Work / Release / Entity data model.

# ContentType and classifiers re-exported from tutubo for convenience
from tutubo import ContentType, classify_video, parse_title, extract_tags

# Convert a search result to typed mediavocab objects
from tutubo import YoutubeSearch

for v in YoutubeSearch.for_movies("nosferatu").iterate_movies(max_res=3):
    work = v.to_work()       # mediavocab.Work
    release = v.to_release() # mediavocab.Release
    print(work.title, work.year, work.media_type)
    print(release.resolution, release.accessibility)
    print(release.external_ids)   # {"youtube": "<video_id>"}

Badge → resolution mapping: "4K""2160p", "8K""4320p", "HD""1080p". CC badge → AccessibilityTrack(kind="captions"). See docs/mediavocab.md.

Pluggable session and TUTUBO_TRANSPORT

Channel and playlist HTML pages are fetched via a pluggable session (requests.Session by default). Set TUTUBO_TRANSPORT=curl_cffi to use browser-fingerprinted TLS:

export TUTUBO_TRANSPORT=curl_cffi   # requires: pip install tutubo[stealth]

Or inject a session object directly:

from curl_cffi import requests as cffi_requests
from tutubo.channel import Channel

ch = Channel("https://www.youtube.com/@LinusTechTips",
             session=cffi_requests.Session(impersonate="chrome"))

Note: tutubo._innertube._post (the search path) uses stdlib urllib.request and is not affected by the transport setting. See docs/transport.md.

Configuration

Env var Effect
TUTUBO_TRANSPORT Set to curl_cffi to enable stealth transport for channel/playlist fetches
MEDIAVOCAB_LANG Default language for classification (e.g. es-es, fr-fr); default en-us

Examples

File What it shows
examples/01_quickstart.py Search, result types, content_type, dict interface
examples/02_search_factories.py 24 intent-focused factory methods + typed iterators
examples/03_channel.py Channel metadata, videos tab, Channel.live vs Channel.streams
examples/04_playlist.py Channel playlists and direct playlist iteration
examples/05_podcasts.py Podcast shows, episode listing, is_podcast=True classification
examples/06_music_search.py YoutubeMusicSearch — tracks, artists, community playlists
examples/07_music_album.py MusicAlbum with full track listing
examples/08_fanedits.py Fan-edit detection via parse_title() + VariantKind.FANEDIT
examples/09_to_mediavocab.py All mediavocab fields from to_work() / to_release()
examples/10_custom_session.py Pluggable session, TUTUBO_TRANSPORT, curl_cffi injection
examples/11_pipeline.py Full parse_titleclassifyto_routing()Signals pipeline

Documentation

License

Apache 2.0

About

Lightweight YouTube scraper that emits typed mediavocab Releases — channels, videos, music, podcasts, livestreams, IPTV

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages