Skip to content

jzhou77/apple-podcast-transcript-extractor

 
 

Repository files navigation

Apple Podcasts Transcript Extractor

This script extracts transcripts from the Podcasts app on macOS.

Installation

  1. Clone the repository
  2. Install dependencies: npm install

Usage

Note: You need to download the desired podcast episode(s) before you can extract the transcript.

Batch Mode

To process all TTML files in your Apple Podcasts cache:

node extractTranscript.js [--timestamps]

This will:

  1. Find all TTML files in ~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML
  2. Create a ./transcripts directory
  3. Save each transcript as ./transcripts/<podcase_name> <episode_title>.txt

Note the podcast name and episode title are extracted from the SQLite database at Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite.

Sample output is:

$ ls ./transcripts
Secret Leaders - Hermann Hauser- The Man Who Saved Apple from Bankruptcy.txt
Secret Leaders - How to Earn Millions by charging $0.txt
Secret Leaders - I built a $2bn Company By Paying Everybody The Same As Me - Nicola Kilner.txt
Secret Leaders - Spencer Matthews- How pushing my body to the limit changed my life.txt
The BugBash Podcast - Ergonomics, reliability, durability.txt
The BugBash Podcast - Every map is wrong, but we made one anyway.txt
The BugBash Podcast - FoundationDB- From Idea to Apple Acquisition.txt
The Intelligence from The Economist - Against the clock- Gaza peace talks.txt
The Intelligence from The Economist - All the president’s money men- the Trumponomics team.txt
The Intelligence from The Economist - Billions of voices heard- a year of elections.txt

Timestamps Option

Add --timestamps to include timestamps for each paragraph in the format [HH:MM:SS].

For example:

[00:01:23] This is what the speaker said
[00:01:25] And then they said this

Single File Mode

node extractTranscript.js <input_file> <output_file> [--timestamps]

Where does the input file come from?

The input file comes from the transcript_<long_episode_id>.ttml file in the ~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent<short_episode_id> directory.

How do I get the episode IDs?

I don't know how these IDs are generated by the Podcasts app.

SQLite Schemas

Generating SQLite schemas for the Apple Podcasts cache:

sqlite3 /Users/<username>/Library/Group\ Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite .schema > schema.sql

sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE limit 5;
PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml
PodcastContent122/v4/f4/1a/aa/f41aaa81-24f0-b259-9870-9b1e48e676f6/transcript_1000427522064.ttml

sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION,
   ...>          p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY
   ...>   FROM ZMTEPISODE e
   ...>   JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID
   ...>   WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml";
Supernova in the East I|553291535|16097.0|Dan Carlin's Hardcore History|Dan Carlin|History

The schema.sql file contains the SQLite schemas for the Apple Podcasts cache.

Key Tables for Metadata:

ZMTEPISODE (episodes):

  • ZTITLE - Episode title
  • ZITEMDESCRIPTION - Episode description
  • ZPUBDATE - Publication date
  • ZDURATION - Episode duration
  • ZPODCASTUUID - Links to podcast
  • ZTRANSCRIPTIDENTIFIER - Likely matches your TTML file ID

ZMTPODCAST (podcasts/shows):

  • ZTITLE - Podcast title
  • ZAUTHOR - Podcast author
  • ZCATEGORY - Podcast category
  • ZUUID - Podcast UUID (matches ZPODCASTUUID in episodes)

ZMTCHANNEL (channels):

  • ZNAME - Channel name
  • Linked via ZCHANNEL field in ZMTPODCAST

For file "~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml". PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml is the ZMTEPISODE.ZTRANSCRIPTIDENTIFIER. We can extract metadata as:

sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE where ZTRANSCRIPTIDENTIFIER like "%02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml%";
PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml
sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION, p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY FROM ZMTEPISODE e JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml";
Transitional injustice: Syria one year after Assad|786884509|1486.0|The Intelligence from The Economist|The Economist|Daily News

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%