This script extracts transcripts from the Podcasts app on macOS.
- Clone the repository
- Install dependencies:
npm install
Note: You need to download the desired podcast episode(s) before you can extract the transcript.
To process all TTML files in your Apple Podcasts cache:
node extractTranscript.js [--timestamps]This will:
- Find all TTML files in
~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML - Create a
./transcriptsdirectory - Save each transcript as
./transcripts/<podcase_name> <episode_title>.txt
Note the podcast name and episode title are extracted from the SQLite database at Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite.
Sample output is:
$ ls ./transcripts
Secret Leaders - Hermann Hauser- The Man Who Saved Apple from Bankruptcy.txt
Secret Leaders - How to Earn Millions by charging $0.txt
Secret Leaders - I built a $2bn Company By Paying Everybody The Same As Me - Nicola Kilner.txt
Secret Leaders - Spencer Matthews- How pushing my body to the limit changed my life.txt
The BugBash Podcast - Ergonomics, reliability, durability.txt
The BugBash Podcast - Every map is wrong, but we made one anyway.txt
The BugBash Podcast - FoundationDB- From Idea to Apple Acquisition.txt
The Intelligence from The Economist - Against the clock- Gaza peace talks.txt
The Intelligence from The Economist - All the president’s money men- the Trumponomics team.txt
The Intelligence from The Economist - Billions of voices heard- a year of elections.txt
Add --timestamps to include timestamps for each paragraph in the format [HH:MM:SS].
For example:
[00:01:23] This is what the speaker said
[00:01:25] And then they said this
node extractTranscript.js <input_file> <output_file> [--timestamps]The input file comes from the transcript_<long_episode_id>.ttml file in the ~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent<short_episode_id> directory.
I don't know how these IDs are generated by the Podcasts app.
Generating SQLite schemas for the Apple Podcasts cache:
sqlite3 /Users/<username>/Library/Group\ Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite .schema > schema.sql
sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE limit 5;
PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml
PodcastContent122/v4/f4/1a/aa/f41aaa81-24f0-b259-9870-9b1e48e676f6/transcript_1000427522064.ttml
sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION,
...> p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY
...> FROM ZMTEPISODE e
...> JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID
...> WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml";
Supernova in the East I|553291535|16097.0|Dan Carlin's Hardcore History|Dan Carlin|HistoryThe schema.sql file contains the SQLite schemas for the Apple Podcasts cache.
Key Tables for Metadata:
ZMTEPISODE (episodes):
- ZTITLE - Episode title
- ZITEMDESCRIPTION - Episode description
- ZPUBDATE - Publication date
- ZDURATION - Episode duration
- ZPODCASTUUID - Links to podcast
- ZTRANSCRIPTIDENTIFIER - Likely matches your TTML file ID
ZMTPODCAST (podcasts/shows):
- ZTITLE - Podcast title
- ZAUTHOR - Podcast author
- ZCATEGORY - Podcast category
- ZUUID - Podcast UUID (matches ZPODCASTUUID in episodes)
ZMTCHANNEL (channels):
- ZNAME - Channel name
- Linked via ZCHANNEL field in ZMTPODCAST
For file "~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml". PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml is the ZMTEPISODE.ZTRANSCRIPTIDENTIFIER. We can extract metadata as:
sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE where ZTRANSCRIPTIDENTIFIER like "%02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml%";
PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml
sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION, p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY FROM ZMTEPISODE e JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml";
Transitional injustice: Syria one year after Assad|786884509|1486.0|The Intelligence from The Economist|The Economist|Daily News