Apple Podcasts Transcript Extractor

This script extracts transcripts from the Podcasts app on macOS.

Installation

Clone the repository
Install dependencies: npm install

Usage

Note: You need to download the desired podcast episode(s) before you can extract the transcript.

Batch Mode

To process all TTML files in your Apple Podcasts cache:

node extractTranscript.js [--timestamps]

This will:

Find all TTML files in ~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML
Create a ./transcripts directory
Save each transcript as ./transcripts/<podcase_name> <episode_title>.txt

Note the podcast name and episode title are extracted from the SQLite database at Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite.

Sample output is:

$ ls ./transcripts
Secret Leaders - Hermann Hauser- The Man Who Saved Apple from Bankruptcy.txt
Secret Leaders - How to Earn Millions by charging $0.txt
Secret Leaders - I built a $2bn Company By Paying Everybody The Same As Me - Nicola Kilner.txt
Secret Leaders - Spencer Matthews- How pushing my body to the limit changed my life.txt
The BugBash Podcast - Ergonomics, reliability, durability.txt
The BugBash Podcast - Every map is wrong, but we made one anyway.txt
The BugBash Podcast - FoundationDB- From Idea to Apple Acquisition.txt
The Intelligence from The Economist - Against the clock- Gaza peace talks.txt
The Intelligence from The Economist - All the president’s money men- the Trumponomics team.txt
The Intelligence from The Economist - Billions of voices heard- a year of elections.txt

Timestamps Option

Add --timestamps to include timestamps for each paragraph in the format [HH:MM:SS].

For example:

[00:01:23] This is what the speaker said
[00:01:25] And then they said this

Single File Mode

node extractTranscript.js <input_file> <output_file> [--timestamps]

Where does the input file come from?

The input file comes from the transcript_<long_episode_id>.ttml file in the ~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent<short_episode_id> directory.

How do I get the episode IDs?

I don't know how these IDs are generated by the Podcasts app.

SQLite Schemas

Generating SQLite schemas for the Apple Podcasts cache:

sqlite3 /Users/<username>/Library/Group\ Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite .schema > schema.sql

sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE limit 5;
PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml
PodcastContent122/v4/f4/1a/aa/f41aaa81-24f0-b259-9870-9b1e48e676f6/transcript_1000427522064.ttml

sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION,
   ...>          p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY
   ...>   FROM ZMTEPISODE e
   ...>   JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID
   ...>   WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml";
Supernova in the East I|553291535|16097.0|Dan Carlin's Hardcore History|Dan Carlin|History

The schema.sql file contains the SQLite schemas for the Apple Podcasts cache.

Key Tables for Metadata:

ZMTEPISODE (episodes):

ZTITLE - Episode title
ZITEMDESCRIPTION - Episode description
ZPUBDATE - Publication date
ZDURATION - Episode duration
ZPODCASTUUID - Links to podcast
ZTRANSCRIPTIDENTIFIER - Likely matches your TTML file ID

ZMTPODCAST (podcasts/shows):

ZTITLE - Podcast title
ZAUTHOR - Podcast author
ZCATEGORY - Podcast category
ZUUID - Podcast UUID (matches ZPODCASTUUID in episodes)

ZMTCHANNEL (channels):

ZNAME - Channel name
Linked via ZCHANNEL field in ZMTPODCAST

For file "~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml". PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml is the ZMTEPISODE.ZTRANSCRIPTIDENTIFIER. We can extract metadata as:

sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE where ZTRANSCRIPTIDENTIFIER like "%02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml%";
PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml
sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION, p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY FROM ZMTEPISODE e JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml";
Transitional injustice: Syria one year after Assad|786884509|1486.0|The Intelligence from The Economist|The Economist|Daily News

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
extractTranscript.js		extractTranscript.js
package.json		package.json
pbcopy		pbcopy
pnpm-lock.yaml		pnpm-lock.yaml
readme.md		readme.md
schema.sql		schema.sql
transcript.txt		transcript.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apple Podcasts Transcript Extractor

Installation

Usage

Batch Mode

Timestamps Option

Single File Mode

Where does the input file come from?

How do I get the episode IDs?

SQLite Schemas

About

Uh oh!

Releases

Packages

Languages

jzhou77/apple-podcast-transcript-extractor

Folders and files

Latest commit

History

Repository files navigation

Apple Podcasts Transcript Extractor

Installation

Usage

Batch Mode

Timestamps Option

Single File Mode

Where does the input file come from?

How do I get the episode IDs?

SQLite Schemas

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages