📝 Background & Objective
We have a vast library of podcast transcripts spanning data science, AI, ML, open-source, career transitions, and engineering. Currently, these episodes live just as separate pages with recording and timestamps. Because actionable insights remain buried deep inside individual episodes, users are forced to consume entire recordings just to find specific advice.
To solve this, we are shifting our platform from episode-based discovery to theme-based discovery. We want to build thematic "Insight Hubs" inspired by the Huberman Lab’s topic pages (e.g., hubermanlab.com/topics/fitness-and-workout-routines). Each hub will act as an educational guide, aggregating insights, takeaways, and relevant podcast clips around a single, recurring theme like "Open Source Contribution" or "AI Evaluation."
Scope Note for Contributors: This is a massive, end-to-end milestone. You do not have to implement everything at once! You are welcome to select just one phase of this task to work on and submit a PR for that specific piece.
The Strategic Value
Creating these hubs establishes a scalable knowledge architecture that provides massive value to our platform:
- Cross-Linking Across Episodes: Connect insights from multiple episodes under one theme (e.g., an
/insights/open-source/ hub featuring lessons from 5 different guests).
- Topical Authority: Each page becomes an SEO “pillar” optimized for a broader theme, building domain depth and authority in search engines.
- Internal Linking & SEO Signals: Episode pages link back to their corresponding insight hubs—and vice versa—creating strong semantic connections for Google crawlers.
- Repurposing Potential: These pages can be easily reused as blog posts, newsletter features, or social media content.
- Enhanced User Experience: Visitors can explore related episodes in one place, encouraging deeper engagement and longer session times.
- Scalable Knowledge Architecture: Over time, these hubs form a structured, easily navigable knowledge system.
📂 Resources You Can Use
Step-by-Step Implementation Plan
You can choose to tackle the entire pipeline or contribute to a specific phase below.
Phase 1: Create a Clean Topic Taxonomy
Propose a standardized master vocabulary of major topics and subtopics. This taxonomy must be broad enough to cover all historical podcast episodes but granular enough to accurately categorize short video clips.
- Requirement: Every major topic must contain at least three meaningful subtopics.
- Example: * Topic: Open-source
- Subtopics: Getting started, Contributing effectively, Roles and responsibilities, Growing a career.
Phase 2: Categorize Episodes and Timestamps
Using your taxonomy, automate the categorization of our existing content located in the _podcast folder.
- Whole Episodes: Assign 1-3 major topics and relevant subtopics to the overall episode. Write this data directly into the markdown front matter.
- Individual Timestamps (Clips): Analyze each timestamp in the transcript. Assign exactly one major topic and one or more subtopics based strictly on the content of the clip. Inject these tags into the existing YAML transcript formatting.
Phase 3: Automate "Insight Hub" Page Generation via LLM
Using the dense, categorized dataset created in Phase 2, write a script that uses an LLM to author and construct the actual /insights/<topic>/ Pillar Pages.
Rather than just programmatically listing links, your script should:
- Aggregate Context: For a given Topic and its Subtopics, query the dataset to find all relevant episodes and their specific timestamped transcript snippets.
- Analyze with LLM: Feed this curated collection of transcripts, clip data, and guest names into an LLM.
- Generate the Content: Prompt the LLM to analyze the aggregated insights and write the comprehensive Pillar Page. The LLM should generate the introductory paragraphs, summarize the subtopics, and seamlessly embed the relevant podcast cards, guest lists, and CTAs.
Target Page Structure:
The LLM must be prompted to output the page following this exact content hierarchy and interface layout:
- Navigation: A left-hand Table of Contents or sidebar for quick navigation.
- Header: A Title (the big topic) followed by a 1-2 paragraph introduction that is heavily interlinked with other topics on the site.
- Subtopic Sections (H2 & H3): This is the core interface. It should feature a clickable subtopic header that reveals all relevant podcast cards under it. Each section must include:
- A 1-2 paragraph introduction summarizing the subtopic insights (written by the LLM based on the transcripts) and interlinked with other topics.
- Links to the relevant podcasts, displayed visually as cards.
- A CTA callout linking to a dedicated subtopic page, a relevant newsletter, or a related blog post.
- Footer - Guest Experts: An H2 section listing all relevant guest experts who contributed insights to this specific page.
- Footer - Related Topics: An H2 section providing a list of clickable links to other related Insight Hubs.
📝 Background & Objective
We have a vast library of podcast transcripts spanning data science, AI, ML, open-source, career transitions, and engineering. Currently, these episodes live just as separate pages with recording and timestamps. Because actionable insights remain buried deep inside individual episodes, users are forced to consume entire recordings just to find specific advice.
To solve this, we are shifting our platform from episode-based discovery to theme-based discovery. We want to build thematic "Insight Hubs" inspired by the Huberman Lab’s topic pages (e.g.,
hubermanlab.com/topics/fitness-and-workout-routines). Each hub will act as an educational guide, aggregating insights, takeaways, and relevant podcast clips around a single, recurring theme like "Open Source Contribution" or "AI Evaluation."Scope Note for Contributors: This is a massive, end-to-end milestone. You do not have to implement everything at once! You are welcome to select just one phase of this task to work on and submit a PR for that specific piece.
The Strategic Value
Creating these hubs establishes a scalable knowledge architecture that provides massive value to our platform:
/insights/open-source/hub featuring lessons from 5 different guests).📂 Resources You Can Use
_podcastfolder. They are formatted in YAML/JSON, with each line recording the time and who said what: https://github.com/DataTalksClub/datatalksclub.github.io/tree/main/_podcast_peoplefolder: https://github.com/DataTalksClub/datatalksclub.github.io/tree/main/_peopleStep-by-Step Implementation Plan
You can choose to tackle the entire pipeline or contribute to a specific phase below.
Phase 1: Create a Clean Topic Taxonomy
Propose a standardized master vocabulary of major topics and subtopics. This taxonomy must be broad enough to cover all historical podcast episodes but granular enough to accurately categorize short video clips.
Phase 2: Categorize Episodes and Timestamps
Using your taxonomy, automate the categorization of our existing content located in the
_podcastfolder.Phase 3: Automate "Insight Hub" Page Generation via LLM
Using the dense, categorized dataset created in Phase 2, write a script that uses an LLM to author and construct the actual
/insights/<topic>/Pillar Pages.Rather than just programmatically listing links, your script should:
Target Page Structure:
The LLM must be prompted to output the page following this exact content hierarchy and interface layout: