BERTopic on Segmented Transcripts: Advice on Preserving Context? #2381

manonb42 · 2025-06-17T11:55:00Z

manonb42
Jun 17, 2025

Hello,

I’m currently working with a dataset of 25 podcast transcriptions generated using Whisper. Each show consists of about 150 segments. I have access to the raw segments, but I can clean, merge, or structure the data as needed.

My goal is to extract meaningful topics from these transcriptions and later compare them across different episodes. I’m currently using BERTopic, which produces interesting results, but I’ve noticed that many sentences are extracted out of context, which affects the coherence of the resulting topics.

I’ve already spent time tuning the parameters carefully and I’m using an embedding model that is well suited to my language.

Do you have any advice? For example, on how to modify the embeddings, adjust the dimensionality reduction step, or perhaps on alternative methods that would preserve context better in a scenario like mine?

Thank you in advance for your help!

MaartenGr · 2025-06-18T13:39:34Z

MaartenGr
Jun 18, 2025
Maintainer

It's a bit tricky to say without seeing your actual code but one thing I can say is that if you are only passing sentences to BERTopic, then that might explain the "out-of-context" that you are getting. Each sentence is embedded independently and it might be that it needs more context for a good representation. For instance, you could aggregate subsequent sentences for example or potentially add context to the embedding based on a summary of the podcast.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BERTopic on Segmented Transcripts: Advice on Preserving Context? #2381

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

BERTopic on Segmented Transcripts: Advice on Preserving Context? #2381

Uh oh!

manonb42 Jun 17, 2025

Replies: 1 comment

Uh oh!

MaartenGr Jun 18, 2025 Maintainer

manonb42
Jun 17, 2025

MaartenGr
Jun 18, 2025
Maintainer