Replies: 1 comment
-
It's a bit tricky to say without seeing your actual code but one thing I can say is that if you are only passing sentences to BERTopic, then that might explain the "out-of-context" that you are getting. Each sentence is embedded independently and it might be that it needs more context for a good representation. For instance, you could aggregate subsequent sentences for example or potentially add context to the embedding based on a summary of the podcast. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I’m currently working with a dataset of 25 podcast transcriptions generated using Whisper. Each show consists of about 150 segments. I have access to the raw segments, but I can clean, merge, or structure the data as needed.
My goal is to extract meaningful topics from these transcriptions and later compare them across different episodes. I’m currently using BERTopic, which produces interesting results, but I’ve noticed that many sentences are extracted out of context, which affects the coherence of the resulting topics.
I’ve already spent time tuning the parameters carefully and I’m using an embedding model that is well suited to my language.
Do you have any advice? For example, on how to modify the embeddings, adjust the dimensionality reduction step, or perhaps on alternative methods that would preserve context better in a scenario like mine?
Thank you in advance for your help!
Beta Was this translation helpful? Give feedback.
All reactions