Rethinking Topic Coherence: Word Embeddings vs. Co-occurrence (NPMI)? #2413

SongLin99 · 2025-08-27T06:48:53Z

SongLin99
Aug 27, 2025

Hi everyone,

When evaluating topic models, we all rely on topic coherence to measure the quality of our topics. The standard approach, like the C_v metric, is based on NPMI scores calculated from word co-occurrence statistics within a massive external corpus.

This method has served us well, but it raises a fundamental question: Is counting direct co-occurrences the best way to capture semantic relationships today?

The Proposal:
I'd like to propose a discussion around using word embeddings to calculate topic coherence, replacing the traditional co-occurrence-based approach.

The workflow would be simple: instead of calculating the NPMI score for a pair of topic words, we would calculate the cosine similarity between their word vectors. The core idea is to shift from measuring explicit co-occurrence to leveraging the latent semantic relationships that embeddings are designed to capture.

Why might this be a better approach?
Deeper Semantic Representation: Embeddings capture deeper, latent relationships, not just direct co-occurrence. For instance, "doctor" and "scalpel" are highly related, even if they don't frequently appear within the same small context window.

More Robust & Generalized: Representations from embeddings trained on huge, diverse corpora are often more robust and generalized than sparse counts from a single reference corpus.

Alignment with Modern NLP: This would align topic model evaluation with the broader shift in NLP from count-based methods to representation learning.

Open Questions for Discussion:
I'm keen to hear what the community thinks about this methodological shift.

Does this move from "co-occurrence" to "latent semantics" make theoretical and practical sense for evaluating topic quality?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rethinking Topic Coherence: Word Embeddings vs. Co-occurrence (NPMI)? #2413

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Rethinking Topic Coherence: Word Embeddings vs. Co-occurrence (NPMI)? #2413

Uh oh!

SongLin99 Aug 27, 2025

Replies: 0 comments

SongLin99
Aug 27, 2025