Skip to content

feat: add Topic Data Quality module and Streamlit app#780

Open
dharani-dj wants to merge 5 commits into
pegasystems:masterfrom
dharani-dj:feature/topic-data-quality
Open

feat: add Topic Data Quality module and Streamlit app#780
dharani-dj wants to merge 5 commits into
pegasystems:masterfrom
dharani-dj:feature/topic-data-quality

Conversation

@dharani-dj

Copy link
Copy Markdown
  • Add pdstools.data_quality.TopicDataQuality library class with: embedding computation, UMAP visualization, topic similarity, text quality analysis, duplicate detection, sample adequacy, cluster tightness, outlier detection, confused samples, keyword overlap, health scoring, and recommendations
  • Add Streamlit app under app/data_quality/ with home page, quality report page, and about page
  • Integrate into unified launcher (4th tile) and CLI (pdstools run dq)
  • Add data_quality optional dependency group (numpy, pandas, scikit-learn, sentence-transformers, umap-learn)
  • Add 16 library tests and 4 Streamlit page tests
  • Update launcher tests for 4-app layout

@StijnKas

StijnKas commented May 8, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the PR. The data-quality angle is genuinely useful and the diagnostic surface (embeddings, similarity heatmap, outlier detection, sample adequacy) is real value. Before we merge, though, this needs to be brought in line with pdstools conventions — there's quite a bit of duplication and divergence from how we do things elsewhere.


Drop pandas, use polars

The repo is polars-first and pandas isn't earning its place here. Going through every pandas use in _topic_data_quality.py: the DataFrame work — dropna, value_counts, nunique, astype(str), groupby(col)[other].apply(" ".join), sort_values, building frames from row-dicts — is all trivially polars-equivalent. Several are cleaner in polars (the groupby-then-join is just df.group_by(topic).agg(pl.col(text).str.concat(" "))).

The ML libraries you feed into — sentence_transformers.encode(), sklearn.cosine_similarity(), umap.UMAP().fit_transform() — all take lists or numpy arrays, not DataFrames. The seam is df.get_column(text_col).to_list() for the encoder and series.to_numpy() (zero-copy when dtypes align) for the ML calls. Pandas adds a roundtrip without earning anything.

Please convert the module to polars and drop pandas from the data_quality extra. Numpy stays — the ML libs need it.

Use LazyNamespace for the sub-namespaces from the start

There are already ~12 public methods on TopicDataQuality and the design clearly splits along three axes: compute (embeddings, UMAP, similarity), inspect (duplicates, outliers, confused samples, adequacy), and report (health score, recommendations, summary). That's the textbook case for the namespace-facade pattern we use on ADMDatamart (dm.plot, dm.aggregates, dm.generate, dm.bin_aggregator).

Set this up now, not after it grows past 20 methods:

class TopicDataQuality:
    def __init__(self, df: pl.DataFrame, ...):
        ...
        self.compute = Compute(self)   # embeddings, UMAP, similarity
        self.plot = Plot(self)         # the Plotly figures (see below)
        self.health = Health(self)     # outliers, duplicates, scoring, recommendations

Each sub-namespace inherits from pdstools.utils.namespaces.LazyNamespace, declares its dependencies (e.g. ["plotly"] for Plot, ["umap-learn"] for Compute) and dependency_group = "data_quality". That gives you free MissingDependenciesException handling — users without the extra installed get a clear install message on dq.plot, not a raw ImportError from inside compute_umap. Replaces the current ad-hoc lazy imports.

Pure __init__, no caller mutation

__post_init__ mutates the caller's frame in place:

self.df = self.df[[self.text_col, self.topic_col]].dropna().reset_index(drop=True)
self.df[self.text_col] = self.df[self.text_col].astype(str)

Then compute_umap() writes x / y columns onto the same frame. The caller hands you a DataFrame and gets it back with extra columns silently appended. AGENTS.md has a pure-__init__ rule for exactly this reason — the constructor takes already-clean data, and normalisation lives in a from_<source> classmethod:

@classmethod
def from_dataframe(cls, df: pl.DataFrame, text_col: str, topic_col: str, ...) -> TopicDataQuality:
    clean = df.select(text_col, topic_col).drop_nulls().with_columns(pl.col(text_col).cast(pl.Utf8))
    return cls(df=clean, text_col=text_col, topic_col=topic_col, ...)

UMAP coordinates should be returned, or stored on self as a separate _umap_coords attribute, not appended to the input frame.

Plot construction belongs in the library

pages/1_Quality_Report.py builds Plotly figures directly — px.bar, go.Figure/go.Heatmap for the UMAP scatter and similarity heatmap, layout choices, hover templates. AGENTS.md is explicit that pages are zero-functionality presentation layers: every plot in the app should be reproducible from a notebook with one library call, and the page is just st.plotly_chart(dq.plot.xxx()).

Move those three figures into the Plot(LazyNamespace) namespace and follow the standard signature:

def topic_distribution(self, *, return_df: bool = False) -> go.Figure | pl.DataFrame: ...
def umap_2d(self, *, return_df: bool = False) -> go.Figure | pl.DataFrame: ...
def similarity_heatmap(self, *, return_df: bool = False) -> go.Figure | pl.DataFrame: ...

The return_df keyword (with @overload) is the standard across existing analyzer plot methods — makes plots scriptable, testable, and composable.

Tests: tighten and add state-transition

The structure is good but most assertions are structural (len(short_texts) == 2, len(dups) >= 4, "diagonal is 1.0"). With 16 hand-crafted rows every output is hand-computable — avg_words, off-diagonal similarity values, cluster-tightness scores. Trade structural checks for exact-value asserts on the same rows; that's what catches silent regressions where the shape stays right but the math drifts.

Separately: the home page has an upload + analyze flow that mutates st.session_state["topic_dq"], and test_dq_pages.py only smoke-tests the renders. Add a state-transition test that simulates the upload and asserts the resulting analyzer was built (row count, detected columns). The exact pattern bit us in the v5.0.0 launcher; AGENTS.md has the reference shape.


Summary

To get this merge-ready:

  1. Switch the module to polars; drop pandas from the data_quality extra.
  2. Set up Compute / Plot / Health as LazyNamespace sub-namespaces from the start, with dependency_group="data_quality" — replaces the ad-hoc lazy imports.
  3. Make __init__ pure; move filtering/coercion into from_dataframe; don't mutate the input frame with UMAP coords.
  4. Move the three Plotly figures from the Quality Report page into dq.plot.* with return_df support.
  5. Tighten unit tests to exact-value assertions; add an upload→analyze state-transition test.

Happy to talk through any of this if it's helpful.

- Add pdstools.data_quality.TopicDataQuality with pure __init__ and
  from_dataframe classmethod (polars-only, no pandas)
- Three LazyNamespace sub-namespaces:
  - dq.compute: embeddings, UMAP, TF-IDF similarity
  - dq.plot: topic_distribution, umap_2d, similarity_heatmap (with return_df)
  - dq.health: text quality, duplicates, adequacy, tightness,
    outliers, confused samples, health score, recommendations, summary
- UMAP coords stored on _umap_coords, not mutated onto input frame
- Streamlit app under app/data_quality/ (zero-functionality presentation)
- Integrated into unified launcher (4th tile) and CLI (pdstools run dq)
- data_quality optional dep group: numpy, scikit-learn,
  sentence-transformers, umap-learn, plotly
- 35 library tests with exact-value assertions + 5 Streamlit tests
  including state-transition test
- Updated launcher tests for 4-app layout
@dharani-dj dharani-dj force-pushed the feature/topic-data-quality branch from 8fe9721 to 2e94b3b Compare May 8, 2026 09:07
@operdeck

Copy link
Copy Markdown
Collaborator

@dharani-dj please have a look at the build failures and the PR review comments

- Add ClassVar annotation to dependencies attrs (RUF012)
- Add strict=True to zip() call (B905)
- Remove unused imports: numpy, TopicOverlapPair (F401)
- Add python_version<3.14 markers for sentence-transformers and
  umap-learn (no torch/numba wheels for 3.14 yet)
@dharani-dj

Copy link
Copy Markdown
Author

Pushed fix commit (eecc819) addressing all CI failures:

Ruff fixes:

  • Added ClassVar annotations to dependencies attrs (RUF012)
  • Added strict=True to zip() (B905)
  • Removed unused imports: numpy, TopicOverlapPair (F401)

Python 3.14 fix:

  • Added python_version < '3.14' markers for sentence-transformers and umap-learn (no torch/numba wheels for 3.14 yet)

Review feedback (all 5 items addressed in prior commit):

  1. Polars-only — pandas removed from library and data_quality extra
  2. Compute/Plot/Health as LazyNamespace sub-namespaces with dependency_group="data_quality"
  3. Pure __init__ + from_dataframe() classmethod; UMAP coords stored separately
  4. All three Plotly figures moved to dq.plot.* with return_df support
  5. 35 exact-value library tests + 5 Streamlit tests including state-transition

All tests and ruff pass locally. Could someone please approve the pending workflow runs when you find time? Thanks! @StijnKas @operdeck

@operdeck

operdeck commented May 15, 2026 via email

Copy link
Copy Markdown
Collaborator

@operdeck

Copy link
Copy Markdown
Collaborator

@dharani-dj I checked w @StijnKas about the app groups, lets create an "nlp" group for this (not data-quality), and make that part of the "app" dependency group, pretty much like app pulls in healthcheck or onnx pulls in api.

chowj added 2 commits May 27, 2026 12:54
- Rename optional dependency group data_quality -> nlp (per operdeck feedback)
- Add data/dq_nlp/smalltalk.csv as built-in demo dataset (1100 rows, 5 topics)
- Add dq_sample() helper in utils/datasets.py for one-liner loading
- Add 'Load sample dataset' button on home page for instant demo
- Add 'How to read this chart' guidance below UMAP visualization
- Fix ruff formatting in _health.py and _plot.py
Add pytest.importorskip for sentence_transformers and umap at module
level in both test files. On Python 3.14 (where sentence-transformers
and umap-learn have no wheels), the tests are gracefully skipped
instead of failing with ImportError.
@codecov

codecov Bot commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0.24876% with 401 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.15%. Comparing base (73b548c) to head (3a804c4).
⚠️ Report is 73 commits behind head on master.

Files with missing lines Patch % Lines
python/pdstools/data_quality/_health.py 0.00% 237 Missing ⚠️
...ython/pdstools/data_quality/_topic_data_quality.py 0.00% 63 Missing ⚠️
python/pdstools/data_quality/_compute.py 0.00% 46 Missing ⚠️
python/pdstools/data_quality/_plot.py 0.00% 46 Missing ⚠️
python/pdstools/utils/datasets.py 10.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #780      +/-   ##
==========================================
- Coverage   91.62%   88.15%   -3.48%     
==========================================
  Files         110      116       +6     
  Lines        9617    10252     +635     
==========================================
+ Hits         8812     9038     +226     
- Misses        805     1214     +409     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

torch.onnx.export() in local_model_utils.from_pytorch requires onnxscript
on PyTorch >= 2.6 (dynamo exporter). On master nothing in the tests extra
installs torch so these tests skip; this branch's nlp extra pulls in
sentence-transformers -> torch, exposing the latent missing-dep bug.
Tighten the skipif guard to also require onnxscript.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants