feat: add Topic Data Quality module and Streamlit app by dharani-dj · Pull Request #780 · pegasystems/pega-datascientist-tools

dharani-dj · 2026-05-02T04:49:52Z

Add pdstools.data_quality.TopicDataQuality library class with: embedding computation, UMAP visualization, topic similarity, text quality analysis, duplicate detection, sample adequacy, cluster tightness, outlier detection, confused samples, keyword overlap, health scoring, and recommendations
Add Streamlit app under app/data_quality/ with home page, quality report page, and about page
Integrate into unified launcher (4th tile) and CLI (pdstools run dq)
Add data_quality optional dependency group (numpy, pandas, scikit-learn, sentence-transformers, umap-learn)
Add 16 library tests and 4 Streamlit page tests
Update launcher tests for 4-app layout

StijnKas · 2026-05-08T08:30:30Z

Thanks for the PR. The data-quality angle is genuinely useful and the diagnostic surface (embeddings, similarity heatmap, outlier detection, sample adequacy) is real value. Before we merge, though, this needs to be brought in line with pdstools conventions — there's quite a bit of duplication and divergence from how we do things elsewhere.

Drop pandas, use polars

The repo is polars-first and pandas isn't earning its place here. Going through every pandas use in _topic_data_quality.py: the DataFrame work — dropna, value_counts, nunique, astype(str), groupby(col)[other].apply(" ".join), sort_values, building frames from row-dicts — is all trivially polars-equivalent. Several are cleaner in polars (the groupby-then-join is just df.group_by(topic).agg(pl.col(text).str.concat(" "))).

The ML libraries you feed into — sentence_transformers.encode(), sklearn.cosine_similarity(), umap.UMAP().fit_transform() — all take lists or numpy arrays, not DataFrames. The seam is df.get_column(text_col).to_list() for the encoder and series.to_numpy() (zero-copy when dtypes align) for the ML calls. Pandas adds a roundtrip without earning anything.

Please convert the module to polars and drop pandas from the data_quality extra. Numpy stays — the ML libs need it.

Use `LazyNamespace` for the sub-namespaces from the start

There are already ~12 public methods on TopicDataQuality and the design clearly splits along three axes: compute (embeddings, UMAP, similarity), inspect (duplicates, outliers, confused samples, adequacy), and report (health score, recommendations, summary). That's the textbook case for the namespace-facade pattern we use on ADMDatamart (dm.plot, dm.aggregates, dm.generate, dm.bin_aggregator).

Set this up now, not after it grows past 20 methods:

class TopicDataQuality:
    def __init__(self, df: pl.DataFrame, ...):
        ...
        self.compute = Compute(self)   # embeddings, UMAP, similarity
        self.plot = Plot(self)         # the Plotly figures (see below)
        self.health = Health(self)     # outliers, duplicates, scoring, recommendations

Each sub-namespace inherits from pdstools.utils.namespaces.LazyNamespace, declares its dependencies (e.g. ["plotly"] for Plot, ["umap-learn"] for Compute) and dependency_group = "data_quality". That gives you free MissingDependenciesException handling — users without the extra installed get a clear install message on dq.plot, not a raw ImportError from inside compute_umap. Replaces the current ad-hoc lazy imports.

Pure `init`, no caller mutation

__post_init__ mutates the caller's frame in place:

self.df = self.df[[self.text_col, self.topic_col]].dropna().reset_index(drop=True)
self.df[self.text_col] = self.df[self.text_col].astype(str)

Then compute_umap() writes x / y columns onto the same frame. The caller hands you a DataFrame and gets it back with extra columns silently appended. AGENTS.md has a pure-__init__ rule for exactly this reason — the constructor takes already-clean data, and normalisation lives in a from_<source> classmethod:

@classmethod
def from_dataframe(cls, df: pl.DataFrame, text_col: str, topic_col: str, ...) -> TopicDataQuality:
    clean = df.select(text_col, topic_col).drop_nulls().with_columns(pl.col(text_col).cast(pl.Utf8))
    return cls(df=clean, text_col=text_col, topic_col=topic_col, ...)

UMAP coordinates should be returned, or stored on self as a separate _umap_coords attribute, not appended to the input frame.

Plot construction belongs in the library

pages/1_Quality_Report.py builds Plotly figures directly — px.bar, go.Figure/go.Heatmap for the UMAP scatter and similarity heatmap, layout choices, hover templates. AGENTS.md is explicit that pages are zero-functionality presentation layers: every plot in the app should be reproducible from a notebook with one library call, and the page is just st.plotly_chart(dq.plot.xxx()).

Move those three figures into the Plot(LazyNamespace) namespace and follow the standard signature:

def topic_distribution(self, *, return_df: bool = False) -> go.Figure | pl.DataFrame: ...
def umap_2d(self, *, return_df: bool = False) -> go.Figure | pl.DataFrame: ...
def similarity_heatmap(self, *, return_df: bool = False) -> go.Figure | pl.DataFrame: ...

The return_df keyword (with @overload) is the standard across existing analyzer plot methods — makes plots scriptable, testable, and composable.

Tests: tighten and add state-transition

The structure is good but most assertions are structural (len(short_texts) == 2, len(dups) >= 4, "diagonal is 1.0"). With 16 hand-crafted rows every output is hand-computable — avg_words, off-diagonal similarity values, cluster-tightness scores. Trade structural checks for exact-value asserts on the same rows; that's what catches silent regressions where the shape stays right but the math drifts.

Separately: the home page has an upload + analyze flow that mutates st.session_state["topic_dq"], and test_dq_pages.py only smoke-tests the renders. Add a state-transition test that simulates the upload and asserts the resulting analyzer was built (row count, detected columns). The exact pattern bit us in the v5.0.0 launcher; AGENTS.md has the reference shape.

Summary

To get this merge-ready:

Switch the module to polars; drop pandas from the data_quality extra.
Set up Compute / Plot / Health as LazyNamespace sub-namespaces from the start, with dependency_group="data_quality" — replaces the ad-hoc lazy imports.
Make __init__ pure; move filtering/coercion into from_dataframe; don't mutate the input frame with UMAP coords.
Move the three Plotly figures from the Quality Report page into dq.plot.* with return_df support.
Tighten unit tests to exact-value assertions; add an upload→analyze state-transition test.

Happy to talk through any of this if it's helpful.

- Add pdstools.data_quality.TopicDataQuality with pure __init__ and from_dataframe classmethod (polars-only, no pandas) - Three LazyNamespace sub-namespaces: - dq.compute: embeddings, UMAP, TF-IDF similarity - dq.plot: topic_distribution, umap_2d, similarity_heatmap (with return_df) - dq.health: text quality, duplicates, adequacy, tightness, outliers, confused samples, health score, recommendations, summary - UMAP coords stored on _umap_coords, not mutated onto input frame - Streamlit app under app/data_quality/ (zero-functionality presentation) - Integrated into unified launcher (4th tile) and CLI (pdstools run dq) - data_quality optional dep group: numpy, scikit-learn, sentence-transformers, umap-learn, plotly - 35 library tests with exact-value assertions + 5 Streamlit tests including state-transition test - Updated launcher tests for 4-app layout

operdeck · 2026-05-12T13:41:03Z

@dharani-dj please have a look at the build failures and the PR review comments

- Add ClassVar annotation to dependencies attrs (RUF012) - Add strict=True to zip() call (B905) - Remove unused imports: numpy, TopicOverlapPair (F401) - Add python_version<3.14 markers for sentence-transformers and umap-learn (no torch/numba wheels for 3.14 yet)

dharani-dj · 2026-05-13T07:53:39Z

Pushed fix commit (eecc819) addressing all CI failures:

Ruff fixes:

Added ClassVar annotations to dependencies attrs (RUF012)
Added strict=True to zip() (B905)
Removed unused imports: numpy, TopicOverlapPair (F401)

Python 3.14 fix:

Added python_version < '3.14' markers for sentence-transformers and umap-learn (no torch/numba wheels for 3.14 yet)

Review feedback (all 5 items addressed in prior commit):

Polars-only — pandas removed from library and data_quality extra
Compute/Plot/Health as LazyNamespace sub-namespaces with dependency_group="data_quality"
Pure __init__ + from_dataframe() classmethod; UMAP coords stored separately
All three Plotly figures moved to dq.plot.* with return_df support
35 exact-value library tests + 5 Streamlit tests including state-transition

All tests and ruff pass locally. Could someone please approve the pending workflow runs when you find time? Thanks! @StijnKas @operdeck

operdeck · 2026-05-15T14:27:16Z

Is “data-quality “ the right name for this dep group? Sounds over generic, shouldn’t it be nlp/topic modeling and do we even need another dep group? It makes it harder to have one shared launcher. Op wo 13 mei 2026 om 09:54 schreef dharani ***@***.***>

…

*dharani-dj* left a comment (pegasystems/pega-datascientist-tools#780) <#780 (comment)> Pushed fix commit (eecc819 <eecc819>) addressing all CI failures: *Ruff fixes:* - Added ClassVar annotations to dependencies attrs (RUF012) - Added strict=True to zip() (B905) - Removed unused imports: numpy, TopicOverlapPair (F401) *Python 3.14 fix:* - Added python_version < '3.14' markers for sentence-transformers and umap-learn (no torch/numba wheels for 3.14 yet) *Review feedback (all 5 items addressed in prior commit):* 1. Polars-only — pandas removed from library and data_quality extra 2. Compute/Plot/Health as LazyNamespace sub-namespaces with dependency_group="data_quality" 3. Pure __init__ + from_dataframe() classmethod; UMAP coords stored separately 4. All three Plotly figures moved to dq.plot.* with return_df support 5. 35 exact-value library tests + 5 Streamlit tests including state-transition All tests and ruff pass locally. Could someone please approve the pending workflow runs when you find time? Thanks! @StijnKas <https://github.com/StijnKas> @operdeck <https://github.com/operdeck> — Reply to this email directly, view it on GitHub <#780 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABEVTKFICGQEKGFV6HREHXL42QSZTAVCNFSM6AAAAACYOBTEHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DIMZYGY2DQMRYGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

operdeck · 2026-05-26T09:42:29Z

@dharani-dj I checked w @StijnKas about the app groups, lets create an "nlp" group for this (not data-quality), and make that part of the "app" dependency group, pretty much like app pulls in healthcheck or onnx pulls in api.

- Rename optional dependency group data_quality -> nlp (per operdeck feedback) - Add data/dq_nlp/smalltalk.csv as built-in demo dataset (1100 rows, 5 topics) - Add dq_sample() helper in utils/datasets.py for one-liner loading - Add 'Load sample dataset' button on home page for instant demo - Add 'How to read this chart' guidance below UMAP visualization - Fix ruff formatting in _health.py and _plot.py

Add pytest.importorskip for sentence_transformers and umap at module level in both test files. On Python 3.14 (where sentence-transformers and umap-learn have no wheels), the tests are gracefully skipped instead of failing with ImportError.

codecov · 2026-05-28T06:11:12Z

Codecov Report

❌ Patch coverage is 0.24876% with 401 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.15%. Comparing base (73b548c) to head (3a804c4).
⚠️ Report is 73 commits behind head on master.

Files with missing lines	Patch %	Lines
python/pdstools/data_quality/_health.py	0.00%	237 Missing ⚠️
...ython/pdstools/data_quality/_topic_data_quality.py	0.00%	63 Missing ⚠️
python/pdstools/data_quality/_compute.py	0.00%	46 Missing ⚠️
python/pdstools/data_quality/_plot.py	0.00%	46 Missing ⚠️
python/pdstools/utils/datasets.py	10.00%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #780      +/-   ##
==========================================
- Coverage   91.62%   88.15%   -3.48%     
==========================================
  Files         110      116       +6     
  Lines        9617    10252     +635     
==========================================
+ Hits         8812     9038     +226     
- Misses        805     1214     +409

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

torch.onnx.export() in local_model_utils.from_pytorch requires onnxscript on PyTorch >= 2.6 (dynamo exporter). On master nothing in the tests extra installs torch so these tests skip; this branch's nlp extra pulls in sentence-transformers -> torch, exposing the latent missing-dep bug. Tighten the skipif guard to also require onnxscript.

dharani-dj force-pushed the feature/topic-data-quality branch from 8fe9721 to 2e94b3b Compare May 8, 2026 09:07

chowj added 2 commits May 27, 2026 12:54

jmelburg mentioned this pull request Jun 3, 2026

add onnxscript dependency and split up PyTorch tests #822

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Topic Data Quality module and Streamlit app#780

feat: add Topic Data Quality module and Streamlit app#780
dharani-dj wants to merge 5 commits into
pegasystems:masterfrom
dharani-dj:feature/topic-data-quality

dharani-dj commented May 2, 2026

Uh oh!

StijnKas commented May 8, 2026

Uh oh!

operdeck commented May 12, 2026

Uh oh!

dharani-dj commented May 13, 2026

Uh oh!

operdeck commented May 15, 2026 via email

Uh oh!

operdeck commented May 26, 2026

Uh oh!

codecov Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dharani-dj commented May 2, 2026

Uh oh!

StijnKas commented May 8, 2026

Drop pandas, use polars

Use LazyNamespace for the sub-namespaces from the start

Pure __init__, no caller mutation

Plot construction belongs in the library

Tests: tighten and add state-transition

Summary

Uh oh!

operdeck commented May 12, 2026

Uh oh!

dharani-dj commented May 13, 2026

Uh oh!

operdeck commented May 15, 2026 via email

Uh oh!

operdeck commented May 26, 2026

Uh oh!

codecov Bot commented May 28, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use `LazyNamespace` for the sub-namespaces from the start

Pure `init`, no caller mutation