feat: add Topic Data Quality module and Streamlit app#780
Conversation
dharani-dj
commented
May 2, 2026
- Add pdstools.data_quality.TopicDataQuality library class with: embedding computation, UMAP visualization, topic similarity, text quality analysis, duplicate detection, sample adequacy, cluster tightness, outlier detection, confused samples, keyword overlap, health scoring, and recommendations
- Add Streamlit app under app/data_quality/ with home page, quality report page, and about page
- Integrate into unified launcher (4th tile) and CLI (pdstools run dq)
- Add data_quality optional dependency group (numpy, pandas, scikit-learn, sentence-transformers, umap-learn)
- Add 16 library tests and 4 Streamlit page tests
- Update launcher tests for 4-app layout
|
Thanks for the PR. The data-quality angle is genuinely useful and the diagnostic surface (embeddings, similarity heatmap, outlier detection, sample adequacy) is real value. Before we merge, though, this needs to be brought in line with pdstools conventions — there's quite a bit of duplication and divergence from how we do things elsewhere. Drop pandas, use polarsThe repo is polars-first and pandas isn't earning its place here. Going through every pandas use in The ML libraries you feed into — Please convert the module to polars and drop pandas from the Use
|
- Add pdstools.data_quality.TopicDataQuality with pure __init__ and
from_dataframe classmethod (polars-only, no pandas)
- Three LazyNamespace sub-namespaces:
- dq.compute: embeddings, UMAP, TF-IDF similarity
- dq.plot: topic_distribution, umap_2d, similarity_heatmap (with return_df)
- dq.health: text quality, duplicates, adequacy, tightness,
outliers, confused samples, health score, recommendations, summary
- UMAP coords stored on _umap_coords, not mutated onto input frame
- Streamlit app under app/data_quality/ (zero-functionality presentation)
- Integrated into unified launcher (4th tile) and CLI (pdstools run dq)
- data_quality optional dep group: numpy, scikit-learn,
sentence-transformers, umap-learn, plotly
- 35 library tests with exact-value assertions + 5 Streamlit tests
including state-transition test
- Updated launcher tests for 4-app layout
8fe9721 to
2e94b3b
Compare
|
@dharani-dj please have a look at the build failures and the PR review comments |
- Add ClassVar annotation to dependencies attrs (RUF012) - Add strict=True to zip() call (B905) - Remove unused imports: numpy, TopicOverlapPair (F401) - Add python_version<3.14 markers for sentence-transformers and umap-learn (no torch/numba wheels for 3.14 yet)
|
Pushed fix commit (eecc819) addressing all CI failures: Ruff fixes:
Python 3.14 fix:
Review feedback (all 5 items addressed in prior commit):
All tests and ruff pass locally. Could someone please approve the pending workflow runs when you find time? Thanks! @StijnKas @operdeck |
|
Is “data-quality “ the right name for this dep group? Sounds over generic,
shouldn’t it be nlp/topic modeling and do we even need another dep group?
It makes it harder to have one shared launcher.
Op wo 13 mei 2026 om 09:54 schreef dharani ***@***.***>
… *dharani-dj* left a comment (pegasystems/pega-datascientist-tools#780)
<#780 (comment)>
Pushed fix commit (eecc819
<eecc819>)
addressing all CI failures:
*Ruff fixes:*
- Added ClassVar annotations to dependencies attrs (RUF012)
- Added strict=True to zip() (B905)
- Removed unused imports: numpy, TopicOverlapPair (F401)
*Python 3.14 fix:*
- Added python_version < '3.14' markers for sentence-transformers and
umap-learn (no torch/numba wheels for 3.14 yet)
*Review feedback (all 5 items addressed in prior commit):*
1. Polars-only — pandas removed from library and data_quality extra
2. Compute/Plot/Health as LazyNamespace sub-namespaces with
dependency_group="data_quality"
3. Pure __init__ + from_dataframe() classmethod; UMAP coords stored
separately
4. All three Plotly figures moved to dq.plot.* with return_df support
5. 35 exact-value library tests + 5 Streamlit tests including
state-transition
All tests and ruff pass locally. Could someone please approve the pending
workflow runs when you find time? Thanks! @StijnKas
<https://github.com/StijnKas> @operdeck <https://github.com/operdeck>
—
Reply to this email directly, view it on GitHub
<#780 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABEVTKFICGQEKGFV6HREHXL42QSZTAVCNFSM6AAAAACYOBTEHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DIMZYGY2DQMRYGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@dharani-dj I checked w @StijnKas about the app groups, lets create an "nlp" group for this (not data-quality), and make that part of the "app" dependency group, pretty much like app pulls in healthcheck or onnx pulls in api. |
- Rename optional dependency group data_quality -> nlp (per operdeck feedback) - Add data/dq_nlp/smalltalk.csv as built-in demo dataset (1100 rows, 5 topics) - Add dq_sample() helper in utils/datasets.py for one-liner loading - Add 'Load sample dataset' button on home page for instant demo - Add 'How to read this chart' guidance below UMAP visualization - Fix ruff formatting in _health.py and _plot.py
Add pytest.importorskip for sentence_transformers and umap at module level in both test files. On Python 3.14 (where sentence-transformers and umap-learn have no wheels), the tests are gracefully skipped instead of failing with ImportError.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #780 +/- ##
==========================================
- Coverage 91.62% 88.15% -3.48%
==========================================
Files 110 116 +6
Lines 9617 10252 +635
==========================================
+ Hits 8812 9038 +226
- Misses 805 1214 +409 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
torch.onnx.export() in local_model_utils.from_pytorch requires onnxscript on PyTorch >= 2.6 (dynamo exporter). On master nothing in the tests extra installs torch so these tests skip; this branch's nlp extra pulls in sentence-transformers -> torch, exposing the latent missing-dep bug. Tighten the skipif guard to also require onnxscript.