Add centralized data pipeline configuration (#26) #113

GunaPalanivel · 2026-01-19T17:53:29Z

Summary

introduce a centralized YAML config at chatbot-core/config/data-pipeline.yml for collection, preprocessing, chunking, embedding, and storage
add a pipeline config loader with env-var overrides and update Makefile data targets to accept CONFIG_PATH/DATA_PIPELINE_CONFIG
refactor pipeline scripts to consume the centralized config, add documentation, and unit tests for the loader

Testing

python -m py_compile chatbot-core/data/chunking/extract_chunk_docs.py
python -m py_compile chatbot-core/data/chunking/extract_chunk_plugins.py
python -m py_compile chatbot-core/rag/embedding/embed_chunks.py
python -m py_compile chatbot-core/rag/vectorstore/store_embeddings.py
python -c "from config.pipeline_loader import load_pipeline_config; print(load_pipeline_config()['chunking'])"
python -c "import os; os.environ['CHUNK_SIZE']='700'; os.environ['CHUNK_OVERLAP']='140'; from config.pipeline_loader import load_pipeline_config; cfg=load_pipeline_config(); print(cfg['chunking']['chunk_size'], cfg['chunking']['chunk_overlap'])"

Notes

Issue: Add centralized config file for data pipeline #26
Defaults preserve previous hard-coded values; env overrides supported (CHUNK_SIZE, CHUNK_OVERLAP, EMBEDDING_MODEL, FAISS_N_LIST, FAISS_N_PROBE)

- Create centralized YAML config at chatbot-core/config/data-pipeline.yml - Implement config loader with environment variable override support - Refactor collection, preprocessing, chunking, embedding, and storage scripts - Update Makefile targets to support CONFIG_PATH variable - Add comprehensive documentation for config usage - Add unit tests for config loader functionality Fixes jenkinsci#26

- Remove uppercase logger name transformation to allow caplog to capture 'data-pipeline' logger records - Enable logger propagation so pytest caplog can capture test records correctly - Add centralized path_utils.resolve_data_dir() to eliminate duplicate path resolution code (fixes R0801) - Refactor docs_crawler.py and preprocess_docs.py to use resolve_data_dir utility - Convert remaining f-string logging to lazy %s formatting (W1203) - Split long lines under 100 char limit (C0301) - Use raw string for regex pattern in test_pipeline_config.py - Remove trailing whitespace across affected files All pipeline config tests now pass: 16 passed, 1 skipped

- Convert file to Unix line endings (LF) - Strip trailing whitespace from all lines - This resolves all remaining C0303 pylint violations

GunaPalanivel · 2026-01-19T19:15:55Z

CI Fix Summary: Pipeline Config Linting & Tests

What Was Fixed

Pylint C0303 (Trailing Whitespace)

Removed trailing spaces from pipeline_loader.py docstrings and blank lines by converting file to Unix line endings (LF) and stripping all trailing whitespace.

Test Failure (test_env_override_invalid_value)

Fixed logger caplog capture by:

Removing uppercase transformation in LoggerFactory.get_logger()
Enabling logger.propagate = True so pytest's caplog can capture named logger records

Duplicate Code (R0801)

Eliminated duplicate path resolution:

Created utils/path_utils.py with resolve_data_dir() utility
Refactored docs_crawler.py and preprocess_docs.py to use it

Minor Lint Issues

Fixed:

Line-too-long (C0301): Split long lines, used raw strings for regex
Lazy logging (W1203): Converted f-string formatting to %s in logger calls

Files Changed

pipeline_loader.py — Cleaned docstrings, fixed line endings
logger.py — Removed name uppercasing, enabled propagation
utils/path_utils.py — New shared utility (NEW)
docs_crawler.py — Refactored to use path utility
preprocess_docs.py — Refactored to use path utility
test_pipeline_config.py — Fixed line length, raw string regex

chatbot-core/retriv

cnu1812 · 2026-01-20T06:59:48Z

@giovanni-vaccarino what's your view on this?

giovanni-vaccarino · 2026-01-21T00:28:10Z

@giovanni-vaccarino what's your view on this?

lgtm at first glance. I'll give another check tomorrow

GunaPalanivel · 2026-01-21T10:08:44Z

Centralized data pipeline configuration: validation + small additions

Summary

Centralized YAML at chatbot-core/config/data-pipeline.yml for collection, preprocessing, chunking, embedding, storage.
Loader with env var overrides (CHUNK_SIZE, CHUNK_OVERLAP, EMBEDDING_MODEL, FAISS_N_LIST, FAISS_N_PROBE).
Makefile supports CONFIG_PATH/DATA_PIPELINE_CONFIG for reproducible runs.
Refactored pipeline scripts to consume config; added docs and unit tests.

Validation

Unit tests passed locally; YAML loads; env overrides verified.

Additions beyond original scope (please confirm if desired)

preprocessing.docs: filter_min_visible_text_length (default 300), filter_max_link_ratio (default 0.1) to fine-tune filtering; wired into filter_processed_docs.py.
stackoverflow chunker now config-driven (input/output paths, chunk size/overlap).

Open questions

Keep these new filter knobs in default YAML, or remove to keep minimal?
Any other sources to include by default in embedding.chunk_files?
Preference for default storage.index_type (IVFFlat vs Flat) for smaller datasets?

Next actions

Happy to adjust defaults, trim YAML, or split knobs into a follow-up PR if preferred.

cc: @giovanni-vaccarino , @berviantoleo

cnu1812 · 2026-01-23T07:09:11Z

@GunaPalanivel checkout these new guidelines https://www.jenkins.io/projects/gsoc/ai-usage-policy/

berviantoleo

Please update the other documentation too.

…file usage

GunaPalanivel requested a review from a team as a code owner January 19, 2026 17:53

GunaPalanivel added 4 commits January 19, 2026 23:33

Fix pipeline config tests and lint issues

173e0a0

Remove trailing whitespace from docstrings in pipeline_loader.py

b7fa3d7

Fix line endings and remove all trailing whitespace

e908df5

- Convert file to Unix line endings (LF) - Strip trailing whitespace from all lines - This resolves all remaining C0303 pylint violations

berviantoleo reviewed Jan 19, 2026

View reviewed changes

chatbot-core/retriv Outdated Show resolved Hide resolved

berviantoleo added the enhancement For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted label Jan 20, 2026

cnu1812 and others added 2 commits January 20, 2026 12:30

Merge branch 'main' into feature/centralized-data-pipeline-config

1108c07

Remove accidentally vendored retriv directory

22eb841

GunaPalanivel requested a review from berviantoleo January 20, 2026 10:44

berviantoleo requested a review from cnu1812 January 21, 2026 00:46

berviantoleo requested changes Jan 23, 2026

View reviewed changes

docs: reference centralized data-pipeline config across docs and Make…

ca12968

…file usage

GunaPalanivel requested a review from berviantoleo January 24, 2026 09:27

Merge branch 'main' into feature/centralized-data-pipeline-config

58fc1e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add centralized data pipeline configuration (#26) #113

Add centralized data pipeline configuration (#26) #113

Uh oh!

GunaPalanivel commented Jan 19, 2026

Uh oh!

GunaPalanivel commented Jan 19, 2026

Uh oh!

Uh oh!

cnu1812 commented Jan 20, 2026 •

edited

Loading

Uh oh!

giovanni-vaccarino commented Jan 21, 2026

Uh oh!

GunaPalanivel commented Jan 21, 2026 •

edited

Loading

Uh oh!

cnu1812 commented Jan 23, 2026

Uh oh!

berviantoleo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Add centralized data pipeline configuration (#26) #113

Are you sure you want to change the base?

Add centralized data pipeline configuration (#26) #113

Uh oh!

Conversation

GunaPalanivel commented Jan 19, 2026

Summary

Testing

Notes

Uh oh!

GunaPalanivel commented Jan 19, 2026

CI Fix Summary: Pipeline Config Linting & Tests

What Was Fixed

Pylint C0303 (Trailing Whitespace)

Test Failure (test_env_override_invalid_value)

Duplicate Code (R0801)

Minor Lint Issues

Files Changed

Uh oh!

Uh oh!

cnu1812 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giovanni-vaccarino commented Jan 21, 2026

Uh oh!

GunaPalanivel commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cnu1812 commented Jan 23, 2026

Uh oh!

berviantoleo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cnu1812 commented Jan 20, 2026 •

edited

Loading

GunaPalanivel commented Jan 21, 2026 •

edited

Loading