Skip to content

Conversation

@GunaPalanivel
Copy link
Contributor

Summary

  • introduce a centralized YAML config at chatbot-core/config/data-pipeline.yml for collection, preprocessing, chunking, embedding, and storage
  • add a pipeline config loader with env-var overrides and update Makefile data targets to accept CONFIG_PATH/DATA_PIPELINE_CONFIG
  • refactor pipeline scripts to consume the centralized config, add documentation, and unit tests for the loader

Testing

  • python -m py_compile chatbot-core/data/chunking/extract_chunk_docs.py
  • python -m py_compile chatbot-core/data/chunking/extract_chunk_plugins.py
  • python -m py_compile chatbot-core/rag/embedding/embed_chunks.py
  • python -m py_compile chatbot-core/rag/vectorstore/store_embeddings.py
  • python -c "from config.pipeline_loader import load_pipeline_config; print(load_pipeline_config()['chunking'])"
  • python -c "import os; os.environ['CHUNK_SIZE']='700'; os.environ['CHUNK_OVERLAP']='140'; from config.pipeline_loader import load_pipeline_config; cfg=load_pipeline_config(); print(cfg['chunking']['chunk_size'], cfg['chunking']['chunk_overlap'])"

Notes

- Create centralized YAML config at chatbot-core/config/data-pipeline.yml
- Implement config loader with environment variable override support
- Refactor collection, preprocessing, chunking, embedding, and storage scripts
- Update Makefile targets to support CONFIG_PATH variable
- Add comprehensive documentation for config usage
- Add unit tests for config loader functionality

Fixes jenkinsci#26
@GunaPalanivel GunaPalanivel requested a review from a team as a code owner January 19, 2026 17:53
- Remove uppercase logger name transformation to allow caplog to capture 'data-pipeline' logger records
- Enable logger propagation so pytest caplog can capture test records correctly
- Add centralized path_utils.resolve_data_dir() to eliminate duplicate path resolution code (fixes R0801)
- Refactor docs_crawler.py and preprocess_docs.py to use resolve_data_dir utility
- Convert remaining f-string logging to lazy %s formatting (W1203)
- Split long lines under 100 char limit (C0301)
- Use raw string for regex pattern in test_pipeline_config.py
- Remove trailing whitespace across affected files

All pipeline config tests now pass: 16 passed, 1 skipped
- Convert file to Unix line endings (LF)
- Strip trailing whitespace from all lines
- This resolves all remaining C0303 pylint violations
@GunaPalanivel
Copy link
Contributor Author

CI Fix Summary: Pipeline Config Linting & Tests

What Was Fixed

Pylint C0303 (Trailing Whitespace)

Removed trailing spaces from pipeline_loader.py docstrings and blank lines by converting file to Unix line endings (LF) and stripping all trailing whitespace.

Test Failure (test_env_override_invalid_value)

Fixed logger caplog capture by:

  • Removing uppercase transformation in LoggerFactory.get_logger()
  • Enabling logger.propagate = True so pytest's caplog can capture named logger records

Duplicate Code (R0801)

Eliminated duplicate path resolution:

  • Created utils/path_utils.py with resolve_data_dir() utility
  • Refactored docs_crawler.py and preprocess_docs.py to use it

Minor Lint Issues

Fixed:

  • Line-too-long (C0301): Split long lines, used raw strings for regex
  • Lazy logging (W1203): Converted f-string formatting to %s in logger calls

Files Changed

  • pipeline_loader.py — Cleaned docstrings, fixed line endings
  • logger.py — Removed name uppercasing, enabled propagation
  • utils/path_utils.py — New shared utility (NEW)
  • docs_crawler.py — Refactored to use path utility
  • preprocess_docs.py — Refactored to use path utility
  • test_pipeline_config.py — Fixed line length, raw string regex

@berviantoleo berviantoleo added the enhancement For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted label Jan 20, 2026
@cnu1812
Copy link
Member

cnu1812 commented Jan 20, 2026

@giovanni-vaccarino what's your view on this?

@giovanni-vaccarino
Copy link
Contributor

@giovanni-vaccarino what's your view on this?

lgtm at first glance. I'll give another check tomorrow

@berviantoleo berviantoleo requested a review from cnu1812 January 21, 2026 00:46
@GunaPalanivel
Copy link
Contributor Author

GunaPalanivel commented Jan 21, 2026

Centralized data pipeline configuration: validation + small additions

Summary

  • Centralized YAML at chatbot-core/config/data-pipeline.yml for collection, preprocessing, chunking, embedding, storage.
  • Loader with env var overrides (CHUNK_SIZE, CHUNK_OVERLAP, EMBEDDING_MODEL, FAISS_N_LIST, FAISS_N_PROBE).
  • Makefile supports CONFIG_PATH/DATA_PIPELINE_CONFIG for reproducible runs.
  • Refactored pipeline scripts to consume config; added docs and unit tests.

Validation

  • Unit tests passed locally; YAML loads; env overrides verified.

Additions beyond original scope (please confirm if desired)

  • preprocessing.docs: filter_min_visible_text_length (default 300), filter_max_link_ratio (default 0.1) to fine-tune filtering; wired into filter_processed_docs.py.
  • stackoverflow chunker now config-driven (input/output paths, chunk size/overlap).

Open questions

  1. Keep these new filter knobs in default YAML, or remove to keep minimal?
  2. Any other sources to include by default in embedding.chunk_files?
  3. Preference for default storage.index_type (IVFFlat vs Flat) for smaller datasets?

Next actions

  • Happy to adjust defaults, trim YAML, or split knobs into a follow-up PR if preferred.

cc: @giovanni-vaccarino , @berviantoleo

@cnu1812
Copy link
Member

cnu1812 commented Jan 23, 2026

@GunaPalanivel checkout these new guidelines https://www.jenkins.io/projects/gsoc/ai-usage-policy/

Copy link
Contributor

@berviantoleo berviantoleo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the other documentation too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants