-
-
Notifications
You must be signed in to change notification settings - Fork 42
Add centralized data pipeline configuration (#26) #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add centralized data pipeline configuration (#26) #113
Conversation
- Create centralized YAML config at chatbot-core/config/data-pipeline.yml - Implement config loader with environment variable override support - Refactor collection, preprocessing, chunking, embedding, and storage scripts - Update Makefile targets to support CONFIG_PATH variable - Add comprehensive documentation for config usage - Add unit tests for config loader functionality Fixes jenkinsci#26
- Remove uppercase logger name transformation to allow caplog to capture 'data-pipeline' logger records - Enable logger propagation so pytest caplog can capture test records correctly - Add centralized path_utils.resolve_data_dir() to eliminate duplicate path resolution code (fixes R0801) - Refactor docs_crawler.py and preprocess_docs.py to use resolve_data_dir utility - Convert remaining f-string logging to lazy %s formatting (W1203) - Split long lines under 100 char limit (C0301) - Use raw string for regex pattern in test_pipeline_config.py - Remove trailing whitespace across affected files All pipeline config tests now pass: 16 passed, 1 skipped
- Convert file to Unix line endings (LF) - Strip trailing whitespace from all lines - This resolves all remaining C0303 pylint violations
CI Fix Summary: Pipeline Config Linting & TestsWhat Was FixedPylint C0303 (Trailing Whitespace)Removed trailing spaces from Test Failure (test_env_override_invalid_value)Fixed logger caplog capture by:
Duplicate Code (R0801)Eliminated duplicate path resolution:
Minor Lint IssuesFixed:
Files Changed
|
|
@giovanni-vaccarino what's your view on this? |
lgtm at first glance. I'll give another check tomorrow |
|
Centralized data pipeline configuration: validation + small additions Summary
Validation
Additions beyond original scope (please confirm if desired)
Open questions
Next actions
|
|
@GunaPalanivel checkout these new guidelines https://www.jenkins.io/projects/gsoc/ai-usage-policy/ |
berviantoleo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the other documentation too.
Summary
chatbot-core/config/data-pipeline.ymlfor collection, preprocessing, chunking, embedding, and storageCONFIG_PATH/DATA_PIPELINE_CONFIGTesting
Notes