feat(llama_stack): centralize vector/RAG config and shared helpers by jgarciao · Pull Request #1266 · opendatahub-io/opendatahub-tests

jgarciao · 2026-03-20T15:12:45Z

Fix automation for product bug https://redhat.atlassian.net/browse/RHAIENG-3816
Move Postgres, vLLM, embedding, and AWS-related defaults into constants (env overrides)
Add IBM 2025 earnings PDFs (encrypted/unencrypted) and finance query sets per search mode
Add vector_store_create_and_poll, file-from-URL/path helpers, and upload assertions in utils
Replace vector_store_with_example_docs with a doc_sources parameter on the vector_store fixture
Refactor conftest and vector store + upgrade RAG tests to use the new constants and helpers

Note, in a follow-up PR we'll add a jsonl file in the dataset to store questions and expected answers

Summary by CodeRabbit

Tests
- Enhanced vector-store test fixtures to accept configurable document sources, perform reliable upload+pacing checks, verify uploaded files, include cleanup on ingestion failures, and added a file-upload verification test plus parametrized cases (including expected failures).
Chores
- Centralized environment and secret-related test configuration into a shared constants module.
Documentation
- Added dataset README describing internal test corpora and usage/legal constraints.

- Fix automation for product bug https://redhat.atlassian.net/browse/RHAIENG-3816 - Move Postgres, vLLM, embedding, and AWS-related defaults into constants (env overrides) - Add IBM 2025 earnings PDFs (encrypted/unencrypted) and finance query sets per search mode - Add vector_store_create_and_poll, file-from-URL/path helpers, and upload assertions in utils - Replace vector_store_with_example_docs with a doc_sources parameter on the vector_store fixture - Refactor conftest and vector store + upgrade RAG tests to use the new constants and helpers Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com> Made-with: Cursor Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>

github-actions · 2026-03-20T15:13:09Z

The following are automatically added/executed:

PR size label.
Run pre-commit
Run tox
Add PR author as the PR assignee
Build image based on the PR

Available user actions:

To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
lgtm label removed on each new commit push.
To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
verified label removed on each new commit push.
To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.

Supported labels

{'/hold', '/verified', '/wip', '/cherry-pick', '/lgtm', '/build-push-pr-image'}

coderabbitai · 2026-03-20T15:14:52Z

📝 Walkthrough

Walkthrough

Environment/config constants were moved to a new constants module; the vector store fixture was rewritten to accept parametrized document sources and to ingest local or URL documents with polling and error handling; new file-upload utilities and updated tests were added; dataset README introduced.

Changes

Cohort / File(s)	Summary
Configuration constants `tests/llama_stack/constants.py`	New module exposing env-derived test constants (Postgres image/user/password, VLLM inference/embedding settings, AWS creds, secret-data mapping, upgrade name, IBM document paths and search queries). Note: consolidates secrets from env into module-level values (review handling of secrets at rest — CWE-200).
Test fixtures (conftest) `tests/llama_stack/conftest.py`	Removed inline `os.getenv` uses in favor of imports from constants. Reworked `vector_store` fixture to accept `request.param` as a dict with optional `doc_sources` (list of URL/path/dir strings) and `vector_io_provider` (defaults to `\"milvus\"` if falsy). Ingests docs on create path (skips on post-upgrade reuse), validates paths/URLs, and attempts cleanup (delete vector store) on ingestion failure. Verify path containment to prevent traversal (CWE-22).
File ingestion utilities `tests/llama_stack/utils.py`	Added polling upload helper `vector_store_create_and_poll()` and path-based upload `vector_store_create_file_from_path()`. Refactored `vector_store_create_file_from_url()` to download to temp file then delegate to path upload; added upload/assertion helpers and `vector_store_upload_doc_sources()` to iterate directories and multiple source types. Review temp-file handling and exceptions to ensure no leftover sensitive files.
Vector store tests `tests/llama_stack/vector_io/test_vector_stores.py`	Replaced in-file IBM constants with imports; parametrized `vector_store` with `doc_sources`; added `test_vector_stores_file_upload()` asserting presence of completed uploaded files; added xfail encrypted-document Milvus case; replaced `vector_store_with_example_docs` usages with `vector_store`.
Upgrade RAG tests `tests/llama_stack/vector_io/upgrade/test_upgrade_vector_store_rag.py`	Updated helper/test inputs to use `vector_store` with `doc_sources` and adjusted references to `vector_store.id`.
Dataset docs `tests/llama_stack/dataset/README.md`	New README describing internal test PDFs (IBM earnings corpus and pdf-testing corpus), usage constraints and legal guidance.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: centralizing configuration into constants and adding shared helper utilities for vector store operations.
Description check	✅ Passed	The description covers the key changes (constants, utilities, fixture refactoring) and links the related product bug; missing some template sections but the content is substantive and clear.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/llama_stack/utils.py (1)

209-236: ⚠️ Potential issue | 🟠 Major

CWE-918: SSRF risk in arbitrary remote URL fetching without validation.

requests.get(url, timeout=60) will fetch any http(s) URL passed as a test parameter. Even though doc_sources is currently parametrized with hardcoded URLs only, pytest fixtures can be driven from external configuration, plugins, or environment sources. Validate scheme/host/IP before the request, revalidate any redirect target, and stream with a size cap instead of buffering response.content to avoid unbounded memory consumption.

Also remove invalid # noqa: FCN001 at line 225.

Suggested fix

-        response = requests.get(url, timeout=60)
+        _validate_remote_doc_source(url)
+        response = requests.get(url, timeout=60, stream=True)
         response.raise_for_status()
@@
-            temp_file.write(response.content)
-            temp_file_path = Path(temp_file.name)  # noqa: FCN001
+            for chunk in response.iter_content(chunk_size=1024 * 1024):
+                if chunk:
+                    temp_file.write(chunk)
+            temp_file_path = Path(temp_file.name)

def _validate_remote_doc_source(url: str, allowed_hosts: set[str]) -> None:
    parsed = urllib.parse.urlsplit(url)
    if parsed.scheme not in {"http", "https"} or not parsed.hostname:
        raise ValueError(f"Unsupported doc source URL: {url!r}")
    if parsed.hostname not in allowed_hosts:
        raise ValueError(f"Untrusted doc source host: {parsed.hostname!r}")

    for result in socket.getaddrinfo(parsed.hostname, parsed.port or 443, type=socket.SOCK_STREAM):
        ip = ipaddress.ip_address(result[4][0])
        if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_multicast or ip.is_reserved:
            raise ValueError(f"Blocked doc source host {parsed.hostname!r} resolving to {ip}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/llama_stack/utils.py` around lines 209 - 236, The download block using
requests.get(url, timeout=60) in this file poses SSRF and memory risks; add a
pre-validation function (e.g., _validate_remote_doc_source) and call it before
requests.get to enforce allowed schemes (http/https), a whitelist of hostnames,
and reject hostnames resolving to private/loopback/reserved IPs, then perform
requests.get with allow_redirects=True but re-validate each redirect target
against the same rules and use stream=True to iterate response.iter_content with
a hard size cap to avoid buffering response.content; also remove the incorrect
"# noqa: FCN001" on temp_file_path and ensure temp_file_path is safely
initialized/checked before os.unlink in the finally block.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/llama_stack/conftest.py`:
- Around line 810-850: Wrap the entire doc_sources ingestion block (the code
that iterates doc_sources and calls vector_store_create_file_from_url and
vector_store_create_file_from_path) in a try/except that catches Exception; in
the except handler call the vector store deletion API to remove the
partially-created store (e.g., vector_store.delete() or use
unprivileged_llama_stack_client to delete the store by id) and then re-raise the
exception so the fixture teardown/yield behavior is preserved; ensure you
reference the same variables (doc_sources, vector_store,
unprivileged_llama_stack_client, vector_store_create_file_from_url,
vector_store_create_file_from_path) when implementing the try/except and
deletion.
- Around line 819-849: The code handling doc_sources allows absolute paths and
"../" traversal; fix by resolving each non-URL source against
request.config.rootdir using Path(...).resolve() and compare it to the repo root
Path(request.config.rootdir).resolve(); if the resolved path is not within the
repo root (use path.relative_to(root) or a startswith check), raise
FileNotFoundError/ValueError and do not call vector_store_create_file_from_path;
keep existing behavior for URLs and the calls to
vector_store_create_file_from_path and vector_store_create_file_from_url
otherwise.

---

Outside diff comments:
In `@tests/llama_stack/utils.py`:
- Around line 209-236: The download block using requests.get(url, timeout=60) in
this file poses SSRF and memory risks; add a pre-validation function (e.g.,
_validate_remote_doc_source) and call it before requests.get to enforce allowed
schemes (http/https), a whitelist of hostnames, and reject hostnames resolving
to private/loopback/reserved IPs, then perform requests.get with
allow_redirects=True but re-validate each redirect target against the same rules
and use stream=True to iterate response.iter_content with a hard size cap to
avoid buffering response.content; also remove the incorrect "# noqa: FCN001" on
temp_file_path and ensure temp_file_path is safely initialized/checked before
os.unlink in the finally block.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 626426c8-00d9-413b-b1b8-1543e6445ca6

📥 Commits

Reviewing files that changed from the base of the PR and between f381daa and 0270be8.

⛔ Files ignored due to path filters (5)

tests/llama_stack/dataset/corpus/finance/ibm-1q25-earnings-press-release-unencrypted.pdf is excluded by !**/*.pdf
tests/llama_stack/dataset/corpus/finance/ibm-2q25-earnings-press-release-unencrypted.pdf is excluded by !**/*.pdf
tests/llama_stack/dataset/corpus/finance/ibm-3q25-earnings-press-release-unencrypted.pdf is excluded by !**/*.pdf
tests/llama_stack/dataset/corpus/finance/ibm-4q25-earnings-press-release-unencrypted.pdf is excluded by !**/*.pdf
tests/llama_stack/dataset/corpus/pdf-testing/ibm-4q25-press-release-encrypted.pdf is excluded by !**/*.pdf

📒 Files selected for processing (5)

tests/llama_stack/conftest.py
tests/llama_stack/constants.py
tests/llama_stack/utils.py
tests/llama_stack/vector_io/test_vector_stores.py
tests/llama_stack/vector_io/upgrade/test_upgrade_vector_store_rag.py

tests/llama_stack/conftest.py

Bobbins228

/lgtm thanks Jorge this looks much cleaner than before, just need to address code rabbit

Ygnas

/lgtm

tests/llama_stack/constants.py

- Improved error handling for doc_sources input, ensuring it is a list and paths are validated against the repository root. - Added logging for successful and failed ingestion of document sources. - Streamlined the process for uploading files from URLs and local paths, including directory handling. Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>

Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/llama_stack/conftest.py`:
- Around line 785-788: The fixture currently uses
params_raw/params/vector_io_provider/doc_sources but hard-codes the persisted
store name ("test_vector_store"), causing collisions; add a helper function
named _vector_store_name that takes vector_io_provider and doc_sources,
normalizes/sorts doc_sources, serializes the payload (vector_io_provider +
normalized sources), hashes it (e.g., sha256) and returns a deterministic store
name (like "test-vector-store-<hash-prefix>"), then replace the hard-coded
"test_vector_store" usages in the fixture (and the other occurrences referenced
around the same block) with a call to
_vector_store_name(vector_io_provider=vector_io_provider,
doc_sources=doc_sources) so each param combination gets a unique persisted store
name.
- Around line 839-849: Re-resolve each directory entry before uploading to
prevent symlink/traversal attacks: inside the loop over files, call resolve() on
the child (file_path_resolved = file_path.resolve(strict=True)) and compare it
against the already-resolved repo_root (repo_root.resolve()) to ensure
file_path_resolved is a descendant of repo_root; skip or raise if not. Use
file_path_resolved.is_file() and pass that resolved path into
vector_store_create_file_from_path (referencing file_path, source_path,
repo_root, and vector_store_create_file_from_path) so you never follow a symlink
that points outside the repo.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 6115d4e1-8a3b-425d-81fe-c70de57cb749

📥 Commits

Reviewing files that changed from the base of the PR and between 0270be8 and f52b283.

📒 Files selected for processing (3)

tests/llama_stack/conftest.py
tests/llama_stack/constants.py
tests/llama_stack/dataset/README.md

✅ Files skipped from review due to trivial changes (1)

tests/llama_stack/dataset/README.md

🚧 Files skipped from review as they are similar to previous changes (1)

tests/llama_stack/constants.py

tests/llama_stack/conftest.py

Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/llama_stack/utils.py (2)

225-225: Invalid # noqa: FCN001 rule codes.

Ruff doesn't recognize FCN001. Either remove these comments or add FCN001 to lint.external in pyproject.toml/ruff.toml if it's from another linter in your toolchain.

Also applies to: 322-322

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/llama_stack/utils.py` at line 225, The inline comment "# noqa: FCN001"
on the temp_file_path assignment (temp_file_path = Path(temp_file.name) in
utils.py) uses an unknown Ruff rule code; remove the invalid noqa marker or
replace it with a recognized code, or alternatively add "FCN001" to
lint.external in pyproject.toml/ruff.toml if that code is required by another
linter; apply the same fix to the other occurrence at the similar assignment
around line 322 so no invalid noqa codes remain.

231-233: Redundant exception tuple.

Exception is a superclass of requests.exceptions.RequestException, so the tuple is redundant.

Proposed fix

-    except (requests.exceptions.RequestException, Exception) as e:
+    except Exception as e:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/llama_stack/utils.py` around lines 231 - 233, The except clause
currently uses a redundant tuple "(requests.exceptions.RequestException,
Exception)"; change it to catch only requests.exceptions.RequestException (i.e.,
"except requests.exceptions.RequestException as e:") so you don't redundantly
include Exception or unintentionally swallow unrelated exceptions; update the
except block referencing LOGGER and url and keep the existing LOGGER.warning and
raise behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/llama_stack/utils.py`:
- Line 268: Fix the typos in the log statements: in the LOGGER.info call that
references uploaded_file.filename and vector_store.id (the f-string currently
has "filename{uploaded_file.filename") change it to use
"filename={uploaded_file.filename}" so the variable interpolates correctly; also
correct the misspelling "Addded" to "Added" in the other LOGGER.info message
that mentions the upload/vector store operation. Ensure both messages remain
f-strings and still reference uploaded_file.filename and vector_store.id (or the
equivalent variables) exactly.

---

Nitpick comments:
In `@tests/llama_stack/utils.py`:
- Line 225: The inline comment "# noqa: FCN001" on the temp_file_path assignment
(temp_file_path = Path(temp_file.name) in utils.py) uses an unknown Ruff rule
code; remove the invalid noqa marker or replace it with a recognized code, or
alternatively add "FCN001" to lint.external in pyproject.toml/ruff.toml if that
code is required by another linter; apply the same fix to the other occurrence
at the similar assignment around line 322 so no invalid noqa codes remain.
- Around line 231-233: The except clause currently uses a redundant tuple
"(requests.exceptions.RequestException, Exception)"; change it to catch only
requests.exceptions.RequestException (i.e., "except
requests.exceptions.RequestException as e:") so you don't redundantly include
Exception or unintentionally swallow unrelated exceptions; update the except
block referencing LOGGER and url and keep the existing LOGGER.warning and raise
behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: b2ff3dae-cffe-44c1-99ec-f16d3ac917bf

📥 Commits

Reviewing files that changed from the base of the PR and between f52b283 and fbb2881.

📒 Files selected for processing (2)

tests/llama_stack/conftest.py
tests/llama_stack/utils.py

tests/llama_stack/utils.py

Bobbins228

/lgtm
/approve

github-actions · 2026-03-20T17:06:24Z

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

jgarciao requested a review from a team as a code owner March 20, 2026 15:12

jgarciao requested a review from Ygnas March 20, 2026 15:12

github-actions bot added the size/xl label Mar 20, 2026

jgarciao requested review from Bobbins228 and jiripetrlik March 20, 2026 15:13

github-actions bot assigned jgarciao Mar 20, 2026

Merge branch 'main' into refactor-vector-stores-add-dataset

0270be8

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

tests/llama_stack/conftest.py Show resolved Hide resolved

tests/llama_stack/conftest.py Outdated Show resolved Hide resolved

Bobbins228 reviewed Mar 20, 2026

View reviewed changes

rhods-ci-bot added the commented-by-Bobbins228 label Mar 20, 2026

Ygnas reviewed Mar 20, 2026

View reviewed changes

tests/llama_stack/constants.py Outdated Show resolved Hide resolved

rhods-ci-bot removed the commented-by-Bobbins228 label Mar 20, 2026

fix: delete unused constant

5b230b0

Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>

rhods-ci-bot added the commented-by-jgarciao label Mar 20, 2026

Bobbins228 previously approved these changes Mar 20, 2026

View reviewed changes

rhods-ci-bot added the lgtm-by-Bobbins228 label Mar 20, 2026

jgarciao enabled auto-merge (squash) March 20, 2026 15:52

Ygnas approved these changes Mar 20, 2026

View reviewed changes

feat: add README.md in the dataset folder

f52b283

Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>

jgarciao dismissed Bobbins228’s stale review via f52b283 March 20, 2026 16:25

rhods-ci-bot removed commented-by-jgarciao lgtm-by-Bobbins228 labels Mar 20, 2026

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

tests/llama_stack/conftest.py Show resolved Hide resolved

tests/llama_stack/conftest.py Outdated Show resolved Hide resolved

fix: prevent symlink/traversal attacks

fbb2881

Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>

rhods-ci-bot added the commented-by-jgarciao label Mar 20, 2026

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

tests/llama_stack/utils.py Show resolved Hide resolved

jgarciao requested a review from Bobbins228 March 20, 2026 17:01

dbasunag approved these changes Mar 20, 2026

View reviewed changes

rhods-ci-bot added the lgtm-by-dbasunag label Mar 20, 2026

Bobbins228 approved these changes Mar 20, 2026

View reviewed changes

jgarciao merged commit 6be3cb8 into opendatahub-io:main Mar 20, 2026
16 checks passed

Conversation

jgarciao commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Bobbins228 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ygnas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Bobbins228 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jgarciao commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

Bobbins228 left a comment •

edited

Loading