Skip to content

Feature/ Evaluation assistant#1891

Draft
RonShakutai wants to merge 22 commits intomainfrom
ronshakutai/presidio-evaluation-repo
Draft

Feature/ Evaluation assistant#1891
RonShakutai wants to merge 22 commits intomainfrom
ronshakutai/presidio-evaluation-repo

Conversation

@RonShakutai
Copy link
Copy Markdown
Collaborator

@RonShakutai RonShakutai commented Mar 8, 2026

Change Description

Presidio Evaluation Flow

This PR introduces the Presidio Evaluation Flow, a new interactive tool under evaluation/ai-assistant/ that guides users through a human-in-the-loop PII detection evaluation process.

USE THE run.md file to run in your environment.

This is the main evaluation branch. PRs will continue to be merged here until the full evaluation flow is complete.

The UI was designed as a Figma Make prototype and converted to working code using the Figma MCP server.

Frontend: React + TypeScript + Vite + Tailwind CSS v4 + shadcn/ui + Recharts
Backend: Python + FastAPI + Poetry + Pydantic v2

Next Steps

  • Dataset interface — Users can load local CSV/JSON files by path, select datasets from a dropdown, and preview records. If the dataset includes a pre-tagged entities column, detection steps can be skipped. Entity schema uses entity_type to match Presidio Analyzer's RecognizerResult format.
  • LLM integration (optional) — Add real LLM-based entity detection as an optional comparison source; the flow should work without LLM if the user doesn't need it
  • Improved sampling — Added real backend sampling with two strategies: random (pandas.sample() with fixed seed) and length-based (stratified by text-length terciles into short/medium/long buckets). Frontend now lets users pick the method via radio buttons, with semantic diversity shown as coming soon.
  • Run Presidio based on config — Execute actual Presidio analysis using the configured recognizers and thresholds (skip if CSV with detected entities is provided)
  • Improved tagging experience — Better UX for the human review step (inline text highlighting, keyboard shortcuts, batch operations)
  • Import presidio-research and presidio-evaluator — Bring the evaluation and research tooling into the Presidio repo for tighter integration
  • Use the actual evaluator — Wire the evaluation step to presidio-evaluator for real precision/recall/F1 calculation against the golden set
  • Improve the scoring dashboard — Richer visualizations, drill-down by entity type, historical comparison across runs

Screenshots

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

- Implemented HumanReview component for entity validation and golden set creation.
- Created Sampling component to configure sample size and method for evaluation.
- Developed Setup component for dataset selection and compliance framework configuration.
- Established routing for new pages including Setup, Sampling, Human Review, and Evaluation.
- Defined types for datasets, entities, and evaluation metrics.
- Set up main application entry point and integrated styles using Tailwind CSS.
- Configured Vite for development with React and Tailwind CSS support.
@RonShakutai RonShakutai self-assigned this Mar 8, 2026
@RonShakutai RonShakutai requested a review from a team as a code owner March 8, 2026 05:12
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

Coverage report (presidio-anonymizer)

This PR does not seem to contain any modification to coverable code.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

Dependency Review

The following issues were found:

  • ✅ 0 vulnerable package(s)
  • ❌ 14 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 9 package(s) with unknown licenses.
  • ⚠️ 24 packages with OpenSSF Scorecard issues.

View full job summary

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

Coverage report (presidio-structured)

This PR does not seem to contain any modification to coverable code.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

Coverage report (presidio-cli)

This PR does not seem to contain any modification to coverable code.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

Coverage report (presidio-image-redactor)

This PR does not seem to contain any modification to coverable code.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

Coverage report (presidio-analyzer)

This PR does not seem to contain any modification to coverable code.

* feat: Enhance Human Review and Setup pages with dataset handling and auto-accept functionality

- Updated HumanReview component to include a "Skip Tagging" button that auto-accepts all entities from records.
- Integrated session storage for setup configuration in HumanReview.
- Modified Setup component to allow loading datasets from CSV/JSON files with a preview feature.
- Added new types for UploadedDataset and SetupConfig to manage dataset metadata.
- Implemented backend API for loading datasets, including CSV and JSON parsing.
- Created sample medical records dataset for testing and demonstration purposes.

* feat: Implement auto-confirm all functionality in Human Review page
file_path = os.path.expanduser(req.path)
if not os.path.isabs(file_path):
raise HTTPException(status_code=400, detail="Path must be absolute.")
if not os.path.isfile(file_path):

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, the problem is that user-controlled data (req["path"]) is used directly as a filesystem path. To fix this, we must validate and constrain the path before calling os.path.isfile and open. A common and appropriate strategy here is to define a safe root directory (for example, the existing _PROJECT_ROOT or a dedicated subdirectory under it), normalize the user-supplied path with os.path.realpath or os.path.normpath relative to that root, and then check that the normalized path is actually within the root (e.g., using os.path.commonpath). Only then should we allow file access.

The best fix that preserves existing functionality while making it safe is:

  • Interpret the incoming req["path"] as a relative path under a safe root (e.g., _PROJECT_ROOT), not as an arbitrary absolute filesystem path.
  • Normalize the joined path using os.path.realpath or os.path.normpath.
  • Use os.path.commonpath([safe_root, normalized]) == safe_root to ensure the resulting path does not escape the root via .. or symlinks.
  • If validation fails, return 400 with an explanatory message.
  • Then use the validated path for os.path.isfile and open.

To implement this in evaluation/ai-assistant/backend/routers/upload.py:

  • Reuse the existing _PROJECT_ROOT constant as the safe root, or introduce a dedicated _ALLOWED_CSV_ROOT that points somewhere under _PROJECT_ROOT (e.g., os.path.join(_PROJECT_ROOT, "data")). We'll reuse _PROJECT_ROOT since it already exists and no new imports are necessary.
  • Update get_csv_columns_from_path:
    • Read raw_path from req["path"].
    • Reject empty or purely whitespace paths.
    • Disallow path separators that would indicate attempts to pass an absolute path directly; instead, treat the value as relative and always join against _PROJECT_ROOT.
    • Compute candidate = os.path.realpath(os.path.join(_PROJECT_ROOT, raw_path)).
    • Check os.path.commonpath([_PROJECT_ROOT, candidate]) == _PROJECT_ROOT; if not, reject.
    • Use candidate in the subsequent os.path.isfile and open calls.

This keeps behavior close to the original intent (read a CSV-like file accessible to the backend) but ensures only files under the repository root can be read, and prevents path traversal or arbitrary absolute-path access. No new imports or external dependencies are required.

Suggested changeset 1
evaluation/ai-assistant/backend/routers/upload.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/upload.py b/evaluation/ai-assistant/backend/routers/upload.py
--- a/evaluation/ai-assistant/backend/routers/upload.py
+++ b/evaluation/ai-assistant/backend/routers/upload.py
@@ -264,13 +264,22 @@
 
 @router.post("/columns-from-path")
 async def get_csv_columns_from_path(req: dict):
-    """Read the header row of a CSV at the given absolute path."""
-    file_path = os.path.expanduser(req.get("path", ""))
-    if not file_path or not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
-    if not os.path.isfile(file_path):
-        raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
-    with open(file_path, encoding="utf-8") as f:
+    """Read the header row of a CSV at the given path under the project root."""
+    raw_path = (req.get("path") or "").strip()
+    if not raw_path:
+        raise HTTPException(status_code=400, detail="Path is required.")
+    # Resolve the user-supplied path against the project root and ensure it stays within it.
+    candidate_path = os.path.realpath(os.path.join(_PROJECT_ROOT, raw_path))
+    try:
+        common = os.path.commonpath([_PROJECT_ROOT, candidate_path])
+    except ValueError:
+        # Different drives or invalid paths
+        raise HTTPException(status_code=400, detail="Invalid path.")
+    if common != _PROJECT_ROOT:
+        raise HTTPException(status_code=400, detail="Access to this path is not allowed.")
+    if not os.path.isfile(candidate_path):
+        raise HTTPException(status_code=400, detail=f"File not found: {raw_path}")
+    with open(candidate_path, encoding="utf-8") as f:
         head = f.read(65_536)
     reader = csv.DictReader(io.StringIO(head))
     columns = list(reader.fieldnames or [])
EOF
@@ -264,13 +264,22 @@

@router.post("/columns-from-path")
async def get_csv_columns_from_path(req: dict):
"""Read the header row of a CSV at the given absolute path."""
file_path = os.path.expanduser(req.get("path", ""))
if not file_path or not os.path.isabs(file_path):
raise HTTPException(status_code=400, detail="Path must be absolute.")
if not os.path.isfile(file_path):
raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
with open(file_path, encoding="utf-8") as f:
"""Read the header row of a CSV at the given path under the project root."""
raw_path = (req.get("path") or "").strip()
if not raw_path:
raise HTTPException(status_code=400, detail="Path is required.")
# Resolve the user-supplied path against the project root and ensure it stays within it.
candidate_path = os.path.realpath(os.path.join(_PROJECT_ROOT, raw_path))
try:
common = os.path.commonpath([_PROJECT_ROOT, candidate_path])
except ValueError:
# Different drives or invalid paths
raise HTTPException(status_code=400, detail="Invalid path.")
if common != _PROJECT_ROOT:
raise HTTPException(status_code=400, detail="Access to this path is not allowed.")
if not os.path.isfile(candidate_path):
raise HTTPException(status_code=400, detail=f"File not found: {raw_path}")
with open(candidate_path, encoding="utf-8") as f:
head = f.read(65_536)
reader = csv.DictReader(io.StringIO(head))
columns = list(reader.fieldnames or [])
Copilot is powered by AI and may make mistakes. Always verify output.
if not os.path.isfile(file_path):
raise HTTPException(status_code=400, detail=f"File not found: {file_path}")

file_size = os.path.getsize(file_path)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, the problem is fixed by restricting user-controlled paths to a well-defined safe root directory and validating the normalized path before using it, or by otherwise constraining the allowed files (for example, an allow list). For this backend, there is already a _DATA_DIR directory intended for managed datasets; the safest and least invasive fix is to ensure that any path passed to /load and /columns-from-path is resolved relative to _DATA_DIR (or another chosen safe root) and that the normalized final path is checked to be inside this directory before any filesystem operations.

Concretely, we can introduce a helper _resolve_safe_path that: (1) takes the user-provided path (which may be absolute or relative), (2) expands ~, (3) if it is absolute, strips the leading path separator and treats it as relative to _DATA_DIR rather than the filesystem root, (4) joins this with _DATA_DIR, (5) normalizes it with os.path.normpath, and (6) checks that the resulting path starts with the _DATA_DIR prefix (using a robust prefix check that avoids partial-directory matches). If the check fails, we raise HTTPException(400, "Path not allowed."). We then use this helper for both get_csv_columns_from_path (around line 265) and load_dataset (around line 288) instead of directly using os.path.expanduser and absolute-path checks. This preserves existing functionality to load arbitrary CSV/JSON files within the project’s data directory while preventing access to other filesystem locations.

To implement this, we will:

  • Add a new helper _resolve_safe_path near the existing _resolve_path helper.
  • Update get_csv_columns_from_path to call _resolve_safe_path(req.get("path", "")), remove the absolute-path requirement, and keep the existing checks for existence and file type.
  • Update load_dataset to call _resolve_safe_path(req.path) and reuse existing size and format validations.
    No new external libraries are needed, and we only rely on os.path which is already imported.
Suggested changeset 1
evaluation/ai-assistant/backend/routers/upload.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/upload.py b/evaluation/ai-assistant/backend/routers/upload.py
--- a/evaluation/ai-assistant/backend/routers/upload.py
+++ b/evaluation/ai-assistant/backend/routers/upload.py
@@ -81,6 +81,32 @@
     return os.path.normpath(os.path.join(_PROJECT_ROOT, path))
 
 
+def _resolve_safe_path(user_path: str) -> str:
+    """
+    Resolve a user-provided path to a location under the managed data directory.
+
+    The returned path is guaranteed to be contained within ``_DATA_DIR`` or an
+    HTTPException is raised.
+    """
+    if not user_path:
+        raise HTTPException(status_code=400, detail="Path is required.")
+
+    expanded = os.path.expanduser(user_path)
+
+    # Treat absolute paths as paths relative to the data directory root
+    if os.path.isabs(expanded):
+        expanded = expanded.lstrip(os.sep)
+
+    candidate = os.path.normpath(os.path.join(_DATA_DIR, expanded))
+
+    data_dir_norm = os.path.normpath(_DATA_DIR)
+    # Ensure the candidate path is inside _DATA_DIR
+    if not (candidate == data_dir_norm or candidate.startswith(data_dir_norm + os.sep)):
+        raise HTTPException(status_code=400, detail="Path not allowed.")
+
+    return candidate
+
+
 def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
     """Ensure a dataset has a managed copy under backend/data."""
     ds = _uploaded.get(dataset_id)
@@ -264,10 +290,9 @@
 
 @router.post("/columns-from-path")
 async def get_csv_columns_from_path(req: dict):
-    """Read the header row of a CSV at the given absolute path."""
-    file_path = os.path.expanduser(req.get("path", ""))
-    if not file_path or not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
+    """Read the header row of a CSV at the given path under the data directory."""
+    raw_path = req.get("path", "")
+    file_path = _resolve_safe_path(raw_path)
     if not os.path.isfile(file_path):
         raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
     with open(file_path, encoding="utf-8") as f:
@@ -287,7 +312,7 @@
 
 @router.post("/load")
 async def load_dataset(req: DatasetLoadRequest):
-    """Load a CSV or JSON file from a local absolute path."""
+    """Load a CSV or JSON file from a local path under the data directory."""
     if req.format not in ("csv", "json"):
         raise HTTPException(
             status_code=400,
@@ -297,9 +322,7 @@
             ),
         )
 
-    file_path = os.path.expanduser(req.path)
-    if not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
+    file_path = _resolve_safe_path(req.path)
     if not os.path.isfile(file_path):
         raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
 
EOF
@@ -81,6 +81,32 @@
return os.path.normpath(os.path.join(_PROJECT_ROOT, path))


def _resolve_safe_path(user_path: str) -> str:
"""
Resolve a user-provided path to a location under the managed data directory.

The returned path is guaranteed to be contained within ``_DATA_DIR`` or an
HTTPException is raised.
"""
if not user_path:
raise HTTPException(status_code=400, detail="Path is required.")

expanded = os.path.expanduser(user_path)

# Treat absolute paths as paths relative to the data directory root
if os.path.isabs(expanded):
expanded = expanded.lstrip(os.sep)

candidate = os.path.normpath(os.path.join(_DATA_DIR, expanded))

data_dir_norm = os.path.normpath(_DATA_DIR)
# Ensure the candidate path is inside _DATA_DIR
if not (candidate == data_dir_norm or candidate.startswith(data_dir_norm + os.sep)):
raise HTTPException(status_code=400, detail="Path not allowed.")

return candidate


def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
"""Ensure a dataset has a managed copy under backend/data."""
ds = _uploaded.get(dataset_id)
@@ -264,10 +290,9 @@

@router.post("/columns-from-path")
async def get_csv_columns_from_path(req: dict):
"""Read the header row of a CSV at the given absolute path."""
file_path = os.path.expanduser(req.get("path", ""))
if not file_path or not os.path.isabs(file_path):
raise HTTPException(status_code=400, detail="Path must be absolute.")
"""Read the header row of a CSV at the given path under the data directory."""
raw_path = req.get("path", "")
file_path = _resolve_safe_path(raw_path)
if not os.path.isfile(file_path):
raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
with open(file_path, encoding="utf-8") as f:
@@ -287,7 +312,7 @@

@router.post("/load")
async def load_dataset(req: DatasetLoadRequest):
"""Load a CSV or JSON file from a local absolute path."""
"""Load a CSV or JSON file from a local path under the data directory."""
if req.format not in ("csv", "json"):
raise HTTPException(
status_code=400,
@@ -297,9 +322,7 @@
),
)

file_path = os.path.expanduser(req.path)
if not os.path.isabs(file_path):
raise HTTPException(status_code=400, detail="Path must be absolute.")
file_path = _resolve_safe_path(req.path)
if not os.path.isfile(file_path):
raise HTTPException(status_code=400, detail=f"File not found: {file_path}")

Copilot is powered by AI and may make mistakes. Always verify output.
if file_size > MAX_FILE_SIZE:
raise HTTPException(status_code=400, detail="File too large (max 50 MB)")

with open(file_path, encoding="utf-8") as f:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, to fix uncontrolled path usage, you should constrain file access to a known-safe directory (or set of directories) and validate any user-supplied path against that root. This typically involves: resolving ~ and relative segments, normalizing with os.path.realpath or os.path.normpath, then checking that the resulting path is within the allowed root. Deny access if the path escapes that root or is not a regular file.

In this code, load_dataset currently requires an absolute path and permits access anywhere. The least-disruptive fix that keeps the same general functionality but removes the vulnerability is:

  • Define a dedicated root directory for loadable datasets (for example, reuse _DATA_DIR as the allowed root) or a new _ALLOWED_DATA_ROOT.
  • Update load_dataset to:
    • Expand user (os.path.expanduser) and normalize/resolve the candidate path using os.path.realpath.
    • Join relative inputs to the allowed root if you decide to support non-absolute paths, or continue to require absolute paths but still enforce containment.
    • Verify that the resolved path is under the allowed root using a robust prefix check such as os.path.commonpath([_ALLOWED_DATA_ROOT, resolved_path]) == _ALLOWED_DATA_ROOT.
    • Optionally, reject paths that are not regular files.
  • Use this validated file_path for os.path.isfile, os.path.getsize, and open.

To avoid assuming anything outside the snippet, we can introduce a new _ALLOWED_DATA_ROOT constant alongside _DATA_DIR at the top of the file and a small helper _validate_and_resolve_user_path near _resolve_path or directly in load_dataset. Since os is already imported, no new imports are required. We will also tighten get_csv_columns_from_path in the same way, because it has the same pattern: arbitrary absolute path from req["path"] going straight into open().

Concretely:

  • Add _ALLOWED_DATA_ROOT = _DATA_DIR (or a sibling directory) where _DATA_DIR is defined.
  • Add a helper _resolve_user_file_path(raw_path: str) -> str that:
    • Ensures raw_path is non-empty.
    • Expands ~, obtains realpath, and checks that commonpath with _ALLOWED_DATA_ROOT equals _ALLOWED_DATA_ROOT.
    • Checks os.path.isfile.
    • Returns the safe path or raises HTTPException(400, ...) on violation.
  • Replace in get_csv_columns_from_path and load_dataset the manual expanduser/isabs/isfile logic with calls to _resolve_user_file_path.

This keeps the external behavior (loading given paths in a controlled dataset directory) while preventing directory traversal or arbitrary file access outside the allowed root.

Suggested changeset 1
evaluation/ai-assistant/backend/routers/upload.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/upload.py b/evaluation/ai-assistant/backend/routers/upload.py
--- a/evaluation/ai-assistant/backend/routers/upload.py
+++ b/evaluation/ai-assistant/backend/routers/upload.py
@@ -46,7 +46,10 @@
 _DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data")
 os.makedirs(_DATA_DIR, exist_ok=True)
 
+# Root directory from which users are allowed to load files by path
+_ALLOWED_DATA_ROOT = _DATA_DIR
 
+
 def _save_registry() -> None:
     """Persist the dataset registry to disk."""
     data = [ds.model_dump() for ds in _uploaded.values()]
@@ -81,6 +83,37 @@
     return os.path.normpath(os.path.join(_PROJECT_ROOT, path))
 
 
+def _resolve_user_file_path(raw_path: str) -> str:
+    """
+    Resolve and validate a user-supplied file path so that it stays under _ALLOWED_DATA_ROOT.
+
+    The returned path is an absolute, normalized path pointing to an existing regular file.
+    """
+    if not raw_path:
+        raise HTTPException(status_code=400, detail="Path is required.")
+
+    # Expand '~' and environment variables, then resolve symlinks and '..'.
+    expanded = os.path.expanduser(raw_path)
+    if not expanded:
+        raise HTTPException(status_code=400, detail="Path is required.")
+
+    # Allow both absolute and relative inputs, but always interpret them under _ALLOWED_DATA_ROOT.
+    if os.path.isabs(expanded):
+        candidate = os.path.realpath(expanded)
+    else:
+        candidate = os.path.realpath(os.path.join(_ALLOWED_DATA_ROOT, expanded))
+
+    allowed_root = os.path.realpath(_ALLOWED_DATA_ROOT)
+    # Ensure the candidate path is within the allowed root directory.
+    if os.path.commonpath([allowed_root, candidate]) != allowed_root:
+        raise HTTPException(status_code=400, detail="Access to the requested path is not allowed.")
+
+    if not os.path.isfile(candidate):
+        raise HTTPException(status_code=400, detail=f"File not found: {candidate}")
+
+    return candidate
+
+
 def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
     """Ensure a dataset has a managed copy under backend/data."""
     ds = _uploaded.get(dataset_id)
@@ -264,12 +297,9 @@
 
 @router.post("/columns-from-path")
 async def get_csv_columns_from_path(req: dict):
-    """Read the header row of a CSV at the given absolute path."""
-    file_path = os.path.expanduser(req.get("path", ""))
-    if not file_path or not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
-    if not os.path.isfile(file_path):
-        raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
+    """Read the header row of a CSV at the given path under the allowed data root."""
+    raw_path = req.get("path", "")
+    file_path = _resolve_user_file_path(raw_path)
     with open(file_path, encoding="utf-8") as f:
         head = f.read(65_536)
     reader = csv.DictReader(io.StringIO(head))
@@ -287,7 +317,7 @@
 
 @router.post("/load")
 async def load_dataset(req: DatasetLoadRequest):
-    """Load a CSV or JSON file from a local absolute path."""
+    """Load a CSV or JSON file from a local path under the allowed data root."""
     if req.format not in ("csv", "json"):
         raise HTTPException(
             status_code=400,
@@ -297,11 +327,7 @@
             ),
         )
 
-    file_path = os.path.expanduser(req.path)
-    if not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
-    if not os.path.isfile(file_path):
-        raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
+    file_path = _resolve_user_file_path(req.path)
 
     file_size = os.path.getsize(file_path)
     if file_size > MAX_FILE_SIZE:
EOF
Copilot is powered by AI and may make mistakes. Always verify output.
RonShakutai and others added 4 commits March 8, 2026 14:21
* feat: implement sampling configuration and record retrieval in Sampling component

* change sampling message

* feat: add sampling methods and integrate into sampling configuration
* Refactor entity comparison logic to support new entity status and source tracking

- Updated EntityComparison component to handle multiple sources for entities and revised status types.
- Enhanced logic for combining and classifying entities from Presidio, LLM, and predefined datasets.
- Improved context retrieval for entities using indexOf for accurate highlighting.
- Adjusted UI badges for entity statuses to reflect new terminology.

Implement LLM Judge functionality in Anonymization page

- Added state management for LLM Judge including model selection, connection status, and progress tracking.
- Integrated API calls to fetch model configurations and analyze records.
- Enhanced user feedback with loading indicators and error handling.

Fetch and display sampled records in Human Review page

- Implemented API calls to load records and LLM results on component mount.
- Updated record handling to support dynamic data fetching and error management.
- Improved UI to reflect loading states and error messages.

Enhance dataset management in Setup page

- Added functionality to fetch saved datasets from the backend on component mount.
- Introduced fields for dataset name and description during dataset upload.
- Implemented editing and deletion capabilities for existing datasets.

Update types to include new dataset properties

- Modified UploadedDataset interface to include name, description, and path fields.

* refactor: update entity status legend and remove conflict indication
@RonShakutai RonShakutai marked this pull request as draft March 9, 2026 16:17
if req.config_path:
if not os.path.isabs(req.config_path):
raise HTTPException(status_code=400, detail="Config path must be an absolute path.")
if not os.path.isfile(req.config_path):

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, the fix is to ensure that any filesystem path derived from user input is constrained to a safe directory and/or to an allow‑listed set of known paths. For this API, the safest approach without changing intended functionality is to only allow config_path values that point inside a designated data/config directory, or that exactly match one of the saved configuration paths stored in the existing _CONFIGS_FILE mechanism. This lets administrators choose from previously uploaded configs instead of arbitrary server paths.

Concretely, within configure_presidio, instead of accepting any absolute req.config_path, we can resolve it to a Path, normalize it, and then verify that it is (a) absolute and (b) under a known safe root directory used for Presidio configs. The code already uses a _DATA_DIR when saving uploaded configs and stores absolute file paths in the config registry; we can use that as the root and enforce resolved_path.is_relative_to(_DATA_DIR) (for Python 3.9+, or a manual prefix/ancestor check). If the path is not under _DATA_DIR, we reject the request with HTTP 400. We still keep the os.path.isfile check but apply it to the sanitized path. To avoid changing behavior more than necessary, we allow both the explicit config_path (if safe) and the previously existing “named configs” mechanism.

Implementation details:

  • Ensure _DATA_DIR is defined in the same file (it already must be for upload_config; we just rely on it).
  • In configure_presidio, replace the current if req.config_path: block (lines 207–211) with logic that:
    • Wraps req.config_path in Path, calls .resolve() to normalize.
    • Checks that the resolved path is within _DATA_DIR (via is_relative_to or fallback).
    • Verifies that the file exists (resolved_path.is_file()).
    • Uses the sanitized string form of this resolved path going forward (assign back to req.config_path or a local variable).
  • This keeps functionality (config still loaded from a path), but ensures it’s only from our configs directory, eliminating arbitrary filesystem access.
Suggested changeset 1
evaluation/ai-assistant/backend/routers/presidio_service.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/presidio_service.py b/evaluation/ai-assistant/backend/routers/presidio_service.py
--- a/evaluation/ai-assistant/backend/routers/presidio_service.py
+++ b/evaluation/ai-assistant/backend/routers/presidio_service.py
@@ -205,11 +205,32 @@
     global _engine
 
     if req.config_path:
-        if not os.path.isabs(req.config_path):
-            raise HTTPException(status_code=400, detail="Config path must be an absolute path.")
-        if not os.path.isfile(req.config_path):
-            raise HTTPException(status_code=400, detail=f"Config file not found: {req.config_path}")
+        raw_path = req.config_path
+        try:
+            resolved_path = Path(raw_path).resolve()
+        except Exception:
+            raise HTTPException(status_code=400, detail="Invalid config path.")
 
+        # Ensure the config file resides within the Presidio data directory
+        try:
+            is_within_data_dir = resolved_path.is_relative_to(_DATA_DIR)
+        except AttributeError:
+            # Fallback for Python versions without Path.is_relative_to
+            try:
+                resolved_data_dir = _DATA_DIR.resolve()
+            except Exception:
+                raise HTTPException(status_code=500, detail="Server configuration error.")
+            is_within_data_dir = resolved_path == resolved_data_dir or resolved_data_dir in resolved_path.parents
+
+        if not is_within_data_dir:
+            raise HTTPException(status_code=400, detail="Config path must be inside the server config directory.")
+
+        if not resolved_path.is_file():
+            raise HTTPException(status_code=400, detail=f"Config file not found: {raw_path}")
+
+        # Use the normalized, safe path from here on
+        req.config_path = str(resolved_path)
+
     # Reset
     _engine = None
     _state["loading"] = True
EOF
@@ -205,11 +205,32 @@
global _engine

if req.config_path:
if not os.path.isabs(req.config_path):
raise HTTPException(status_code=400, detail="Config path must be an absolute path.")
if not os.path.isfile(req.config_path):
raise HTTPException(status_code=400, detail=f"Config file not found: {req.config_path}")
raw_path = req.config_path
try:
resolved_path = Path(raw_path).resolve()
except Exception:
raise HTTPException(status_code=400, detail="Invalid config path.")

# Ensure the config file resides within the Presidio data directory
try:
is_within_data_dir = resolved_path.is_relative_to(_DATA_DIR)
except AttributeError:
# Fallback for Python versions without Path.is_relative_to
try:
resolved_data_dir = _DATA_DIR.resolve()
except Exception:
raise HTTPException(status_code=500, detail="Server configuration error.")
is_within_data_dir = resolved_path == resolved_data_dir or resolved_data_dir in resolved_path.parents

if not is_within_data_dir:
raise HTTPException(status_code=400, detail="Config path must be inside the server config directory.")

if not resolved_path.is_file():
raise HTTPException(status_code=400, detail=f"Config file not found: {raw_path}")

# Use the normalized, safe path from here on
req.config_path = str(resolved_path)

# Reset
_engine = None
_state["loading"] = True
Copilot is powered by AI and may make mistakes. Always verify output.
… extras and add custom analyzer configuration
raise HTTPException(status_code=400, detail="Config path is required.")
if not os.path.isabs(path):
raise HTTPException(status_code=400, detail="Config path must be absolute.")
if not os.path.isfile(path):

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, to fix this class of problem, you should ensure that any path derived from user input is constrained to a safe location and/or sanitized. Common patterns include: (1) allowing only filenames and joining them with a fixed, trusted base directory; (2) normalizing and verifying that the resulting path stays within a designated root directory; or (3) enforcing an allow list of known-safe paths. Which one to use depends on how much flexibility is required.

For this endpoint, the best low-impact fix is to restrict configuration imports to a specific “configs root” directory on the server, and then verify that the requested path stays within that root after normalization. We can do this by: defining a _CONFIGS_ROOT directory (for example, under the project root’s configs or data/configs folder); joining the user-provided path to this root; normalizing it with os.path.normpath; and then checking that the resulting full path is both a file and resides under _CONFIGS_ROOT. Instead of requiring the client to send an absolute path, we then treat req.path as a relative path (or simple filename) under _CONFIGS_ROOT. Finally, shutil.copy2 must operate on this validated fullpath rather than the raw user string.

Concretely, in evaluation/ai-assistant/backend/routers/presidio_service.py:

  • Add a module-level constant _CONFIGS_ROOT near _DATA_DIR, e.g. Path(__file__).resolve().parent.parent / "configs".
  • In save_config, stop requiring os.path.isabs(path), and instead:
    • Reject path components that look like absolute paths or contain null bytes.
    • Build fullpath = os.path.normpath(os.path.join(_CONFIGS_ROOT, path)).
    • Ensure fullpath is inside _CONFIGS_ROOT (e.g. comparing fullpath.resolve() to _CONFIGS_ROOT.resolve() via is_relative_to when available, or a safe startswith on path components).
    • Check fullpath.is_file() instead of os.path.isfile(path).
  • Use fullpath as the source argument to shutil.copy2 instead of the unvalidated path.

This preserves the intended functionality (importing a config file into _DATA_DIR) while eliminating the ability to reference arbitrary filesystem paths.

Suggested changeset 1
evaluation/ai-assistant/backend/routers/presidio_service.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/presidio_service.py b/evaluation/ai-assistant/backend/routers/presidio_service.py
--- a/evaluation/ai-assistant/backend/routers/presidio_service.py
+++ b/evaluation/ai-assistant/backend/routers/presidio_service.py
@@ -60,6 +60,7 @@
 # Per-config results accumulator (replaces run snapshots)
 # ---------------------------------------------------------------------------
 _DATA_DIR = Path(__file__).resolve().parent.parent / "data"
+_CONFIGS_ROOT = Path(__file__).resolve().parent.parent / "configs"
 
 # config_name -> {rec_id -> entities}
 _all_config_results: dict[str, dict[str, list]] = {}
@@ -123,16 +124,28 @@
         raise HTTPException(status_code=400, detail="Name may only contain letters, numbers, hyphens, underscores, dots, and spaces.")
     if not path:
         raise HTTPException(status_code=400, detail="Config path is required.")
-    if not os.path.isabs(path):
-        raise HTTPException(status_code=400, detail="Config path must be absolute.")
-    if not os.path.isfile(path):
+
+    # Treat the provided path as relative to a trusted configs root and
+    # validate that the normalized path stays within that root.
+    # Disallow absolute paths outright.
+    if os.path.isabs(path):
+        raise HTTPException(status_code=400, detail="Config path must be relative to the server configs directory.")
+    # Build and normalize the full path under the configs root
+    _CONFIGS_ROOT.mkdir(parents=True, exist_ok=True)
+    fullpath = os.path.normpath(os.path.join(str(_CONFIGS_ROOT), path))
+    # Ensure the normalized path is still within the configs root
+    configs_root_str = str(_CONFIGS_ROOT.resolve())
+    fullpath_resolved = os.path.realpath(fullpath)
+    if not fullpath_resolved.startswith(configs_root_str + os.sep) and fullpath_resolved != configs_root_str:
+        raise HTTPException(status_code=400, detail="Config path is not allowed.")
+    if not os.path.isfile(fullpath_resolved):
         raise HTTPException(status_code=400, detail=f"Config file not found: {path}")
 
     # Copy the config file into our data/ folder so we own the copy
     _DATA_DIR.mkdir(parents=True, exist_ok=True)
     safe_name = name.replace(" ", "_").replace("/", "_")
     dest = _DATA_DIR / f"config-{safe_name}.yml"
-    shutil.copy2(path, dest)
+    shutil.copy2(fullpath_resolved, dest)
     abs_path = str(dest.resolve())
 
     user_configs = _get_user_configs()
EOF
@@ -60,6 +60,7 @@
# Per-config results accumulator (replaces run snapshots)
# ---------------------------------------------------------------------------
_DATA_DIR = Path(__file__).resolve().parent.parent / "data"
_CONFIGS_ROOT = Path(__file__).resolve().parent.parent / "configs"

# config_name -> {rec_id -> entities}
_all_config_results: dict[str, dict[str, list]] = {}
@@ -123,16 +124,28 @@
raise HTTPException(status_code=400, detail="Name may only contain letters, numbers, hyphens, underscores, dots, and spaces.")
if not path:
raise HTTPException(status_code=400, detail="Config path is required.")
if not os.path.isabs(path):
raise HTTPException(status_code=400, detail="Config path must be absolute.")
if not os.path.isfile(path):

# Treat the provided path as relative to a trusted configs root and
# validate that the normalized path stays within that root.
# Disallow absolute paths outright.
if os.path.isabs(path):
raise HTTPException(status_code=400, detail="Config path must be relative to the server configs directory.")
# Build and normalize the full path under the configs root
_CONFIGS_ROOT.mkdir(parents=True, exist_ok=True)
fullpath = os.path.normpath(os.path.join(str(_CONFIGS_ROOT), path))
# Ensure the normalized path is still within the configs root
configs_root_str = str(_CONFIGS_ROOT.resolve())
fullpath_resolved = os.path.realpath(fullpath)
if not fullpath_resolved.startswith(configs_root_str + os.sep) and fullpath_resolved != configs_root_str:
raise HTTPException(status_code=400, detail="Config path is not allowed.")
if not os.path.isfile(fullpath_resolved):
raise HTTPException(status_code=400, detail=f"Config file not found: {path}")

# Copy the config file into our data/ folder so we own the copy
_DATA_DIR.mkdir(parents=True, exist_ok=True)
safe_name = name.replace(" ", "_").replace("/", "_")
dest = _DATA_DIR / f"config-{safe_name}.yml"
shutil.copy2(path, dest)
shutil.copy2(fullpath_resolved, dest)
abs_path = str(dest.resolve())

user_configs = _get_user_configs()
Copilot is powered by AI and may make mistakes. Always verify output.
_DATA_DIR.mkdir(parents=True, exist_ok=True)
safe_name = name.replace(" ", "_").replace("/", "_")
dest = _DATA_DIR / f"config-{safe_name}.yml"
shutil.copy2(path, dest)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, the fix is to restrict and validate any user-controlled path before using it for filesystem access. Common approaches are: (1) restrict paths to live under a specific trusted root directory and verify this after normalizing the path, or (2) maintain an allow list of permitted paths and reject everything else.

Here, we want to keep the current functionality of letting users “register” a config file, but without letting them point to arbitrary locations like /etc/shadow. The safest change, without altering the later behavior (which only ever reads from our own _DATA_DIR copies), is to restrict the source config file path to be under a known safe root. A natural candidate is the project’s data directory (_DATA_DIR’s parent or a subdirectory of it), or some other predetermined configs root. We can implement this by normalizing the input path with os.path.realpath, then checking that it is inside a chosen base directory using os.path.commonpath. If the check fails, we return a 400 error.

Concretely, in save_config (lines 117–143), after confirming that path is absolute and points to an existing file, we will:

  • Normalize the user-supplied path via os.path.realpath.
  • Define a trusted root directory for source configs, e.g. a configs subdirectory under _DATA_DIR (or _DATA_DIR itself, depending on your policy).
  • Use os.path.commonpath([trusted_root, normalized_path]) == trusted_root to ensure the normalized path is within the trusted root.
  • Reject paths outside that directory with HTTPException(400, ...).

We will then use the normalized path (normalized_path) in the shutil.copy2 call, instead of the raw path. This adds path traversal protection and prevents access to arbitrary filesystem locations, while preserving the existing behavior for valid config files in the allowed directory. All needed utilities (os.path.realpath, os.path.commonpath) are already available via the existing import os, so no new imports are required.

Suggested changeset 1
evaluation/ai-assistant/backend/routers/presidio_service.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/presidio_service.py b/evaluation/ai-assistant/backend/routers/presidio_service.py
--- a/evaluation/ai-assistant/backend/routers/presidio_service.py
+++ b/evaluation/ai-assistant/backend/routers/presidio_service.py
@@ -128,11 +128,17 @@
     if not os.path.isfile(path):
         raise HTTPException(status_code=400, detail=f"Config file not found: {path}")
 
+    # Restrict source config files to live under the data directory (or a subdirectory)
+    base_path = os.path.realpath(str(_DATA_DIR))
+    source_path = os.path.realpath(path)
+    if os.path.commonpath([base_path, source_path]) != base_path:
+        raise HTTPException(status_code=400, detail="Config path must be located under the allowed data directory.")
+
     # Copy the config file into our data/ folder so we own the copy
     _DATA_DIR.mkdir(parents=True, exist_ok=True)
     safe_name = name.replace(" ", "_").replace("/", "_")
     dest = _DATA_DIR / f"config-{safe_name}.yml"
-    shutil.copy2(path, dest)
+    shutil.copy2(source_path, dest)
     abs_path = str(dest.resolve())
 
     user_configs = _get_user_configs()
EOF
@@ -128,11 +128,17 @@
if not os.path.isfile(path):
raise HTTPException(status_code=400, detail=f"Config file not found: {path}")

# Restrict source config files to live under the data directory (or a subdirectory)
base_path = os.path.realpath(str(_DATA_DIR))
source_path = os.path.realpath(path)
if os.path.commonpath([base_path, source_path]) != base_path:
raise HTTPException(status_code=400, detail="Config path must be located under the allowed data directory.")

# Copy the config file into our data/ folder so we own the copy
_DATA_DIR.mkdir(parents=True, exist_ok=True)
safe_name = name.replace(" ", "_").replace("/", "_")
dest = _DATA_DIR / f"config-{safe_name}.yml"
shutil.copy2(path, dest)
shutil.copy2(source_path, dest)
abs_path = str(dest.resolve())

user_configs = _get_user_configs()
Copilot is powered by AI and may make mistakes. Always verify output.
_DATA_DIR.mkdir(parents=True, exist_ok=True)
safe_name = name.replace(" ", "_").replace("/", "_")
dest = _DATA_DIR / f"config-{safe_name}.yml"
shutil.copy2(path, dest)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

safe_name = name.replace(" ", "_").replace("/", "_")
dest = _DATA_DIR / f"config-{safe_name}.yml"
shutil.copy2(path, dest)
abs_path = str(dest.resolve())

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

_DATA_DIR.mkdir(parents=True, exist_ok=True)
safe_name = name.replace(" ", "_").replace("/", "_")
dest = _DATA_DIR / f"config-{safe_name}.yml"
dest.write_bytes(content)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

safe_name = name.replace(" ", "_").replace("/", "_")
dest = _DATA_DIR / f"config-{safe_name}.yml"
dest.write_bytes(content)
abs_path = str(dest.resolve())

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

safe_ds_name = display_name.replace(" ", "_").replace("/", "_")
stored_filename = f"{safe_ds_name}_{uid}{ext}"
stored_path = os.path.join(_DATA_DIR, stored_filename)
shutil.copy2(file_path, stored_path)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, paths derived from user input must be constrained to a safe directory tree. The usual pattern is: define a fixed root directory, normalize the user-supplied path relative to that root, and reject any path that resolves outside the root. For absolute paths, you typically either (a) disallow them, or (b) normalize them and verify they live under the allowed root before accessing them.

For this code, the safest minimal fix that preserves behavior is:

  • Introduce a dedicated data root for user-loadable files (e.g., under the existing _DATA_DIR or a sibling directory).
  • Add a helper _resolve_user_path that:
    • Takes the raw req.path (after expanduser).
    • Normalizes it with os.path.realpath.
    • Joins it to the allowed root if it is not absolute, or at least checks that the resulting real path starts with the allowed root path.
    • Rejects paths that fall outside the allowed root with an HTTP 400/403.
  • Use this helper instead of the current direct expanduser + isabs logic in load_dataset, and also in get_csv_columns_from_path for consistency, so that both endpoints only operate on files inside the allowed root.
  • Keep the rest of the logic unchanged (copying to _DATA_DIR, size checks, parsing, etc.).

Concretely:

  • In evaluation/ai-assistant/backend/routers/upload.py, define a new constant like _IMPORT_ROOT that points to a dedicated directory under the project root (e.g., os.path.join(_PROJECT_ROOT, "import")), and ensure it exists (os.makedirs(..., exist_ok=True)).
  • Define a new function _resolve_user_path(raw_path: str) -> str near _resolve_path that:
    • Validates presence of raw_path.
    • Expands ~.
    • If the resulting path is absolute, uses it directly; if it is relative, join it to _IMPORT_ROOT.
    • Normalizes with os.path.realpath.
    • Verifies it starts with _IMPORT_ROOT (using os.path.commonpath or prefix check on normalized paths).
    • Raises HTTPException if invalid or outside root.
  • Update get_csv_columns_from_path and load_dataset to call _resolve_user_path(...) instead of manually working with user-controlled paths. Keep checks like os.path.isfile and size limits the same but applied to the resolved safe path.
  • Ensure no other changes to business logic (dataset naming, registry saving, etc.).
Suggested changeset 1
evaluation/ai-assistant/backend/routers/upload.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/upload.py b/evaluation/ai-assistant/backend/routers/upload.py
--- a/evaluation/ai-assistant/backend/routers/upload.py
+++ b/evaluation/ai-assistant/backend/routers/upload.py
@@ -73,7 +73,11 @@
     os.path.join(os.path.dirname(__file__), "..", "..")
 )
 
+# Root directory for user-loadable files; all user-specified paths must resolve here
+_IMPORT_ROOT = os.path.join(_PROJECT_ROOT, "import")
+os.makedirs(_IMPORT_ROOT, exist_ok=True)
 
+
 def _resolve_path(path: str) -> str:
     """Resolve a path; relative paths are resolved against the project root."""
     if os.path.isabs(path):
@@ -81,6 +84,29 @@
     return os.path.normpath(os.path.join(_PROJECT_ROOT, path))
 
 
+def _resolve_user_path(raw_path: str) -> str:
+    """
+    Resolve a user-supplied path safely under _IMPORT_ROOT.
+
+    The returned path is guaranteed to be within _IMPORT_ROOT, or an HTTPException is raised.
+    """
+    if not raw_path:
+        raise HTTPException(status_code=400, detail="Path is required.")
+    expanded = os.path.expanduser(raw_path)
+    # If the user supplied an absolute path, use it as-is; otherwise, resolve relative to _IMPORT_ROOT.
+    if os.path.isabs(expanded):
+        candidate = expanded
+    else:
+        candidate = os.path.join(_IMPORT_ROOT, expanded)
+    # Normalize and resolve symlinks
+    real_path = os.path.realpath(candidate)
+    import_root_real = os.path.realpath(_IMPORT_ROOT)
+    # Ensure the resolved path is inside the allowed import root
+    if os.path.commonpath([real_path, import_root_real]) != import_root_real:
+        raise HTTPException(status_code=400, detail="Path is outside the allowed import directory.")
+    return real_path
+
+
 def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
     """Ensure a dataset has a managed copy under backend/data."""
     ds = _uploaded.get(dataset_id)
@@ -264,10 +290,9 @@
 
 @router.post("/columns-from-path")
 async def get_csv_columns_from_path(req: dict):
-    """Read the header row of a CSV at the given absolute path."""
-    file_path = os.path.expanduser(req.get("path", ""))
-    if not file_path or not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
+    """Read the header row of a CSV at the given path under the import directory."""
+    raw_path = req.get("path", "")
+    file_path = _resolve_user_path(raw_path)
     if not os.path.isfile(file_path):
         raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
     with open(file_path, encoding="utf-8") as f:
@@ -287,7 +312,7 @@
 
 @router.post("/load")
 async def load_dataset(req: DatasetLoadRequest):
-    """Load a CSV or JSON file from a local absolute path."""
+    """Load a CSV or JSON file from a local path under the import directory."""
     if req.format not in ("csv", "json"):
         raise HTTPException(
             status_code=400,
@@ -297,9 +322,7 @@
             ),
         )
 
-    file_path = os.path.expanduser(req.path)
-    if not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
+    file_path = _resolve_user_path(req.path)
     if not os.path.isfile(file_path):
         raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
 
EOF
Copilot is powered by AI and may make mistakes. Always verify output.
safe_ds_name = display_name.replace(" ", "_").replace("/", "_")
stored_filename = f"{safe_ds_name}_{uid}{ext}"
stored_path = os.path.join(_DATA_DIR, stored_filename)
shutil.copy2(file_path, stored_path)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, to fix uncontrolled data in path expressions where you want to allow some flexibility but keep files within a specific directory, you should normalize the final path and then verify that it remains inside a designated root directory. This usually means combining a fixed base directory with a user-influenced filename, normalizing with os.path.normpath (or os.path.realpath), and then checking that the normalized path starts with the expected base directory (plus a path separator, to avoid prefix tricks).

For this specific code, the issue is around how stored_path is constructed from display_name (user input) and then passed into shutil.copy2. We already have _DATA_DIR as a fixed storage directory, and we already lightly sanitize the display name and validate it with _validate_name. The best minimal fix is to additionally normalize and validate stored_path itself, ensuring it cannot escape _DATA_DIR even if a future change weakens _validate_name or path semantics differ across platforms.

Concretely, in load_dataset:

  • Keep the generation of safe_ds_name and stored_filename as-is to preserve current naming behavior.
  • After building stored_path = os.path.join(_DATA_DIR, stored_filename), compute a normalized absolute path, e.g. normalized_stored_path = os.path.normpath(os.path.abspath(stored_path)).
  • Do the same for _DATA_DIR (once, or inline here) and verify that normalized_stored_path is within _DATA_DIR. A straightforward, cross-platform-safe way that avoids prefix edge cases is to use os.path.commonpath([normalized_stored_path, _DATA_DIR]) == os.path.normpath(_DATA_DIR).
  • If the check fails, raise an HTTPException with a 400 code indicating that the dataset name is invalid.
  • Use normalized_stored_path for the copy and when saving stored_path into UploadedDataset.

All changes are confined to evaluation/ai-assistant/backend/routers/upload.py around lines 336–350; no new imports are needed because os.path is already imported as os.

Suggested changeset 1
evaluation/ai-assistant/backend/routers/upload.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/upload.py b/evaluation/ai-assistant/backend/routers/upload.py
--- a/evaluation/ai-assistant/backend/routers/upload.py
+++ b/evaluation/ai-assistant/backend/routers/upload.py
@@ -337,7 +337,11 @@
     safe_ds_name = display_name.replace(" ", "_").replace("/", "_")
     stored_filename = f"{safe_ds_name}_{uid}{ext}"
     stored_path = os.path.join(_DATA_DIR, stored_filename)
-    shutil.copy2(file_path, stored_path)
+    normalized_data_dir = os.path.normpath(os.path.abspath(_DATA_DIR))
+    normalized_stored_path = os.path.normpath(os.path.abspath(stored_path))
+    if os.path.commonpath([normalized_stored_path, normalized_data_dir]) != normalized_data_dir:
+        raise HTTPException(status_code=400, detail="Invalid dataset name.")
+    shutil.copy2(file_path, normalized_stored_path)
 
     description = req.description.strip() if req.description else ""
     dataset = UploadedDataset(
@@ -346,7 +350,7 @@
         name=display_name,
         description=description,
         path=file_path,
-        stored_path=stored_path,
+        stored_path=normalized_stored_path,
         format=req.format,
         record_count=len(records),
         has_entities=has_entities,
EOF
@@ -337,7 +337,11 @@
safe_ds_name = display_name.replace(" ", "_").replace("/", "_")
stored_filename = f"{safe_ds_name}_{uid}{ext}"
stored_path = os.path.join(_DATA_DIR, stored_filename)
shutil.copy2(file_path, stored_path)
normalized_data_dir = os.path.normpath(os.path.abspath(_DATA_DIR))
normalized_stored_path = os.path.normpath(os.path.abspath(stored_path))
if os.path.commonpath([normalized_stored_path, normalized_data_dir]) != normalized_data_dir:
raise HTTPException(status_code=400, detail="Invalid dataset name.")
shutil.copy2(file_path, normalized_stored_path)

description = req.description.strip() if req.description else ""
dataset = UploadedDataset(
@@ -346,7 +350,7 @@
name=display_name,
description=description,
path=file_path,
stored_path=stored_path,
stored_path=normalized_stored_path,
format=req.format,
record_count=len(records),
has_entities=has_entities,
Copilot is powered by AI and may make mistakes. Always verify output.
safe_ds_name = display_name.replace(" ", "_").replace("/", "_")
stored_filename = f"{safe_ds_name}_{uid}.csv"
stored_path = os.path.join(_DATA_DIR, stored_filename)
with open(stored_path, "w", encoding="utf-8") as f:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
stored_filename = f"{safe_name}_{dataset_id}{ext}"
stored_path = os.path.join(_DATA_DIR, stored_filename)

shutil.copy2(resolved, stored_path)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
This path depends on a
user-provided value
.
This path depends on a
user-provided value
.
This path depends on a
user-provided value
.
This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, to fix uncontrolled path usage we must (1) restrict all file operations to a known safe root, (2) normalize paths before checking them, and (3) ensure filenames are sanitized. In this code, the primary issues are that _resolve_path allows arbitrary absolute paths (and relative paths that may traverse upwards), and that ds.name and dataset_id are used in the stored filename without extra validation. We can address all the variants by making _resolve_path enforce that any dataset path resolves into a dedicated datasets root directory, and by tightening the sanitization of the generated stored filename. Because _ensure_records_loaded and the /download route both rely on _ensure_stored_copy and _resolve_path, strengthening those two places will automatically cover all the CodeQL variants.

Concretely, we can:

  1. Introduce a dedicated datasets root directory under the backend (e.g. backend/data/datasets) derived from _DATA_DIR and ensure it exists.
  2. Update _resolve_path so it always resolves against this datasets root:
    • Compute candidate = os.path.normpath(os.path.join(_DATASETS_ROOT, path)).
    • Reject the path if os.path.isabs(path) or if .. is used to escape the root (checked via os.path.commonpath).
    • Return the safe normalized candidate.
      This prevents absolute paths and directory traversal while still allowing relative dataset paths inside the datasets root.
  3. Harden the stored filename generation in _ensure_stored_copy by restricting characters:
    • Derive a safe_name from ds.name using a regex (similar to _NAME_RE) or a simple whitelist of alphanumerics, space, dot, underscore, and hyphen; convert anything else to _.
    • Similarly sanitize dataset_id so even if it’s user-controlled, it cannot inject path separators or special characters.
    • Build stored_filename = f"{safe_name}_{safe_id}{ext}" and join with _DATA_DIR as before.

These changes stay within the provided files, only add basic os.path.commonpath usage (no new dependencies), and do not change the external API; they only reject malicious or malformed dataset paths and filenames.


Suggested changeset 1
evaluation/ai-assistant/backend/routers/upload.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/upload.py b/evaluation/ai-assistant/backend/routers/upload.py
--- a/evaluation/ai-assistant/backend/routers/upload.py
+++ b/evaluation/ai-assistant/backend/routers/upload.py
@@ -46,7 +46,11 @@
 _DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data")
 os.makedirs(_DATA_DIR, exist_ok=True)
 
+# Root directory for dataset source files (restrict all resolved paths to this tree)
+_DATASETS_ROOT = os.path.join(_DATA_DIR, "datasets")
+os.makedirs(_DATASETS_ROOT, exist_ok=True)
 
+
 def _save_registry() -> None:
     """Persist the dataset registry to disk."""
     data = [ds.model_dump() for ds in _uploaded.values()]
@@ -75,12 +78,40 @@
 
 
 def _resolve_path(path: str) -> str:
-    """Resolve a path; relative paths are resolved against the project root."""
+    """Resolve a dataset path safely under the datasets root directory.
+
+    Absolute paths and paths that escape the configured datasets root are rejected.
+    """
+    # Reject absolute paths to avoid pointing outside the managed datasets tree
     if os.path.isabs(path):
-        return path
-    return os.path.normpath(os.path.join(_PROJECT_ROOT, path))
+        raise HTTPException(
+            status_code=400,
+            detail="Absolute paths are not allowed for dataset files.",
+        )
 
+    # Resolve relative path against the dedicated datasets root
+    candidate = os.path.normpath(os.path.join(_DATASETS_ROOT, path))
 
+    # Ensure the normalized path stays within the datasets root (prevents '..' escape)
+    datasets_root_norm = os.path.normpath(_DATASETS_ROOT)
+    try:
+        common = os.path.commonpath([datasets_root_norm, candidate])
+    except ValueError:
+        # On path-type mismatch, treat as invalid
+        raise HTTPException(
+            status_code=400,
+            detail="Invalid dataset path.",
+        )
+
+    if common != datasets_root_norm:
+        raise HTTPException(
+            status_code=400,
+            detail="Dataset path escapes the allowed datasets directory.",
+        )
+
+    return candidate
+
+
 def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
     """Ensure a dataset has a managed copy under backend/data."""
     ds = _uploaded.get(dataset_id)
@@ -98,8 +126,10 @@
         )
 
     ext = os.path.splitext(resolved)[1] or ".csv"
-    safe_name = ds.name.replace(" ", "_").replace("/", "_")
-    stored_filename = f"{safe_name}_{dataset_id}{ext}"
+    # Sanitize dataset name and id to avoid injecting path separators or special chars
+    safe_name = re.sub(r"[^A-Za-z0-9_.\-]+", "_", ds.name.strip()) or "dataset"
+    safe_id = re.sub(r"[^A-Za-z0-9_.\-]+", "_", dataset_id)
+    stored_filename = f"{safe_name}_{safe_id}{ext}"
     stored_path = os.path.join(_DATA_DIR, stored_filename)
 
     shutil.copy2(resolved, stored_path)
EOF
@@ -46,7 +46,11 @@
_DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data")
os.makedirs(_DATA_DIR, exist_ok=True)

# Root directory for dataset source files (restrict all resolved paths to this tree)
_DATASETS_ROOT = os.path.join(_DATA_DIR, "datasets")
os.makedirs(_DATASETS_ROOT, exist_ok=True)


def _save_registry() -> None:
"""Persist the dataset registry to disk."""
data = [ds.model_dump() for ds in _uploaded.values()]
@@ -75,12 +78,40 @@


def _resolve_path(path: str) -> str:
"""Resolve a path; relative paths are resolved against the project root."""
"""Resolve a dataset path safely under the datasets root directory.

Absolute paths and paths that escape the configured datasets root are rejected.
"""
# Reject absolute paths to avoid pointing outside the managed datasets tree
if os.path.isabs(path):
return path
return os.path.normpath(os.path.join(_PROJECT_ROOT, path))
raise HTTPException(
status_code=400,
detail="Absolute paths are not allowed for dataset files.",
)

# Resolve relative path against the dedicated datasets root
candidate = os.path.normpath(os.path.join(_DATASETS_ROOT, path))

# Ensure the normalized path stays within the datasets root (prevents '..' escape)
datasets_root_norm = os.path.normpath(_DATASETS_ROOT)
try:
common = os.path.commonpath([datasets_root_norm, candidate])
except ValueError:
# On path-type mismatch, treat as invalid
raise HTTPException(
status_code=400,
detail="Invalid dataset path.",
)

if common != datasets_root_norm:
raise HTTPException(
status_code=400,
detail="Dataset path escapes the allowed datasets directory.",
)

return candidate


def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
"""Ensure a dataset has a managed copy under backend/data."""
ds = _uploaded.get(dataset_id)
@@ -98,8 +126,10 @@
)

ext = os.path.splitext(resolved)[1] or ".csv"
safe_name = ds.name.replace(" ", "_").replace("/", "_")
stored_filename = f"{safe_name}_{dataset_id}{ext}"
# Sanitize dataset name and id to avoid injecting path separators or special chars
safe_name = re.sub(r"[^A-Za-z0-9_.\-]+", "_", ds.name.strip()) or "dataset"
safe_id = re.sub(r"[^A-Za-z0-9_.\-]+", "_", dataset_id)
stored_filename = f"{safe_name}_{safe_id}{ext}"
stored_path = os.path.join(_DATA_DIR, stored_filename)

shutil.copy2(resolved, stored_path)
Copilot is powered by AI and may make mistakes. Always verify output.
raise HTTPException(status_code=400, detail="Path must be absolute.")
if not os.path.isfile(file_path):
raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
with open(file_path, encoding="utf-8") as f:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, to fix uncontrolled path usage you must (1) define a safe root directory for any file access based on user input, (2) normalize the combined path using os.path.normpath (and preferably os.path.realpath), and (3) verify the normalized path is within the safe root before passing it to filesystem APIs like open, os.path.isfile, etc. Optionally, also restrict the allowed file extension (.csv here).

For this specific function, the least intrusive, backwards‑compatible fix is to require that the caller’s path lies under an allowed base directory (for example _PROJECT_ROOT or a dedicated _DATA_DIR) and to normalize the path before use. We can reuse _PROJECT_ROOT that already exists in this file to implement a simple _resolve_and_validate_path helper that expands ~, resolves relative paths against _PROJECT_ROOT, normalizes via os.path.realpath, and then enforces that the result is still under _PROJECT_ROOT (using a prefix check on the normalized absolute path). Then get_csv_columns_from_path should call this helper instead of using os.path.expanduser directly and should apply the existing .csv check already used in get_csv_columns. We only need to modify the _resolve_path helper and the get_csv_columns_from_path endpoint in this file; no new imports are required.

Concretely:

  • Update _resolve_path to normalize and secure paths:
    • Expand ~ with os.path.expanduser.
    • If the input is absolute, normalize/realpath it.
    • If relative, join it to _PROJECT_ROOT, then realpath it.
    • Verify the resulting path starts with _PROJECT_ROOT (with a separator boundary).
    • Raise HTTPException(400, "Access to this path is not allowed.") if check fails.
  • In get_csv_columns_from_path:
    • Replace file_path = os.path.expanduser(req.get("path", "")) and the absolute‑path requirement with file_path = _resolve_path(req.get("path", "")).
    • Add a .csv extension check consistent with the upload‑based endpoint.
    • Keep the os.path.isfile check but now operating on the validated file_path.

This preserves the basic behavior (“give me columns from this path”) but constrains it to the project tree and only CSV files, eliminating arbitrary filesystem read capability.

Suggested changeset 1
evaluation/ai-assistant/backend/routers/upload.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/upload.py b/evaluation/ai-assistant/backend/routers/upload.py
--- a/evaluation/ai-assistant/backend/routers/upload.py
+++ b/evaluation/ai-assistant/backend/routers/upload.py
@@ -75,10 +75,26 @@
 
 
 def _resolve_path(path: str) -> str:
-    """Resolve a path; relative paths are resolved against the project root."""
+    """Resolve and validate a path under the project root.
+
+    - Expands '~'
+    - Resolves relative paths against the project root
+    - Normalizes and resolves symlinks
+    - Ensures the resulting path stays within the project root
+    """
+    if not path:
+        raise HTTPException(status_code=400, detail="Path must not be empty.")
+    # Expand user home and normalize
+    path = os.path.expanduser(path)
     if os.path.isabs(path):
-        return path
-    return os.path.normpath(os.path.join(_PROJECT_ROOT, path))
+        candidate = os.path.realpath(path)
+    else:
+        candidate = os.path.realpath(os.path.join(_PROJECT_ROOT, path))
+    project_root_real = os.path.realpath(_PROJECT_ROOT)
+    # Ensure the candidate path is within the project root
+    if not (candidate == project_root_real or candidate.startswith(project_root_real + os.sep)):
+        raise HTTPException(status_code=400, detail="Access to this path is not allowed.")
+    return candidate
 
 
 def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
@@ -264,10 +279,11 @@
 
 @router.post("/columns-from-path")
 async def get_csv_columns_from_path(req: dict):
-    """Read the header row of a CSV at the given absolute path."""
-    file_path = os.path.expanduser(req.get("path", ""))
-    if not file_path or not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
+    """Read the header row of a CSV at the given path under the project root."""
+    raw_path = req.get("path", "")
+    file_path = _resolve_path(raw_path)
+    if not file_path.lower().endswith(".csv"):
+        raise HTTPException(status_code=400, detail="Only .csv files are accepted.")
     if not os.path.isfile(file_path):
         raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
     with open(file_path, encoding="utf-8") as f:
EOF
@@ -75,10 +75,26 @@


def _resolve_path(path: str) -> str:
"""Resolve a path; relative paths are resolved against the project root."""
"""Resolve and validate a path under the project root.

- Expands '~'
- Resolves relative paths against the project root
- Normalizes and resolves symlinks
- Ensures the resulting path stays within the project root
"""
if not path:
raise HTTPException(status_code=400, detail="Path must not be empty.")
# Expand user home and normalize
path = os.path.expanduser(path)
if os.path.isabs(path):
return path
return os.path.normpath(os.path.join(_PROJECT_ROOT, path))
candidate = os.path.realpath(path)
else:
candidate = os.path.realpath(os.path.join(_PROJECT_ROOT, path))
project_root_real = os.path.realpath(_PROJECT_ROOT)
# Ensure the candidate path is within the project root
if not (candidate == project_root_real or candidate.startswith(project_root_real + os.sep)):
raise HTTPException(status_code=400, detail="Access to this path is not allowed.")
return candidate


def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
@@ -264,10 +279,11 @@

@router.post("/columns-from-path")
async def get_csv_columns_from_path(req: dict):
"""Read the header row of a CSV at the given absolute path."""
file_path = os.path.expanduser(req.get("path", ""))
if not file_path or not os.path.isabs(file_path):
raise HTTPException(status_code=400, detail="Path must be absolute.")
"""Read the header row of a CSV at the given path under the project root."""
raw_path = req.get("path", "")
file_path = _resolve_path(raw_path)
if not file_path.lower().endswith(".csv"):
raise HTTPException(status_code=400, detail="Only .csv files are accepted.")
if not os.path.isfile(file_path):
raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
with open(file_path, encoding="utf-8") as f:
Copilot is powered by AI and may make mistakes. Always verify output.
file_path = os.path.expanduser(req.path)
if not os.path.isabs(file_path):
raise HTTPException(status_code=400, detail="Path must be absolute.")
if not os.path.isfile(file_path):

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 20 days ago

In general, to fix this type of issue you must ensure that any filesystem path derived from user input is validated against a safe root directory (or an explicit allow‑list) after normalization. The usual pattern is: take the user path, expand user (~) if needed, join it to a fixed root, normalize the result with os.path.normpath or os.path.realpath, and then check that the final path is still inside the root. Only then use the path in open, os.path.isfile, os.path.getsize, etc.

In this codebase, the single best fix with minimal functional change is to require that all user-specified paths for /columns-from-path and /load live under the existing _DATA_DIR (the managed backend/data directory). We already have _DATA_DIR and _PROJECT_ROOT; we can implement a small helper that resolves user input against _DATA_DIR safely:

  • Create a function _resolve_safe_path(user_path: str) -> str that:
    • Rejects empty paths.
    • If os.path.isabs(user_path), strip any leading path separator and treat it as relative (so /foo.csv becomes foo.csv), or simply reject absolute inputs; the most conservative change is to reject them.
    • Joins the (possibly relative) user_path with _DATA_DIR.
    • Normalizes the result with os.path.normpath.
    • Verifies that the normalized path starts with _DATA_DIR plus a path separator (or equals _DATA_DIR exactly) to prevent “..” escaping.
    • Raises HTTPException(400, ...) if validation fails.
    • Returns the safe path otherwise.

Then update:

  • get_csv_columns_from_path (lines 265–279): instead of using os.path.expanduser and requiring an absolute path, call _resolve_safe_path on the user-supplied path, then perform os.path.isfile, open, etc. on the safe path.
  • load_dataset (lines 288–312): similarly, derive file_path from _resolve_safe_path(req.path) instead of os.path.expanduser and the absolute-path check. The rest of the logic (size check, parsing) stays unchanged.

This keeps existing endpoint semantics (loading from local files) but limits them to files under the application’s data directory, eliminating arbitrary filesystem access while avoiding new external dependencies.

Suggested changeset 1
evaluation/ai-assistant/backend/routers/upload.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/evaluation/ai-assistant/backend/routers/upload.py b/evaluation/ai-assistant/backend/routers/upload.py
--- a/evaluation/ai-assistant/backend/routers/upload.py
+++ b/evaluation/ai-assistant/backend/routers/upload.py
@@ -81,6 +81,28 @@
     return os.path.normpath(os.path.join(_PROJECT_ROOT, path))
 
 
+def _resolve_safe_path(user_path: str) -> str:
+    """
+    Resolve a user-provided path safely under the local data directory.
+
+    The resulting path is normalised and validated to ensure it stays within
+    the configured _DATA_DIR, preventing directory traversal and access to
+    unexpected filesystem locations.
+    """
+    if not user_path:
+        raise HTTPException(status_code=400, detail="Path must not be empty.")
+    # Treat the user input as relative to the data directory
+    # to avoid exposing arbitrary filesystem locations.
+    # This also prevents use of absolute paths like "/etc/passwd".
+    relative = user_path.lstrip(os.sep)
+    candidate = os.path.normpath(os.path.join(_DATA_DIR, relative))
+    data_dir_norm = os.path.normpath(_DATA_DIR)
+    # Ensure the resolved path is within the data directory
+    if not (candidate == data_dir_norm or candidate.startswith(data_dir_norm + os.sep)):
+        raise HTTPException(status_code=400, detail="Path is outside the allowed data directory.")
+    return candidate
+
+
 def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
     """Ensure a dataset has a managed copy under backend/data."""
     ds = _uploaded.get(dataset_id)
@@ -264,10 +286,9 @@
 
 @router.post("/columns-from-path")
 async def get_csv_columns_from_path(req: dict):
-    """Read the header row of a CSV at the given absolute path."""
-    file_path = os.path.expanduser(req.get("path", ""))
-    if not file_path or not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
+    """Read the header row of a CSV at the given path under the data directory."""
+    raw_path = req.get("path", "")
+    file_path = _resolve_safe_path(raw_path)
     if not os.path.isfile(file_path):
         raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
     with open(file_path, encoding="utf-8") as f:
@@ -287,7 +308,7 @@
 
 @router.post("/load")
 async def load_dataset(req: DatasetLoadRequest):
-    """Load a CSV or JSON file from a local absolute path."""
+    """Load a CSV or JSON file from a local path under the data directory."""
     if req.format not in ("csv", "json"):
         raise HTTPException(
             status_code=400,
@@ -297,9 +318,7 @@
             ),
         )
 
-    file_path = os.path.expanduser(req.path)
-    if not os.path.isabs(file_path):
-        raise HTTPException(status_code=400, detail="Path must be absolute.")
+    file_path = _resolve_safe_path(req.path)
     if not os.path.isfile(file_path):
         raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
 
EOF
@@ -81,6 +81,28 @@
return os.path.normpath(os.path.join(_PROJECT_ROOT, path))


def _resolve_safe_path(user_path: str) -> str:
"""
Resolve a user-provided path safely under the local data directory.

The resulting path is normalised and validated to ensure it stays within
the configured _DATA_DIR, preventing directory traversal and access to
unexpected filesystem locations.
"""
if not user_path:
raise HTTPException(status_code=400, detail="Path must not be empty.")
# Treat the user input as relative to the data directory
# to avoid exposing arbitrary filesystem locations.
# This also prevents use of absolute paths like "/etc/passwd".
relative = user_path.lstrip(os.sep)
candidate = os.path.normpath(os.path.join(_DATA_DIR, relative))
data_dir_norm = os.path.normpath(_DATA_DIR)
# Ensure the resolved path is within the data directory
if not (candidate == data_dir_norm or candidate.startswith(data_dir_norm + os.sep)):
raise HTTPException(status_code=400, detail="Path is outside the allowed data directory.")
return candidate


def _ensure_stored_copy(dataset_id: str) -> UploadedDataset:
"""Ensure a dataset has a managed copy under backend/data."""
ds = _uploaded.get(dataset_id)
@@ -264,10 +286,9 @@

@router.post("/columns-from-path")
async def get_csv_columns_from_path(req: dict):
"""Read the header row of a CSV at the given absolute path."""
file_path = os.path.expanduser(req.get("path", ""))
if not file_path or not os.path.isabs(file_path):
raise HTTPException(status_code=400, detail="Path must be absolute.")
"""Read the header row of a CSV at the given path under the data directory."""
raw_path = req.get("path", "")
file_path = _resolve_safe_path(raw_path)
if not os.path.isfile(file_path):
raise HTTPException(status_code=400, detail=f"File not found: {file_path}")
with open(file_path, encoding="utf-8") as f:
@@ -287,7 +308,7 @@

@router.post("/load")
async def load_dataset(req: DatasetLoadRequest):
"""Load a CSV or JSON file from a local absolute path."""
"""Load a CSV or JSON file from a local path under the data directory."""
if req.format not in ("csv", "json"):
raise HTTPException(
status_code=400,
@@ -297,9 +318,7 @@
),
)

file_path = os.path.expanduser(req.path)
if not os.path.isabs(file_path):
raise HTTPException(status_code=400, detail="Path must be absolute.")
file_path = _resolve_safe_path(req.path)
if not os.path.isfile(file_path):
raise HTTPException(status_code=400, detail=f"File not found: {file_path}")

Copilot is powered by AI and may make mistakes. Always verify output.
@RonShakutai RonShakutai changed the title Feature/ Evaluation assistant baseline Feature/ Evaluation assistant Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants