fix(upload): store relative paths in uploaded_files.storage_path#235
fix(upload): store relative paths in uploaded_files.storage_path#235tanbro wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Should this be a alembic migration?
There was a problem hiding this comment.
I think it is no need to make an ablembic migration for records have absolute path in uploads_file table into relative path.
Because the PR just works with absolute path records. Let them be for less risks.
There was a problem hiding this comment.
This is ok, but the problem of script is that there is too much explanation cost to teach users how to use it, what's your idea?
There was a problem hiding this comment.
OK, I'll optimize the migration script, apply batch and transaction on it, to make it possible for large uploaded_files table.
There was a problem hiding this comment.
The script has been optimized to be production-ready:
- Batch processing with transactions (default batch size: 1000)
- Progress bars for visibility
- Subcommands (
check/migrate) for better UX
The previous implementation wasn't practical for real use for large uploads fiels table.
There was a problem hiding this comment.
This is ok, but the problem of script is that there is too much explanation cost to teach users how to use it, what's your idea?
I prefer do NOT make a migration in our alembic scripts.
The migration tool scripts/migrate_uploads_file_abs_path.py provide by the PR is OPTIONAL.
Users may choose to use it - the tool is production-ready with streaming/batch support, and the UX is clear and easy.
Also, users may choose to keep their uploaded_files table as they are - because the PR's backward compatibility is solid.
There was a problem hiding this comment.
My concern is if compatibility is fine, what's the meaning of this script, no one would use it other than you and me.
There was a problem hiding this comment.
I think we can keep English version only.
- Convert storage_path from absolute to relative paths for portability - Add path conversion utilities (to_relative_path, to_absolute_path, find_file_by_path) - Update file upload, workspace, and websocket to store relative paths - Add Alembic migration for data transformation (upgrade: abs->rel, downgrade: rel->abs) - Add manual migration tool for multiple uploads_dir scenarios - Add Chinese and English README for manual migration tool
5d2ba0d to
13b177a
Compare
Also keep scripts markdown documentations by adding !scripts/**/*.md to .gitignore to ensure markdown files in scripts directory are not ignored. fix(scripts): translate Chinese README to English for migration tool Convert the Chinese documentation in scripts/migrate_uploads_file_abs_path.README.md to English to improve internationalization and maintain consistency with the project's English-based documentation standards.
- Fix kb.py import order per isort - Remove relative path migration tool script's non-English readme
PR xorbitsai#247 changed UPLOADS_DIR from a constant to get_uploads_dir() function. The new test test_relative_path_works_with_different_upload_dir was using the old UPLOADS_DIR attribute approach, causing AttributeError. This fix aligns with the unified configuration module introduced in PR xorbitsai#247.
rogercloud
left a comment
There was a problem hiding this comment.
Review: Store relative paths in uploaded_files.storage_path
The overall approach is solid and well-structured. Backward compatibility via the absolute_path property and dual-format find_file_by_path is a good design. However, I found several issues that need to be addressed before merging.
Critical (not inline — design issues)
1. storage_path UNIQUE constraint will break the migration in multi-user deployments
storage_path has a global UNIQUE constraint (Column(String(2048), nullable=False, unique=True)). After migration, two users with the same relative path will collide:
- User 1:
/uploads/user_1/web_task_1/output/file.txt→web_task_1/output/file.txt - User 2:
/uploads/user_2/web_task_1/output/file.txt→web_task_1/output/file.txt
The migration aborts mid-batch, leaving the database in a partially-migrated state.
Fix: Either (a) drop the UNIQUE constraint, (b) change it to UniqueConstraint('user_id', 'storage_path'), or (c) include user_{user_id}/ in the relative path to preserve uniqueness.
2. Telegram and Feishu channel bots still store absolute paths
src/xagent/web/channels/telegram/bot.py:238 and src/xagent/web/channels/feishu/bot.py:476 still use storage_path=str(target_path). These files weren't updated by this PR but should be converted to use to_relative_path() for consistency.
Summary
| Severity | Count | Key Issues |
|---|---|---|
| Critical | 2 | UNIQUE constraint collision; missed bot conversions |
| Major | 4 | Broken dedup; wrong find_file_by_path usage (x2); backfill UNIQUE risk; silent fallback |
| Minor | 4 | Migration logging; downgrade POSIX; docstring; separator consistency |
| _auto_register = contextvars.ContextVar("_auto_register", default=False) | ||
|
|
||
|
|
||
| def _to_relative_path(file_path: Path, user_id: int) -> str: |
There was a problem hiding this comment.
Major: Broken deduplication in list_all_user_files
After this change, DB records have storage_path as a relative string (e.g. "web_task_123/output/file.txt"), but the filesystem scan below (lines 855-856) compares against file_path which is an absolute string (e.g. "/uploads/user_1/web_task_123/output/file.txt").
is_already_listed = any(
f.get("storage_path") == file_path for f in result_files
)Before this PR, both sides were absolute, so the comparison worked. Now it never matches, causing duplicate entries — once from DB records and once from the workspace filesystem scan.
Fix: Use str(file_record.absolute_path) for the comparison, or compare using absolute_path consistently.
is_already_listed = any(
f.get("storage_path") == relative_storage_path
for f in result_files
)Where relative_storage_path matches the format stored in storage_path for DB records.
| from ..utils.db_timezone import safe_timestamp_to_unix | ||
| from ..utils.file import find_file_by_path, to_relative_path | ||
|
|
||
| logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Major: find_file_by_path called with relative path — won't find old absolute records
find_file_by_path is designed to accept an absolute path (see its docstring at line 81: file_path: Absolute file path to search for). Internally it tries: (1) exact match, then (2) relative conversion if file_path.is_absolute().
But here relative_storage_path is already a relative string. Inside the function:
- First query matches
storage_path == "web_task_123/..."— OK for new records - Second branch:
file_path.is_absolute()→False— skipped, so old absolute-path records are never found
During the transition period (before migration), the DB still has absolute paths. This code would fail to find existing records and attempt to create duplicates, potentially hitting the UNIQUE constraint.
Fix: Pass the absolute resolved_path directly — the function handles conversion internally:
file_record = find_file_by_path(db, resolved_path, task_user_id)| normalized_relative_path | ||
| ) | ||
|
|
||
| file_record = ( |
There was a problem hiding this comment.
Major: Same find_file_by_path issue as above
Same problem as _normalize_file_outputs — relative_storage_path is already relative, so find_file_by_path cannot find old-style absolute records.
Fix: Pass resolved_path (absolute) instead:
file_record = find_file_by_path(db, resolved_path, owner_user_id)Then use relative_storage_path only for creating new records.
| from ..models.database import get_db | ||
| from ..models.uploaded_file import UploadedFile | ||
| from ..models.user import User | ||
| from ..utils.file import find_file_by_path, to_relative_path |
There was a problem hiding this comment.
Major: Backfill dedup fails with mixed old/new records
existing_paths collects raw storage_path values from the DB. If some records still have absolute paths (pre-migration), the set contains strings like "/uploads/user_1/web_task_123/output/file.txt". The computed relative_storage_path is "web_task_123/output/file.txt" — the in check fails, and the backfill attempts to insert a duplicate, violating the UNIQUE constraint.
Fix: Normalize existing_paths to relative format:
existing_paths = set()
for row in db.query(UploadedFile.storage_path)
.filter(UploadedFile.user_id.in_(target_user_ids))
.all():
p = Path(row[0])
if p.is_absolute():
existing_paths.add(to_relative_path(p, target_user_ids[0])) # normalize
else:
existing_paths.add(row[0])Or simpler: use find_file_by_path to check existence instead of a set lookup.
| except ImportError: | ||
| # Fallback for non-web contexts (e.g., tests) | ||
| # Store as-is absolute path | ||
| return str(file_path) |
There was a problem hiding this comment.
Major: Silent fallback stores absolute paths
If the import fails, return str(file_path) stores an absolute path — the exact problem this PR is trying to solve. No warning is logged, making it hard to diagnose.
However, this fallback is likely dead code since other imports from ..web in this file (e.g. line 227: from ..web.models.uploaded_file import UploadedFile) would also fail if the web module were unavailable.
Suggestion: Either make the import mandatory (remove the try/except), or at minimum add logger.warning(...):
except ImportError:
logger.warning("Could not import web.utils.file; storing absolute path")
return str(file_path)| record.storage_path = relative_path.as_posix() # pyright: ignore[reportAttributeAccessIssue] | ||
| except ValueError as e: | ||
| # Path outside uploads_dir - keep as absolute but log the issue | ||
| print( |
There was a problem hiding this comment.
Minor: Multi-line warning printed for every unconvertible record
This 7-line print() block fires for each record that can't be converted. If there are hundreds of such records, this floods stdout with thousands of lines.
Fix: Print the detailed message once, then just count subsequent failures:
unconvertible = 0
...
except ValueError as e:
if unconvertible == 0:
print(f"Warning: Some paths cannot be converted...")
print("See scripts/migrate_uploads_file_abs_path.README.md")
unconvertible += 1
...
if unconvertible > 0:
print(f"{unconvertible} records could not be converted")| # Convert relative to absolute | ||
| user_root = get_uploads_dir() / f"user_{record.user_id}" | ||
| absolute_path = user_root / path_obj | ||
| record.storage_path = str(absolute_path) # pyright: ignore[reportAttributeAccessIssue] |
There was a problem hiding this comment.
Minor: downgrade uses str() instead of .as_posix()
The upgrade path uses .as_posix() (line 74) to ensure forward-slash separators. But downgrade uses str(absolute_path), which on Windows produces backslash paths like C:\uploads\user_1\....
While the detection code in upgrade handles both formats, the asymmetry is inconsistent. Consider:
record.storage_path = Path(absolute_path).as_posix()This ensures roundtrip consistency: upgrade → downgrade → upgrade produces identical results on all platforms.
| Relative path string with POSIX separators (/) | ||
|
|
||
| Raises: | ||
| ValueError: If path is not within UPLOADS_DIR (caught by caller) |
There was a problem hiding this comment.
Minor: Docstring says Raises: ValueError but the function never raises it
The docstring documents ValueError: If path is not within UPLOADS_DIR, but the actual implementation catches ValueError at line 42 and returns absolute_path.as_posix() as a fallback. The function never raises ValueError.
Fix: Update the docstring to reflect actual behavior:
Returns:
Relative path with POSIX separators. If the path is outside
UPLOADS_DIR, returns the absolute path as POSIX.
And remove the Raises section.
| @@ -127,13 +128,21 @@ def delete_collection_uploaded_files( | |||
There was a problem hiding this comment.
Minor: os.sep vs "/" inconsistency
to_relative_path() always returns POSIX paths (uses .as_posix()), so rel_prefix never contains os.sep. The check endswith(os.sep) works by coincidence on Linux (os.sep = "/") but is semantically misleading — on Windows it checks for "\\" which never matches.
Fix: Just check for "/":
if not rel_prefix.endswith("/"):
rel_prefix_str = rel_prefix + "/"
Store Relative Paths in
uploaded_files.storage_pathSummary
Convert
uploaded_files.storage_pathfrom storing absolute paths to relative paths (withoutuser_{user_id}prefix).Before:
/uploads/user_1/web_task_123/output/file.txtAfter:
web_task_123/output/file.txtMotivation
Storing absolute paths in the database creates portability and configuration issues:
XAGENT_UPLOADS_DIRinvalidates existing recordsChanges
New Utilities
src/xagent/web/utils/file.py:to_relative_path()- Convert absolute to relative for storageto_absolute_path()- Convert relative to absolute for file accessfind_file_by_path()- Query helper handling both formatsModel Enhancement
src/xagent/web/models/uploaded_file.py:absolute_pathproperty - transparently resolves both absolute (old) and relative (new) pathsUpdated Storage Locations
files.py: File upload stores relative pathsworkspace.py: Agent workspace registration uses relative pathswebsocket.py: Output file registration uses relative pathskb_file_service.py: Knowledge base operations use relative pathsBackward Compatibility
The
absolute_pathproperty andfind_file_by_path()handle both formats seamlessly. Existing code continues to work after migration.Migration
Automatic (Recommended for most cases)
Migration
3da89273f616converts all absolute paths to relative paths using currentUPLOADS_DIR.Manual (For complex scenarios)
If
XAGENT_UPLOADS_DIRhas changed multiple times, use the manual tool:See
scripts/migrate_uploads_file_abs_path.README.mdfor details.Testing
tests/web/test_storage_path_relative.py- 6 tests covering:absolute_pathproperty for both formatsAll passing (822/823 tests; 1 unrelated pre-existing failure).
Design Notes
Single-column design: Data transformation only, no schema changes. This allows full rollback via
alembic downgrade.Reversible: Upgrade converts abs→rel, downgrade converts rel→abs.