Skip to content

fix(upload): store relative paths in uploaded_files.storage_path#235

Open
tanbro wants to merge 6 commits into
xorbitsai:mainfrom
tanbro:fix/upload_relative_path
Open

fix(upload): store relative paths in uploaded_files.storage_path#235
tanbro wants to merge 6 commits into
xorbitsai:mainfrom
tanbro:fix/upload_relative_path

Conversation

@tanbro
Copy link
Copy Markdown
Contributor

@tanbro tanbro commented Apr 1, 2026

Store Relative Paths in uploaded_files.storage_path

Summary

Convert uploaded_files.storage_path from storing absolute paths to relative paths (without user_{user_id} prefix).

Before: /uploads/user_1/web_task_123/output/file.txt
After: web_task_123/output/file.txt

Motivation

Storing absolute paths in the database creates portability and configuration issues:

Issue Impact
Non-portable Cannot migrate database to different environments
Config changes break data Changing XAGENT_UPLOADS_DIR invalidates existing records
Backup/restore complex Restoring to new path requires updating all records
Cross-platform Windows and Unix path formats incompatible

Changes

New Utilities

src/xagent/web/utils/file.py:

  • to_relative_path() - Convert absolute to relative for storage
  • to_absolute_path() - Convert relative to absolute for file access
  • find_file_by_path() - Query helper handling both formats

Model Enhancement

src/xagent/web/models/uploaded_file.py:

  • Added absolute_path property - transparently resolves both absolute (old) and relative (new) paths
@property
def absolute_path(self) -> Path:
    stored = Path(self.storage_path)
    if stored.is_absolute():
        return stored  # Old data: return as-is
    user_root = get_uploads_dir() / f"user_{self.user_id}"
    return (user_root / self.storage_path).resolve()  # New data: resolve

Updated Storage Locations

  • files.py: File upload stores relative paths
  • workspace.py: Agent workspace registration uses relative paths
  • websocket.py: Output file registration uses relative paths
  • kb_file_service.py: Knowledge base operations use relative paths

Backward Compatibility

The absolute_path property and find_file_by_path() handle both formats seamlessly. Existing code continues to work after migration.

Migration

Automatic (Recommended for most cases)

alembic upgrade head

Migration 3da89273f616 converts all absolute paths to relative paths using current UPLOADS_DIR.

Manual (For complex scenarios)

If XAGENT_UPLOADS_DIR has changed multiple times, use the manual tool:

python scripts/migrate_uploads_file_abs_path.py migrate -d /old/uploads --confirm

See scripts/migrate_uploads_file_abs_path.README.md for details.

Testing

tests/web/test_storage_path_relative.py - 6 tests covering:

  • Absolute ↔ relative conversion
  • absolute_path property for both formats
  • Backward compatibility

All passing (822/823 tests; 1 unrelated pre-existing failure).

Design Notes

Single-column design: Data transformation only, no schema changes. This allows full rollback via alembic downgrade.

Reversible: Upgrade converts abs→rel, downgrade converts rel→abs.

@XprobeBot XprobeBot added the bug Something isn't working label Apr 1, 2026
@tanbro tanbro marked this pull request as draft April 1, 2026 06:15
@tanbro tanbro marked this pull request as ready for review April 1, 2026 06:36
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a alembic migration?

Copy link
Copy Markdown
Contributor Author

@tanbro tanbro Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is no need to make an ablembic migration for records have absolute path in uploads_file table into relative path.

Because the PR just works with absolute path records. Let them be for less risks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok, but the problem of script is that there is too much explanation cost to teach users how to use it, what's your idea?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll optimize the migration script, apply batch and transaction on it, to make it possible for large uploaded_files table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script has been optimized to be production-ready:

  • Batch processing with transactions (default batch size: 1000)
  • Progress bars for visibility
  • Subcommands (check / migrate) for better UX

The previous implementation wasn't practical for real use for large uploads fiels table.

Copy link
Copy Markdown
Contributor Author

@tanbro tanbro Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok, but the problem of script is that there is too much explanation cost to teach users how to use it, what's your idea?

I prefer do NOT make a migration in our alembic scripts.

The migration tool scripts/migrate_uploads_file_abs_path.py provide by the PR is OPTIONAL.

Users may choose to use it - the tool is production-ready with streaming/batch support, and the UX is clear and easy.

Also, users may choose to keep their uploaded_files table as they are - because the PR's backward compatibility is solid.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is if compatibility is fine, what's the meaning of this script, no one would use it other than you and me.

@tanbro tanbro marked this pull request as draft April 2, 2026 01:33
@tanbro tanbro marked this pull request as ready for review April 2, 2026 04:05
@tanbro tanbro requested a review from qinxuye April 2, 2026 05:53
@tanbro tanbro marked this pull request as draft April 2, 2026 13:28
@tanbro tanbro marked this pull request as ready for review April 2, 2026 14:50
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep English version only.

- Convert storage_path from absolute to relative paths for portability
- Add path conversion utilities (to_relative_path, to_absolute_path, find_file_by_path)
- Update file upload, workspace, and websocket to store relative paths
- Add Alembic migration for data transformation (upgrade: abs->rel, downgrade: rel->abs)
- Add manual migration tool for multiple uploads_dir scenarios
- Add Chinese and English README for manual migration tool
@tanbro tanbro force-pushed the fix/upload_relative_path branch from 5d2ba0d to 13b177a Compare April 9, 2026 01:21
@tanbro tanbro marked this pull request as ready for review April 9, 2026 01:59
tanbro added 5 commits April 9, 2026 10:06
Also keep scripts markdown documentations by adding !scripts/**/*.md to .gitignore to ensure  markdown files in scripts directory are not ignored.

fix(scripts): translate Chinese README to English for  migration tool

Convert the Chinese documentation in  scripts/migrate_uploads_file_abs_path.README.md to English to improve internationalization and maintain consistency with the project's English-based documentation standards.
- Fix kb.py import order per isort
- Remove relative path migration tool script's non-English readme
PR xorbitsai#247 changed UPLOADS_DIR from a constant to get_uploads_dir()
function. The new test test_relative_path_works_with_different_upload_dir
was using the old UPLOADS_DIR attribute approach, causing AttributeError.

This fix aligns with the unified configuration module introduced in PR xorbitsai#247.
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Store relative paths in uploaded_files.storage_path

The overall approach is solid and well-structured. Backward compatibility via the absolute_path property and dual-format find_file_by_path is a good design. However, I found several issues that need to be addressed before merging.


Critical (not inline — design issues)

1. storage_path UNIQUE constraint will break the migration in multi-user deployments

storage_path has a global UNIQUE constraint (Column(String(2048), nullable=False, unique=True)). After migration, two users with the same relative path will collide:

  • User 1: /uploads/user_1/web_task_1/output/file.txtweb_task_1/output/file.txt
  • User 2: /uploads/user_2/web_task_1/output/file.txtweb_task_1/output/file.txt

The migration aborts mid-batch, leaving the database in a partially-migrated state.

Fix: Either (a) drop the UNIQUE constraint, (b) change it to UniqueConstraint('user_id', 'storage_path'), or (c) include user_{user_id}/ in the relative path to preserve uniqueness.

2. Telegram and Feishu channel bots still store absolute paths

src/xagent/web/channels/telegram/bot.py:238 and src/xagent/web/channels/feishu/bot.py:476 still use storage_path=str(target_path). These files weren't updated by this PR but should be converted to use to_relative_path() for consistency.


Summary

Severity Count Key Issues
Critical 2 UNIQUE constraint collision; missed bot conversions
Major 4 Broken dedup; wrong find_file_by_path usage (x2); backfill UNIQUE risk; silent fallback
Minor 4 Migration logging; downgrade POSIX; docstring; separator consistency

_auto_register = contextvars.ContextVar("_auto_register", default=False)


def _to_relative_path(file_path: Path, user_id: int) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major: Broken deduplication in list_all_user_files

After this change, DB records have storage_path as a relative string (e.g. "web_task_123/output/file.txt"), but the filesystem scan below (lines 855-856) compares against file_path which is an absolute string (e.g. "/uploads/user_1/web_task_123/output/file.txt").

is_already_listed = any(
    f.get("storage_path") == file_path for f in result_files
)

Before this PR, both sides were absolute, so the comparison worked. Now it never matches, causing duplicate entries — once from DB records and once from the workspace filesystem scan.

Fix: Use str(file_record.absolute_path) for the comparison, or compare using absolute_path consistently.

is_already_listed = any(
    f.get("storage_path") == relative_storage_path
    for f in result_files
)

Where relative_storage_path matches the format stored in storage_path for DB records.

from ..utils.db_timezone import safe_timestamp_to_unix
from ..utils.file import find_file_by_path, to_relative_path

logger = logging.getLogger(__name__)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major: find_file_by_path called with relative path — won't find old absolute records

find_file_by_path is designed to accept an absolute path (see its docstring at line 81: file_path: Absolute file path to search for). Internally it tries: (1) exact match, then (2) relative conversion if file_path.is_absolute().

But here relative_storage_path is already a relative string. Inside the function:

  1. First query matches storage_path == "web_task_123/..." — OK for new records
  2. Second branch: file_path.is_absolute()Falseskipped, so old absolute-path records are never found

During the transition period (before migration), the DB still has absolute paths. This code would fail to find existing records and attempt to create duplicates, potentially hitting the UNIQUE constraint.

Fix: Pass the absolute resolved_path directly — the function handles conversion internally:

file_record = find_file_by_path(db, resolved_path, task_user_id)

normalized_relative_path
)

file_record = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major: Same find_file_by_path issue as above

Same problem as _normalize_file_outputsrelative_storage_path is already relative, so find_file_by_path cannot find old-style absolute records.

Fix: Pass resolved_path (absolute) instead:

file_record = find_file_by_path(db, resolved_path, owner_user_id)

Then use relative_storage_path only for creating new records.

from ..models.database import get_db
from ..models.uploaded_file import UploadedFile
from ..models.user import User
from ..utils.file import find_file_by_path, to_relative_path
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major: Backfill dedup fails with mixed old/new records

existing_paths collects raw storage_path values from the DB. If some records still have absolute paths (pre-migration), the set contains strings like "/uploads/user_1/web_task_123/output/file.txt". The computed relative_storage_path is "web_task_123/output/file.txt" — the in check fails, and the backfill attempts to insert a duplicate, violating the UNIQUE constraint.

Fix: Normalize existing_paths to relative format:

existing_paths = set()
for row in db.query(UploadedFile.storage_path)
    .filter(UploadedFile.user_id.in_(target_user_ids))
    .all():
    p = Path(row[0])
    if p.is_absolute():
        existing_paths.add(to_relative_path(p, target_user_ids[0]))  # normalize
    else:
        existing_paths.add(row[0])

Or simpler: use find_file_by_path to check existence instead of a set lookup.

except ImportError:
# Fallback for non-web contexts (e.g., tests)
# Store as-is absolute path
return str(file_path)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major: Silent fallback stores absolute paths

If the import fails, return str(file_path) stores an absolute path — the exact problem this PR is trying to solve. No warning is logged, making it hard to diagnose.

However, this fallback is likely dead code since other imports from ..web in this file (e.g. line 227: from ..web.models.uploaded_file import UploadedFile) would also fail if the web module were unavailable.

Suggestion: Either make the import mandatory (remove the try/except), or at minimum add logger.warning(...):

except ImportError:
    logger.warning("Could not import web.utils.file; storing absolute path")
    return str(file_path)

record.storage_path = relative_path.as_posix() # pyright: ignore[reportAttributeAccessIssue]
except ValueError as e:
# Path outside uploads_dir - keep as absolute but log the issue
print(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Multi-line warning printed for every unconvertible record

This 7-line print() block fires for each record that can't be converted. If there are hundreds of such records, this floods stdout with thousands of lines.

Fix: Print the detailed message once, then just count subsequent failures:

unconvertible = 0
...
except ValueError as e:
    if unconvertible == 0:
        print(f"Warning: Some paths cannot be converted...")
        print("See scripts/migrate_uploads_file_abs_path.README.md")
    unconvertible += 1
...
if unconvertible > 0:
    print(f"{unconvertible} records could not be converted")

# Convert relative to absolute
user_root = get_uploads_dir() / f"user_{record.user_id}"
absolute_path = user_root / path_obj
record.storage_path = str(absolute_path) # pyright: ignore[reportAttributeAccessIssue]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: downgrade uses str() instead of .as_posix()

The upgrade path uses .as_posix() (line 74) to ensure forward-slash separators. But downgrade uses str(absolute_path), which on Windows produces backslash paths like C:\uploads\user_1\....

While the detection code in upgrade handles both formats, the asymmetry is inconsistent. Consider:

record.storage_path = Path(absolute_path).as_posix()

This ensures roundtrip consistency: upgradedowngradeupgrade produces identical results on all platforms.

Relative path string with POSIX separators (/)

Raises:
ValueError: If path is not within UPLOADS_DIR (caught by caller)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Docstring says Raises: ValueError but the function never raises it

The docstring documents ValueError: If path is not within UPLOADS_DIR, but the actual implementation catches ValueError at line 42 and returns absolute_path.as_posix() as a fallback. The function never raises ValueError.

Fix: Update the docstring to reflect actual behavior:

Returns:
    Relative path with POSIX separators. If the path is outside
    UPLOADS_DIR, returns the absolute path as POSIX.

And remove the Raises section.

@@ -127,13 +128,21 @@ def delete_collection_uploaded_files(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: os.sep vs "/" inconsistency

to_relative_path() always returns POSIX paths (uses .as_posix()), so rel_prefix never contains os.sep. The check endswith(os.sep) works by coincidence on Linux (os.sep = "/") but is semantically misleading — on Windows it checks for "\\" which never matches.

Fix: Just check for "/":

if not rel_prefix.endswith("/"):
    rel_prefix_str = rel_prefix + "/"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants