Skip to content

Conversation

@potter-potter
Copy link
Contributor

@potter-potter potter-potter commented Jan 29, 2026

Note

Low Risk
Small, localized change to SFTP indexing output (file identifiers) plus a version bump; risk is mainly backward-compatibility for any downstream consumers relying on the previous identifier format.

Overview
Removes the SFTP indexer’s custom run() override that rewrote each FileData.identifier into a fully qualified sftp://host:port/... URL, reverting to the base FsspecIndexer behavior.

Bumps version to 1.2.38 and records the change in CHANGELOG.md.

Written by Cursor Bugbot for commit f284d5f. This will update automatically on new commits. Configure here.

@potter-potter
Copy link
Contributor Author

@claude

please review

this was the reason we made the change

The SFTP indexer was overriding run() to prefix file identifiers with sftp://host:port/, creating identifiers like sftp://3.22.168.92:22/uuid. This format contains forward slashes that break URL routing when the identifier is used as a path parameter in the ETL Job API. Since the identifier is only used for logging (file retrieval uses original_file_path from metadata), removing this override makes SFTP consistent with all other fsspec connectors that use UUID-only identifiers.

@claude
Copy link

claude bot commented Jan 29, 2026

Claude finished @potter-potter's task —— View job


Code Review for SFTP Indexer Override Removal

Todo List

  • Read repository guidelines (CLAUDE.md)
  • Examine the changed files to understand modifications
  • Review the removed SFTP indexer override implementation
  • Analyze consistency with other fsspec connectors
  • Provide comprehensive review feedback

Review Summary

This PR removes the SftpIndexer.run() method override that was prefixing file identifiers with sftp://{host}:{port}/, reverting to the base FsspecIndexer behavior. The change is well-justified and correctly implemented.

Analysis

What was removed:

def run(self, **kwargs: Any) -> Generator[FileData, None, None]:
    for file in super().run(**kwargs):
        new_identifier = (
            f"sftp://"
            f"{self.connection_config.host}:"
            f"{self.connection_config.port}/"
            f"{file.identifier}"
        )
        file.identifier = new_identifier
        yield file

Impact Assessment:

✅ Positive impacts:

  • Fixes URL routing issues: Removes forward slashes from identifiers that break URL path parameters in ETL Job API
  • Consistency: Makes SFTP behavior consistent with other fsspec connectors (Azure, GCS, S3, Box, Dropbox) that use UUID-only identifiers
  • Simplicity: Reduces custom logic and relies on well-tested base class behavior

⚠️ Considerations:

  • Breaking change: Existing systems that rely on the sftp://host:port/uuid identifier format will be affected
  • Deduplication impact: Systems using identifiers for deduplication may see duplicates during transition
  • State tracking: Any persisted references to the old identifier format will need updating

Code Quality Review

Implementation: ✅ Clean and correct

  • Proper version bump to 1.2.38 in unstructured_ingest/__version__.py:1
  • Appropriate changelog entry in CHANGELOG.md:3
  • Clean removal without leaving orphaned code

Base FsspecIndexer behavior: ✅ Reliable fallback

  • Uses uuid5(NAMESPACE_DNS, file_path) for identifier generation in fsspec.py:244
  • Stores original file path in additional_metadata["original_file_path"] for retrieval in fsspec.py:242
  • Consistent with 5 other fsspec connector implementations

Recommendations

  1. Documentation: Consider updating any API documentation that references SFTP identifier format
  2. Migration guide: Document the identifier format change for users who may have dependencies on the old format
  3. Monitoring: Monitor for any deduplication or state tracking issues after deployment

Conclusion

This is a well-reasoned fix that prioritizes API compatibility over connector-specific identifier formatting. The removal is clean, properly versioned, and brings SFTP in line with other connectors. The trade-off of breaking existing identifier dependencies is justified by fixing URL routing issues in the ETL Job API.

Recommendation: Approve


@potter-potter potter-potter merged commit 060bd52 into main Jan 29, 2026
37 of 40 checks passed
@potter-potter potter-potter deleted the potter/sftp-indexer branch January 29, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants