-
Notifications
You must be signed in to change notification settings - Fork 113
GCS: Honor filetype hints #495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Greptile SummaryThis PR refactors endpoint parsing for blob sources (S3 and GCS) by extracting the logic into a new Critical Issues Found:
Both issues will cause runtime failures when using the filetype hint feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 3 comments
ingestr/src/blob.py
Outdated
| def determine_endpoint(table: str, path: str): | ||
| """ | ||
| determines the endpoint/method to use for reading data from a blob source | ||
| """ | ||
|
|
||
| if "#" in table: | ||
| _, endpoint = table.split("#") | ||
| if endpoint not in ["csv", "jsonl", "parquet"]: | ||
| raise UnsupportedEndpointError(f"Unsupported file format: {endpoint}") | ||
| endpoint = f"read_{endpoint}" | ||
| else: | ||
| try: | ||
| endpoint = parse_endpoint(path) | ||
| except Exception as e: | ||
| raise ValueError( | ||
| f"Failed to parse endpoint from path: {path}" | ||
| ) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P0] Function never returns a value. Add return endpoint at the end (after line 94) to return the computed endpoint string.
Prompt To Fix With AI
This is a comment left during a code review.
Path: ingestr/src/blob.py
Line: 78:94
Comment:
[P0] Function never returns a value. Add `return endpoint` at the end (after line 94) to return the computed endpoint string.
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (2)
Prompt To Fix With AIThis is a comment left during a code review.
Path: ingestr/src/sources.py
Line: 2332:2332
Comment:
[P0] `parse_uri` is called with `table` that may contain `#filetype` hint. This corrupts the parsed path since `parse_uri` doesn't strip the hint. Strip the hint before calling: `table_without_hint = table.split('#')[0]`
How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix With AIThis is a comment left during a code review.
Path: ingestr/src/sources.py
Line: 1722:1722
Comment:
[P0] `parse_uri` is called with `table` that may contain `#filetype` hint. This corrupts the parsed path since `parse_uri` doesn't strip the hint. Strip the hint before calling: `table_without_hint = table.split('#')[0]`
How can I resolve this? If you propose a fix, please make it concise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request refactors the blob file type detection logic to honor explicit filetype hints for both S3 and GCS sources. The changes introduce a new determine_endpoint function that consolidates the logic for determining file types either from explicit hints (via # separator in the table parameter) or from file extensions.
Changes:
- Introduced
determine_endpointfunction inblob.pyto centralize filetype detection logic - Refactored S3 and GCS source classes to use the new
determine_endpointfunction instead of inline logic
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| ingestr/src/blob.py | Added new determine_endpoint function that handles filetype hints via # separator and falls back to parsing file extensions |
| ingestr/src/sources.py | Updated S3Source and GCSSource to use the new determine_endpoint function for consistent filetype detection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ingestr/src/blob.py
Outdated
| except Exception as e: | ||
| raise ValueError( | ||
| f"Failed to parse endpoint from path: {path}" | ||
| ) from e |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is missing a return statement. After determining the endpoint value in either branch, it should return the endpoint string. Add 'return endpoint' at the end of the function.
| ) from e | |
| ) from e | |
| return endpoint |
ingestr/src/blob.py
Outdated
| try: | ||
| endpoint = parse_endpoint(path) | ||
| except Exception as e: | ||
| raise ValueError( | ||
| f"Failed to parse endpoint from path: {path}" | ||
| ) from e |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handling is inconsistent with the function's purpose. The function should propagate UnsupportedEndpointError when parse_endpoint raises it, rather than catching it and re-wrapping it as ValueError. The calling code already handles UnsupportedEndpointError separately from generic exceptions. Only catch and wrap truly unexpected exceptions.
ingestr/src/blob.py
Outdated
| raise UnsupportedEndpointError(f"Unsupported file format: {file_extension}") | ||
| return endpoint | ||
|
|
||
| def determine_endpoint(table: str, path: str): |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature is missing a return type annotation. Add '-> str' to indicate that this function returns a string, consistent with the parse_endpoint function and the usage in calling code where the result is annotated as 'str'.
| def determine_endpoint(table: str, path: str): | |
| def determine_endpoint(table: str, path: str) -> str: |
Background
Blob storage source (s3, gcs) can read data in multiple file formats. In order to know which decoder to use, ingestr relies on:
S3 source supports both of these hints, but GCS only supports the 1st. This change adds support for the 2nd hint to the GCS source.