Skip to content

Conversation

@turtleDev
Copy link
Contributor

@turtleDev turtleDev commented Jan 13, 2026

Background

Blob storage source (s3, gcs) can read data in multiple file formats. In order to know which decoder to use, ingestr relies on:

  1. The file extension of the --source-table, or
  2. A type hint at the end, prefixed with a hash (for e.g.: '/path/to/file.bin#csv')

S3 source supports both of these hints, but GCS only supports the 1st. This change adds support for the 2nd hint to the GCS source.

Copilot AI review requested due to automatic review settings January 13, 2026 13:17
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Summary

This PR refactors endpoint parsing for blob sources (S3 and GCS) by extracting the logic into a new determine_endpoint() function. The refactoring enables users to specify file format hints using a # separator (e.g., bucket/file#csv) to override automatic format detection based on file extensions.

Critical Issues Found:

  1. The new determine_endpoint() function is missing a return statement, causing it to return None instead of the computed endpoint string
  2. The table parameter (which may contain #filetype hints) is passed to parse_uri() before the hint is stripped, corrupting the parsed file path

Both issues will cause runtime failures when using the filetype hint feature.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 78 to 94
def determine_endpoint(table: str, path: str):
"""
determines the endpoint/method to use for reading data from a blob source
"""

if "#" in table:
_, endpoint = table.split("#")
if endpoint not in ["csv", "jsonl", "parquet"]:
raise UnsupportedEndpointError(f"Unsupported file format: {endpoint}")
endpoint = f"read_{endpoint}"
else:
try:
endpoint = parse_endpoint(path)
except Exception as e:
raise ValueError(
f"Failed to parse endpoint from path: {path}"
) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Function never returns a value. Add return endpoint at the end (after line 94) to return the computed endpoint string.

Prompt To Fix With AI
This is a comment left during a code review.
Path: ingestr/src/blob.py
Line: 78:94

Comment:
[P0] Function never returns a value. Add `return endpoint` at the end (after line 94) to return the computed endpoint string.

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Additional Comments (2)

ingestr/src/sources.py
[P0] parse_uri is called with table that may contain #filetype hint. This corrupts the parsed path since parse_uri doesn't strip the hint. Strip the hint before calling: table_without_hint = table.split('#')[0]

Prompt To Fix With AI
This is a comment left during a code review.
Path: ingestr/src/sources.py
Line: 2332:2332

Comment:
[P0] `parse_uri` is called with `table` that may contain `#filetype` hint. This corrupts the parsed path since `parse_uri` doesn't strip the hint. Strip the hint before calling: `table_without_hint = table.split('#')[0]`

How can I resolve this? If you propose a fix, please make it concise.

ingestr/src/sources.py
[P0] parse_uri is called with table that may contain #filetype hint. This corrupts the parsed path since parse_uri doesn't strip the hint. Strip the hint before calling: table_without_hint = table.split('#')[0]

Prompt To Fix With AI
This is a comment left during a code review.
Path: ingestr/src/sources.py
Line: 1722:1722

Comment:
[P0] `parse_uri` is called with `table` that may contain `#filetype` hint. This corrupts the parsed path since `parse_uri` doesn't strip the hint. Strip the hint before calling: `table_without_hint = table.split('#')[0]`

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the blob file type detection logic to honor explicit filetype hints for both S3 and GCS sources. The changes introduce a new determine_endpoint function that consolidates the logic for determining file types either from explicit hints (via # separator in the table parameter) or from file extensions.

Changes:

  • Introduced determine_endpoint function in blob.py to centralize filetype detection logic
  • Refactored S3 and GCS source classes to use the new determine_endpoint function instead of inline logic

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
ingestr/src/blob.py Added new determine_endpoint function that handles filetype hints via # separator and falls back to parsing file extensions
ingestr/src/sources.py Updated S3Source and GCSSource to use the new determine_endpoint function for consistent filetype detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

except Exception as e:
raise ValueError(
f"Failed to parse endpoint from path: {path}"
) from e
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is missing a return statement. After determining the endpoint value in either branch, it should return the endpoint string. Add 'return endpoint' at the end of the function.

Suggested change
) from e
) from e
return endpoint

Copilot uses AI. Check for mistakes.
Comment on lines 89 to 94
try:
endpoint = parse_endpoint(path)
except Exception as e:
raise ValueError(
f"Failed to parse endpoint from path: {path}"
) from e
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling is inconsistent with the function's purpose. The function should propagate UnsupportedEndpointError when parse_endpoint raises it, rather than catching it and re-wrapping it as ValueError. The calling code already handles UnsupportedEndpointError separately from generic exceptions. Only catch and wrap truly unexpected exceptions.

Copilot uses AI. Check for mistakes.
raise UnsupportedEndpointError(f"Unsupported file format: {file_extension}")
return endpoint

def determine_endpoint(table: str, path: str):
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature is missing a return type annotation. Add '-> str' to indicate that this function returns a string, consistent with the parse_endpoint function and the usage in calling code where the result is annotated as 'str'.

Suggested change
def determine_endpoint(table: str, path: str):
def determine_endpoint(table: str, path: str) -> str:

Copilot uses AI. Check for mistakes.
@turtleDev turtleDev merged commit 3812ec7 into main Jan 13, 2026
11 of 15 checks passed
@turtleDev turtleDev deleted the gcs/honor-filetype-hints branch January 13, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants