Skip to content

[WIP] Support data skipping based on column stats in Hudi connector #25184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yihua
Copy link
Member

@yihua yihua commented Feb 28, 2025

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Feb 28, 2025
@github-actions github-actions bot added hudi Hudi connector hive Hive connector labels Feb 28, 2025
@yihua yihua force-pushed the HUDI-data-skipping branch from 74846a0 to 30b1e6d Compare March 12, 2025 00:00
@@ -188,6 +188,16 @@ public static Map<String, List<FileSlice>> lookupCandidateFilesInMetadataTable(
.stream()
.filter(fileSlice -> pruneFiles(fileSlice, statsByFileName, regularColumnPredicates, regularColumns))
.collect(Collectors.toList())));
if (log.isInfoEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check debug level instead?

public long getPos()
throws IOException
{
return super.getPos() - startOffset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can return negative value? Or is it guaranteed that the underlying stream is positioned at or after startOffset at initialization?

throws IOException
{
if (pos > length) {
throw new IOException(String.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should also add a safety check just in case pos < 0 then throw exception in this case as well?

"Attempting to seek past inline content: position to seek to is %s but the length is %s",
pos, length));
}
super.seek(startOffset + pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we seek till pos == length. I am assuming that length corresponds to a single hfile data block. Is there any chance to hit EOF?

throws IOException
{
return new InlineSeekableDataInputStream(
(TrinoInputStream) storage.open(getFilePathFromInlinePath(path)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we test with colstats having log files in latest file slices? Just want t ensure that we don't hit any ClassCastException while casting here?

}

@Override
public SeekableDataInputStream openSeekable(StoragePath path, int bufferSize, boolean wrapStream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we ignore bufferSize and wrapStream?

Copy link

github-actions bot commented Apr 4, 2025

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector hudi Hudi connector stale
Development

Successfully merging this pull request may close these issues.

2 participants