Skip to content

Conversation

@huleilei
Copy link
Contributor

Changes Made

VideoFile.keyframes() computed an incorrect seek timestamp when start_time > 0. The previous implementation used start_time * time_base, which keeps the effective seek position near 0 for small time_base values (for example 1/30000). As a result, the start_time parameter was effectively ignored and keyframes were decoded from the beginning of the video.

This also affected the daft.functions.video_keyframes() UDF, which delegates to VideoFile.keyframes(). Users expecting to extract keyframes from a later time range (for example starting at 10 seconds) instead received frames starting close to time 0.

Related Issues

#5949

@github-actions github-actions bot added the fix label Jan 10, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 10, 2026

Greptile Overview

Greptile Summary

This PR fixes a critical bug in VideoFile.keyframes() where the start_time parameter was effectively ignored due to incorrect timestamp calculation.

The Bug

The original implementation used seek_timestamp = int(start_time * video.time_base) to calculate the seek position. Since time_base represents seconds per PTS unit (typically values like 1/30000), multiplying by start_time produced timestamps near 0 regardless of the actual start_time value. For example, with start_time=10 seconds and time_base=1/30000, the old formula gave 10 * (1/30000) = 0.000333, which rounds to 0.

The Fix

The PR correctly changes the formula to seek_timestamp = int(start_time / float(video.time_base)), which properly converts seconds to PTS units. With the same example: 10 / (1/30000) = 300000 PTS units - the correct seek position.

Additional improvements include:

  • Adding stream=video parameter to container.seek() for stream-specific seeking
  • Adding post-seek frame filtering to skip frames that land before start_time (since seeks may land on earlier keyframes)
  • Adding test coverage for start_time functionality

Code Quality Notes

The core fix is mathematically sound and addresses the reported issue. The code properly handles edge cases like None time_base values. Minor style inconsistencies exist in the end_time checking logic, and test coverage could be strengthened to verify actual frame timestamps rather than just frame counts.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it fixes a critical bug with a mathematically correct solution
  • The core timestamp calculation fix is mathematically sound and directly addresses the reported issue. The implementation properly handles edge cases (None time_base, frame.time checks). Additional safety measures (post-seek filtering, stream parameter) strengthen robustness. Minor style inconsistencies and test coverage gaps exist but don't affect correctness. Score is 4/5 rather than 5/5 due to: (1) test coverage validates counts but not actual frame timestamps, and (2) minor style inconsistency in end_time handling that could be cleaned up.
  • No files require special attention - both files have been thoroughly reviewed and the changes are straightforward

Important Files Changed

File Analysis

Filename Score Overview
daft/file/video.py 4/5 Fixed critical timestamp calculation bug and added post-seek frame filtering; minor style inconsistency in end_time handling
tests/file/test_video.py 3/5 Added three tests for start_time functionality; tests verify basic behavior but lack edge case coverage and precise frame time validation

Sequence Diagram

sequenceDiagram
    participant User
    participant VideoFile
    participant PyAV
    participant Container
    participant VideoStream
    
    User->>VideoFile: keyframes(start_time=10, end_time=None)
    VideoFile->>VideoFile: open() file
    VideoFile->>PyAV: av.open(file)
    PyAV-->>VideoFile: container
    VideoFile->>Container: get video stream
    Container-->>VideoFile: video stream
    
    alt start_time > 0 and video.time_base exists
        VideoFile->>VideoFile: seek_timestamp = int(start_time / float(time_base))
        VideoFile->>Container: seek(seek_timestamp, stream=video)
        Container-->>VideoFile: positioned at keyframe
    end
    
    VideoFile->>VideoStream: skip_frame = "NONKEY"
    VideoFile->>Container: decode(video)
    
    loop for each keyframe
        Container-->>VideoFile: frame
        
        alt frame.time < start_time
            VideoFile->>VideoFile: continue (skip frame)
        else if end_time exists and frame.time > end_time
            VideoFile->>VideoFile: break (stop iteration)
        else
            VideoFile->>VideoFile: frame.to_image()
            VideoFile-->>User: yield PIL.Image
        end
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@huleilei huleilei force-pushed the hll/fix-video-keyframe-seek branch from 5d250c0 to 1d64cd0 Compare January 10, 2026 07:28
@huleilei huleilei marked this pull request as draft January 10, 2026 08:05
@huleilei huleilei force-pushed the hll/fix-video-keyframe-seek branch from 1d64cd0 to 7bacde6 Compare January 12, 2026 02:08
@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.91%. Comparing base (ae14540) to head (7bacde6).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6005      +/-   ##
==========================================
+ Coverage   72.89%   72.91%   +0.01%     
==========================================
  Files         971      971              
  Lines      125987   125989       +2     
==========================================
+ Hits        91842    91859      +17     
+ Misses      34145    34130      -15     
Files with missing lines Coverage Δ
daft/file/video.py 69.84% <100.00%> (+30.49%) ⬆️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@huleilei huleilei marked this pull request as ready for review January 12, 2026 03:29
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +83 to +94
def test_keyframes_start_time_skips_early_frames(sample_video_path):
file = daft.VideoFile(sample_video_path)
metadata = file.metadata()
duration = metadata["duration"]
assert duration is not None

start_time = duration / 2.0

all_keyframes = list(file.keyframes())
later_keyframes = list(file.keyframes(start_time=start_time))

assert len(later_keyframes) < len(all_keyframes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only verifies that fewer frames are returned, but doesn't validate that the returned frames actually start from the correct time. Consider strengthening the test to verify the actual frame times:

Suggested change
def test_keyframes_start_time_skips_early_frames(sample_video_path):
file = daft.VideoFile(sample_video_path)
metadata = file.metadata()
duration = metadata["duration"]
assert duration is not None
start_time = duration / 2.0
all_keyframes = list(file.keyframes())
later_keyframes = list(file.keyframes(start_time=start_time))
assert len(later_keyframes) < len(all_keyframes)
def test_keyframes_start_time_skips_early_frames(sample_video_path):
file = daft.VideoFile(sample_video_path)
metadata = file.metadata()
duration = metadata["duration"]
assert duration is not None
start_time = duration / 2.0
all_keyframes = list(file.keyframes())
later_keyframes = list(file.keyframes(start_time=start_time))
assert len(later_keyframes) < len(all_keyframes)
# Verify that returned keyframes actually start after start_time
# This would require accessing frame metadata, which PIL.Image doesn't have
# Consider using a lower-level API or storing timestamps to validate

The current test passes as long as the count is different, but doesn't verify correctness. Ideally, you'd want to check that the first returned frame is at or after start_time.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/file/test_video.py
Line: 83:94

Comment:
This test only verifies that fewer frames are returned, but doesn't validate that the returned frames actually start from the correct time. Consider strengthening the test to verify the actual frame times:

```suggestion
def test_keyframes_start_time_skips_early_frames(sample_video_path):
    file = daft.VideoFile(sample_video_path)
    metadata = file.metadata()
    duration = metadata["duration"]
    assert duration is not None

    start_time = duration / 2.0

    all_keyframes = list(file.keyframes())
    later_keyframes = list(file.keyframes(start_time=start_time))

    assert len(later_keyframes) < len(all_keyframes)
    # Verify that returned keyframes actually start after start_time
    # This would require accessing frame metadata, which PIL.Image doesn't have
    # Consider using a lower-level API or storing timestamps to validate
```

The current test passes as long as the count is different, but doesn't verify correctness. Ideally, you'd want to check that the first returned frame is at or after `start_time`.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 12, 2026

Additional Comments (1)

daft/file/video.py
The end_time check has a style inconsistency with the start_time check. Line 117 directly uses frame.time, but here you assign it to frame_time first. Consider refactoring for consistency:

                    # Check end time if specified
                    if end_time is not None:
                        if frame.time and frame.time > end_time:
                            break

This makes the code more consistent and slightly cleaner.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/file/video.py
Line: 121:124

Comment:
The `end_time` check has a style inconsistency with the `start_time` check. Line 117 directly uses `frame.time`, but here you assign it to `frame_time` first. Consider refactoring for consistency:

```suggestion
                    # Check end time if specified
                    if end_time is not None:
                        if frame.time and frame.time > end_time:
                            break
```

This makes the code more consistent and slightly cleaner.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant