Skip to content

Conversation

@rfbr
Copy link
Collaborator

@rfbr rfbr commented Oct 13, 2025

Previously when a RequestException occurred while getting a WARC url, the sub-process running the document_generator function would crash without yielding any result.
The following logic

while True:
    ...
    if progress_bar.n == len(warc_paths):
        break

would thus never reach the expected count, preventing the loop from breaking, causing the main process to hang indefinitely.

This commit fixes the issue by:

  • adding a WarcResults dataclass to track success status
  • adding the get_response call in a try-except to catch RequestException, yielding a WarcResults with success=False in case of a RequestException
    It also improve the error reporting with some statistics about failed WARC and records.

@rfbr rfbr force-pushed the fix-hanging-processes-on-warc-error branch 3 times, most recently from 6898c41 to fbe31fe Compare October 13, 2025 12:43
@rfbr rfbr force-pushed the fix-hanging-processes-on-warc-error branch from fbe31fe to 2c7a0b5 Compare October 13, 2025 14:11
class WarcResults:
warc_url: str
success: bool
total_records: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an attribute computed dynamically to avoid asking the caller to keep track of it?
Something like

@property
def total_records():
    return self.processed_records + self.failed_records

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Done.

processed_records=processed_records,
failed_records=failed_records,
)
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we never print the exception, it can be a pain to debug then. Can we put it in WarcResults? If it's not pickable we can always store the error message there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it was mainly RequestException or IncompleteRead errors I thought it wasn't really needed. Nevertheless it makes more sense to log the error. Done!

@rfbr rfbr merged commit cd37a59 into kyutai-labs:main Oct 14, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants