Skip to content

Conversation

@dougyster
Copy link
Collaborator

@dougyster dougyster commented Jan 1, 2026

Motivation

Improve the quality of the failure monitor by showing test level failures for each job failure, as well as other fixes and design improvements.

Modifications

ci_failures_analysis.py

Accuracy Tests

(https://github.com/sgl-project/sglang/actions/runs/20641823449)

Benchmarking and Profiling

N/A

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dougyster, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the CI failure monitoring system by introducing the capability to analyze failures at the individual test file level. It enhances the existing job-level failure tracking with more detailed insights, allowing developers to quickly pinpoint the root cause of CI instability. The changes include fetching job logs, parsing test summaries, and presenting this granular information in an improved, interactive GitHub summary report, making it easier to identify and address persistent test failures.

Highlights

  • Detailed Test-Level Failure Analysis: Introduced new functionality to fetch and parse job logs to identify specific failing test files within a CI job. This includes extracting test summaries, counting passed/total tests, and listing individual failed Python test files.
  • Enhanced Failure Tracking: Implemented logic to track total failures and current consecutive failure streaks for individual test files. This allows for more granular identification of flaky or consistently failing tests.
  • Improved Reporting for Broken Jobs: Added a new analysis step that focuses on jobs with existing failure streaks (>=2) or high failure rates (>=50%). For these critical jobs, the system now retrieves and presents detailed test-level failure information directly in the report.
  • Expanded Run History Tracking: Increased the number of recent job runs tracked for analysis from 5 to 10, providing a broader historical context for job and test stability.
  • Refined GitHub Summary Output: The GitHub Actions summary now includes collapsible sections for detailed test failures within problematic jobs, highlighting tests with high streaks and providing direct links to job logs. The report structure has also been adjusted, moving runner health to a more prominent position and introducing distinct categories for jobs with high intermittent failure rates.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the CI failure monitor by introducing test-level failure analysis. It adds functionality to fetch and parse job logs to identify specific failing tests within a job, track their failure streaks, and report them in a new collapsible section in the GitHub Actions summary. The reporting has been refactored from console output to a more detailed and user-friendly markdown format. The changes are well-structured, but there are a few opportunities to improve code quality by reducing duplication and adhering to Python best practices, which I've detailed in my comments.

Returns:
Dict with passed/total counts and list of failed tests, or None if no summary found.
"""
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code organization and to adhere to standard Python style (PEP 8), module imports should be at the top of the file. Placing import re inside the function can lead to repeated, unnecessary imports if the function is called multiple times. Please move it to the top of the file with the other imports.

for match in re.finditer(r"(\S+\.py)", failed_section):
full_path = match.group(1)
# Extract just the filename from the path
test_file = full_path.split("/")[-1] if "/" in full_path else full_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using os.path.basename() is a more idiomatic and robust way to extract a filename from a path in Python compared to manual string splitting. It handles different path separators (e.g., / and \) automatically, making the code more portable. The os module is already imported in this file.

Suggested change
test_file = full_path.split("/")[-1] if "/" in full_path else full_path
test_file = os.path.basename(full_path)

Args:
recent_runs: List of recent run info dicts with job_id, job_url, conclusion, etc.
debug: Enable debug logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for analyze_test_failures_for_job includes a debug argument that is not defined in the function's signature. This can be misleading for future developers. Please remove it to keep the documentation accurate.

Comment on lines +238 to +245
test_failures[test_file]["recent_runs"].append(
{
"run_number": run_info.get("run_number"),
"job_url": run_info.get("job_url"),
"status": "❌",
"failed": True,
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dictionary for a recent_runs entry is created in multiple places with the same structure (e.g., lines 253-260, 264-271, etc.). To improve maintainability and reduce code duplication, consider creating a private helper method to construct this dictionary.

For example:

def _create_run_entry(self, run_info, status, failed):
    return {
        "run_number": run_info.get("run_number"),
        "job_url": run_info.get("job_url"),
        "status": status,
        "failed": failed,
    }

You could then call it like self._create_run_entry(run_info, "❌", True).

Comment on lines +284 to +294
else:
# Other conclusion (cancelled, skipped, etc.) - don't reset streaks, mark as unknown
for test_file in test_failures.keys():
test_failures[test_file]["recent_runs"].append(
{
"run_number": run_info.get("run_number"),
"job_url": run_info.get("job_url"),
"status": "⚪",
"failed": None,
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This else block, which handles conclusions like 'cancelled' or 'skipped', is identical to the else block on lines 261-271 that handles a job failure where the test summary could not be parsed. This code duplication can be avoided by restructuring the conditional logic to group all cases that should result in an "unknown" status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants