Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses intermittent failures in 400ZR and 400ZR+ optical channel tests by improving the robustness of telemetry collection and validation. The changes focus on providing more time for optical interfaces to converge and for telemetry data to stabilize, implementing retry mechanisms for data collection, and refining statistical comparison logic to account for real-world variations and precision issues. These adjustments aim to eliminate statistical validation failures and interface timeout errors, leading to more reliable test results. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to fix intermittent failures in optical channel tests by making them more robust. Key changes include increasing timeouts for interface convergence, adding stabilization delays using time.Sleep before telemetry validation, and implementing a retry mechanism for fetching telemetry data. The validation logic for statistical telemetry (min, max, avg) has been significantly improved to be more resilient to race conditions by checking for internal consistency with a tolerance instead of comparing against instantaneous values. These changes are applied consistently across both zr_tunable_parameters_test.go and zrp_tunable_parameters_test.go. My review includes one point of feedback regarding a discrepancy in the configured statistical tolerance value compared to what is mentioned in the pull request description, which has been retained as it aligns with best practices for clarifying 'magic numbers'.
| samplingInterval = 10 * time.Second | ||
| frequencyTolerance = 1800 | ||
| interfaceTimeout = 3 * time.Minute | ||
| telemetryWaitTime = 60 * time.Second // 6 sampling windows | ||
| maxTelemetryRetries = 3 | ||
| statisticsTolerance = 3.0 // Relaxed tolerance for statistical comparisons | ||
| ) |
There was a problem hiding this comment.
The PR description mentions a statistical tolerance of ±0.1, but the statisticsTolerance constant is set to 3.0. This seems like a large discrepancy. Could you confirm if 3.0 is the intended value? If so, it might be helpful to add a comment explaining the units for this tolerance (e.g., dBm for power, MHz for frequency offset) to provide more context.
PR:4709 - updated to have gnpsi field in main
The Test400ZRTunableFrequency and related optical channel tests are experiencing intermittent failures with two main failure patterns:
Statistical Validation Failures
Optical-Channel: carrier-frequency-offset min: -1 greater than carrier-frequency-offset avg: -13
This error occurs when telemetry statistical values (min/max/avg) are inconsistent, due to:
Race conditions during telemetry collection
Stale/cached telemetry data being used for validation
Device updating statistical values non-atomically
Interface Timeout Failures
context deadline exceeded
This occurs when optical interfaces take longer than the configured timeout to come up after configuration changes.
Root Causes:
The test collects telemetry immediately after configuration, but optical modules need time to stabilize their statistical measurements
Insufficient Stabilization Time: 90-second timeout and 80-second stabilization delays are insufficient for optical channel convergence
Floating-point precision issues in statistical comparisons
This PR implements a targeted fix addressing the specific failure patterns:
Enhanced Telemetry Stabilization
Increased timeout from 90 seconds to 3 minutes for optical interface convergence
Increased stabilization delays after configuration changes (from 80s to 100s before validation)
Extended telemetry wait time to allow statistical measurements to stabilize
Sample Flushing for Fresh Data
Flushes old/stale samples from telemetry streams before validation
Validates data sanity before using telemetry for statistical comparisons
Retry logic for telemetry collection with up to 3 attempts
Robust Statistical Validation
Proper floating-point handling with rounding to 1 decimal place
Statistical tolerance (±0.1) for min/max/avg comparisons