Description
Describe the bug
A clear and concise description of what the bug is.
The success rate Grafana graph here shows success rate consistently >= 100% even though we know that there have been transcode failures.
From reviewing the metrics code, it appears that the success rate metric is updated whenever there is a call to census.sendSuccess()
in monitor/census.go
.
I see this method being called in at least two places:
- In SegmentFullyTranscoded() which is called here after B finishes downloading all results from O
- In segmentTranscodeFailed() which is called in
SegmentTranscodeFailed()
which is called whenever a transcode error is encountered
For 2, there is a concept of a "permanent" vs. "non-permanent" transcode error indicated via the permanent
bool passed to SegmentTranscodeFailed()
. We can see non-permanent errors being recorded here. The only place where there is a permanent error recorded is here for NoOrchestrator transcode errors. This seems problematic because only permanent errors will trigger a call to census.sendSuccess()
here when recording a transcode error. As a result, I don't think we are properly updating the success rate metric in at least two places:
- When we hit the max # of transcode attempts which prompts B to give up on a segment here
- When we hit a non-retryable transcode error which prompts B to give up on a segment here
- When the caller (i.e. HTTP push client) gives up on the transcode resulting in a context cancellation here
To Reproduce
Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
We could trigger the aforementioned errors that are not being factored in right now to see if the success rate is not affected. A solution should demonstrate that these errors cause the success rate to drop.
Expected behavior
A clear and concise description of what you expected to happen.
I expect the success rate metric to properly factor in all transcode errors that result in no renditions for a segment that is passed in.
Generally, I see at least these categories of transcode errors that should cause success rate to drop:
- If B hits the max # of transcode attempts - B should give up on the segment b/c it tried enough times already
- If B hits a non-retryable error - B should give up on the segment b/c it knows that this segment likely just cannot be transcoded
- If B knows that the caller (i.e. HTTP push client) is no longer waiting for a result - B should give up on the segment b/c it knows no one cares about the results anymore because the transcode was too slow
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: [e.g. iOS]
- Browser [e.g. chrome, safari]
- Version [e.g. 22]
Smartphone (please complete the following information):
- Device: [e.g. iPhone6]
- OS: [e.g. iOS8.1]
- Browser [e.g. stock browser, safari]
- Version [e.g. 22]
Additional context
Add any other context about the problem here.