Skip to content

P1 - Success rate metric only calculated based on NoOrchestrator transcode errors #2674

Open
@yondonfu

Description

@yondonfu

Describe the bug
A clear and concise description of what the bug is.

The success rate Grafana graph here shows success rate consistently >= 100% even though we know that there have been transcode failures.

From reviewing the metrics code, it appears that the success rate metric is updated whenever there is a call to census.sendSuccess() in monitor/census.go.

I see this method being called in at least two places:

  1. In SegmentFullyTranscoded() which is called here after B finishes downloading all results from O
  2. In segmentTranscodeFailed() which is called in SegmentTranscodeFailed() which is called whenever a transcode error is encountered

For 2, there is a concept of a "permanent" vs. "non-permanent" transcode error indicated via the permanent bool passed to SegmentTranscodeFailed(). We can see non-permanent errors being recorded here. The only place where there is a permanent error recorded is here for NoOrchestrator transcode errors. This seems problematic because only permanent errors will trigger a call to census.sendSuccess() here when recording a transcode error. As a result, I don't think we are properly updating the success rate metric in at least two places:

  • When we hit the max # of transcode attempts which prompts B to give up on a segment here
  • When we hit a non-retryable transcode error which prompts B to give up on a segment here
  • When the caller (i.e. HTTP push client) gives up on the transcode resulting in a context cancellation here

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

We could trigger the aforementioned errors that are not being factored in right now to see if the success rate is not affected. A solution should demonstrate that these errors cause the success rate to drop.

Expected behavior
A clear and concise description of what you expected to happen.

I expect the success rate metric to properly factor in all transcode errors that result in no renditions for a segment that is passed in.

Generally, I see at least these categories of transcode errors that should cause success rate to drop:

  • If B hits the max # of transcode attempts - B should give up on the segment b/c it tried enough times already
  • If B hits a non-retryable error - B should give up on the segment b/c it knows that this segment likely just cannot be transcoded
  • If B knows that the caller (i.e. HTTP push client) is no longer waiting for a result - B should give up on the segment b/c it knows no one cares about the results anymore because the transcode was too slow

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions