Added Otel Tracing to soci-snapshotter#1645
Conversation
Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
| ) | ||
|
|
||
| func Init(ctx context.Context) (func(context.Context) error, error) { | ||
| if err := isDisabled(); err != nil { |
There was a problem hiding this comment.
If tracing is disabled intentionally shouldn't we treat this as not an error? From the code bit in main.go it seems like any error will make this fail. IMO we should return a nil error if it is explicitly disabled, so maybe isDisabled should be split into isDisabled and something like checkSetup or whatever to separate a bad setup from explicitly disabled.
There was a problem hiding this comment.
Updated the isDisabled method to return a bool indicating if tracing is disabled or not, instead of always throwing an error and updated the related code in this commit 334a4de.
…onally Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
| log.G(ctx).WithError(err).Fatalf("failed to configure snapshotter") | ||
| } | ||
|
|
||
| log.G(ctx).Info("setting up otel tracing") |
There was a problem hiding this comment.
Info-level seems a little excessive here, since we aren't actually doing anything — maybe debug-level would be more appropriate. We already report when it is successfully setup or errors out so I think it's best to leave this as a debug log (or gone entirely, up to you)
| log.G(ctx).Info("setting up otel tracing") | ||
| tracingDisabled, shutDownTracing, err := tracing.Init(ctx) | ||
| if err != nil { | ||
| log.G(ctx).WithError(err).Info("failed to initialize otel tracing") |
There was a problem hiding this comment.
Is there any reason we don't just return error instead of logging an error message? The current behavior means we will log an error but proceed since we don't return the err after getting one.
There was a problem hiding this comment.
From my understanding, tracing should not interfere with the main soci functionality. If we return the error from the main function when it fails to setup the tracer, will that not prevent soci from starting its gRPC server?
There was a problem hiding this comment.
Synced offline on this, it doesn't seem like a big deal either way. A hard-failure is louder but maybe undesirable for folks who don't really care about this.
I think the best middle-ground solution we can do here is:
- If all the otel-specific variables are empty, continue launching the snapshotter (optionally maybe a message here too?).
- If the disable var is true, print the disabled message and continue launching the snapshotter.
- If any otel-specific variables are set incorrectly, we should hard-fail.
Does that work? If not LMK, this is my preferred solution but I won't block on this change per se.
| } else { | ||
| defer func() { | ||
| if err := shutDownTracing(ctx); err != nil { | ||
| log.G(ctx).WithError(err).Errorf("failed to shutdown tracing") |
There was a problem hiding this comment.
nit: Errorf -> Error — we aren't using any formatting
| } | ||
|
|
||
| func FetchSociArtifacts(ctx context.Context, refspec reference.Spec, indexDesc ocispec.Descriptor, localStore store.Store, remoteStore resolverStorage) (*soci.Index, error) { | ||
| ctx, span := otel.Tracer("").Start(ctx, "soci-snapshotter.fs.artifact_fetcher.FetchSociArtifacts") |
There was a problem hiding this comment.
soci-snapshotter.fs.artifact_fetcher.FetchSociArtifacts is a pretty long name. Does it have to be formatted this way? If we want to keep this format (either because we have to or because we think it's the clearest way to do this) we should probably create a dynamic way to generate this string easily. (Alternatively, we can just store this string into a var so that it's a little easier to refer to down the line.)
Not a blocking change, just food for thought 🤷♀️
There was a problem hiding this comment.
As per containerd:
Span Names
- Dot-separated notation.
- Span Names may include relative path to the package.
- Span Names should include a name that represents the specific component or service performing the operation.
- For example: "pkg.cri.sbserver.CreateContainer"
- "pkg.cri.sbserver" - relative path to the package
- "CreateContainer" - describes the operation that is traced
I moved the long names to a separate variable for now. Let me know if this okay or if I should use the reflect package to try to generate this dynamically.
There was a problem hiding this comment.
Let's do this for now. If we get too many different variable names we can consider generating dynamically.
| v = os.Getenv(otlpProtocolEnv) | ||
| } | ||
|
|
||
| const timeout = 5 * time.Second |
There was a problem hiding this comment.
Style comment: do we want to just do timeout := 5 * time.Second? I couldn't find anything in their style guide on constants and this works perfectly fine but it still sticks out to me some reason🤷
Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
| defaultServiceName = "soci-snapshotter" | ||
| ) | ||
|
|
||
| func Init(ctx context.Context) (bool, func(context.Context) error, error) { |
There was a problem hiding this comment.
nit: would it be more clear to require the caller to decide if tracing is disabled and not call this function?
e.g.
if !tracing.Disabled() {
cleanup, err := tracing.Init(ctx)
// ...
}
…or when otel setup fails after setting the env vars correctly Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Issue #, if available:
Description of changes:
This PR adds the Otel Stats handler to the SOCI gRPC server along with its own Exporter. This allows SOCI to instrument its gRPC calls and export the traces enabling us to monitor SOCI performance and request latencies. This is demonstrated by adding spans to
FetchSociArtifactsandResolvefunctions.Testing performed:
By setting the env variable
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318and starting a Jaeger server that listens to the same port, we can see traces like this when we pull an image withcrictl-For
public.ecr.aws/soci-workshop-examples/tensorflow_gpu:latestFor
public.ecr.aws/soci-workshop-examples/tensortflow:latestBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.