Skip to content

Added Otel Tracing to soci-snapshotter#1645

Open
Swapnanil-Gupta wants to merge 9 commits into
awslabs:mainfrom
Swapnanil-Gupta:soci-instrumentation
Open

Added Otel Tracing to soci-snapshotter#1645
Swapnanil-Gupta wants to merge 9 commits into
awslabs:mainfrom
Swapnanil-Gupta:soci-instrumentation

Conversation

@Swapnanil-Gupta
Copy link
Copy Markdown
Contributor

Issue #, if available:

Description of changes:
This PR adds the Otel Stats handler to the SOCI gRPC server along with its own Exporter. This allows SOCI to instrument its gRPC calls and export the traces enabling us to monitor SOCI performance and request latencies. This is demonstrated by adding spans to FetchSociArtifacts and Resolve functions.

Testing performed:
By setting the env variable OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 and starting a Jaeger server that listens to the same port, we can see traces like this when we pull an image with crictl -

For public.ecr.aws/soci-workshop-examples/tensorflow_gpu:latest

image

For public.ecr.aws/soci-workshop-examples/tensortflow:latest

image

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Swapnanil-Gupta Swapnanil-Gupta requested a review from a team as a code owner July 10, 2025 21:34
@github-actions github-actions Bot added dependencies Pull requests that update a dependency file go Pull requests that update Go code labels Jul 10, 2025
Comment thread tracing/tracing.go Outdated
)

func Init(ctx context.Context) (func(context.Context) error, error) {
if err := isDisabled(); err != nil {
Copy link
Copy Markdown
Contributor

@sondavidb sondavidb Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tracing is disabled intentionally shouldn't we treat this as not an error? From the code bit in main.go it seems like any error will make this fail. IMO we should return a nil error if it is explicitly disabled, so maybe isDisabled should be split into isDisabled and something like checkSetup or whatever to separate a bad setup from explicitly disabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the isDisabled method to return a bool indicating if tracing is disabled or not, instead of always throwing an error and updated the related code in this commit 334a4de.

…onally

Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Comment thread cmd/soci-snapshotter-grpc/main.go Outdated
log.G(ctx).WithError(err).Fatalf("failed to configure snapshotter")
}

log.G(ctx).Info("setting up otel tracing")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info-level seems a little excessive here, since we aren't actually doing anything — maybe debug-level would be more appropriate. We already report when it is successfully setup or errors out so I think it's best to leave this as a debug log (or gone entirely, up to you)

Comment thread cmd/soci-snapshotter-grpc/main.go Outdated
log.G(ctx).Info("setting up otel tracing")
tracingDisabled, shutDownTracing, err := tracing.Init(ctx)
if err != nil {
log.G(ctx).WithError(err).Info("failed to initialize otel tracing")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we don't just return error instead of logging an error message? The current behavior means we will log an error but proceed since we don't return the err after getting one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, tracing should not interfere with the main soci functionality. If we return the error from the main function when it fails to setup the tracer, will that not prevent soci from starting its gRPC server?

Copy link
Copy Markdown
Contributor

@sondavidb sondavidb Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline on this, it doesn't seem like a big deal either way. A hard-failure is louder but maybe undesirable for folks who don't really care about this.

I think the best middle-ground solution we can do here is:

  • If all the otel-specific variables are empty, continue launching the snapshotter (optionally maybe a message here too?).
  • If the disable var is true, print the disabled message and continue launching the snapshotter.
  • If any otel-specific variables are set incorrectly, we should hard-fail.

Does that work? If not LMK, this is my preferred solution but I won't block on this change per se.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with 8cf422c

Comment thread cmd/soci-snapshotter-grpc/main.go Outdated
} else {
defer func() {
if err := shutDownTracing(ctx); err != nil {
log.G(ctx).WithError(err).Errorf("failed to shutdown tracing")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Errorf -> Error — we aren't using any formatting

Comment thread fs/artifact_fetcher.go Outdated
}

func FetchSociArtifacts(ctx context.Context, refspec reference.Spec, indexDesc ocispec.Descriptor, localStore store.Store, remoteStore resolverStorage) (*soci.Index, error) {
ctx, span := otel.Tracer("").Start(ctx, "soci-snapshotter.fs.artifact_fetcher.FetchSociArtifacts")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

soci-snapshotter.fs.artifact_fetcher.FetchSociArtifacts is a pretty long name. Does it have to be formatted this way? If we want to keep this format (either because we have to or because we think it's the clearest way to do this) we should probably create a dynamic way to generate this string easily. (Alternatively, we can just store this string into a var so that it's a little easier to refer to down the line.)

Not a blocking change, just food for thought 🤷‍♀️

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per containerd:

Span Names
    - Dot-separated notation.
    - Span Names may include relative path to the package.
    - Span Names should include a name that represents the specific component or service performing the operation.
    - For example: "pkg.cri.sbserver.CreateContainer"
        - "pkg.cri.sbserver" - relative path to the package
        - "CreateContainer" - describes the operation that is traced

I moved the long names to a separate variable for now. Let me know if this okay or if I should use the reflect package to try to generate this dynamically.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this for now. If we get too many different variable names we can consider generating dynamically.

Comment thread tracing/tracing.go Outdated
v = os.Getenv(otlpProtocolEnv)
}

const timeout = 5 * time.Second
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style comment: do we want to just do timeout := 5 * time.Second? I couldn't find anything in their style guide on constants and this works perfectly fine but it still sticks out to me some reason🤷

Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Comment thread tracing/tracing.go Outdated
defaultServiceName = "soci-snapshotter"
)

func Init(ctx context.Context) (bool, func(context.Context) error, error) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would it be more clear to require the caller to decide if tracing is disabled and not call this function?

e.g.

if !tracing.Disabled() {
    cleanup, err := tracing.Init(ctx)
    // ...
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with 8cf422c

…or when otel setup fails after setting the env vars correctly

Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Signed-off-by: Swapnanil-Gupta <swpnlg@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants