-
Notifications
You must be signed in to change notification settings - Fork 416
feat: integrate OTEL/Jaeger #3815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):🟢 Maintainers must be able to edit this pull request (more info) ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
…u1110/gno into feat/integrate-otel-jaeger
c6d3691
to
f8437e6
Compare
hi @gfanton, I've updated the code with your suggested changes, could you take a look again pls :) |
Can you address the failing CI? |
hiii @gfanton , sorry for my late reply. Seems that some tests are broken on master, I'm waiting a little bit for that to be fixed first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the consensus module is not the best place, for now, to showcase tracing.
What do you think about hooking up the keeper query handlers, or RPC handlers?
MeterName string `json:"meter_name" toml:"meter_name"` | ||
ServiceName string `json:"service_name" toml:"service_name" comment:"in Prometheus this is transformed into the label 'exported_job'"` | ||
ServiceInstanceID string `json:"service_instance_id" toml:"service_instance_id" comment:"the ID helps to distinguish instances of the same service that exist at the same time (e.g. instances of a horizontally scaled service), in Prometheus this is transformed into the label 'exported_instance"` | ||
ExporterEndpoint string `json:"exporter_endpoint" toml:"exporter_endpoint" comment:"the endpoint to export metrics to, like a local OpenTelemetry collector"` | ||
|
||
TracesEnabled bool `json:"traces_enabled" toml:"traces_enabled"` | ||
GracefulShutdownTelemetry bool `json:"graceful_shutdown_telemetry" toml:"graceful_shutdown_telemetry"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we ever not want to gracefully shut down telemetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally there was no Graceful shutdown, so I added this for case where we need that. I think that we will need GracefulShutdown when we run it in production and does not need it when running locally or test. Wdyt ?
tm2/pkg/telemetry/tracer.go
Outdated
"go.opentelemetry.io/otel/trace/noop" | ||
) | ||
|
||
type TracerFactory func() trace.Tracer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop this indirection
tm2/pkg/telemetry/init.go
Outdated
// Check if the metrics are enabled at all | ||
if !c.MetricsEnabled { | ||
return nil | ||
func Init(c config.Config, logger *slog.Logger) (*sdkTrace.TracerProvider, *sdkMetric.MeterProvider, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing a logger just to log that something happened in the func is overkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should also be split up into 2 different ones, one for the tracing, and one for the metrics -- because in the current state you still require the caller to check whether something is nil
tm2/pkg/telemetry/traces/traces.go
Outdated
return nil, fmt.Errorf("unable to create http traces exporter, %w", err) | ||
} | ||
default: | ||
exp, err = otlptracegrpc.New( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a convention after talking to Jae, and that's that grpc fallbacks like this should be dropped entirely -- only support http / https
tm2/pkg/telemetry/traces/traces.go
Outdated
return nil, fmt.Errorf("error parsing tracer exporter endpoint: %s, %w", cfg.ExporterEndpoint, err) | ||
} | ||
|
||
// Use oltp metric exporter with http/https or grpc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here?
tm2/pkg/bft/consensus/state.go
Outdated
@@ -1445,6 +1513,12 @@ func (cs *ConsensusState) defaultSetProposal(proposal *types.Proposal) error { | |||
// NOTE: block is not necessarily valid. | |||
// Asynchronously triggers either enterPrevote (before we timeout of propose) or tryFinalizeCommit, once we have the full block. | |||
func (cs *ConsensusState) addProposalBlockPart(msg *BlockPartMessage, peerID p2pTypes.ID) (added bool, err error) { | |||
span := cs.addSpan("addProposalBlockPart") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop this
tm2/pkg/bft/consensus/state.go
Outdated
@@ -1515,6 +1589,13 @@ func (cs *ConsensusState) addProposalBlockPart(msg *BlockPartMessage, peerID p2p | |||
|
|||
// Attempt to add the vote. if its a duplicate signature, dupeout the validator | |||
func (cs *ConsensusState) tryAddVote(vote *types.Vote, peerID p2pTypes.ID) (bool, error) { | |||
span := cs.addSpan("tryAddVote") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop this
tm2/pkg/bft/consensus/state.go
Outdated
@@ -1741,6 +1832,9 @@ func (cs *ConsensusState) voteTime() time.Time { | |||
|
|||
// sign the vote and publish on internalMsgQueue | |||
func (cs *ConsensusState) signAddVote(type_ types.SignedMsgType, hash []byte, header types.PartSetHeader) *types.Vote { | |||
span := cs.addSpan("signAddVote") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop this
tm2/pkg/bft/consensus/state.go
Outdated
func (cs *ConsensusState) addTrace(spanName string, opts ...trace.SpanStartOption) trace.Span { | ||
var span trace.Span | ||
|
||
if telemetry.TracesEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any updates on this?
tm2/pkg/bft/consensus/state.go
Outdated
attribute.Int("csStep", int(cs.Step)), | ||
) | ||
} | ||
cs.traceCtx = ctx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfanton is right, no need to introduce traceCtx
. The fact that context is not propagated to most of the consensus state operations is a bug, not a feature
@zivkovicmilos I addressed your feedbacks, please re-check this PR :) thanks |
Implementation for #2434
We can filter by BlockHeight

We can trace the calls
