-
Notifications
You must be signed in to change notification settings - Fork 405
feat: integrate OTEL/Jaeger #3815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):🟢 Maintainers must be able to edit this pull request (more info) ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
…u1110/gno into feat/integrate-otel-jaeger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think it is fine to rely on Tempo for reference OTEL configuration.
I requested just several changes to Docker Compose config since I disagree with some choices, but more NITs generally speaking.
My main concern is about splitting the configuration of telemetry enabling only metrics or traces.
We should internally discuss if it is a fine grained configuration that we want or it is smt overloading the configuration of the node already in place?
Can you elaborate on that? (just asking and open to discussion)
misc/telemetry/docker-compose.yml
Outdated
|
||
tempo: | ||
image: grafana/tempo:latest | ||
command: [ "-config.file=/etc/tempo.yaml" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow other commands' syntax
command:
- "-config.file=/etc/tempo.yaml"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
misc/telemetry/docker-compose.yml
Outdated
depends_on: | ||
- tempo | ||
- prometheus | ||
profiles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reference Compose file. We don't need to add profiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I remove it.
Just FYI: I've added it because when testing locally, we need to run all other components but the gnochain, so we can test the integration easier. But you are right, we should handle it locally only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
misc/telemetry/docker-compose.yml
Outdated
@@ -22,6 +42,9 @@ services: | |||
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml | |||
networks: | |||
- gnoland-net | |||
profiles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
misc/telemetry/docker-compose.yml
Outdated
@@ -35,14 +58,18 @@ services: | |||
- "3000:3000" | |||
networks: | |||
- gnoland-net | |||
|
|||
profiles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
misc/telemetry/docker-compose.yml
Outdated
gnoland-val: | ||
image: ghcr.io/gnolang/gno/gnoland:master | ||
networks: | ||
- gnoland-net | ||
volumes: | ||
# Shared Volume | ||
- gnoland-shared:/gnoroot/shared-data | ||
- ../../genesis.json:/gnoroot/genesis.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not make sense.
Why? In the commands below genesis is generated from scratch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I admit that I don't remember why I've added this line.
Removed it.
misc/telemetry/docker-compose.yml
Outdated
@@ -98,6 +129,8 @@ services: | |||
gnoland-val: | |||
condition: service_healthy | |||
restart: true | |||
profiles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
misc/telemetry/docker-compose.yml
Outdated
@@ -108,12 +141,16 @@ services: | |||
restart: unless-stopped | |||
networks: | |||
- gnoland-net | |||
profiles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
misc/telemetry/docker-compose.yml
Outdated
- tempo:/var/tempo | ||
- ./tempo/tempo.yaml:/etc/tempo.yaml | ||
ports: | ||
- "3200:3200" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this port public?
As far as I understand tempo should be only accessed through Grafana. Do we need an external access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked, we do not need to expose this port. Removed it.
@@ -10,6 +10,10 @@ processors: | |||
exporters: | |||
prometheus: | |||
endpoint: collector:8090 | |||
otlp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing.
Why using the same name otlp
for receiver and exporter?
Can we use a different name, or they should be the same for a reason, which is the reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in fact otlp is not mandatory for exporter in this case.
I will rename it but hesitate between 2 options: tempo vs traces. traces is more generic so we can switch to jaeger later if needed but it has the same name with service so it could create the confusion later that those names must be the same. So I decided to use: tempo for now.
Wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sw360cab FYI, after checking, the otlp
here in this case is the naming convention in fact, it must be a known exporter.
So the best supported way is: otlp/tempo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job ! I've left some suggestions. I will let @zivkovicmilos have the final word on this, as he is the most familiar with the consensus package.
tm2/pkg/bft/consensus/state.go
Outdated
func (cs *ConsensusState) addTrace(spanName string, opts ...trace.SpanStartOption) trace.Span { | ||
var span trace.Span | ||
|
||
if telemetry.TracesEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not a big fan of using TracesEnabled explicitly everywhere we need it; it seems a bit redundant and error-prone.
what do you think about creating an helper inside our trace
package, something like this:
import (
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/otel/trace/noop"
)
type TracerFactory func() trace.Tracer
var noopTracerProvider = noop.NewTracerProvider()
func Tracer(name string, options ...trace.TracerOption) TracerFactory {
var once sync.Once
var t trace.Tracer = noopTracerProvider.Tracer(name, options...) // Initilize noop tracer as default
return func() trace.Tracer {
if TracingEnabled() {
once.Do(func() {
provider := otel.GetTracerProvider()
t = provider.Tracer(name, options...)
})
}
return t
}
}
so we can use it this way in any package:
var tracer = telemetry.Tracer("consensus")
ctx, span := tracer().Start(....)
defer span.End()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've followed the current implementation of Metrics:
func (cs *ConsensusState) logTelemetry(block *types.Block) {
if !telemetry.MetricsEnabled() {
return
}
But I think your suggestion is really cool.
I hesitate a little bit on some points:
- should we apply the same pattern (that you proposed for Traces ) for Metrics or is it fine that we have 2 different pattern ?
ctx, span := tracer().Start(....)
we have to use lazy here but the way we call tracer().Start is a little bit cumbersome. Should we wrap Tracer.Start we so we can call tracer.Start() to have the easier usage ?
wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've followed the current implementation of Metrics:
Yeah, I know, but as you might guess, I don't like that either :p.
should we apply the same pattern (that you proposed for Traces ) for Metrics or is it fine that we have 2 different pattern ?
Yes, Ideally we should use the same pattern. However, I believe Metrics need their own revamp because there are additional changes required. For instance, every package using Metrics should declare the related instruments within its own package. Although it's not ideal to have two different patterns at the moment, this is a separate issue we should address later. For now, let's focus on tracing.
ctx, span := tracer().Start(....) we have to use lazy here but the way we call tracer().Start is a little bit cumbersome. Should we wrap
Tracer.Start
we so we can call tracer.Start() to have the easier usage ?
Im fine with that 👍
tm2/pkg/bft/consensus/state.go
Outdated
attribute.Int("csStep", int(cs.Step)), | ||
) | ||
} | ||
cs.traceCtx = ctx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels really wrong. If addTrace
is used in the wrong place, it could create a data race and have some traces overlapping with each other. Can't we just pass the context as an argument to propagate it through the functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to propagate the context through the functions but with current implementation, finally I realize that passing by traceCtx in ConsensusState is the simplest way (I've tried some implementations but not find any better than using that traceCtx :()
IMHO, ConsensusState is used to keep the state of consensus so traceCtx could be part of that and it makes sense, in any case, the functions call will change the state so adding trace in those functions should be fine.
I've checked from my side and I don't see yet in which case this could cause the problem. Maybe we need the opinion from @zivkovicmilos who knows well this part.
To resume for @zivkovicmilos :
- I'm using traceCtx in ConsensusState to keep the ctx for span.
- @gfanton suggests that we should pass ctx though functions instead.
IMHO, I totally agree with @gfanton on the principle, but in this case, I don't find other better way than using traceCtx in ConsensusState. Maybe there is some edge case that I don't see yet with my current knowledge, so I really appreciate if you can help me on this.
Thankss @zivkovicmilos , @gfanton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I get it, and I think you're right; it could be fine to manage it directly on the state like you did, but you must ensure that updates occur only within mtx.Lock
. If we have a lock, we can assume that everything outside may happen asynchronously. here for example. My concern is that overlapping traces can lead to confusing behavior and incorrect information. That's why we need to be extra cautious to ensure that this state is updated only when it is locked. So I think you can create a span context anywhere, but it must reach a lock before being set
or get
from the state.
tm2/pkg/bft/consensus/state.go
Outdated
) | ||
|
||
// ----------------------------------------------------------------------------- | ||
// Tracer | ||
var tracer = otel.Tracer("consensus") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will likely implement this in other packages, so I think we should prefix the name of the tracer to provide more context.
var tracer = otel.Tracer("consensus") | |
var tracer = otel.Tracer("tm2/bft/consensus") |
tm2/pkg/bft/consensus/state.go
Outdated
@@ -666,6 +691,11 @@ func (cs *ConsensusState) receiveRoutine(maxSteps int) { | |||
|
|||
// state transitions on complete-proposal, 2/3-any, 2/3-one | |||
func (cs *ConsensusState) handleMsg(mi msgInfo) { | |||
span := cs.addTrace(fmt.Sprintf("handleMsg(peerID:%s)", mi.PeerID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use attributes here (and everywhere else)?
span := cs.addTrace(fmt.Sprintf("handleMsg(peerID:%s)", mi.PeerID)) | |
span := cs.addTrace("handleMsg", trace.WithAttributes(attribute.String("PeerID", mi.PeerID)))) |
Using attributes could later help to better identify the trace given specific attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added attributes in addTrace
(I will add some more attrs like you suggest) for tracing but here I keep some short info in the label to have more visibility when seeing the traces (like in the screenshot).
I can remove those labels and keep only function name as the span name if we think that's not necessary ?
wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure! You can include brief information in the name to help identify the trace, but please keep it concise for readability. Feel free to duplicate this information as an attribute as well. If I remember correctly (this was the case in Jaeger) when you click on a trace, you should see the attributes, which is very helpful for understanding the context.
tm2/pkg/bft/consensus/state.go
Outdated
spanNameWithInfo := fmt.Sprintf("%s - CS(%v/%v/%v)", spanName, cs.Height, cs.Round, cs.Step) | ||
spanNameWithInfo = strings.ReplaceAll(spanNameWithInfo, "RoundStep", "RS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant. Why not specify them once in the root span and keep them as attributes for other spans?
I think we could retain the step
attribute in the name, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think some info like Height is redundant, maybe I should keep the Round + Step ?
Or maybe better, I can separate round in another trace root so we can keep just step and make it easier to trace ?
Co-authored-by: Guilhem Fanton <[email protected]>
thankss @gfanton for you time and your reviews, I have some questions related to your review. Could you check it when you're available pls ? thankss |
c6d3691
to
f8437e6
Compare
hi @gfanton, I've updated the code with your suggested changes, could you take a look again pls :) |
Implementation for #2434
We can filter by BlockHeight

We can trace the calls
