Skip to content

feat: integrate OTEL/Jaeger #3815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

hthieu1110
Copy link
Contributor

@hthieu1110 hthieu1110 commented Feb 24, 2025

Implementation for #2434

We can filter by BlockHeight
Screenshot 2025-03-05 at 11 19 19

We can trace the calls
Screenshot 2025-03-05 at 11 20 03

@github-actions github-actions bot added 📦 🌐 tendermint v2 Issues or PRs tm2 related 📦 ⛰️ gno.land Issues or PRs gno.land package related labels Feb 24, 2025
@hthieu1110 hthieu1110 changed the title wip infra: integrate OTEL/Jaeger Feb 24, 2025
@Gno2D2 Gno2D2 requested a review from a team February 24, 2025 14:26
@Gno2D2
Copy link
Collaborator

Gno2D2 commented Feb 24, 2025

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)
🟢 Pending initial approval by a review team member, or review from tech-staff

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Automated Checks
Maintainers must be able to edit this pull request (more info)

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: hthieu1110/gno)

Then

🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request

Pending initial approval by a review team member, or review from tech-staff

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 Not (🔴 Pull request author is a member of the team: tech-staff)

Then

🟢 Requirement satisfied
└── 🟢 If
    ├── 🟢 Condition
    │   └── 🟢 Or
    │       ├── 🔴 At least 1 user(s) of the organization reviewed the pull request (with state "APPROVED")
    │       ├── 🟢 At least 1 user(s) of the team tech-staff reviewed pull request
    │       └── 🔴 This pull request is a draft
    └── 🟢 Then
        └── 🟢 And
            ├── 🟢 Not (🔴 This label is applied to pull request: review/triage-pending)
            └── 🟢 At least 1 user(s) of the team tech-staff reviewed pull request

Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

Copy link

codecov bot commented Feb 25, 2025

@hthieu1110 hthieu1110 marked this pull request as ready for review March 5, 2025 04:22
@Kouteki Kouteki requested a review from zivkovicmilos March 10, 2025 10:48
@Kouteki Kouteki added this to the ⏭️Next after mainnet beta milestone Mar 10, 2025
@zivkovicmilos zivkovicmilos linked an issue Mar 11, 2025 that may be closed by this pull request
@zivkovicmilos zivkovicmilos marked this pull request as draft March 13, 2025 17:27
@Gno2D2 Gno2D2 removed the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Mar 13, 2025
@Kouteki Kouteki moved this from In Review to Todo in 🧙‍♂️gno.land core team Apr 7, 2025
Copy link
Contributor

@sw360cab sw360cab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it is fine to rely on Tempo for reference OTEL configuration.
I requested just several changes to Docker Compose config since I disagree with some choices, but more NITs generally speaking.

My main concern is about splitting the configuration of telemetry enabling only metrics or traces.
We should internally discuss if it is a fine grained configuration that we want or it is smt overloading the configuration of the node already in place?
Can you elaborate on that? (just asking and open to discussion)


tempo:
image: grafana/tempo:latest
command: [ "-config.file=/etc/tempo.yaml" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow other commands' syntax

command:

  • "-config.file=/etc/tempo.yaml"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

depends_on:
- tempo
- prometheus
profiles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reference Compose file. We don't need to add profiles

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I remove it.
Just FYI: I've added it because when testing locally, we need to run all other components but the gnochain, so we can test the integration easier. But you are right, we should handle it locally only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -22,6 +42,9 @@ services:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
networks:
- gnoland-net
profiles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -35,14 +58,18 @@ services:
- "3000:3000"
networks:
- gnoland-net

profiles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

gnoland-val:
image: ghcr.io/gnolang/gno/gnoland:master
networks:
- gnoland-net
volumes:
# Shared Volume
- gnoland-shared:/gnoroot/shared-data
- ../../genesis.json:/gnoroot/genesis.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not make sense.
Why? In the commands below genesis is generated from scratch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I admit that I don't remember why I've added this line.
Removed it.

@@ -98,6 +129,8 @@ services:
gnoland-val:
condition: service_healthy
restart: true
profiles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -108,12 +141,16 @@ services:
restart: unless-stopped
networks:
- gnoland-net
profiles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

- tempo:/var/tempo
- ./tempo/tempo.yaml:/etc/tempo.yaml
ports:
- "3200:3200"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this port public?
As far as I understand tempo should be only accessed through Grafana. Do we need an external access?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked, we do not need to expose this port. Removed it.

@@ -10,6 +10,10 @@ processors:
exporters:
prometheus:
endpoint: collector:8090
otlp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing.
Why using the same name otlp for receiver and exporter?
Can we use a different name, or they should be the same for a reason, which is the reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in fact otlp is not mandatory for exporter in this case.

I will rename it but hesitate between 2 options: tempo vs traces. traces is more generic so we can switch to jaeger later if needed but it has the same name with service so it could create the confusion later that those names must be the same. So I decided to use: tempo for now.

Wdyt ?

Copy link
Contributor Author

@hthieu1110 hthieu1110 Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sw360cab FYI, after checking, the otlp here in this case is the naming convention in fact, it must be a known exporter.
So the best supported way is: otlp/tempo

@sw360cab sw360cab moved this from Todo to In Review in 🧙‍♂️gno.land core team Apr 9, 2025
@sw360cab sw360cab marked this pull request as ready for review April 9, 2025 16:16
Copy link
Member

@gfanton gfanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job ! I've left some suggestions. I will let @zivkovicmilos have the final word on this, as he is the most familiar with the consensus package.

func (cs *ConsensusState) addTrace(spanName string, opts ...trace.SpanStartOption) trace.Span {
var span trace.Span

if telemetry.TracesEnabled() {
Copy link
Member

@gfanton gfanton Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not a big fan of using TracesEnabled explicitly everywhere we need it; it seems a bit redundant and error-prone.

what do you think about creating an helper inside our trace package, something like this:

import (
	"go.opentelemetry.io/otel/trace"
	"go.opentelemetry.io/otel/trace/noop"
)

type TracerFactory func() trace.Tracer

var noopTracerProvider = noop.NewTracerProvider()

func Tracer(name string, options ...trace.TracerOption) TracerFactory {
	var once sync.Once
	var t trace.Tracer = noopTracerProvider.Tracer(name, options...) // Initilize noop tracer as default
	return func() trace.Tracer {
		if TracingEnabled() {
			once.Do(func() {
				provider := otel.GetTracerProvider()
				t = provider.Tracer(name, options...)
			})
		}

		return t
	}
}

so we can use it this way in any package:

var tracer = telemetry.Tracer("consensus")

ctx, span := tracer().Start(....)
defer span.End()

Copy link
Contributor Author

@hthieu1110 hthieu1110 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've followed the current implementation of Metrics:

func (cs *ConsensusState) logTelemetry(block *types.Block) {
	if !telemetry.MetricsEnabled() {
		return
	}

But I think your suggestion is really cool.

I hesitate a little bit on some points:

  • should we apply the same pattern (that you proposed for Traces ) for Metrics or is it fine that we have 2 different pattern ?
  • ctx, span := tracer().Start(....) we have to use lazy here but the way we call tracer().Start is a little bit cumbersome. Should we wrap Tracer.Start we so we can call tracer.Start() to have the easier usage ?

wdyt ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've followed the current implementation of Metrics:

Yeah, I know, but as you might guess, I don't like that either :p.

should we apply the same pattern (that you proposed for Traces ) for Metrics or is it fine that we have 2 different pattern ?

Yes, Ideally we should use the same pattern. However, I believe Metrics need their own revamp because there are additional changes required. For instance, every package using Metrics should declare the related instruments within its own package. Although it's not ideal to have two different patterns at the moment, this is a separate issue we should address later. For now, let's focus on tracing.

ctx, span := tracer().Start(....) we have to use lazy here but the way we call tracer().Start is a little bit cumbersome. Should we wrap Tracer.Start we so we can call tracer.Start() to have the easier usage ?

Im fine with that 👍

attribute.Int("csStep", int(cs.Step)),
)
}
cs.traceCtx = ctx
Copy link
Member

@gfanton gfanton Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels really wrong. If addTrace is used in the wrong place, it could create a data race and have some traces overlapping with each other. Can't we just pass the context as an argument to propagate it through the functions?

Copy link
Contributor Author

@hthieu1110 hthieu1110 Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to propagate the context through the functions but with current implementation, finally I realize that passing by traceCtx in ConsensusState is the simplest way (I've tried some implementations but not find any better than using that traceCtx :()
IMHO, ConsensusState is used to keep the state of consensus so traceCtx could be part of that and it makes sense, in any case, the functions call will change the state so adding trace in those functions should be fine.

I've checked from my side and I don't see yet in which case this could cause the problem. Maybe we need the opinion from @zivkovicmilos who knows well this part.

To resume for @zivkovicmilos :

  • I'm using traceCtx in ConsensusState to keep the ctx for span.
  • @gfanton suggests that we should pass ctx though functions instead.
    IMHO, I totally agree with @gfanton on the principle, but in this case, I don't find other better way than using traceCtx in ConsensusState. Maybe there is some edge case that I don't see yet with my current knowledge, so I really appreciate if you can help me on this.

Thankss @zivkovicmilos , @gfanton

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I get it, and I think you're right; it could be fine to manage it directly on the state like you did, but you must ensure that updates occur only within mtx.Lock. If we have a lock, we can assume that everything outside may happen asynchronously. here for example. My concern is that overlapping traces can lead to confusing behavior and incorrect information. That's why we need to be extra cautious to ensure that this state is updated only when it is locked. So I think you can create a span context anywhere, but it must reach a lock before being set or get from the state.

)

// -----------------------------------------------------------------------------
// Tracer
var tracer = otel.Tracer("consensus")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will likely implement this in other packages, so I think we should prefix the name of the tracer to provide more context.

Suggested change
var tracer = otel.Tracer("consensus")
var tracer = otel.Tracer("tm2/bft/consensus")

@@ -666,6 +691,11 @@ func (cs *ConsensusState) receiveRoutine(maxSteps int) {

// state transitions on complete-proposal, 2/3-any, 2/3-one
func (cs *ConsensusState) handleMsg(mi msgInfo) {
span := cs.addTrace(fmt.Sprintf("handleMsg(peerID:%s)", mi.PeerID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use attributes here (and everywhere else)?

Suggested change
span := cs.addTrace(fmt.Sprintf("handleMsg(peerID:%s)", mi.PeerID))
span := cs.addTrace("handleMsg", trace.WithAttributes(attribute.String("PeerID", mi.PeerID))))

Using attributes could later help to better identify the trace given specific attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added attributes in addTrace (I will add some more attrs like you suggest) for tracing but here I keep some short info in the label to have more visibility when seeing the traces (like in the screenshot).
I can remove those labels and keep only function name as the span name if we think that's not necessary ?

wdyt ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure! You can include brief information in the name to help identify the trace, but please keep it concise for readability. Feel free to duplicate this information as an attribute as well. If I remember correctly (this was the case in Jaeger) when you click on a trace, you should see the attributes, which is very helpful for understanding the context.

Comment on lines 1939 to 1940
spanNameWithInfo := fmt.Sprintf("%s - CS(%v/%v/%v)", spanName, cs.Height, cs.Round, cs.Step)
spanNameWithInfo = strings.ReplaceAll(spanNameWithInfo, "RoundStep", "RS")
Copy link
Member

@gfanton gfanton Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant. Why not specify them once in the root span and keep them as attributes for other spans?
I think we could retain the step attribute in the name, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think some info like Height is redundant, maybe I should keep the Round + Step ?

Or maybe better, I can separate round in another trace root so we can keep just step and make it easier to trace ?

@hthieu1110
Copy link
Contributor Author

thankss @gfanton for you time and your reviews, I have some questions related to your review. Could you check it when you're available pls ? thankss

@hthieu1110 hthieu1110 force-pushed the feat/integrate-otel-jaeger branch from c6d3691 to f8437e6 Compare April 17, 2025 10:20
@hthieu1110
Copy link
Contributor Author

hi @gfanton, I've updated the code with your suggested changes, could you take a look again pls :)
I've checked, normally all the code related to addSpan are already after the lock acquisition.

@hthieu1110 hthieu1110 requested review from sw360cab and gfanton April 22, 2025 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📦 🌐 tendermint v2 Issues or PRs tm2 related 📦 ⛰️ gno.land Issues or PRs gno.land package related
Projects
Status: In Progress
Status: In Review
Development

Successfully merging this pull request may close these issues.

[chain] Add OTEL tracing functionality + Jaeger
5 participants