Skip to content

Conversation

@yokaze
Copy link
Contributor

@yokaze yokaze commented Oct 22, 2025

  • Add neco-server-exporter to expose custom node-local metrices
    • Focus on performance digit not exposed from node-exporter nor cAdvisor
    • Extensible structure
      • It can collect information from multiple aspects of a server
      • BPF performance monitoring as the first plugin
        • Assume Linux 6.6+ and Cilium 1.16+ (TCX mode) to collect extended information

Signed-off-by: Daichi Sakaue [email protected]

@yokaze yokaze self-assigned this Oct 22, 2025
@yokaze yokaze force-pushed the neconet-exporter branch 8 times, most recently from 4574776 to 65f5f30 Compare October 29, 2025 04:13
@yokaze yokaze force-pushed the neconet-exporter branch 3 times, most recently from 4e7564b to 57d1bc0 Compare October 30, 2025 01:11
@yokaze yokaze marked this pull request as ready for review October 30, 2025 06:29
@yokaze
Copy link
Contributor Author

yokaze commented Oct 30, 2025

NOTE: merge this PR after preparing ghcr.io

@yokaze yokaze requested a review from chez-shanpu October 30, 2025 06:31
@yokaze yokaze force-pushed the neconet-exporter branch 2 times, most recently from aca806e to 4f2916e Compare November 4, 2025 07:34
return nil
}

func (e *Exporter) Run(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Run() never return an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to leave it as is, because it states the Run function may return error in future development.
Without it, someone may confuse how to treat errors happen in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. How about writing it in a comment to left what you replied?
For example...

// Run starts the metric collection loop.
// The error return is reserved for future fatal error conditions.
// Currently always returns nil except when context is cancelled.
func (e *Exporter) Run(ctx context.Context) error {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will add it.

}
}

func (e *Exporter) AddCollector(c Collector) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: How about making collectors a constructor parameter instead of using AddCollector(). This would remove running and make the API simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool idea, thank you!

Comment on lines +52 to +55
if err := http.ListenAndServe("0.0.0.0:8080", nil); err != nil {
e.log.Error("metrics server stopped", slog.Any("error", err))
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This HTTP server doesn't stop when ctx is cancelled. How about using http.Server.Shutdown() when ctx is done for graceful shutdown?

Comment on lines 10 to 12
"github.com/cilium/cilium/pkg/client"
"github.com/cilium/ebpf"
"github.com/cybozu/neco-containers/neco-server-exporter/pkg/exporter"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"github.com/cilium/cilium/pkg/client"
"github.com/cilium/ebpf"
"github.com/cybozu/neco-containers/neco-server-exporter/pkg/exporter"
"github.com/cilium/cilium/pkg/client"
"github.com/cilium/ebpf"
"github.com/cybozu/neco-containers/neco-server-exporter/pkg/exporter"

Missing blank line between external and internal imports.

Copy link
Contributor Author

@yokaze yokaze Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point is new to me, because the proposed style is not currently adopted by our team.

My feeling is neutral for this and it's ok to accept it. However, it's better to automate.
I'm going to install goimports in tools.mk and check the style in make check-generate.

Comment on lines 8 to 11
"github.com/cybozu/neco-containers/neco-server-exporter/pkg/collector/bpf"
"github.com/cybozu/neco-containers/neco-server-exporter/pkg/exporter"
"github.com/spf13/cobra"
ctrl "sigs.k8s.io/controller-runtime"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

uninstall:
$(KUBECTL) delete -f testdata/namespace.yaml
-docker image rm $(IMAGE_TAG)
-docker image prune -f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it enough docker image rm $(IMAGE_TAG)?
image prune seems little dangerous to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's consistent with other scripts.
Do you want all of them to be fixed?

$ git grep -F -- '-docker image prune -f'
cep-checker/e2e/Makefile:43:    -docker image prune -f
neco-server-exporter/e2e/Makefile:57:   -docker image prune -f
squid-exporter/e2e/Makefile:34: -docker image prune -f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think other ones could be fixed as well, but for now, I think only this one is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me clarify my implication, I think it's very important to choose one from the following options.

  1. Fix all-at-once.
  2. Leave it as negligible.

If we fix the problem partially, the remaining ones will be copy-pasted sometime and we need to point out the same problem again and again. That's not productive.
I would like to ask that the problem is worth fixing for paying the all-at-once cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. What do you think about image prune?
If you also think it's a problem, let's fix them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my opinion it looks a little dangerous too.
It's not intuitive that make uninstall prunes all the unrelated images.

My preference is to remove automatic image prune from the entire repository.

Comment on lines 31 to 36

// Please uncomment when needed

// func kubectl(input []byte, args ...string) ([]byte, []byte, error) {
// return runCommand(kubectlPath, input, args...)
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can remove these lines (I think we can add it when it's needed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to leave it here, because the function will very likely be needed.
Otherwise, the one who wants kubectl will need to grep neco repositories all-around to get an appropriate version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, why don't you call kubectl() from kubectlSafe()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tiny optimization to save call-stack memory that's not-so-effective (~10ns/32Bytes).
I'll use kubectl() from kubectlSafe(), that's more rigid and readable.

ret := make(map[ebpf.ProgramID]TCXMetadata)
for it.Next() {
li := it.Take()
defer li.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be close end of for or extract it as a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not a major problem, but will fix the implementation to close the handle per-program to avoid consuming thousands of file descriptors caused by a large number of BPF programs.


type Collector interface {
// Metrics names will be "neco_server_<SectionName>_<MetricsName>{MetricsLabels}".
SectionName() string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think just Name() would be fine too, since it's clear that it's the collector's(section?) name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use SectionName, because it flavors like something special is there.
Name looks too concise to find out it's used as a part of metrics names.
Someone may think it's ok to set arbitrary text here and write as "BPF Performance Exporter".

BTW how about MetricsGroupName()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricsGroupName() is fine. Also MetricsPrefix is clear to me in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'm going to use MetricsPrefix.
Thanks for the suggestion 🙇

@@ -0,0 +1,120 @@
module github.com/cybozu/neco-containers/neco-server-exporter

go 1.24.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 1.24.5 has some vulnerabilities. I think the latest would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version should exactly match with the workflow:
https://github.com/cybozu/neco-containers/blob/main/.github/workflows/main.yaml#L281
https://github.com/cybozu/neco-containers/actions/runs/19063539929/job/54448614430

Do you think it's ok to postpone to regular update, or better to update all for now?

$ git grep -F 1.24.5
.github/workflows/main.yaml:285:      go-version: "1.24.5"
admission/go.mod:3:go 1.24.5
bmc-log-collector/go.mod:3:go 1.24.5
cep-checker/go.mod:3:go 1.24.5
envoy/go.mod:3:go 1.24.5
neco-server-exporter/go.mod:3:go 1.24.5
squid-exporter/go.mod:3:go 1.24.5
tcp-keepalive/go.mod:3:go 1.24.5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think it's enough to just update this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible to fix neco-server-exporter only, due to repository structure.
We must choose all-at-once or regular update for this problem.

@yokaze yokaze force-pushed the neconet-exporter branch 5 times, most recently from 427e2d0 to 8e2eb90 Compare November 5, 2025 07:57
Signed-off-by: Daichi Sakaue <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants