Skip to content
This repository was archived by the owner on Oct 23, 2024. It is now read-only.

Commit d956633

Browse files
Ben Keithkeitwb
authored andcommitted
Allow use of K8s node name instead of machine_id
This is optional and disabled by default but is necessary on platforms like EKS and PKS that have duplicated machine-ids across nodes. Eventually it would be better to migrate away from machine-id altogether since it is so unreliable, but this would break backwards compatibility so it will have to be a major release
1 parent 16e51e6 commit d956633

File tree

8 files changed

+164
-79
lines changed

8 files changed

+164
-79
lines changed

deployments/k8s/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,10 @@ A few things to do before deploying these:
1717
reference in [./clusterrolebinding.yaml](./clusterrolebinding.yaml) to the
1818
namespace in which you are deploying the agent.
1919

20+
3. Create a secret in K8s with your org's access token:
21+
22+
`kubectl create secret generic --from-literal access-token=MY_ACCESS_TOKEN signalfx-agent`
23+
2024
Then to deploy run the following from the present directory:
2125

2226
`cat *.yaml | kubectl apply -f -`

docs/kubernetes-setup.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,12 @@ cluster and configure it to auto-discover SignalFx-supported integrations to
1515
monitor.
1616

1717
1. Store your organization's Access Token as a key named `access-token` in a
18-
Kubernetes secret named `signalfx-agent`.
18+
Kubernetes secret named `signalfx-agent`:
19+
20+
```sh
21+
$ kubectl create secret generic --from-literal access-token=MY_ACCESS_TOKEN signalfx-agent`
22+
```
23+
1924
2. If you use [Helm](https://github.com/kubernetes/helm), you can use [our
2025
chart](https://github.com/kubernetes/charts/tree/master/stable/signalfx-agent)
2126
in the stable Helm chart repository. Otherwise, download the following
@@ -43,6 +48,17 @@ monitor.
4348
**If you are using Rancher for your Kubernetes deployment,** complete the
4449
instructions in [Rancher](#rancher) before proceeding with the next step.
4550
51+
**If you are using AWS Elastic Container Service for Kubernetes (EKS)for
52+
your Kubernetes deployment,** complete the instructions in [AWS Elastic
53+
Container Service for Kubernetes
54+
(EKS)](#aws-elastic-container-service-for-kubernetes-eks) before
55+
proceeding with the next step.
56+
57+
**If you are using Pivotal Container Service (PKS) for your Kubernetes
58+
deployment,** complete the instructions in [Pivotal Container Service
59+
(PKS)](#pivotal-container-service-pks) before proceeding with the next
60+
step.
61+
4662
**If you are using Google Container Engine (GKE) for your Kubernetes
4763
deployment,** complete the instructions in [Google Container Engine
4864
(GKE)](#google-container-engine-gke) before proceeding with the next
@@ -257,6 +273,32 @@ Cadvisor runs on port 9344 instead of the standard 4194.
257273
When you have completed these steps, continue with step 3 in
258274
[Installation](#installation).
259275
276+
277+
## AWS Elastic Container Service for Kubernetes (EKS)
278+
On EKS, machine ids are identical across worker nodes, which makes that value
279+
useless for identification. Therefore, there are two changes you should make
280+
to the configmap to use the K8s node name instead of machine-id.
281+
282+
1) In the configmap.yaml, change the top-level config option `sendMachineId`
283+
to `false`. This will cause the agent to omit the machine_id dimension from
284+
all datapoints and instead send the `kubernetes_node` dimension on all
285+
datapoints emitted by the agent.
286+
287+
2) Under the kubernetes-cluster monitor configuration, set the option
288+
`useNodeName: true`. This will cause that monitor to sync node labels to the
289+
`kubernetes_node` dimension instead of the `machine_id` dimension.
290+
291+
Note that in EKS there is no concept of a "master" node (at least not that is
292+
exposed via the K8s API) and so all nodes will be treated as workers.
293+
294+
295+
## Pivotal Container Service (PKS)
296+
297+
See [AWS Elastic Container Service for
298+
Kubernetes](#aws-elastic-container-service-for-kubernetes-eks) -- the setup for
299+
PKS is identical because of the similar lack of reliable machine ids.
300+
301+
260302
## Google Container Engine (GKE)
261303
262304
On GKE, access to the kubelet is highly restricted and service accounts will

docs/monitors/kubernetes-cluster.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Monitor Type: `kubernetes-cluster`
3737
| Config option | Required | Type | Description |
3838
| --- | --- | --- | --- |
3939
| `alwaysClusterReporter` | no | `bool` | If `true`, leader election is skipped and metrics are always reported. (**default:** `false`) |
40+
| `useNodeName` | no | `bool` | If set to true, the Kubernetes node name will be used as the dimension to which to sync properties about each respective node. This is necessary if your cluster's machines do not have unique machine-id values, as can happen when machine images are improperly cloned. (**default:** `false`) |
4041
| `kubernetesAPI` | no | `object (see below)` | Config for the K8s API client |
4142

4243

@@ -85,8 +86,9 @@ dimensions may be specific to certain metrics.
8586
| --- | --- |
8687
| `kubernetes_name` | The name of the resource that the metric describes |
8788
| `kubernetes_namespace` | The namespace of the resource that the metric describes |
89+
| `kubernetes_node` | The name of the node, as defined by the `name` field of the node resource. |
8890
| `kubernetes_pod_uid` | The UID of the pod that the metric describes |
89-
| `machine_id` | The machine ID from /etc/machine-id. This should be unique across all nodes in your cluster, but some cluster deployment tools don't guarantee this. |
91+
| `machine_id` | The machine ID from /etc/machine-id. This should be unique across all nodes in your cluster, but some cluster deployment tools don't guarantee this. This will not be sent if the `useNodeName` config option is set to true. |
9092
| `metric_source` | This is always set to `kubernetes` |
9193

9294
## Properties
@@ -97,7 +99,7 @@ are set on the dimension values of the dimension specified.
9799

98100
| Name | Dimension | Description |
99101
| --- | --- | --- |
100-
| `<node label>` | `machine_id` | All non-blank labels on a given node will be synced as properties to the `machine_id` dimension value for that node. Any blank values will be synced as tags on that same dimension. |
102+
| `<node label>` | `machine_id/kubernetes_node` | All non-blank labels on a given node will be synced as properties to the `machine_id` or `kubernetes_node` dimension value for that node. Which dimension gets the properties is determined by the `useNodeName` config option. Any blank values will be synced as tags on that same dimension. |
101103
| `<pod label>` | `kubernetes_pod_uid` | Any labels with non-blank values on the pod will be synced as properties to the `kubernetes_pod_uid` dimension. Any blank labels will be synced as tags on that same dimension. |
102104

103105

internal/core/hostid/dims.go

Lines changed: 46 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,19 @@
11
package hostid
22

3-
import log "github.com/sirupsen/logrus"
3+
import (
4+
"sync"
5+
6+
log "github.com/sirupsen/logrus"
7+
)
48

59
// Dimensions returns a map of host-specific dimensions that are derived from
610
// the environment.
711
func Dimensions(sendMachineID bool, hostname string, useFullyQualifiedHost *bool) map[string]string {
812
log.Info("Fetching host id dimensions")
9-
// Fire off all lookups simultaneously so we delay agent startup as little
10-
// as possible.
1113

12-
hostProvider := callConcurrent(func() string {
14+
var g dimGatherer
15+
16+
g.GatherDim("host", func() string {
1317
if hostname != "" {
1418
return hostname
1519
}
@@ -18,38 +22,53 @@ func Dimensions(sendMachineID bool, hostname string, useFullyQualifiedHost *bool
1822
// if the user specified it explicitly as false with this logic.
1923
return getHostname(useFullyQualifiedHost == nil || *useFullyQualifiedHost)
2024
})
21-
awsProvider := callConcurrent(AWSUniqueID)
22-
gcpProvider := callConcurrent(GoogleComputeID)
23-
machineIDProvider := callConcurrent(MachineID)
24-
azureProvider := callConcurrent(AzureUniqueID)
25-
26-
dims := make(map[string]string)
27-
insertNextChanValue(dims, "host", hostProvider)
28-
insertNextChanValue(dims, "AWSUniqueId", awsProvider)
29-
insertNextChanValue(dims, "gcp_id", gcpProvider)
25+
g.GatherDim("AWSUniqueId", AWSUniqueID)
26+
g.GatherDim("gcp_id", GoogleComputeID)
3027
if sendMachineID {
31-
insertNextChanValue(dims, "machine_id", machineIDProvider)
28+
g.GatherDim("machine_id", MachineID)
29+
} else {
30+
// If not running on k8s, this will be blank and thus omitted. It is
31+
// only sent as an alternative to machine id because k8s node labels
32+
// are synced as properties to this instead of machine_id when
33+
// machine_id isn't available.
34+
g.GatherDim("kubernetes_node", KubernetesNodeName)
3235
}
33-
insertNextChanValue(dims, "azure_resource_id", azureProvider)
36+
g.GatherDim("azure_resource_id", AzureUniqueID)
37+
38+
dims := g.WaitForDimensions()
3439

3540
log.Infof("Using host id dimensions %v", dims)
3641
return dims
3742
}
3843

39-
func callConcurrent(f func() string) <-chan string {
40-
res := make(chan string)
44+
// Helper to fire off the dim lookups in parallel to minimize delay to agent
45+
// start up.
46+
type dimGatherer struct {
47+
lock sync.Mutex
48+
dims map[string]string
49+
wg sync.WaitGroup
50+
}
51+
52+
// GatherDim inserts the given dim key based on the output of the provider
53+
// func. If the output is blank, the dimension will not be inserted.
54+
func (dg *dimGatherer) GatherDim(key string, provider func() string) {
55+
dg.wg.Add(1)
4156
go func() {
42-
res <- f()
57+
res := provider()
58+
if res != "" {
59+
dg.lock.Lock()
60+
if dg.dims == nil {
61+
dg.dims = make(map[string]string)
62+
}
63+
64+
dg.dims[key] = res
65+
dg.lock.Unlock()
66+
}
67+
dg.wg.Done()
4368
}()
44-
return res
4569
}
4670

47-
func insertNextChanValue(m map[string]string, k string, ch <-chan string) {
48-
select {
49-
case val := <-ch:
50-
// Don't insert blank values
51-
if val != "" {
52-
m[k] = val
53-
}
54-
}
71+
func (dg *dimGatherer) WaitForDimensions() map[string]string {
72+
dg.wg.Wait()
73+
return dg.dims
5574
}

internal/core/hostid/k8s.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
package hostid
2+
3+
import "os"
4+
5+
// KubernetesNodeName returns the name of the current K8s node name, if running
6+
// on K8s and if the appropriate envvar (MY_NODE_NAME) has been injected in the
7+
// agent pod by the downward API mechanism.
8+
func KubernetesNodeName() string {
9+
return os.Getenv("MY_NODE_NAME")
10+
}

internal/monitors/kubernetes/cluster/metrics/cache.go

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,20 +24,26 @@ type cachedResourceKey struct {
2424
UID types.UID
2525
}
2626

27+
var logger = log.WithFields(log.Fields{
28+
"monitorType": "kubernetes-cluster",
29+
})
30+
2731
// DatapointCache holds an up to date copy of datapoints pertaining to the
2832
// cluster. It is updated whenever the HandleChange method is called with new
2933
// K8s resources.
3034
type DatapointCache struct {
3135
dpCache map[cachedResourceKey][]*datapoint.Datapoint
3236
dimPropCache map[cachedResourceKey]*atypes.DimProperties
37+
useNodeName bool
3338
mutex sync.Mutex
3439
}
3540

3641
// NewDatapointCache creates a new clean cache
37-
func NewDatapointCache() *DatapointCache {
42+
func NewDatapointCache(useNodeName bool) *DatapointCache {
3843
return &DatapointCache{
3944
dpCache: make(map[cachedResourceKey][]*datapoint.Datapoint),
4045
dimPropCache: make(map[cachedResourceKey]*atypes.DimProperties),
46+
useNodeName: useNodeName,
4147
}
4248
}
4349

@@ -109,8 +115,8 @@ func (dc *DatapointCache) HandleAdd(newObj runtime.Object) interface{} {
109115
case *v1beta1.ReplicaSet:
110116
dps = datapointsForReplicaSet(o)
111117
case *v1.Node:
112-
dps = datapointsForNode(o)
113-
dimProps = dimPropsForNode(o)
118+
dps = datapointsForNode(o, dc.useNodeName)
119+
dimProps = dimPropsForNode(o, dc.useNodeName)
114120
default:
115121
log.WithFields(log.Fields{
116122
"obj": spew.Sdump(newObj),
Lines changed: 42 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,9 @@
11
package metrics
22

33
import (
4-
"reflect"
5-
64
"github.com/signalfx/golib/datapoint"
75
"github.com/signalfx/golib/sfxclient"
86
atypes "github.com/signalfx/signalfx-agent/internal/monitors/types"
9-
log "github.com/sirupsen/logrus"
107
"k8s.io/api/core/v1"
118
)
129

@@ -15,48 +12,70 @@ import (
1512

1613
// DIMENSION(machine_id): The machine ID from /etc/machine-id. This should be
1714
// unique across all nodes in your cluster, but some cluster deployment tools
18-
// don't guarantee this.
15+
// don't guarantee this. This will not be sent if the `useNodeName` config
16+
// option is set to true.
17+
18+
// DIMENSION(kubernetes_node): The name of the node, as defined by the `name`
19+
// field of the node resource.
1920

20-
// PROPERTY(machine_id:<node label>): All non-blank labels on a given node will
21-
// be synced as properties to the `machine_id` dimension value for that node.
22-
// Any blank values will be synced as tags on that same dimension.
21+
// PROPERTY(machine_id/kubernetes_node:<node label>): All non-blank labels on a
22+
// given node will be synced as properties to the `machine_id` or
23+
// `kubernetes_node` dimension value for that node. Which dimension gets the
24+
// properties is determined by the `useNodeName` config option. Any blank
25+
// values will be synced as tags on that same dimension.
2326

2427
// A map to check for duplicate machine IDs
25-
var machineIDToHostMap = make(map[string]string)
28+
var machineIDToNodeNameMap = make(map[string]string)
2629

27-
func datapointsForNode(node *v1.Node) []*datapoint.Datapoint {
30+
func datapointsForNode(node *v1.Node, useNodeName bool) []*datapoint.Datapoint {
2831
dims := map[string]string{
29-
"host": firstNodeHostname(node),
30-
"machine_id": node.Status.NodeInfo.MachineID,
32+
"host": firstNodeHostname(node),
33+
"kubernetes_node": node.Name,
34+
}
35+
36+
// If we aren't using the node name as the node id, then we need machine_id
37+
// to sync properties to. Eventually we should just get rid of machine_id
38+
// if it doesn't become more reliable and dependable across k8s deployments.
39+
if !useNodeName {
40+
dims["machine_id"] = node.Status.NodeInfo.MachineID
3141
}
3242

3343
return []*datapoint.Datapoint{
3444
sfxclient.Gauge("kubernetes.node_ready", dims, nodeConditionValue(node, v1.NodeReady)),
3545
}
3646
}
3747

38-
func dimPropsForNode(node *v1.Node) *atypes.DimProperties {
48+
func dimPropsForNode(node *v1.Node, useNodeName bool) *atypes.DimProperties {
3949
props, tags := propsAndTagsFromLabels(node.Labels)
4050

4151
if len(props) == 0 && len(tags) == 0 {
4252
return nil
4353
}
4454

45-
host := firstNodeHostname(node)
46-
machineID := node.Status.NodeInfo.MachineID
47-
48-
if otherHost, ok := machineIDToHostMap[machineID]; ok && otherHost != host {
49-
log.Errorf("Your K8s cluster appears to have duplicate node machine IDs, "+
50-
"host %s and %s both have machine ID %s. This is probably kubelet's fault.", host, otherHost, machineID)
51-
return nil
55+
dim := atypes.Dimension{
56+
Name: "kubernetes_node",
57+
Value: node.Name,
5258
}
53-
machineIDToHostMap[machineID] = host
5459

55-
return &atypes.DimProperties{
56-
Dimension: atypes.Dimension{
60+
if !useNodeName {
61+
machineID := node.Status.NodeInfo.MachineID
62+
dim = atypes.Dimension{
5763
Name: "machine_id",
5864
Value: machineID,
59-
},
65+
}
66+
67+
if otherNodeName, ok := machineIDToNodeNameMap[machineID]; ok && otherNodeName != node.Name {
68+
logger.Errorf("Your K8s cluster appears to have duplicate node machine IDs, "+
69+
"node %s and %s both have machine ID %s. Please set the `useNodeName` option "+
70+
"in this monitor config and set the top-level config option `sendMachineID` to "+
71+
"false.", node.Name, otherNodeName, machineID)
72+
}
73+
74+
machineIDToNodeNameMap[machineID] = node.Name
75+
}
76+
77+
return &atypes.DimProperties{
78+
Dimension: dim,
6079
Properties: props,
6180
Tags: tags,
6281
}
@@ -87,25 +106,3 @@ func firstNodeHostname(node *v1.Node) string {
87106
}
88107
return ""
89108
}
90-
91-
// Nodes get updated a lot due to heartbeat checks that alter the
92-
// lastHeartbeatCheck field within condition items. Also the images can
93-
// sometimes come in different orderings and we don't really care about them
94-
// anyway, so just get rid of them before comparing.
95-
func nodesDifferent(n1 *v1.Node, n2 *v1.Node) bool {
96-
if n2 == nil {
97-
return true
98-
}
99-
c1 := *n1
100-
c2 := *n2
101-
102-
c1.ResourceVersion = c2.ResourceVersion
103-
104-
c1.Status.Conditions = nil
105-
c2.Status.Conditions = nil
106-
107-
c1.Status.Images = nil
108-
c2.Status.Images = nil
109-
110-
return !reflect.DeepEqual(c1, c2)
111-
}

0 commit comments

Comments
 (0)