Skip to content

Docs: SEO updates to operations, other specs sections #25518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion website/content/docs/operations/aws-oidc-provider.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: docs
page_title: Federate access to AWS with Nomad Workload Identity
description: |-
Integrate Nomad as an OpenID Connect (OIDC) provider with AWS IAM identity and federate access to AWS resources.
Integrate Nomad as an OpenID Connect (OIDC) provider with AWS IAM identity and use workload identity to federate access to AWS resources and services.
---

# Federate access to AWS with Nomad Workload Identity
Expand Down
6 changes: 3 additions & 3 deletions website/content/docs/operations/benchmarking.mdx
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
layout: docs
page_title: Benchmarking Nomad
page_title: Benchmark and load test Nomad
description: |-
Load testing Nomad by utilizing the Nomad Bench project.
Use the Nomad Bench project to benchmark and load test Nomad servers.
---

# Nomad Bench
# Benchmark and load test Nomad

The Nomad Bench project provides reusable infrastructure automation to run test scenarios in order
to collect metrics and data from Nomad clusters running at scale. The core goal of the project is
Expand Down
5 changes: 3 additions & 2 deletions website/content/docs/operations/federation/failure.mdx
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
---
layout: docs
page_title: Federated cluster failure scenarios
description: Failure scenarios in multi-region federated cluster deployments.
description: |-
Review failure scenarios in multi-region federated cluster deployments. Learn which Nomad features continue to work under federated and authoritative region failures.
---

# Failure scenarios
# Federated cluster failure scenarios

When running Nomad in federated mode, failure situations and impacts are different depending on
whether the authoritative region is the impacted region or not, and what the failure mode is. In
Expand Down
2 changes: 1 addition & 1 deletion website/content/docs/operations/federation/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: docs
page_title: Federated cluster operations
description: |-
Operational considerations for running Nomad multi-region federated clusters as well as instructions for migrating the authoritative region to a federated region.
Review operational considerations for running Nomad multi-region federated clusters as well as instructions for migrating the authoritative region to a federated region.
---

# Federated cluster operations
Expand Down
2 changes: 1 addition & 1 deletion website/content/docs/operations/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: docs
page_title: Operations
description: |-
Learn about operating Nomad.
This section contains guides, explanatory content, and reference information for running Nomad in a production environment. Topics include stateful workloads, monitoring, benchmarking, key management, IPv6 support, federation, cluster management, access control, and transport security.
---

# Operations
Expand Down
6 changes: 3 additions & 3 deletions website/content/docs/operations/ipv6-support.mdx
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
layout: docs
page_title: Support for IPv6
page_title: IPv6 Support in Nomad
description: |-
Nomad support for IPv6
Learn how Nomad supports IPv6. Configure Nomad to advertise IPv6 addresses. Link Nomad servers and clients that have specific IPv6 addresses. Set up Consul and Vault to use Nomad's IPv6 address. Learn how workload tasks and task drivers can use IPv6 addresses.
---

# IPv6 support in Nomad
# IPv6 Support in Nomad

Nomad supports IPv6 as long as the underlying networks, host machines,
and operating systems running it support IPv6.
Expand Down
15 changes: 8 additions & 7 deletions website/content/docs/operations/key-management.mdx
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
---
layout: docs
page_title: Key Management
description: Learn about the key management in Nomad.
page_title: Key management
description: |-
Learn how Nomad manages its keyring, which Nomad uses to encrypt variables, sign task workload identities, and sign OpenID Connect (OIDC) client assertions. Review key rotation, key decryption, and key redaction in Raft snapshots. Learn how Nomad v1.9+ can replicate keys from older Nomad versions.
---

# Key Management
# Key management

Nomad servers maintain an encryption keyring used to encrypt [Variables][],
sign task [workload identities][], and sign OIDC [client assertion JWTs][].
Expand All @@ -27,7 +28,7 @@ Under normal operations the keyring is entirely managed by Nomad, but this
section provides administrators additional context around key replication and
recovery.

## Key Rotation
## Key rotation

Only one key in the keyring is "active" at any given time, and all encryption
and signing operations happen on the leader. Nomad automatically rotates the
Expand All @@ -42,15 +43,15 @@ operator root keyring rotate -full`][]. A new "active" key will be created and
re-encrypt all variables with the new key. As each key's variables are encrypted
with the new key, the old key will marked as "deprecated".

## Key Decryption
## Key decryption

When a leader is elected, the leader creates the keyring if it does not already
exist. When a key is added, the new wrapped key material is replicated via
Raft. As each server replicates the new key, the server starts a task to decrypt
the key material. Until this task completes, the server is not able to serve
requests that require this key.

## Key Redaction in Raft Snapshots
## Key redaction in Raft scenario snapshots

The default AEAD `keyring` configuration stores the KEK in Raft. Raft snapshots
contain the cleartext KEK. The `nomad operator snapshot save` command has a
Expand All @@ -60,7 +61,7 @@ existing snapshot.

Redacting key material is not required when using an external KMS.

## Legacy Keystore
## Legacy keystore

Versions of Nomad prior to 1.9.0 stored only key metadata in Raft, but the
encryption key material was stored in a separate file in the `keystore`
Expand Down
31 changes: 16 additions & 15 deletions website/content/docs/operations/metrics-reference.mdx
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
---
layout: docs
page_title: Metrics Reference
description: Learn about the different metrics available in Nomad.
page_title: Metrics reference
description: |-
This page contains reference information on the gauge, counter, and timer runtime metrics data that Nomad collects. Use the metrics endpoint to access the metrics data. Learn about the key metrics for monitoring your cluster. Review client, host, allocation, job summary, job status, server, Raft BoltDB, and agent metrics fields.
---

# Metrics Reference
# Metrics reference

The Nomad agent collects various runtime metrics about the performance of
different libraries and subsystems. These metrics are aggregated on a ten
Expand Down Expand Up @@ -67,22 +68,22 @@ Below is sample output of a telemetry dump:
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204
```

### Metric Types
### Metric types

| Type | Description | Quantiles |
| ------- | ------------------------------------------------------------------------------------------------------------------- | --------- |
| Gauge | Gauge types report an absolute number at the end of the aggregation interval | false |
| Counter | Counts are incremented and flushed at the end of the aggregation interval and then are reset to zero | true |
| Timer | Timers measure the time to complete a task and will include quantiles, means, standard deviation, etc per interval. | true |

### Tagged Metrics
### Tagged metrics

Nomad emits metrics in a tagged format. Each metric can support more than one
tag, meaning that it is possible to do a match over metrics for datapoints
such as a particular datacenter, and return all metrics with this tag. Nomad
supports labels for namespaces as well.

## Key Metrics
## Key metrics

The metrics in the table below are the most important metrics for monitoring
the overall health of a Nomad cluster.
Expand Down Expand Up @@ -121,7 +122,7 @@ signals.
| `nomad.raft.replication.appendEntries` | Raft transaction commit time | ms / Raft Log Append | Timer |
| `nomad.license.expiration_time_epoch` | Time as epoch (seconds since Jan 1 1970) at which license will expire | Seconds | Gauge |

## Client Metrics
## Client metrics

The Nomad client emits metrics related to the resource usage of the allocations
and tasks running on it and the node itself. Operators have to explicitly turn
Expand Down Expand Up @@ -149,7 +150,7 @@ parameterized or periodic job respectively. For example, a dispatch job with the
| parent_id | `myjob` |
| dispatch_id | `1312323423423` |

## Host Metrics
## Host metrics

Nomad will emit [tagged metrics][tagged-metrics], in the below format:

Expand Down Expand Up @@ -188,7 +189,7 @@ Nomad will emit [tagged metrics][tagged-metrics], in the below format:
| `nomad.client.unallocated.memory` | Total amount of memory free for the scheduler to allocate to tasks | Megabytes | Gauge | datacenter, host, node_class, node_id, node_pool, node_scheduling_eligibility, node_status |
| `nomad.client.uptime` | Uptime of the host running the Nomad client | Seconds | Gauge | datacenter, host, node_class, node_id, node_pool, node_scheduling_eligibility, node_status |

### Client Hook Metrics
### Client hook metrics

Nomad will emit metrics allowing you to monitor and alert on allocation and task hook performance.
If you do not need these, they can be disabled via the [`disable_allocation_hook_metrics`][]
Expand All @@ -203,7 +204,7 @@ configuration parameter.
| `nomad.client.task_hook.prestart.success` | Number of hook executions that completed successfully | Integer | Counter | datacenter, host, node_class, node_id, node_pool, hook_name |
| `nomad.client.task_hook.prestart.elapsed` | The time it took the hook to run | Milliseconds | Timer | datacenter, host, node_class, node_id, node_pool, hook_name |

## Allocation Metrics
## Allocation metrics

The following metrics are emitted for each allocation if allocation metrics
are enabled. Note that allocation metrics available may be dependent on factors
Expand Down Expand Up @@ -234,7 +235,7 @@ such as the task driver and control group (cgroup) version in use.
| `nomad.client.allocs.restart` | Number of task restarts | Integer | Counter | alloc_id, host, job, namespace, task, task_group |
| `nomad.client.allocs.running` | Number of running allocations | Integer | Counter | alloc_id, host, job, namespace, task, task_group |

## Job Summary Metrics
## Job summary metrics

Job summary metrics are emitted by the Nomad leader server.

Expand All @@ -248,7 +249,7 @@ Job summary metrics are emitted by the Nomad leader server.
| `nomad.nomad.job_summary.running` | Number of running allocations for a job | Integer | Gauge | host, job, namespace, task_group |
| `nomad.nomad.job_summary.starting` | Number of starting allocations for a job | Integer | Gauge | host, job, namespace, task_group |

## Job Status Metrics
## Job status metrics

Job status metrics are emitted by the Nomad leader server.

Expand All @@ -258,7 +259,7 @@ Job status metrics are emitted by the Nomad leader server.
| `nomad.nomad.job_status.pending` | Number of pending jobs | Integer | Gauge | host |
| `nomad.nomad.job_status.running` | Number of running jobs | Integer | Gauge | host |

## Server Metrics
## Server metrics

The following table includes metrics for overall cluster health in addition to
those listed in [Key Metrics](#key-metrics) above.
Expand Down Expand Up @@ -501,7 +502,7 @@ those listed in [Key Metrics](#key-metrics) above.
| `nomad.scheduler.allocs.rescheduled.wait_until` | Time that a rescheduled allocation will be delayed | Float | Gauge | alloc_id, job, namespace, task_group, follow_up_eval_id |
| `nomad.state.snapshotIndex` | Current snapshot index | Integer | Gauge | host |

## Raft BoltDB Metrics
## Raft BoltDB metrics

Raft database metrics are emitted by the `raft-boltdb` library.

Expand All @@ -526,7 +527,7 @@ Raft database metrics are emitted by the `raft-boltdb` library.
| `nomad.raft.boltdb.txstats.write` | Count of total write operations | Integer | Counter |
| `nomad.raft.boltdb.txstats.writeTime` | Sample of write operation times | Nanoseconds | Summary |

## Agent Metrics
## Agent metrics

Agent metrics are emitted by all Nomad agents running in either client or server mode.

Expand Down
21 changes: 9 additions & 12 deletions website/content/docs/operations/monitoring-nomad.mdx
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
layout: docs
page_title: Monitoring Nomad
page_title: Monitor Nomad
description: |-
Overview of runtime metrics available in Nomad along with monitoring and
alerting.
Learn how to monitor the health and performance of Nomad clusters. Export data to Prometheus or DataDog. Review metrics for the Raft consensus protocol, scheduling, performance, capacity, task resource consumption, job and task status, runtime, and federated deployments.
---

# Monitoring Nomad
# Monitor Nomad

The Nomad client and server agents collect a wide range of runtime metrics.
These metrics are useful for monitoring the health and performance of Nomad
Expand Down Expand Up @@ -70,16 +69,14 @@ patterns.
system as appropriate. In many cases, it may be ok if a given batch job fails
occasionally, as long as it goes back to passing.

# Key Performance Indicators
## Key performance indicators

Nomad servers' memory, CPU, disk, and network usage all scales linearly with
cluster size and scheduling throughput. The most important aspect of ensuring
Nomad operates normally is monitoring these system resources to ensure the
servers are not encountering resource constraints.

The sections below cover a number of other important metrics.

## Consensus Protocol (Raft)
## Raft consensus protocol

Nomad uses the Raft consensus protocol for leader election and state
replication. Spurious leader elections can be caused by networking
Expand Down Expand Up @@ -261,18 +258,18 @@ a per client basis.
- **nomad.client.allocated.memory**
- **nomad.client.unallocated.memory**

## Task Resource Consumption
## Task resource consumption

The metrics listed [here][allocation-metrics] can be used to track resource
consumption on a per task basis. For user facing services, it is common to alert
when the CPU is at or above the reserved resources for the task.

## Job and Task Status
## Job and task status

See [Job Summary Metrics] for monitoring the health and status of workloads
running on Nomad.

## Runtime Metrics
## Runtime metrics

Runtime metrics apply to all clients and servers. The following metrics are
general indicators of load and memory pressure.
Expand All @@ -284,7 +281,7 @@ general indicators of load and memory pressure.
It is recommended to alert on upticks in any of the above, server memory usage
in particular.

## Federated Deployments (Serf)
## Serf federated deployments

Nomad uses the membership and failure detection capabilities of the Serf library
to maintain a single, global gossip pool for all servers in a federated
Expand Down
Loading