Skip to content

Commit ef02897

Browse files
committed
docs: add docs around versioning, and bump versions
1 parent 2b4ff40 commit ef02897

File tree

8 files changed

+250
-9
lines changed

8 files changed

+250
-9
lines changed

.gitlab-ci.yml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,3 +228,65 @@ create-agent-json:
228228
artifacts:
229229
paths:
230230
- artifacts.json
231+
232+
create-chart-json:
233+
rules:
234+
- if: '$CI_COMMIT_TAG =~ /^chart\/v\d+\.\d+\.\d+$/'
235+
needs:
236+
- job: bootstrap
237+
artifacts: true
238+
script:
239+
- |
240+
cat > artifacts.json << EOF
241+
{
242+
"skyhook": {
243+
"source": {
244+
"org": "${NGC_PRIVATE_ORG}",
245+
"team": "skyhook"
246+
},
247+
"target": {
248+
"org": "nvidia",
249+
"team": "skyhook"
250+
},
251+
"artifacts": [
252+
{
253+
"name": "chart",
254+
"version": "${CI_COMMIT_TAG#chart/}",
255+
"type": "chart"
256+
}
257+
],
258+
"nspect_id": "${OPERATOR_NSPECT_ID}"
259+
}
260+
}
261+
EOF
262+
artifacts:
263+
paths:
264+
- artifacts.json
265+
266+
publish-chart:
267+
stage: deploy
268+
rules:
269+
## on commit to main publish a chart to our dev registry
270+
- if: $CI_COMMIT_REF_NAME == $CI_DEFAULT_BRANCH
271+
needs: [bootstrap]
272+
variables:
273+
ENV: dev
274+
VERSION: $CI_COMMIT_SHORT_SHA
275+
REGISTRY: swgpu-baseos
276+
URL: helm.ngc.nvidia.com/nvidian/swgpu-baseos
277+
## on tag publish a chart to the prod staging registry
278+
- if: '$CI_COMMIT_TAG =~ /^chart\/v\d+\.\d+\.\d+$/'
279+
needs: [bootstrap]
280+
variables:
281+
ENV: prod
282+
REGISTRY: staging
283+
URL: helm.ngc.nvidia.com/staging
284+
variables:
285+
USERNAME: "$$oauthtoken"
286+
PASSWORD: ${NVCR_REGISTRY_PASSWORD}
287+
ARGS: ""
288+
script:
289+
- if [ "${ENV}" == "dev" ]; then ARGS="--version $(date +%Y.%m.%d)-${CI_COMMIT_SHORT_SHA}"; fi
290+
- /workspace/bin/helm repo add ${REGISTRY} https://${URL} --username="${USERNAME}" --password=${PASSWORD}
291+
- /workspace/bin/helm package chart ${ARGS}
292+
- /workspace/bin/helm cm-push $(ls skyhook-operator-*.tgz) ${REGISTRY}

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,8 @@ The stages are applied in this order:
125125
**Semantic versioning is strictly enforced in the operator** in order to support upgrade and uninstall. Semantic versioning allows the
126126
operator to know which way the package is going while also enforcing best versioning practices.
127127

128+
**For detailed information about our versioning strategy, git tagging conventions, and component release process, see [docs/versioning.md](docs/versioning.md) and [docs/release-process.md](docs/release-process.md).**
129+
128130
## Packages
129131
Part of how the operator works is the [skyhook-agent](agent/README.md). Packages have to be created in way so the operator knows how to use them. This is where the agent comes into play, more on that later. A package is a container that meets these requirements:
130132

chart/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@ type: application
1313
# This is the chart version. This version number should be incremented each time you make changes
1414
# to the chart and its templates, including the app version.
1515
# Versions are expected to follow Semantic Versioning (https://semver.org/)
16-
version: 0.7.0
16+
version: v0.8.0
1717
# This is the version number of the application being deployed. This version number should be
1818
# incremented each time you make changes to the application. Versions are not expected to
1919
# follow Semantic Versioning. They should reflect the version the application is using.
2020
# It is recommended to use it with quotes.
21-
appVersion: "0.7.0"
21+
appVersion: v0.8.0
2222
## TODO: not sure how we want to manage this version
2323
#kubeVersion: ">=1.28.0"

chart/README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,10 @@ Settings | Description | Default |
2828
| controllerManager.manager.env.logLevel | Log level for the operator controller. If you want more or less logs, change this value to "debug" or "error". | "info" |
2929
| controllerManager.manager.env.reapplyOnReboot | Reapply the packages on reboot. This is useful for systems that are read-only. | "false" |
3030
| controllerManager.manager.env.runtimeRequiredTaint | This feature assumes nodes are added to the cluster with `--register-with-taints` kubelet flag. This taint is assume to be all new nodes, and skyhook pods will tolerate this taint, and remove it one the nodes packages are complete. | skyhook.nvidia.com=runtime-required:NoSchedule |
31+
| controllerManager.manager.image.repository | Where to get the image from | "ghcr.io/nvidia/skyhook/operator" |
32+
| controllerManager.manager.image.tag | what version of the operator to run | defaults to appVersion |
33+
| controllerManager.manager.agent.repository | Where to get the image from | "ghcr.io/nvidia/skyhook/agent" |
34+
| controllerManager.manager.agent.tag | what version of the agent to run | defaults to the current latest, but is not latest example v6.1.5 |
3135
| imagePullSecret | the secret used to pull the operator controller image, agent image, and package images. | node-init-secret |
3236
| estimatedPackageCount | estimated number of packages to be installed on the cluster, this is used to calculate the resources for the operator controller. | 1 |
3337
| estimatedNodeCount | estimated number of nodes in the cluster, this is used to calculate the resources for the operator controller | 1 |
@@ -38,4 +42,12 @@ Settings | Description | Default |
3842
- **CRD**: This project currently has one CRD and its not managed the ["recommended" way](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/). Its part of the templates. Meaning it will be updated with the `helm upgrade`. We decided it was better do it this way for this project. Doing it either way has consequences and this route has worked well for upgrades so far our deployments.
3943

4044
### Resource Management
41-
Skyhook uses Kubernetes LimitRange to set default CPU/memory requests/limits for all containers in the namespace. You can override these per-package in your Skyhook CR. Strict validation is enforced. See [../docs/resource_management.md](../docs/resource_management.md) for details and examples.
45+
Skyhook uses Kubernetes LimitRange to set default CPU/memory requests/limits for all containers in the namespace. You can override these per-package in your Skyhook CR. Strict validation is enforced. See [../docs/resource_management.md](../docs/resource_management.md) for details and examples.
46+
47+
## Versioning
48+
49+
This Helm chart follows independent versioning from the operator and agent components. The chart's `appVersion` field specifies the recommended stable operator version that provides a good default for installations. See [../docs/versioning.md](../docs/versioning.md) for more details on versioning.
50+
51+
### Chart Version vs App Version
52+
- **Chart version** (`version` in Chart.yaml): Tracks changes to chart templates, values, and configuration (NOTE: agent version in set in the values.)
53+
- **App version** (`appVersion` in Chart.yaml): Recommended stable operator version for this chart release

chart/templates/deployment.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,11 +76,10 @@ spec:
7676
- name: PAUSE_IMAGE
7777
value: {{ quote .Values.controllerManager.manager.env.pauseImage }}
7878
- name: AGENT_IMAGE
79-
value: {{ quote .Values.controllerManager.manager.env.agentImage }}
79+
value: {{ .Values.controllerManager.manager.agent.repository }}:{{ .Values.controllerManager.manager.agent.tag}}
8080
- name: KUBERNETES_CLUSTER_DOMAIN
8181
value: {{ quote .Values.kubernetesClusterDomain }}
82-
image: {{ .Values.controllerManager.manager.image.repository }}:{{ .Values.controllerManager.manager.image.tag
83-
| default .Chart.AppVersion }}
82+
image: {{ .Values.controllerManager.manager.image.repository }}:{{ .Values.controllerManager.manager.image.tag | default .Chart.AppVersion }}
8483
livenessProbe:
8584
httpGet:
8685
path: /healthz

chart/values.yaml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,11 +60,14 @@ controllerManager:
6060
runtimeRequiredTaint: skyhook.nvidia.com=runtime-required:NoSchedule
6161
## puaseImage: is the image used for the pause container in the operator controller.
6262
pauseImage: registry.k8s.io/pause:3.10
63-
## agentImage: is the image used for the agent container. This image is the default for this install, but can be overridden in the CR at package level.
64-
agentImage: ghcr.io/nvidia/skyhook/agent:latest
6563
image:
6664
repository: ghcr.io/nvidia/skyhook/operator
67-
tag: latest
65+
tag: "" ## if omitted, default to the chart appVersion
66+
## agentImage: is the image used for the agent container. This image is the default for this install, but can be overridden in the CR at package level.
67+
agent:
68+
repository: ghcr.io/nvidia/skyhook/agent
69+
tag: "v6.1.5"
70+
6871
# resources: If this is defined it will override the default calculation for resources
6972
# from estimatedNodeCount and estimatedPackageCount. The below values are
7073
# what will be calculated until nodes > 1000 and packages 1-2 or nodes > 500 and packages >= 4

docs/release-process.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Skyhook Release Process
2+
3+
Step-by-step process for releasing Skyhook components (operator, agent, chart).
4+
5+
## Release Workflow
6+
7+
### Operator Release
8+
9+
```bash
10+
# 1. Test thoroughly, merge all PRs to main
11+
# 2. Tag and push
12+
git checkout main && git pull origin main
13+
git tag operator/v1.2.3
14+
git push origin operator/v1.2.3
15+
```
16+
17+
**Automated:** Tests → Multi-platform build → Publish to ghcr.io + nvcr.io → Attestations
18+
19+
### Agent Release
20+
21+
```bash
22+
# 1. Test agent compatibility, merge all PRs to main
23+
# 2. Tag and push
24+
git checkout main && git pull origin main
25+
git tag agent/v1.2.3
26+
git push origin agent/v1.2.3
27+
```
28+
29+
**Automated:** Tests → Multi-platform build → Publish to ghcr.io + nvcr.io
30+
31+
### Chart Release
32+
33+
```bash
34+
# 1. Update Chart.yaml versions
35+
# chart/Chart.yaml
36+
version: v1.2.3 # Chart version
37+
appVersion: v0.8.0 # Recommended operator version
38+
39+
# 2. Create PR and merge
40+
git checkout -b release/chart-v1.2.3
41+
git add chart/Chart.yaml
42+
git commit -m "chart: bump version to v1.2.3"
43+
git push origin release/chart-v1.2.3
44+
# Review and merge PR
45+
46+
# 3. Tag after merge
47+
git checkout main && git pull origin main
48+
git tag chart/v1.2.3
49+
git push origin chart/v1.2.3
50+
```
51+
52+
**Automated:** Package Helm chart → Publish to chart repository (when implemented)
53+
54+
## Release Checklist
55+
56+
**Before tagging:**
57+
- [ ] All PRs merged to main
58+
- [ ] For charts: Chart.yaml updated and merged
59+
- [ ] Tests passing
60+
- [ ] Documentation updated
61+
62+
**After tagging:**
63+
- [ ] CI/CD pipeline completes
64+
- [ ] Images published successfully
65+
- [ ] Test deployment with new version
66+
67+
## Common Commands
68+
69+
```bash
70+
# Check current tags
71+
git tag -l 'operator/v*' --sort=-v:refname | head -5
72+
git tag -l 'agent/v*' --sort=-v:refname | head -5
73+
git tag -l 'chart/v*' --sort=-v:refname | head -5
74+
75+
# See what will be included in tag
76+
git log --oneline $(git tag -l 'operator/v*' --sort=-v:refname | head -1)..HEAD
77+
78+
# Delete tag if needed (before CI runs)
79+
git tag -d operator/v1.2.3
80+
git push origin :refs/tags/operator/v1.2.3
81+
```
82+
83+
## Rollback
84+
85+
For problematic releases:
86+
1. Tag new patch release with fixes
87+
2. For critical issues: Update chart `appVersion` to previous stable version
88+
89+
See [versioning.md](versioning.md) for version strategy details.

docs/versioning.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Skyhook Versioning Strategy
2+
3+
Skyhook uses independent versioning for three components, all following [Semantic Versioning](https://semver.org/):
4+
5+
1. **Operator** - Kubernetes operator that manages Skyhook resources
6+
2. **Agent** - Container that executes package operations on nodes
7+
3. **Chart** - Helm chart for deploying the operator
8+
1. NOTE: the versioning of the chat also includes versioning of the expected agent version.
9+
2. NOTE: If changing either the operator or agent this will generally include a chart release too.
10+
11+
## Git Tagging Convention
12+
13+
```bash
14+
operator/v{version} # Operator releases
15+
agent/v{version} # Agent releases
16+
chart/v{version} # Chart releases
17+
```
18+
19+
## Component Versioning
20+
21+
### Operator & Agent
22+
- **Independent versioning** with their own release cycles
23+
- **Semantic versioning**: MAJOR.MINOR.PATCH
24+
- **Compatibility**: Maintained through well-defined interfaces
25+
26+
### Helm Chart
27+
- **Independent from operator/agent** (starting at v0.8.0 these will start to diverge in version number)
28+
- **Chart version** (`version`): Tracks chart template/config changes
29+
- **App version** (`appVersion`): Recommended stable operator version
30+
31+
## Chart Behavior
32+
33+
### Chart.yaml
34+
example:
35+
```yaml
36+
version: v1.0.0 # Chart version (independent)
37+
appVersion: v0.7.0 # Recommended operator version
38+
```
39+
40+
### Image Tag Defaults
41+
```yaml
42+
# values.yaml
43+
image:
44+
tag: "" # Empty = defaults to Chart.AppVersion
45+
46+
# Template renders as:
47+
image: "ghcr.io/nvidia/skyhook/operator:0.7.0"
48+
```
49+
50+
## Quick Reference
51+
52+
```bash
53+
# Check deployed versions
54+
kubectl get deployment -n skyhook -o jsonpath='{.items[0].spec.template.spec.containers[0].image}'
55+
helm list -n skyhook
56+
57+
# Override operator version
58+
helm install skyhook ./chart --set controllerManager.manager.image.tag="0.8.0"
59+
```
60+
61+
## Release Process
62+
63+
For step-by-step instructions on how to release components, see [release-process.md](release-process.md).
64+
65+
**CI/CD triggers on git tags:**
66+
- `operator/vx.y.z` → publishes operator image
67+
- `agent/vx.y.z` → publishes agent image
68+
- `chart/vx.y.z` → publishes helm chart
69+
70+
**Chart versioning:**
71+
- **PATCH**: Bug fixes, docs
72+
- **MINOR**: New features, config options
73+
- **MAJOR**: Breaking changes to chart, or compatibility with agent or operator.
74+

0 commit comments

Comments
 (0)