docs(skills): initial conversion of GPU Operators skills#401
Conversation
chenopis
left a comment
There was a problem hiding this comment.
Documentation Review — 15 findings (6 critical)
This PR is a first-pass conversion of GPU Operator RST docs into Agent Skills (SKILL.md + reference .md) plus Sphinx meta:: blocks for skill-discoverable metadata on the existing RSTs. The Skill-format direction is great, but the conversion script that produced the new files has multiple systemic bugs that make the resulting content broken or misleading.
The findings below are organized by root cause so you can fix them in the converter once and re-run, rather than file-by-file. Where a class of issue recurs many times, I've posted one inline finding on a representative location and listed the full set in that comment.
Critical (broken content / wrong information)
gpu-operator-install-ing-nvidia/SKILL.md— Lost prerequisites (only 1 of 5+ items survived the conversion).gpu-operator-install-ing-nvidia/SKILL.md—.. literalinclude::content silently dropped ("Create a file with contents like the following:" then no example). Repeats in at least 11 places across 6 SKILLs.gpu-operator-nvidia-google/SKILL.md— YAML fragment leaking out of code block at file top.gpu-operator-nvidia-google/SKILL.md— Mangled table (4 columns merged into 1 header row).gpu-operator-references/SKILL.md— Description routes wrong for the umbrella references skill (claims confidential-containers only; loads 7 references).gpu-operator-install-ing-nvidia/SKILL.md— Skill namegpu-operator-install-ing-nvidiais a converter typo for "installing"; this is a top-level routing identity.
High (broken cross-refs and missing assets)
- Broken
:external+...:doc:Sphinx role leaking as raw text. 9 instances across 7 SKILLs. pstai_RST hyperlink target leaking inreferences/overview.mdlicense table. 16 occurrences.references/overview.md— Missing image asset (graphics/nvidia-gpu-operator-image.jpg).- Lost
:ref:/:doc:cross-references as bare text. Pattern recurs 40+ times across SKILLs.
Medium (style/structure)
references/security.md:26— duplicated phrase.references/overview.md:43— typoidentifieis.Trigger keywords - …suffix in description frontmatter — not part of Agent Skills spec. 25 SKILLs affected.Step N:prefix on every H2 regardless of procedural intent. 23 SKILLs affected... note::/.. tip::admonitions flattened to bare**Tip:**bold. All SKILLs affected.
Pre-existing (not regressions)
The deterministic style scanner reports 42 findings (latinisms e.g., via; banned marketing words simple, simply; contractions It's, let's). These were in the source RSTs — flagging here only because once the docs are converted to Skills they'll start being executed by agents, where prose-quality regressions matter more than they did in the upstream rendered HTML.
Critical issues must be resolved before merge.
Review generated with AI assistance using DORI ::pr.
| @@ -0,0 +1,493 @@ | |||
| --- | |||
| name: "gpu-operator-install-ing-nvidia" | |||
There was a problem hiding this comment.
Critical: skill name is a converter typo (install-ing) — root_cause: weird-skill-naming
The converter appears to have split "installing" on a hyphen boundary, producing the broken identifier gpu-operator-install-ing-nvidia. This is the skill's top-level routing name: it's what agents use to address the skill and what other SKILL.md files cross-reference (e.g. (use the gpu-operator-install-... skill)).
Suggested rename: gpu-operator-getting-started (matches the source page getting-started.rst and the :description-agent: text in the corresponding .. meta:: block) or gpu-operator-install (shorter, matches install-gpu-operator-* sibling pages).
Whichever you pick, the directory name (gpu-operator-install-ing-nvidia/), the name: frontmatter field, and any (use the gpu-operator-install-ing-nvidia skill) references in other SKILL.md files all need to update together.
|
|
||
| # Prerequisites | ||
|
|
||
| 1. You have the `kubectl` and `helm` CLIs available on a client machine. |
There was a problem hiding this comment.
Critical: prerequisites silently truncated — root_cause: lost-content
This SKILL.md was converted from gpu-operator/getting-started.rst, which lists ~5 prerequisite items:
kubectlandhelmCLIs (kept here)- ClusterPolicy / driver / OS-version constraints (lost)
- Container engine (CRI-O or containerd) on every node (lost)
- PSA
pod-security.kubernetes.io/enforce=privilegedlabeling (lost) - NFD already running and how to detect it (lost)
The converter dropped items 2–5 silently. Following this skill as written, an agent (or human) would attempt the install on a cluster missing any of those preconditions and the install would fail in confusing ways.
The original prereqs are at gpu-operator/getting-started.rst:50-90 in the source RST. Restore them in this SKILL.md.
|
|
||
| You can perform the following steps to deploy Jupyter Notebook in your cluster: | ||
|
|
||
| 1. Create a file, such as `tf-notebook.yaml`, with contents like the following example: |
There was a problem hiding this comment.
Critical: .. literalinclude:: content silently dropped — root_cause: dropped-literalinclude
The conversion script lost the actual example content where the source uses .. literalinclude::. Here at line 426–428: "Create a file, such as tf-notebook.yaml, with contents like the following example:" jumps straight to step 2 ("Apply the manifest") with no manifest in between. The source RST (getting-started.rst:709) had .. literalinclude:: ./manifests/input/tf-notebook.yaml. The Jupyter Notebook tutorial in this skill is now unrunnable.
This pattern recurs in at least 11 places across 6 SKILLs:
| File | Line | Asset |
|---|---|---|
gpu-operator-install-ing-nvidia/SKILL.md |
426 | tf-notebook.yaml |
gpu-operator-multiinstance/SKILL.md |
357 | custom-mig-config.yaml |
gpu-operator-nvidia-amazon/SKILL.md |
116 | cluster-config.yaml |
gpu-operator-nvidia-driver/SKILL.md |
179 | nvd-all.yaml |
gpu-operator-nvidia-driver/SKILL.md |
209 | nvd-driver-multiple.yaml |
gpu-operator-nvidia-driver/SKILL.md |
227 | nvd-precompiled-all.yaml |
gpu-operator-nvidia-driver/SKILL.md |
254 | nvd-precomiled-some.yaml |
gpu-operator-nvidia-google/SKILL.md |
69 | gpu-operator-quota.yaml |
gpu-operator-nvidia-google/SKILL.md |
178 | gpu-operator-quota.yaml |
gpu-operator-timeslicing-gpus/SKILL.md |
154, 187, 351 | three time-slicing configs |
Fix the converter to inline the contents of every .. literalinclude:: target as a fenced code block (with the right language tag from the original :language: option), then re-run.
|
|
||
| * You installed and initialized the Google Cloud CLI. | ||
|
|
||
| - name: RUNTIME_CONFIG_SOURCE |
There was a problem hiding this comment.
Critical: YAML fragment leaking out of code block — root_cause: escape-out-of-context-yaml
This line appears as bare text between the (lone) prerequisite at line 11 and the H1 at line 15:
- name: RUNTIME_CONFIG_SOURCE
It's a fragment from a YAML example that escaped its container during conversion. There's no surrounding code block, so it renders as a markdown list item with no context. Almost certainly the rest of the YAML (and the surrounding prose) was also lost.
| - name: RUNTIME_CONFIG_SOURCE |
Remove this stray line and check the source gpu-operator/google-gke.rst for the YAML block this fragment escaped from — odds are other content was lost in the same place.
|
|
||
| The choice depends on the operating system and whether you prefer to have the Operator manage all the software components. | ||
|
|
||
| | Google Driver Installer - | Container-Optimized OS | Ubuntu with containerd | The Google driver installer manages the NVIDIA GPU Driver. NVIDIA GPU Operator manages other software components. | |
There was a problem hiding this comment.
Critical: mangled table — root_cause: garbled-table
The table at lines 25–27 has its 4 columns collapsed into one header row. The first row reads:
| Google Driver Installer - | Container-Optimized OS | Ubuntu with containerd | The Google driver installer manages the NVIDIA GPU Driver. NVIDIA GPU Operator manages other software components. |
| --- | --- | --- | --- |
| NVIDIA Driver Manager - | Ubuntu with containerd | NVIDIA GPU Operator manages the lifecycle and upgrades of the driver and other NVIDIA software. | |
The original RST list-table had columns: Approach, Operating System(s), Description. The converter has merged the approach name and the first OS column into one cell, then put the description in column 4 — and on the second row it dropped one cell entirely so the row has the wrong column count.
Fix the converter's list-table handling and re-emit. As-is, this is unreadable.
| ## CVEs | ||
|
|
||
| The following is a list of known CVEs in the GPU Operator or its operands. | ||
| To view any published security bulletins for NVIDIA products published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/. |
There was a problem hiding this comment.
Medium: duplicated phrase
The sentence at line 26 reads:
To view any published security bulletins for NVIDIA products published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/.
| To view any published security bulletins for NVIDIA products published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/. | |
| To view any published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/. |
| The base images used by the software might include software that is licensed under open-source licenses such as GPL. | ||
| The source code for these components is archived on the CUDA opensource [index](https://developer.download.nvidia.com/compute/cuda/opensource/). | ||
|
|
||
| The following table identifieis the licenses for the Operator and software components. |
There was a problem hiding this comment.
Medium: typo identifieis → identifies
| The following table identifieis the licenses for the Operator and software components. | |
| The following table identifies the licenses for the Operator and software components. |
| @@ -0,0 +1,189 @@ | |||
| --- | |||
| name: "gpu-operator-container-device" | |||
| description: "Explains how to configure CDI and NRI support for GPU workloads. Use when enabling CDI, configuring containerd, or troubleshooting CDI-based GPU injection. Trigger keywords - NVIDIA GPU Operator, CDI, NRI, containerd, Kubernetes." | |||
There was a problem hiding this comment.
Medium: Trigger keywords - … suffix in description — root_cause: description-trigger-suffix
The converter appends Trigger keywords - X, Y, Z. to the end of every skill's description field. This isn't part of the Agent Skills spec: the spec describes description as a focused, single-purpose summary capped at 1024 chars, with separate triggers and tags arrays for keyword-style routing.
The suffix bloats the description (e.g. gpu-operator-references/SKILL.md:3 runs to 17+ keywords), competes with the actual sentence-form description for the model's attention during routing, and duplicates information that should live under triggers: / tags:.
25 SKILLs affected (every SKILL.md in the PR except the 7 that already had short descriptions). Recommend dropping the Trigger keywords - … suffix from every description and instead emitting triggers: and tags: arrays in the frontmatter (the upstream RSTs already supply them via the :tags: and :keywords: fields in their .. meta:: blocks).
|
|
||
| # NVIDIA GPU Operator with Amazon EKS | ||
|
|
||
| ## Step 1: Approaches for Working with Amazon EKS |
There was a problem hiding this comment.
Medium: Step N: prefix on every H2 — root_cause: awkward-step-numbering
The converter prefixes every H2 heading with Step 1:, Step 2:, … regardless of whether the H2 is procedural. Here:
## Step 1: Approaches for Working with Amazon EKS
…the section discusses two alternative approaches; it isn't "Step 1" of anything. Same pattern appears as "Step 1: About Multi-Instance GPU", "Step 1: HTTP Proxy Configuration for Openshift", "Step 1: Special Considerations for Service Meshes", etc.
23 SKILLs affected. The Step N: prefix is also inconsistent with how procedural steps are actually written inside each H2 (numbered list items 1., 2., …). Recommend dropping the auto-Step N: prefix from H2s. If a SKILL.md genuinely needs ordered top-level phases, structure them as a numbered list rather than as numbered H2s.
| 1) Container images need to be pulled during GPU Operator installation. | ||
| 2) The `driver` container needs to download several OS packages prior to driver installation. | ||
|
|
||
| **Tip:** |
There was a problem hiding this comment.
Medium: .. note:: / .. tip:: admonitions flattened to bold — root_cause: flat-bold-admonitions
The converter renders RST admonitions as a bare bold label followed by paragraph body, e.g.:
**Tip:**
Using precompiled-drivers removes the need for the `driver` containers to
download operating system packages.
Nothing visually distinguishes this from any other paragraph; the admonition's callout semantics are lost. Every SKILL.md in the PR is affected (typically 2–10 admonitions per file).
Recommend converting RST admonitions to GitHub-flavored Markdown alerts (> [!NOTE], > [!TIP], > [!WARNING], > [!IMPORTANT], > [!CAUTION]):
| **Tip:** | |
| > [!TIP] | |
| > Using precompiled-drivers removes the need for the `driver` containers to | |
| > download operating system packages. |
If this skill format is not intended to render in GitHub, pick whatever admonition syntax the target Skill viewer supports — but keep the admonition class (Note / Tip / Warning) machine-readable so downstream renderers can style them.
|
@miyoungc thanks for opening I'm curious about maintaining these going forward. And it looks like all new pages need the meta data now? |
|
@a-mccarthy |
No description provided.