Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 47 additions & 53 deletions concepts/NVIDIA-Operator.xml
Original file line number Diff line number Diff line change
@@ -1,103 +1,96 @@
<?xml version="1.0" encoding="UTF-8"?>
<!-- This file originates from the project https://github.com/openSUSE/doc-kit
-->
<!-- This file originates from the project https://github.com/openSUSE/doc-kit -->
<!-- This file can be edited downstream. -->
<!DOCTYPE topic [ <!ENTITY % entities SYSTEM "../common/generic-entities.ent">
%entities; ]>
<!DOCTYPE topic
[
<!ENTITY % entities SYSTEM "../common/generic-entities.ent">
%entities;
]>
<topic xml:id="nvidia-operator" role="concept" xml:lang="en"
xmlns="http://docbook.org/ns/docbook" version="5.2"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:trans="http://docbook.org/ns/transclusion">
xmlns="http://docbook.org/ns/docbook" version="5.2"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:trans="http://docbook.org/ns/transclusion">
<info>
<title>Introduction to the &nvoperator;</title>
<meta name="maintainer" content="[email protected]" its:translate="no"/>
<abstract>
<para>
This article explains the &nvoperator;, outlines the &nvidia; GPU
components it manages, and summarizes the benefits of using it.
This article explains the &nvoperator;. It outlines the &nvidia; GPU
components it manages. It also summarizes the benefits of using it.
</para>
</abstract>
</info>
<section xml:id="what-is-nvidia-operator">
<title>What is the &nvoperator;?</title>
<para>
The &nvoperator; is a &kube; operator that simplifies the management and
deployment of &nvidia; GPU resources in a &kube; cluster. It automates the
configuration and monitoring of &nvidia; GPU drivers, as well as
associated components like CUDA, container runtimes, and other GPU-related
software.
The &nvoperator; is a &kube; operator. It simplifies management and
deployment of &nvidia; GPUs. These resources are in a &kube; cluster. The
operator automates configuration and monitoring of &nvidia; GPU drivers.
It also manages associated components. These include CUDA, container
runtimes, and other GPU-related software.
</para>
</section>
<section xml:id="how-does-nvidia-operator-work">
<title>How does the &nvoperator; work?</title>
<title>How the &nvoperator; works</title>
<para>
The &nvoperator; follows this workflow:
</para>
<orderedlist>
<listitem>
<para>
<emphasis role="bold">Operator deployment.</emphasis> The &nvidia;
Operator is deployed as a &helm; chart or using &kube; manifests.
<emphasis role="bold">Operator deployment.</emphasis> You can deploy
the &nvoperator; as a &helm; chart. You can also use &kube; manifests.
</para>
</listitem>
<listitem>
<para>
<emphasis role="bold">Node labeling &amp; GPU discovery.</emphasis>
Once installed, the operator deploys the <emphasis>GPU Feature
Discovery</emphasis> (GFD) daemon, which scans the hardware on each
node for &nvidia; GPUs. It labels nodes with GPU-specific information,
making it easier for &kube; to schedule GPU workloads based on
available hardware.
<emphasis role="bold">Node labeling and GPU discovery.</emphasis> Once
installed, the operator deploys the GPU Feature Discovery (GFD)
daemon. GFD scans node hardware for &nvidia; GPUs. It labels nodes
with GPU-specific information. This makes it easier for &kube; to
schedule GPU workloads. Scheduling is based on available hardware.
</para>
</listitem>
<!--
<listitem>
<para>
<emphasis role="bold">&nvidia; driver installation.</emphasis> The
operator ensures that the appropriate &nvidia; drivers are installed
on the cluster nodes.
</para>
</listitem> -->
<listitem>
<para>
<emphasis role="bold">NVIDIA Container Toolkit
<emphasis role="bold">&nvidia; Container Toolkit
configuration.</emphasis> The GPU operator installs and configures the
&nvidia; Container Toolkit, which allows GPU-accelerated containers to
run in &kube;.
&nvidia; Container Toolkit. The toolkit allows GPU-accelerated
containers to run in &kube;.
</para>
</listitem>
<listitem>
<para>
<emphasis role="bold">CUDA runtime and libraries.</emphasis> The
operator ensures that the CUDA toolkit is properly installed, making
it easier for applications requiring CUDA to work seamlessly without
manual intervention.
operator ensures the CUDA toolkit is properly installed. This makes it
easier for applications requiring CUDA to work. Applications work
seamlessly without manual intervention.
</para>
</listitem>
<listitem>
<para>
<emphasis role="bold">Validation and health monitoring.</emphasis>
After setting up the environment, the operator continuously monitors
the health of the GPU resources. It also exposes health metrics for
administrators to view and use for decision-making.
After setup, the operator monitors the health of GPU resources. It
also exposes health metrics. Administrators can use these metrics for
decision-making.
</para>
</listitem>
<listitem>
<para>
<emphasis role="bold">Scheduling GPU workloads.</emphasis> Once the
environment is configured, you can install workloads that require GPU
acceleration. &kube; will use the node labels and available GPU
resources to schedule these jobs on GPU-enabled nodes automatically.
<emphasis role="bold">Scheduling GPU workloads.</emphasis> Once
configured, you can install workloads that require GPU acceleration.
&kube; uses node labels and available GPU resources. This automates
scheduling these jobs on GPU-enabled nodes.
</para>
</listitem>
</orderedlist>
</section>
<section xml:id="nvidia-operator-benefits">
<title>Benefits of using the &nvoperator;</title>
<title>Benefits of the &nvoperator;</title>
<para>
Using the &nvoperator; has the following key benefits:
Using the &nvoperator; has these key benefits:
</para>
<itemizedlist>
<listitem>
Expand All @@ -108,20 +101,21 @@ xmlns:trans="http://docbook.org/ns/transclusion">
</listitem>
<listitem>
<para>
<emphasis role="bold">Cluster-wide management.</emphasis> Works across
the entire &kube; cluster, scaling with node additions or removals.
<emphasis role="bold">Cluster-wide management.</emphasis> It works
across the entire &kube; cluster. It scales with node additions or
removals.
</para>
</listitem>
<listitem>
<para>
<emphasis role="bold">Simplified updates.</emphasis> Automates
<emphasis role="bold">Simplified updates.</emphasis> It automates
updates for GPU-related components.
</para>
</listitem>
<listitem>
<para>
<emphasis role="bold">Optimized GPU usage.</emphasis> Ensures that GPU
resources are efficiently allocated and used.
<emphasis role="bold">Optimized GPU usage.</emphasis> It ensures
efficient allocation and use of GPU resources.
</para>
</listitem>
</itemizedlist>
Expand Down
52 changes: 26 additions & 26 deletions tasks/keylime-enable-ima-tracking.xml
Original file line number Diff line number Diff line change
Expand Up @@ -16,37 +16,39 @@
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:trans="http://docbook.org/ns/transclusion">
<info>
<title>Enabling IMA tracking for &keylime;</title>
<title>Enabling IMA tracking</title>
<meta name="maintainer" content="[email protected]" its:translate="no"/>
<abstract>
<para>
&keylime; is a remote attestation solution that enables you to monitor
the health of remote nodes. <emphasis>Integrity management
architecture</emphasis> (IMA) is a kernel integrity subsystem that
provides a means of detecting malicious changes to files.
&keylime; is a remote attestation solution. It enables you to
monitor the health of remote nodes. Integrity Measurement
Architecture (IMA) is a kernel integrity subsystem. It provides a
means of detecting malicious changes to files.
</para>
</abstract>
</info>
<para>
When using IMA, the kernel calculates a hash of accessed files. The hash is
then used to extend the PCR 10 in the TPM and also log a list of accessed
files. The verifier can request a signed quote to the agent for PCR 10 to
get the logs of all accessed files including the file hashes. Verifiers
then compare the accessed files with a local allowlist of approved files.
If any of the hashes are not recognized, the system is considered unsafe,
and a revocation event is triggered.
When using IMA, the kernel calculates a hash of accessed files. The
kernel then uses the hash to extend the Platform Configuration
Register (PCR) 10 in the Trusted Platform Module (TPM). It also logs a
list of accessed files. The verifier can request a signed quote to
the agent for PCR 10. This retrieves the logs of all accessed files,
including the file hashes. Verifiers then compare the accessed files
with a local allowlist of approved files. If the system does not
recognize any of the hashes, it considers the system unsafe. A
revocation event is then triggered.
</para>
<para>
Before &keylime; can collect information, IMA/EVM needs to be enabled. To
Before &keylime; can collect information, you must enable IMA/EVM. To
enable the process, boot a kernel of the agent with the
<literal>ima_appraise=log</literal> and <literal>ima_policy=tcb</literal>
parameters:
parameters.
</para>
<procedure>
<step>
<para>
Update the <option>GRUB_CMDLINE_LINUX_DEFAULT</option> option with the
parameters in <filename>/etc/default/grub</filename>:
parameters in <filename>/etc/default/grub</filename>.
</para>
<screen>GRUB_CMDLINE_LINUX_DEFAULT="ima_appraise=log ima_policy=tcb"</screen>
</step>
Expand All @@ -58,23 +60,21 @@
</step>
<step>
<para>
Reboot your system.
Reboot the system.
</para>
</step>
</procedure>
<para>
The procedure above uses the default kernel IMA policy. To avoid monitoring
too many files and therefore creating long logs, create a new custom
policy. Find more details in the
<link xlink:href="https://keylime-docs.readthedocs.io/en/latest/user_guide/runtime_ima.html">&keylime;
documentation</link>.
The procedure above uses the default kernel IMA policy. To avoid
monitoring too many files and creating long logs, create a new custom
policy. Find more details in the &keylime; documentation.
</para>
<para>
To indicate the expected hashes, use the <option>--allowlist</option>option
of the <command>keylime_tenant</command> command when registering the
agent. To view the excluded or ignored files, use the
<option>--exclude</option> option of the <command>keylime_tenant</command>
command:
To indicate the expected hashes, use the <option>--allowlist</option>
option of the <command>keylime_tenant</command> command when
registering the agent. To view the excluded or ignored files, use the
<option>--exclude</option> option of the
<command>keylime_tenant</command> command.
</para>
<screen>&prompt.root;keylime_tenant --allowlist
-v 127.0.0.1 \
Expand Down
60 changes: 28 additions & 32 deletions tasks/keylime-run-with-podman.xml
Original file line number Diff line number Diff line change
Expand Up @@ -20,25 +20,23 @@
<title>Running the &keylime; workload using &podman;</title>
<abstract>
<para>
&keylime; is a remote attestation solution that enables you to monitor
the health of remote nodes. The <emphasis>verifier</emphasis> and
<emphasis>registrar</emphasis> are essential components of &keylime; on
remote systems to perform the registration and attestation of &keylime;
agents.
&keylime; is a remote attestation solution. It helps you monitor the
health of remote nodes. The verifier and registrar are essential
&keylime; components. You use these components on remote systems to
register and attest &keylime; agents.
</para>
</abstract>
</info>
<note>
<para>
The container described in this article delivers control plane services
<emphasis>verifier</emphasis> and <emphasis>registrar</emphasis> and a
<emphasis>tenant</emphasis> command-line tool (CLI) that are part of the
&keylime; project.
This document describes a container. The container delivers &keylime;
control plane services. These services include the verifier, registrar,
and tenant command-line interface (CLI).
</para>
</note>
<para>
Before you start installing and registering agents, prepare the verifier
and the registrar on remote hosts, as described in the following procedure.
Before installing and registering agents, prepare the verifier and registrar
on remote hosts. The following procedure describes how to prepare them.
</para>
<procedure>
<step>
Expand All @@ -60,53 +58,51 @@ registry.opensuse.org/devel/microos/containers/containerfile/opensuse/keylime-co
</step>
<step>
<para>
Create the <literal>keylime-control-plane</literal> volume to
persist the database and certificates required during the attestation
process.
Create the <literal>keylime-control-plane</literal> volume. This
persists the database and certificates for attestation.
</para>
<screen>&prompt.root;podman container runlabel install \
registry.opensuse.org/devel/microos/containers/containerfile/opensuse/keylime-control-plane:latest</screen>
</step>
<step>
<para>
Start the container and related services.
Start the container and its services.
</para>
<screen>&prompt.root;podman container runlabel run \
registry.opensuse.org/devel/microos/containers/containerfile/opensuse/keylime-control-plane:latest</screen>
<para>
The <literal>keylime-control-plane</literal> container is
created. It includes configured and running registrar and verifier
services. Internally, the container exposes ports 8881, 8890 and 8891
to the host using the default values. Validate the firewall
configuration to allow access to the ports and to allow communication
between containers, because the tenant CLI requires it.
This creates the <literal>keylime-control-plane</literal> container. The
container includes configured and running registrar and verifier
services. It internally exposes ports 8881, 8890, and 8891 to the host.
These ports use default values. Validate your firewall configuration.
Ensure it allows access to these ports and communication between
containers. The tenant CLI requires this communication.
</para>
</step>
</procedure>
<tip>
<para>
If you need to stop &keylime; services, run the following command:
To stop the &keylime; services, run this command:
</para>
<screen>&prompt.root;<command>podman kill keylime-control-plane-container</command></screen>
</tip>
<section xml:id="keylime-monitor">
<title>Monitoring &keylime; services</title>
<para>
To get the status of running containers on the host, run the following
command:
To check the status of running containers on the host, run this command:
</para>
<screen>&prompt.root;podman ps</screen>
<para>
To view the logs of &keylime; services, run the following command:
To view the &keylime; service logs, run this command:
</para>
<screen>&prompt.root;podman logs keylime-control-plane-container</screen>
</section>
<section xml:id="keylime-executing-tenant">
<title>Executing the tenant CLI</title>
<para>
The tenant CLI tool is included in the container, and if the host
firewall does not interfere with the ports exposed by &keylime; services,
you can execute it using the same image, for example:
The container includes the tenant CLI tool. If the host firewall does not
interfere with the exposed ports, you can execute the CLI using the same
image:
</para>
<screen>&prompt.root;<command>podman run --rm \
-v keylime-control-plane-volume:/var/lib/keylime/ \
Expand All @@ -116,9 +112,9 @@ keylime_tenant -v 10.88.0.1 -r 10.88.0.1 --cert default -c reglist</command></sc
<section xml:id="keylime-extract-certificates">
<title>Extracting the &keylime; certificate</title>
<para>
The first time that the &keylime; container is executed, its services
create a certificate required by several agents. You need to extract the
certificate from the container and copy it to the agent's
When you first execute the &keylime; container, its services create a
certificate. Several agents require this certificate. Extract the
certificate from the container. Then, copy it to the agent's
<filename>/var/lib/keylime/cv_ca/</filename> directory.
</para>
<screen>&prompt.root;<command>podman cp \
Expand All @@ -128,7 +124,7 @@ keylime-control-plane-container:/var/lib/keylime/cv_ca/cacert.crt
<replaceable>AGENT_HOST:/var/lib/keylime/cv_ca/</replaceable></screen>
<tip>
<para>
Find more details about installing the agent in
For more details about installing the agent, see
<xref linkend="keylime-install-agent"/>.
</para>
</tip>
Expand Down
Loading
Loading