You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<para> Secure Encrypted Virtualization (SEV) and SEV with Encrypted State (SEV-ES) technologies, are available on the 5th Generation AMD
2020
-
EPYC Processors when SUSE Linux Enterprise Server 15 SP4 is used both as an host and guest OS. They allow the memory of the VMs to be encrypted, enabling an high level of
2020
+
EPYC Processors when SUSE Linux Enterprise Server 15 SP4 is used both as a host and guest OS. They allow the memory of the VMs to be encrypted, enabling a high level of
2021
2021
confidentiality. SEV-ES is considered superior to plain SEV, as also CPU registers are encrypted when they are saved into the host memory.
2022
2022
For using SEV-ES for a VM, we need to enable it in the VM's own configuration file, but there are preparation steps that needs to occur at
2023
2023
the host level. </para>
@@ -2178,7 +2178,7 @@ Total 346060.48 346246.34 692306.82
2178
2178
are where the mapping between vCPUs and pCPUs is established (<parameter>vcpu</parameter>
2179
2179
being the vCPU ID and <parameter>cpuset</parameter> being either one or a list of pCPU IDs). </para>
2180
2180
2181
-
<para> In order to be able to create VMs with more than 255 vCPUs,
2181
+
<para> To be able to create VMs with more than 255 vCPUs,
<para>As said already, full cores must always be used. If possible always fully use CCXes/dies too.
2263
2263
Since each die has 16 CPUs, that means that a VM with 288 vCPUs will use 9 CCXes on each of the 2 nodes
2264
2264
(as 9 x 16 x 2 is indeed 288).
2265
-
So, for instance, vCPUs 0 to 15 can be assigned to Cores L#0 to L#7 (and hence to CPUs P#0 to P#7 and P#192 to P#199), on node P#0;
2265
+
So, for example, vCPUs 0 to 15 can be assigned to Cores L#0 to L#7 (and hence to CPUs P#0 to P#7 and P#192 to P#199), on node P#0;
2266
2266
vCPUs 16 to 31 to Cores L#$8 to L#15, and so on. No vCPU will be pinned, on the other hand, on Cores L#72 to L#95.
2267
2267
And the same on node P#1.
2268
2268
In fact, this is what we call coherent 1-to-1 mapping between virtual and physical topologies.</para>
@@ -2323,7 +2323,7 @@ node 0
2323
2323
them, respectively. And this is all possible without the need for any of the VMs to span
2324
2324
more than one host NUMA node. </para>
2325
2325
2326
-
<para>Of course, in all these cases, VMs are sharing dies, which means they will interfere among each others via the L3 caches.
2326
+
<para>Of course, in all these cases, VMs are sharing dies, which means they will interfere with each other via the L3 caches.
2327
2327
This may or may not be a problem, but there is no way around it, as soon as more VMs than the number of available dies are necessary.
2328
2328
The performance impact of such sharing should be tolerable, in most cases, for these configurations, but this needs to be assessed with tests and benchmarks.
2329
2329
If that is not the case, then the recommendation is to not go above 24 VMs.
@@ -2397,8 +2397,8 @@ node 0
2397
2397
<bridgehead>CPU Oversubscription</bridgehead>
2398
2398
2399
2399
<para> CPU oversubscription happens when the total cumulative number of vCPUs from all VMs
2400
-
becomes higher than 384. Such situation unavoidably introduces latencies, and leads to lower
2401
-
performance than when host resources are just enough. It is, however, impossible to tell a
2400
+
becomes higher than 384. Such a situation inevitably introduces latencies, resulting in lower
2401
+
performance compared to when host resources are sufficient. It is, however, impossible to tell a
2402
2402
priori by what extent this happens, at least not without a detailed knowledge of the actual
2403
2403
workload. </para>
2404
2404
@@ -2498,7 +2498,7 @@ node 0
2498
2498
2499
2499
<para> The <parameter><topology></parameter> element specifies the CPU
2500
2500
characteristics. In this case, we are creating vCPUs which will be seen by the guest OS as
2501
-
being arranged in 2 sockets, each of which has 9 dies, each of which has 8 cores with 2 threads (i.e., 16 CPUs).
2501
+
being arranged in 2 sockets, each of which has 9 dies, each of which has 8 cores with 2 threads (that is 16 CPUs).
2502
2502
And this is how we match, for the one big VM, the topology of the host. </para>
2503
2503
2504
2504
<para> Each <parameter><cell></parameter> element defines one virtual NUMA node,
<para>Note that, in order to achieve the outcome shown above, it is very important to use the following CPU topology description string, in the VM configuration:</para>
2608
+
<para>Note that, to achieve the outcome shown above, it is very important to use the following CPU topology description string, in the VM configuration:</para>
<para>In order to verify that the appropriate type of memory is being used by the VMs,
2718
+
<para>To verify that the appropriate type of memory is being used by the VMs,
2719
2719
one can check the content of <parameter>/proc/meminfo</parameter>, with the VMs running,
2720
2720
and observe that all the pre-allocated Huge Pages are actually occupied.</para>
2721
2721
@@ -2878,21 +2878,21 @@ dmesg | grep SEV
2878
2878
</mediaobject>
2879
2879
</figure>
2880
2880
2881
-
<para>The single thread results are basically identical between baremetal (blue rectangles) and inside of the VM (orange rectangles), for
2881
+
<para>The single thread results identical between baremetal (blue rectangles) and inside of the VM (orange rectangles), for
2882
2882
all the operations (<parameter>Copy</parameter>, <parameter>Scale</parameter>, <parameter>Add</parameter> and
2883
2883
<parameter>Triadd</parameter>) of the benchmark.
2884
2884
</para>
2885
2885
2886
2886
<para> About the parallel case, remember that the VM is slightly "smaller"" than the host, in terms of number of CPUs. Therefore,
2887
2887
we cannot run the parallel version of STREAM with as many thread as on the host. In fact, a very good result is reached, on the host, with twice as many threads as there are
2888
-
Least Level Caches (LLCs), i.e., 48. If we do the same inside of the VM, there it means 36 threads.
2888
+
Least Level Caches (LLCs), that is 48. If we do the same inside of the VM, there it means 36 threads.
2889
2889
We can see, however, that the performance reached inside of the VM, even with 36 threads (dark red rectangles, ~616 GBytes/sec for the <parameter>Copy</parameter> operation)
2890
2890
is close enough to the values achieved on the host with 48 threads (yellow rectangles, ~640 GBytes/sec for the <parameter>Copy</parameter> operation).
2891
2891
For completeness, we also run the benchmark on the host with 36 threads (green rectangles, ~605 Gbytes/sec for the <parameter>Copy</parameter> operation); and as we
2892
2892
could have expectd, the results of such baremetal runs and of the VM runs are very close.</para>
2893
2893
2894
2894
<para> This clearly shows how proper tuning allows a single VM running on an AMD EPYC 7004 Series
2895
-
Processor server to achieve a memory bandwidth performance that basically matches the one that we can reach directly on the host. </para>
2895
+
Processor server to achieve a memory bandwidth performance that matches the one that we can reach directly on the host. </para>
2896
2896
2897
2897
<note>
2898
2898
<para>Inside of the VM, the STREAM benchmark was configured almost identically to what has been
@@ -2907,8 +2907,8 @@ dmesg | grep SEV
2907
2907
is selected. In fact, since using that model builds a VM with only 2 LLCs (and also to other problems with the cache topology),
2908
2908
running STREAM with twice as many threads as there are LLCs in the system, results in the benchmark spawning only 4 of them (yellow and green rectangles).
2909
2909
And this, of course, dramatically reduces the performance. We can also see that, if we instead manually set the number of threads
2910
-
to the value that we know to be the best for this VM (i.e., 36, dark red and cyan rectangles) performance are restored to how good
2911
-
we know things can be from <xreflinkend="fig-stream-bm-vm"/> (and from the <parameter>cpumodel</parameter> results, i.e. the blue and orange rectangles).</para>
2910
+
to the value that we know to be the best for this VM (that is 36, dark red and cyan rectangles) performance are restored to how good
2911
+
we know things can be from <xreflinkend="fig-stream-bm-vm"/> (and from the <parameter>cpumodel</parameter> results, see the blue and orange rectangles).</para>
2912
2912
2913
2913
<para>Actually, the cyan rectangles represent the absolute best result (~620 GBytes/sec for the <parameter>Copy</parameter> operation), probably thanks to the fact that the CPUIdle
2914
2914
<parameter>haltpoll</parameter> governor is the most effective when coupled with <parameter>cpupassthrough</parameter>. However, the orange rectangles follow very closely
@@ -2990,10 +2990,10 @@ dmesg | grep SEV
2990
2990
</mediaobject>
2991
2991
</figure>
2992
2992
2993
-
<para>We see in <xreflinkend="fig-stream-single-vms-avg"/> how the single STREAM bandwidth stays basically flat until (and including when) 6 VMs are used.
2993
+
<para>We see in <xreflinkend="fig-stream-single-vms-avg"/> how the single STREAM bandwidth stays flat until (and including when) 6 VMs are used.
2994
2994
Then it starts to decline a bit, as more VMs are packed on the NUMA nodes and, hence, compete for the bandwith of the memory controllers. However,
2995
2995
<xreflinkend="fig-stream-single-vms-sum"/> reminds us how the total bandwidth achieved, if we consider all the VMs involved in each experiments, actually goes up,
2996
-
until it reaches the same level that we know (e.g., from <xreflinkend="fig-stream-bm-vm"/>) it can touch.
2996
+
until it reaches the same level that we know (for example, from <xreflinkend="fig-stream-bm-vm"/>) it can touch.
2997
2997
</para>
2998
2998
2999
2999
<para><xreflinkend="fig-stream-omp-vms-avg"/> and <xreflinkend="fig-stream-omp-vms-sum"/> show the same, but when the parallel (via OpenMP) version of STREAM
0 commit comments