Skip to content

Commit 80a0f54

Browse files
committed
Fixed minor style and wording issues
According to doc review and style guide policies, fixed some minor wording and style issues.
1 parent 8530b06 commit 80a0f54

File tree

2 files changed

+42
-42
lines changed

2 files changed

+42
-42
lines changed

xml/MAIN-SBP-AMD-EPYC-4-SLES15SP4.xml

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@
214214
EPYC Processors, there are several micro-architectural differences. The <emphasis
215215
role="italic">Instructions Per Cycle (IPC)</emphasis> has improved by 13% on average across
216216
a selected range of workloads, although the exact improvement is workload-dependent. The
217-
improvements are due to a variety of factors including a larger L2 cache, improvements in
217+
improvements result from a variety of factors including a larger L2 cache, improvements in
218218
branch prediction, the execution engine, the front-end fetching/decoding of instructions and
219219
additional instructions such as supporting AVX-512. The degree to which these changes affect
220220
performance varies between applications.</para>
@@ -323,7 +323,7 @@ node 0 1
323323
(TDPs)</emphasis> differ for the AMD EPYC 9004 Series Dense processor, with different
324324
frequency scaling limits and generally a lower peak frequency. While each individual core may
325325
achieve less peak performance than the AMD EPYC 9004 Series Classic Processor, the total peak
326-
compute throughput available is higher due to the increased number of cores.</para>
326+
compute throughput available is higher because of the increased number of cores.</para>
327327

328328
<para>The intended use case and workloads determine which processor is superior. The key
329329
advantage of the AMD EPYC 9004 Series Dense Processor is packing more cores within the same
@@ -1699,11 +1699,11 @@ epyc:~ # perf script
16991699
9004 Series Dense, the most important task is to set expectations. While super-linear scaling
17001700
is possible, it should not be expected. It may be possible to achieve super-linear scaling in
17011701
Cloud Environments for the number of instances hosted without performance loss if individual
1702-
containers or virtual machines are not utilising 100% of CPU. However, it should be planned
1702+
containers or virtual machines are not utilizing 100% of CPU. However, it should be planned
17031703
carefully and tested. This would be particularly true in cases where multiple instances are
17041704
hosted that have different times of day or year for active phases. The normal expectation is a
1705-
best case of 33% gain for CPU-intensive workloads due to the increased number of cores. But
1706-
sub-linear scaling is common due to resource contention. Contention between SMT siblings,
1705+
best case of 33% gain for CPU-intensive workloads because of the increased number of cores. But
1706+
sub-linear scaling is common because of resource contention. Contention between SMT siblings,
17071707
memory bandwidth, memory availability, memory interconnects, thread communication overhead or
17081708
peripheral devices may prevent perfect linear scaling even for perfectly parallelized
17091709
applications. Similarly, not all applications can scale perfectly. It is possible for
@@ -1845,7 +1845,7 @@ epyc:~ # perf script
18451845
<sect3 xml:id="sec-allocating-resources-hostos-kvm">
18461846
<title>Reserving CPUs and memory for the host on KVM</title>
18471847

1848-
<para> When using KVM, sparing, for example, 24 cores (i.e., one full core for each CCX on both NUMA nodes) and 64 GB of RAM for the host OS is
1848+
<para> When using KVM, sparing, for example, 24 cores (that is one full core for each CCX on both NUMA nodes) and 64 GB of RAM for the host OS is
18491849
done by stopping creating VMs when the total number of vCPUs of all VMs has reached 336
18501850
(as each core has 2 threads) and when the total cumulative amount of allocated RAM has
18511851
reached 690 GB. </para>
@@ -2017,7 +2017,7 @@ epyc:~ # perf script
20172017
<title>Secure Encrypted Virtualization (SEV)</title>
20182018

20192019
<para> Secure Encrypted Virtualization (SEV) and SEV with Encrypted State (SEV-ES) technologies, are available on the 5th Generation AMD
2020-
EPYC Processors when SUSE Linux Enterprise Server 15 SP4 is used both as an host and guest OS. They allow the memory of the VMs to be encrypted, enabling an high level of
2020+
EPYC Processors when SUSE Linux Enterprise Server 15 SP4 is used both as a host and guest OS. They allow the memory of the VMs to be encrypted, enabling a high level of
20212021
confidentiality. SEV-ES is considered superior to plain SEV, as also CPU registers are encrypted when they are saved into the host memory.
20222022
For using SEV-ES for a VM, we need to enable it in the VM's own configuration file, but there are preparation steps that needs to occur at
20232023
the host level. </para>
@@ -2178,7 +2178,7 @@ Total 346060.48 346246.34 692306.82
21782178
are where the mapping between vCPUs and pCPUs is established (<parameter>vcpu</parameter>
21792179
being the vCPU ID and <parameter>cpuset</parameter> being either one or a list of pCPU IDs). </para>
21802180

2181-
<para> In order to be able to create VMs with more than 255 vCPUs,
2181+
<para> To be able to create VMs with more than 255 vCPUs,
21822182
the following element should be added in the
21832183
<parameter>&lt;device></parameter> section:</para>
21842184

@@ -2262,7 +2262,7 @@ node 0 1
22622262
<para>As said already, full cores must always be used. If possible always fully use CCXes/dies too.
22632263
Since each die has 16 CPUs, that means that a VM with 288 vCPUs will use 9 CCXes on each of the 2 nodes
22642264
(as 9 x 16 x 2 is indeed 288).
2265-
So, for instance, vCPUs 0 to 15 can be assigned to Cores L#0 to L#7 (and hence to CPUs P#0 to P#7 and P#192 to P#199), on node P#0;
2265+
So, for example, vCPUs 0 to 15 can be assigned to Cores L#0 to L#7 (and hence to CPUs P#0 to P#7 and P#192 to P#199), on node P#0;
22662266
vCPUs 16 to 31 to Cores L#$8 to L#15, and so on. No vCPU will be pinned, on the other hand, on Cores L#72 to L#95.
22672267
And the same on node P#1.
22682268
In fact, this is what we call coherent 1-to-1 mapping between virtual and physical topologies.</para>
@@ -2323,7 +2323,7 @@ node 0
23232323
them, respectively. And this is all possible without the need for any of the VMs to span
23242324
more than one host NUMA node. </para>
23252325

2326-
<para>Of course, in all these cases, VMs are sharing dies, which means they will interfere among each others via the L3 caches.
2326+
<para>Of course, in all these cases, VMs are sharing dies, which means they will interfere with each other via the L3 caches.
23272327
This may or may not be a problem, but there is no way around it, as soon as more VMs than the number of available dies are necessary.
23282328
The performance impact of such sharing should be tolerable, in most cases, for these configurations, but this needs to be assessed with tests and benchmarks.
23292329
If that is not the case, then the recommendation is to not go above 24 VMs.
@@ -2397,8 +2397,8 @@ node 0
23972397
<bridgehead>CPU Oversubscription</bridgehead>
23982398

23992399
<para> CPU oversubscription happens when the total cumulative number of vCPUs from all VMs
2400-
becomes higher than 384. Such situation unavoidably introduces latencies, and leads to lower
2401-
performance than when host resources are just enough. It is, however, impossible to tell a
2400+
becomes higher than 384. Such a situation inevitably introduces latencies, resulting in lower
2401+
performance compared to when host resources are sufficient. It is, however, impossible to tell a
24022402
priori by what extent this happens, at least not without a detailed knowledge of the actual
24032403
workload. </para>
24042404

@@ -2498,7 +2498,7 @@ node 0
24982498

24992499
<para> The <parameter>&lt;topology></parameter> element specifies the CPU
25002500
characteristics. In this case, we are creating vCPUs which will be seen by the guest OS as
2501-
being arranged in 2 sockets, each of which has 9 dies, each of which has 8 cores with 2 threads (i.e., 16 CPUs).
2501+
being arranged in 2 sockets, each of which has 9 dies, each of which has 8 cores with 2 threads (that is 16 CPUs).
25022502
And this is how we match, for the one big VM, the topology of the host. </para>
25032503

25042504
<para> Each <parameter>&lt;cell></parameter> element defines one virtual NUMA node,
@@ -2605,7 +2605,7 @@ vm1:~ # cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
26052605
0-15
26062606
</screen>
26072607

2608-
<para>Note that, in order to achieve the outcome shown above, it is very important to use the following CPU topology description string, in the VM configuration:</para>
2608+
<para>Note that, to achieve the outcome shown above, it is very important to use the following CPU topology description string, in the VM configuration:</para>
26092609

26102610
<screen>&lt;topology sockets="2" dies="9" cores="8" threads="2"/></screen>
26112611

@@ -2715,7 +2715,7 @@ vm1:~ # modprobe cpuidle-haltpoll
27152715
&lt;/memoryBacking>
27162716
</screen>
27172717

2718-
<para>In order to verify that the appropriate type of memory is being used by the VMs,
2718+
<para>To verify that the appropriate type of memory is being used by the VMs,
27192719
one can check the content of <parameter>/proc/meminfo</parameter>, with the VMs running,
27202720
and observe that all the pre-allocated Huge Pages are actually occupied.</para>
27212721

@@ -2878,21 +2878,21 @@ dmesg | grep SEV
28782878
</mediaobject>
28792879
</figure>
28802880

2881-
<para>The single thread results are basically identical between baremetal (blue rectangles) and inside of the VM (orange rectangles), for
2881+
<para>The single thread results identical between baremetal (blue rectangles) and inside of the VM (orange rectangles), for
28822882
all the operations (<parameter>Copy</parameter>, <parameter>Scale</parameter>, <parameter>Add</parameter> and
28832883
<parameter>Triadd</parameter>) of the benchmark.
28842884
</para>
28852885

28862886
<para> About the parallel case, remember that the VM is slightly "smaller"" than the host, in terms of number of CPUs. Therefore,
28872887
we cannot run the parallel version of STREAM with as many thread as on the host. In fact, a very good result is reached, on the host, with twice as many threads as there are
2888-
Least Level Caches (LLCs), i.e., 48. If we do the same inside of the VM, there it means 36 threads.
2888+
Least Level Caches (LLCs), that is 48. If we do the same inside of the VM, there it means 36 threads.
28892889
We can see, however, that the performance reached inside of the VM, even with 36 threads (dark red rectangles, ~616 GBytes/sec for the <parameter>Copy</parameter> operation)
28902890
is close enough to the values achieved on the host with 48 threads (yellow rectangles, ~640 GBytes/sec for the <parameter>Copy</parameter> operation).
28912891
For completeness, we also run the benchmark on the host with 36 threads (green rectangles, ~605 Gbytes/sec for the <parameter>Copy</parameter> operation); and as we
28922892
could have expectd, the results of such baremetal runs and of the VM runs are very close.</para>
28932893

28942894
<para> This clearly shows how proper tuning allows a single VM running on an AMD EPYC 7004 Series
2895-
Processor server to achieve a memory bandwidth performance that basically matches the one that we can reach directly on the host. </para>
2895+
Processor server to achieve a memory bandwidth performance that matches the one that we can reach directly on the host. </para>
28962896

28972897
<note>
28982898
<para>Inside of the VM, the STREAM benchmark was configured almost identically to what has been
@@ -2907,8 +2907,8 @@ dmesg | grep SEV
29072907
is selected. In fact, since using that model builds a VM with only 2 LLCs (and also to other problems with the cache topology),
29082908
running STREAM with twice as many threads as there are LLCs in the system, results in the benchmark spawning only 4 of them (yellow and green rectangles).
29092909
And this, of course, dramatically reduces the performance. We can also see that, if we instead manually set the number of threads
2910-
to the value that we know to be the best for this VM (i.e., 36, dark red and cyan rectangles) performance are restored to how good
2911-
we know things can be from <xref linkend="fig-stream-bm-vm"/> (and from the <parameter>cpumodel</parameter> results, i.e. the blue and orange rectangles).</para>
2910+
to the value that we know to be the best for this VM (that is 36, dark red and cyan rectangles) performance are restored to how good
2911+
we know things can be from <xref linkend="fig-stream-bm-vm"/> (and from the <parameter>cpumodel</parameter> results, see the blue and orange rectangles).</para>
29122912

29132913
<para>Actually, the cyan rectangles represent the absolute best result (~620 GBytes/sec for the <parameter>Copy</parameter> operation), probably thanks to the fact that the CPUIdle
29142914
<parameter>haltpoll</parameter> governor is the most effective when coupled with <parameter>cpupassthrough</parameter>. However, the orange rectangles follow very closely
@@ -2990,10 +2990,10 @@ dmesg | grep SEV
29902990
</mediaobject>
29912991
</figure>
29922992

2993-
<para>We see in <xref linkend="fig-stream-single-vms-avg"/> how the single STREAM bandwidth stays basically flat until (and including when) 6 VMs are used.
2993+
<para>We see in <xref linkend="fig-stream-single-vms-avg"/> how the single STREAM bandwidth stays flat until (and including when) 6 VMs are used.
29942994
Then it starts to decline a bit, as more VMs are packed on the NUMA nodes and, hence, compete for the bandwith of the memory controllers. However,
29952995
<xref linkend="fig-stream-single-vms-sum"/> reminds us how the total bandwidth achieved, if we consider all the VMs involved in each experiments, actually goes up,
2996-
until it reaches the same level that we know (e.g., from <xref linkend="fig-stream-bm-vm"/>) it can touch.
2996+
until it reaches the same level that we know (for example, from <xref linkend="fig-stream-bm-vm"/>) it can touch.
29972997
</para>
29982998

29992999
<para><xref linkend="fig-stream-omp-vms-avg"/> and <xref linkend="fig-stream-omp-vms-sum"/> show the same, but when the parallel (via OpenMP) version of STREAM

0 commit comments

Comments
 (0)