Skip to content

Commit b1122dc

Browse files
committed
fix typo
1 parent 055b2f1 commit b1122dc

File tree

1 file changed

+21
-20
lines changed

1 file changed

+21
-20
lines changed

linux/network-performance-ultimate-guide.md

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -33,15 +33,15 @@ Table of Contents:
3333
- [2.4.4. Accelerated Receive Flow Steering (aRFS)](#244-accelerated-receive-flow-steering-arfs)
3434
- [2.5. Interrupt Coalescing (soft IRQ)](#25-interrupt-coalescing-soft-irq)
3535
- [2.6. Ingress QDisc](#26-ingress-qdisc)
36-
- [2.7. Egress Disc - txqueuelen and default_qdisc](#27-egress-disc---txqueuelen-and-default_qdisc)
36+
- [2.7. Egress Disc - txqueuelen and default\_qdisc](#27-egress-disc---txqueuelen-and-default_qdisc)
3737
- [2.8. TCP Read and Write Buffers/Queues](#28-tcp-read-and-write-buffersqueues)
3838
- [2.9. TCP FSM and congestion algorithm](#29-tcp-fsm-and-congestion-algorithm)
3939
- [2.10. NUMA](#210-numa)
40-
- [2.11. Further more - Packet processing](#211-further-more---packet-processing)
41-
- [2.11.1. AF_PACKET v4](#2111-af_packet-v4)
42-
- [2.11.2. PACKET_MMAP](#2112-packet_mmap)
40+
- [2.11. Furthermore - Packet processing](#211-furthermore---packet-processing)
41+
- [2.11.1. `AF_PACKET` v4](#2111-af_packet-v4)
42+
- [2.11.2. `PACKET_MMAP`](#2112-packet_mmap)
4343
- [2.11.3. Kernel bypass: Data Plane Development Kit (DPDK)](#2113-kernel-bypass-data-plane-development-kit-dpdk)
44-
- [2.11.4. PF_RING](#2114-pf_ring)
44+
- [2.11.4. PF\_RING](#2114-pf_ring)
4545
- [2.11.5. Programmable packet processing: eXpress Data Path (XDP)](#2115-programmable-packet-processing-express-data-path-xdp)
4646

4747
## 1. Linux Networking stack
@@ -75,11 +75,12 @@ Source:
7575

7676
<details>
7777
<summary>Click to expand</summary>
78+
7879
- In network devices, it is common for the NIC to raise an **IRQ** to signal that a packet has arrived and is ready to be processed.
7980
- An IRQ (Interrupt Request) is a hardware signal sent to the processor instructing it to suspend its current activity and handle some external event, such as a keyboard input or a mouse movement.
8081
- In Linux, IRQ mappings are stored in **/proc/interrupts**.
8182
- When an IRQ handler is executed by the Linux kernel, it runs at a very, very high priority and often blocks additional IRQs from being generated. As such, IRQ handlers in device drivers must execute as quickly as possible and defer all long running work to execute outside of this context. This is why the **softIRQ** system exists.
82-
- **softIRQ** system is a system that kernel uses to process work outside of the device driver IRQ context. In the case of network devices, the softIRQQ system is responsible for processing incoming packets
83+
- **softIRQ** system is a system that kernel uses to process work outside of the device driver IRQ context. In the case of network devices, the softIRQ system is responsible for processing incoming packets
8384
- Initial setup (from step 1-4):
8485

8586
![](https://cdn.buttercms.com/hwT5dgTatRdfG7UshrAF)
@@ -103,12 +104,12 @@ Source:
103104
- The call to `napi_schedule` in the driver adds the driver's NAPI poll structure to the `poll_list` for the current CPU.
104105
- The softirq pending a bit is set so that the `ksoftirqd` process on this CPU knows that there are packets to process.
105106
- `run_ksoftirqd` function (which is being run in a loop by the `ksoftirq` kernel thread) executes.
106-
- `__do_softirq` is called which checks the pending bitfield, sees that a softIRQ is pending, and calls the handler registerd for the pending softIRQ: `net_rx_action` (softIRQ kernel thread executes this, not the driver IRQ handler).
107+
- `__do_softirq` is called which checks the pending bitfield, sees that a softIRQ is pending, and calls the handler registered for the pending softIRQ: `net_rx_action` (softIRQ kernel thread executes this, not the driver IRQ handler).
107108
- Now, data processing begins:
108109
- `net_rx_action` loop starts by checking the NAPI poll list for NAPI structures.
109110
- The `budget` and elapsed time are checked to ensure that the softIRQ will not monopolize CPU time.
110111
- The registered `poll` function is called.
111-
- The driver's `poll` functio harvests packets from the ring buffer in RAM.
112+
- The driver's `poll` function harvests packets from the ring buffer in RAM.
112113
- Packets are handed over to `napi_gro_receive` (GRO - Generic Receive Offloading).
113114
- GRO is a widely used SW-based offloading technique to reduce per-packet processing overheads.
114115
- By reassembling small packets into larger ones, GRO enables applications to process fewer large packets directly, thus reducing the number of packets to be processed.
@@ -119,12 +120,12 @@ Source:
119120
- If RPS is disabled:
120121
- 1. `netif_receive_skb` passed the data onto `__netif_receive_core`.
121122
- 6. `__netif_receive_core` delivers the data to any taps.
122-
- 7. `__netif_receive_core` delivers data to registed protocol layer handlers.
123+
- 7. `__netif_receive_core` delivers data to registered protocol layer handlers.
123124
- If RPS is enabled:
124125
- 1. `netif_receive_skb` passes the data on to `enqueue_to_backlog`.
125126
- 2. Packets are placed on a per-CPU input queue for processing.
126127
- 3. The remote CPU’s NAPI structure is added to that CPU’s poll_list and an IPI is queued which will trigger the softIRQ kernel thread on the remote CPU to wake-up if it is not running already.
127-
- 4. When the `ksoftirqd` kernel thread on the remote CPU runs, it follows the same pattern describe in the previous section, but this time, the registered poll function is `process_backlog` which harvests packets from the current CPU’s input queue.
128+
- 4. When the `ksoftirqd` kernel thread on the remote CPU runs, it follows the same pattern described in the previous section, but this time, the registered poll function is `process_backlog` which harvests packets from the current CPU’s input queue.
128129
- 5. Packets are passed on toward `__net_receive_skb_core`.
129130
- 6. `__netif_receive_core` delivers data to any taps (like PCAP).
130131
- 7. `__netif_receive_core` delivers data to registered protocol layer handlers.
@@ -141,7 +142,7 @@ Source:
141142

142143
![](https://img-blog.csdnimg.cn/20201025161643899.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L1JvbmdfVG9h,size_16,color_FFFFFF,t_70)
143144

144-
> **NOTE**: Some NICs are "multiple queues" NICs. This diagram above shows just a single ring buffer for simplicity, but depending on the NIC you are using and your hardware settings you may have mutliple queues in the system. Check [Share the load of packet processing among CPUs](#24-share-the-load-of-packet-processing-among-cpus) section for detail.
145+
> **NOTE**: Some NICs are "multiple queues" NICs. This diagram above shows just a single ring buffer for simplicity, but depending on the NIC you are using and your hardware settings you may have multiple queues in the system. Check [Share the load of packet processing among CPUs](#24-share-the-load-of-packet-processing-among-cpus) section for detail.
145146
146147
1. Packet arrives at the NIC
147148
2. NIC verifies `MAC` (if not on [promiscuous mode](https://unix.stackexchange.com/questions/14056/what-is-kernel-ip-forwarding)) and `FCS` and decide to drop or to continue
@@ -199,7 +200,7 @@ Source:
199200
```
200201

201202
9. NAPI polls data from the rx ring buffer.
202-
- NAPI was written to make processing data packets of incoming cards more efficient. HardIRQs are expensive because they can't be interrupt, we both known that. Even with _Interrupt coalesecense_ (describe later in more detail), the interrupt handler will monopolize a CPU core completely. The design of NAPI allows the driver to go into a polling mode instead of being HardIRQ for every required packet receive.
203+
- NAPI was written to make processing data packets of incoming cards more efficient. HardIRQs are expensive because they can't be interrupted, we both know that. Even with _Interrupt coalescence_ (described later in more detail), the interrupt handler will monopolize a CPU core completely. The design of NAPI allows the driver to go into a polling mode instead of being HardIRQ for every required packet receive.
203204
- Step 1->9 in brief:
204205

205206
![](https://i.stack.imgur.com/BKBvW.png)
@@ -257,13 +258,13 @@ Source:
257258
20. It finds the right socket
258259
21. It goes to the tcp finite state machine
259260
22. Enqueue the packet to the receive buffer and sized as `tcp_rmem` rules
260-
- If `tcp_moderate_rcvbuf is enabled kernel will auto-tune the receive buffer
261+
- If `tcp_moderate_rcvbuf` is enabled kernel will auto-tune the receive buffer
261262
- `tcp_rmem`: Contains 3 values that represent the minimum, default and maximum size of the TCP socket receive buffer.
262263
- min: minimal size of receive buffer used by TCP sockets. It is guaranteed to each TCP socket, even under moderate memory pressure. Default: 4 KB.
263264
- default: initial size of receive buffer used by TCP sockets. This value overrides `net.core.rmem_default` used by other protocols. Default: 131072 bytes. This value results in initial window of 65535.
264265
- max: maximal size of receive buffer allowed for automatically selected receiver buffers for TCP socket. This value does not override `net.core.rmem_max`. Calling `setsockopt()` with `SO_RCVBUF` disables automatic tuning of that socket’s receive buffer size, in which case this value is ignored. `SO_RECVBUF` sets the fixed size of the TCP receive buffer, it will override `tcp_rmem`, and the kernel will no longer dynamically adjust the buffer. The maximum value set by `SO_RECVBUF` cannot exceed `net.core.rmem_max`. Normally, we will not use it. Default: between 131072 and 6MB, depending on RAM size.
265266
- `net.core.rmem_max`: the upper limit of the TCP receive buffer size.
266-
- Between `net.core.rmem_max` and `net.ipv4.tcp-rmem`'max value, the bigger value [takes precendence](https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_output.c#L241).
267+
- Between `net.core.rmem_max` and `net.ipv4.tcp-rmem` max value, the bigger value [takes precendence](https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_output.c#L241).
267268
- Increase this buffer to enable scaling to a larger window size. Larger windows increase the amount of data to be transferred before an acknowledgement (ACK) is required. This reduces overall latencies and results in increased throughput.
268269
- This setting is typically set to a very conservative value of 262,144 bytes. It is recommended this value be set as large as the kernel allows. 4.x kernels accept values over 16 MB.
269270
23. Kernel will signalize that there is data available to apps (epoll or any polling system)
@@ -612,9 +613,9 @@ Once upon a time, everything was so simple. The network card was slow and had on
612613
![](<http://2.bp.blogspot.com/-AnaAh45OOcI/Ulx3IWrgsNI/AAAAAAAABiQ/iq_zZUH5rOM/s1600/study+(5).png>)
613614

614615
- RSS provides the benefits of parallel receive processing in multiprocessing environment.
615-
- This is NIC technology. It supprots multiple queues and integrates a hashing function (distributes packets to different queues by Source and Destination IP and if applicable by TCP/UDP source and destination ports) in the NIC. The NIC computes a hash value for each incoming packet. Based on hash values, NIC assigns packets of the same data flow to a single queue and evenly distributes traffic flows across queues.
616+
- This is NIC technology. It supports multiple queues and integrates a hashing function (distributes packets to different queues by Source and Destination IP and if applicable by TCP/UDP source and destination ports) in the NIC. The NIC computes a hash value for each incoming packet. Based on hash values, NIC assigns packets of the same data flow to a single queue and evenly distributes traffic flows across queues.
616617
- Check with `ethool -L` command.
617-
- According [Linux kernel documentation](https://github.com/torvalds/linux/blob/v4.11/Documentation/networking/scaling.txt#L80), `RSS should be enabled when latency is a concern or whenever receive interrupt processing froms a bottleneck... For low latency networking, the optimal setting is to allocate as many queues as there are CPUs in the system (or the NIC maximum, if lower)`.
618+
- According [Linux kernel documentation](https://github.com/torvalds/linux/blob/v4.11/Documentation/networking/scaling.txt#L80), `RSS should be enabled when latency is a concern or whenever receive interrupt processing forms a bottleneck... For low latency networking, the optimal setting is to allocate as many queues as there are CPUs in the system (or the NIC maximum, if lower)`.
618619
619620
![](https://learn.microsoft.com/en-us/windows-hardware/drivers/network/images/rss.png)
620621
@@ -730,7 +731,7 @@ sysctl -w net.core.rps_sock_flow_entries 32768
730731
731732
### 2.6. Ingress QDisc
732733
733-
- In step (14), I has mentioned `netdev_max_backlog`, it's about Per-CPU backlog queue. The `netif_receive_skb()` kernel function (step (12)) will find the corresponding CPU for a packet, and enqueue packets in that CPU's queue. If the queue for that processor is full and already at maximum size, packets will be dropped. The default size of queue - `netdev_max_backlog` value is 1000, this may not be enough for multiple interfaces operating at 1Gbps, or even a single interface at 10Gbps.
734+
- In step (14), I have mentioned `netdev_max_backlog`, it's about Per-CPU backlog queue. The `netif_receive_skb()` kernel function (step (12)) will find the corresponding CPU for a packet, and enqueue packets in that CPU's queue. If the queue for that processor is full and already at maximum size, packets will be dropped. The default size of queue - `netdev_max_backlog` value is 1000, this may not be enough for multiple interfaces operating at 1Gbps, or even a single interface at 10Gbps.
734735
- Tuning:
735736
- Change command:
736737
- Double the value -> check `/proc/net/softnet_stat`
@@ -875,11 +876,11 @@ systemctl stop irqbalance
875876
876877
- `numadctl`: control NUMA policy for processes or shared memory.
877878
878-
### 2.11. Further more - Packet processing
879+
### 2.11. Furthermore - Packet processing
879880
880881
This section is an advance one. It introduces some advance module/framework to achieve high performance.
881882
882-
#### 2.11.1. AF_PACKET v4
883+
#### 2.11.1. `AF_PACKET` v4
883884
884885
Source:
885886
@@ -932,7 +933,7 @@ Source:
932933
933934
- In order to improve Rx and Tx performance this implementation make uses `PACKET_MMAP`.
934935
935-
#### 2.11.2. PACKET_MMAP
936+
#### 2.11.2. `PACKET_MMAP`
936937
937938
Source:
938939

0 commit comments

Comments
 (0)