You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[2.11.3. Kernel bypass: Data Plane Development Kit (DPDK)](#2113-kernel-bypass-data-plane-development-kit-dpdk)
44
-
-[2.11.4. PF_RING](#2114-pf_ring)
44
+
-[2.11.4. PF\_RING](#2114-pf_ring)
45
45
-[2.11.5. Programmable packet processing: eXpress Data Path (XDP)](#2115-programmable-packet-processing-express-data-path-xdp)
46
46
47
47
## 1. Linux Networking stack
@@ -75,11 +75,12 @@ Source:
75
75
76
76
<details>
77
77
<summary>Click to expand</summary>
78
+
78
79
- In network devices, it is common for the NIC to raise an **IRQ** to signal that a packet has arrived and is ready to be processed.
79
80
- An IRQ (Interrupt Request) is a hardware signal sent to the processor instructing it to suspend its current activity and handle some external event, such as a keyboard input or a mouse movement.
80
81
- In Linux, IRQ mappings are stored in **/proc/interrupts**.
81
82
- When an IRQ handler is executed by the Linux kernel, it runs at a very, very high priority and often blocks additional IRQs from being generated. As such, IRQ handlers in device drivers must execute as quickly as possible and defer all long running work to execute outside of this context. This is why the **softIRQ** system exists.
82
-
-**softIRQ** system is a system that kernel uses to process work outside of the device driver IRQ context. In the case of network devices, the softIRQQ system is responsible for processing incoming packets
83
+
-**softIRQ** system is a system that kernel uses to process work outside of the device driver IRQ context. In the case of network devices, the softIRQ system is responsible for processing incoming packets
- The call to `napi_schedule` in the driver adds the driver's NAPI poll structure to the `poll_list` for the current CPU.
104
105
- The softirq pending a bit is set so that the `ksoftirqd` process on this CPU knows that there are packets to process.
105
106
-`run_ksoftirqd` function (which is being run in a loop by the `ksoftirq` kernel thread) executes.
106
-
-`__do_softirq` is called which checks the pending bitfield, sees that a softIRQ is pending, and calls the handler registerd for the pending softIRQ: `net_rx_action` (softIRQ kernel thread executes this, not the driver IRQ handler).
107
+
-`__do_softirq` is called which checks the pending bitfield, sees that a softIRQ is pending, and calls the handler registered for the pending softIRQ: `net_rx_action` (softIRQ kernel thread executes this, not the driver IRQ handler).
107
108
- Now, data processing begins:
108
109
-`net_rx_action` loop starts by checking the NAPI poll list for NAPI structures.
109
110
- The `budget` and elapsed time are checked to ensure that the softIRQ will not monopolize CPU time.
110
111
- The registered `poll` function is called.
111
-
- The driver's `poll`functio harvests packets from the ring buffer in RAM.
112
+
- The driver's `poll`function harvests packets from the ring buffer in RAM.
112
113
- Packets are handed over to `napi_gro_receive` (GRO - Generic Receive Offloading).
113
114
- GRO is a widely used SW-based offloading technique to reduce per-packet processing overheads.
114
115
- By reassembling small packets into larger ones, GRO enables applications to process fewer large packets directly, thus reducing the number of packets to be processed.
@@ -119,12 +120,12 @@ Source:
119
120
- If RPS is disabled:
120
121
-1.`netif_receive_skb` passed the data onto `__netif_receive_core`.
121
122
-6.`__netif_receive_core` delivers the data to any taps.
122
-
-7.`__netif_receive_core` delivers data to registed protocol layer handlers.
123
+
-7.`__netif_receive_core` delivers data to registered protocol layer handlers.
123
124
- If RPS is enabled:
124
125
-1.`netif_receive_skb` passes the data on to `enqueue_to_backlog`.
125
126
-2. Packets are placed on a per-CPU input queue for processing.
126
127
-3. The remote CPU’s NAPI structure is added to that CPU’s poll_list and an IPI is queued which will trigger the softIRQ kernel thread on the remote CPU to wake-up if it is not running already.
127
-
-4. When the `ksoftirqd` kernel thread on the remote CPU runs, it follows the same pattern describe in the previous section, but this time, the registered poll function is `process_backlog` which harvests packets from the current CPU’s input queue.
128
+
-4. When the `ksoftirqd` kernel thread on the remote CPU runs, it follows the same pattern described in the previous section, but this time, the registered poll function is `process_backlog` which harvests packets from the current CPU’s input queue.
128
129
-5. Packets are passed on toward `__net_receive_skb_core`.
129
130
-6.`__netif_receive_core` delivers data to any taps (like PCAP).
130
131
-7.`__netif_receive_core` delivers data to registered protocol layer handlers.
> **NOTE**: Some NICs are "multiple queues" NICs. This diagram above shows just a single ring buffer for simplicity, but depending on the NIC you are using and your hardware settings you may have mutliple queues in the system. Check [Share the load of packet processing among CPUs](#24-share-the-load-of-packet-processing-among-cpus) section for detail.
145
+
> **NOTE**: Some NICs are "multiple queues" NICs. This diagram above shows just a single ring buffer for simplicity, but depending on the NIC you are using and your hardware settings you may have multiple queues in the system. Check [Share the load of packet processing among CPUs](#24-share-the-load-of-packet-processing-among-cpus) section for detail.
145
146
146
147
1. Packet arrives at the NIC
147
148
2. NIC verifies `MAC` (if not on [promiscuous mode](https://unix.stackexchange.com/questions/14056/what-is-kernel-ip-forwarding)) and `FCS` and decide to drop or to continue
@@ -199,7 +200,7 @@ Source:
199
200
```
200
201
201
202
9. NAPI polls data from the rx ring buffer.
202
-
- NAPI was written to make processing data packets of incoming cards more efficient. HardIRQs are expensive because they can't be interrupt, we both known that. Even with _Interrupt coalesecense_ (describe later in more detail), the interrupt handler will monopolize a CPU core completely. The design of NAPI allows the driver to go into a polling mode instead of being HardIRQ for every required packet receive.
203
+
- NAPI was written to make processing data packets of incoming cards more efficient. HardIRQs are expensive because they can't be interrupted, we both know that. Even with _Interrupt coalescence_ (described later in more detail), the interrupt handler will monopolize a CPU core completely. The design of NAPI allows the driver to go into a polling mode instead of being HardIRQ for every required packet receive.
203
204
- Step 1->9 in brief:
204
205
205
206

@@ -257,13 +258,13 @@ Source:
257
258
20. It finds the right socket
258
259
21. It goes to the tcp finite state machine
259
260
22. Enqueue the packet to the receive buffer and sized as `tcp_rmem` rules
260
-
- If `tcp_moderate_rcvbuf is enabled kernel will auto-tune the receive buffer
261
+
- If `tcp_moderate_rcvbuf` is enabled kernel will auto-tune the receive buffer
261
262
- `tcp_rmem`: Contains 3 values that represent the minimum, default and maximum size of the TCP socket receive buffer.
262
263
- min: minimal size of receive buffer used by TCP sockets. It is guaranteed to each TCP socket, even under moderate memory pressure. Default: 4 KB.
263
264
- default: initial size of receive buffer used by TCP sockets. This value overrides `net.core.rmem_default` used by other protocols. Default: 131072 bytes. This value results in initial window of 65535.
264
265
- max: maximal size of receive buffer allowed for automatically selected receiver buffers for TCP socket. This value does not override `net.core.rmem_max`. Calling `setsockopt()` with `SO_RCVBUF` disables automatic tuning of that socket’s receive buffer size, in which case this value is ignored. `SO_RECVBUF` sets the fixed size of the TCP receive buffer, it will override `tcp_rmem`, and the kernel will no longer dynamically adjust the buffer. The maximum value set by `SO_RECVBUF` cannot exceed `net.core.rmem_max`. Normally, we will not use it. Default: between 131072 and 6MB, depending on RAM size.
265
266
- `net.core.rmem_max`: the upper limit of the TCP receive buffer size.
266
-
- Between `net.core.rmem_max` and `net.ipv4.tcp-rmem`'max value, the bigger value [takes precendence](https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_output.c#L241).
267
+
- Between `net.core.rmem_max` and `net.ipv4.tcp-rmem`max value, the bigger value [takes precendence](https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_output.c#L241).
267
268
- Increase this buffer to enable scaling to a larger window size. Larger windows increase the amount of data to be transferred before an acknowledgement (ACK) is required. This reduces overall latencies and results in increased throughput.
268
269
- This setting is typically set to a very conservative value of 262,144 bytes. It is recommended this value be set as large as the kernel allows. 4.x kernels accept values over 16 MB.
269
270
23. Kernel will signalize that there is data available to apps (epoll or any polling system)
@@ -612,9 +613,9 @@ Once upon a time, everything was so simple. The network card was slow and had on
- RSS provides the benefits of parallel receive processing in multiprocessing environment.
615
-
- This is NIC technology. It supprots multiple queues and integrates a hashing function(distributes packets to different queues by Source and Destination IP and if applicable by TCP/UDP source and destination ports) in the NIC. The NIC computes a hash value for each incoming packet. Based on hash values, NIC assigns packets of the same data flow to a single queue and evenly distributes traffic flows across queues.
616
+
- This is NIC technology. It supports multiple queues and integrates a hashing function(distributes packets to different queues by Source and Destination IP and if applicable by TCP/UDP source and destination ports) in the NIC. The NIC computes a hash value for each incoming packet. Based on hash values, NIC assigns packets of the same data flow to a single queue and evenly distributes traffic flows across queues.
616
617
- Check with `ethool -L` command.
617
-
- According [Linux kernel documentation](https://github.com/torvalds/linux/blob/v4.11/Documentation/networking/scaling.txt#L80), `RSS should be enabled when latency is a concern or whenever receive interrupt processing froms a bottleneck... For low latency networking, the optimal setting is to allocate as many queues as there are CPUs in the system (or the NIC maximum, if lower)`.
618
+
- According [Linux kernel documentation](https://github.com/torvalds/linux/blob/v4.11/Documentation/networking/scaling.txt#L80), `RSS should be enabled when latency is a concern or whenever receive interrupt processing forms a bottleneck... For low latency networking, the optimal setting is to allocate as many queues as there are CPUs in the system (or the NIC maximum, if lower)`.
- In step (14), I has mentioned `netdev_max_backlog`, it's about Per-CPU backlog queue. The `netif_receive_skb()` kernel function(step (12)) will find the corresponding CPU fora packet, and enqueue packetsin that CPU's queue. If the queue for that processor is full and already at maximum size, packets will be dropped. The default size of queue - `netdev_max_backlog` value is 1000, this may not be enough for multiple interfaces operating at 1Gbps, or even a single interface at 10Gbps.
734
+
- In step (14), I have mentioned `netdev_max_backlog`, it's about Per-CPU backlog queue. The `netif_receive_skb()` kernel function(step (12)) will find the corresponding CPU fora packet, and enqueue packetsin that CPU's queue. If the queue for that processor is full and already at maximum size, packets will be dropped. The default size of queue - `netdev_max_backlog` value is 1000, this may not be enough for multiple interfaces operating at 1Gbps, or even a single interface at 10Gbps.
734
735
- Tuning:
735
736
- Change command:
736
737
- Double the value -> check `/proc/net/softnet_stat`
@@ -875,11 +876,11 @@ systemctl stop irqbalance
875
876
876
877
- `numadctl`: control NUMA policy for processes or shared memory.
877
878
878
-
### 2.11. Further more - Packet processing
879
+
### 2.11. Furthermore - Packet processing
879
880
880
881
This section is an advance one. It introduces some advance module/framework to achieve high performance.
881
882
882
-
#### 2.11.1. AF_PACKET v4
883
+
#### 2.11.1. `AF_PACKET` v4
883
884
884
885
Source:
885
886
@@ -932,7 +933,7 @@ Source:
932
933
933
934
- In order to improve Rx and Tx performance this implementation make uses `PACKET_MMAP`.
0 commit comments