system performance

ntk148v · ntk148v · commit 623de2d21c41 · 2025-12-05T17:17:49.000+07:00
diff --git a/others/system-performance-analysis.md b/others/system-performance-analysis.md
@@ -0,0 +1,162 @@
+# Briefing on Systems Performance Analysis
+
+Table of contents:
+- [Briefing on Systems Performance Analysis](#briefing-on-systems-performance-analysis)
+  - [1. Executive Summary](#1-executive-summary)
+  - [2. Core Methodologies and Concepts](#2-core-methodologies-and-concepts)
+    - [2.1. Fundamental Concepts](#21-fundamental-concepts)
+    - [2.2. Key Analytical Methodologies](#22-key-analytical-methodologies)
+    - [2.3. Observability vs. Experimentation](#23-observability-vs-experimentation)
+  - [3. Modern Observability Landscape](#3-modern-observability-landscape)
+    - [3.1. Core Technologies](#31-core-technologies)
+    - [3.2. Instrumentation Types](#32-instrumentation-types)
+    - [3.3. Key Tooling Front-ends](#33-key-tooling-front-ends)
+  - [4. Analysis Across the System Stack](#4-analysis-across-the-system-stack)
+    - [4.1. Applications](#41-applications)
+    - [4.2. CPU](#42-cpu)
+    - [4.3. Memory](#43-memory)
+    - [4.4. File Systems \& Disks](#44-file-systems--disks)
+    - [4.5. Network](#45-network)
+  - [5. Cloud and Virtualization Performance](#5-cloud-and-virtualization-performance)
+    - [5.1. Hardware Virtualization (Virtual Machines)](#51-hardware-virtualization-virtual-machines)
+    - [5.2. OS Virtualization (Containers)](#52-os-virtualization-containers)
+  - [6. Benchmarking and Case Study Insights](#6-benchmarking-and-case-study-insights)
+    - [6.1. Principles of Effective Benchmarking](#61-principles-of-effective-benchmarking)
+    - [6.2. Netflix Case Study: VM vs. Container](#62-netflix-case-study-vm-vs-container)
+
+## 1. Executive Summary
+
+This document synthesizes the core principles, methodologies, and tools for modern systems performance analysis, as detailed in the source materials. The central thesis is that systems performance is a complex, ongoing discipline that necessitates a structured, evidence-based approach rather than relying on guesswork. Effective analysis hinges on deep observability, rigorous methodologies, and a holistic understanding of the entire system stack, from applications down to hardware, especially within contemporary enterprise and cloud environments.
+
+The most critical takeaways are:
+
+- **Methodology is Paramount**: The USE Method (Utilization, Saturation, Errors) is presented as a foundational strategy for rapidly checking the health of all system resources (CPU, memory, disk, network), ensuring no bottleneck is overlooked. This is complemented by workload characterization, which investigates the load being applied, and latency analysis, which quantifies performance from the application's perspective.
+- **Modern Observability is Transformative**: The evolution of observability tools, particularly the integration of Extended BPF (eBPF) into the Linux kernel, has revolutionized performance analysis. BPF front-ends like BCC and `bpftrace` enable the creation of powerful, low-overhead custom tools for deep inspection of kernel and application behavior, moving beyond the limitations of traditional counters and statistics. These are used alongside established power tools like perf (for CPU profiling and hardware counter analysis) and Ftrace (for kernel function tracing).
+- **Focus on Linux and the Cloud**: The material reflects a significant industry shift towards Linux-based systems, dedicating the vast majority of its analysis to the Linux kernel and its tooling. It places a heavy emphasis on cloud computing environments, addressing unique challenges such as the performance impact of hypervisors, multi-tenant resource contention (the "noisy neighbor" problem), and the complexities of OS-level virtualization (containers), where performance is often dictated by resource controls like cgroups.
+- **Analysis is Multi-Layered**: A performance issue is rarely isolated to a single component. A thorough investigation requires analyzing the system vertically through its layers—application, operating system, and hardware—and understanding their interactions. For example, an application issue might manifest as high kernel CPU time, which can be diagnosed by tracing system calls, which in turn cause disk I/O, whose latency must be measured at the block device layer. Visualizations like flame graphs are indispensable for comprehending these complex interactions, especially for CPU profiling.
+
+## 2. Core Methodologies and Concepts
+
+Effective performance analysis is grounded in a set of core concepts and systematic methodologies that provide a framework for investigation.
+
+### 2.1. Fundamental Concepts
+
+- **Latency**: The most essential metric for performance, defined as the time spent waiting for an operation to complete. It can be used to quantify the magnitude of a problem and estimate the maximum potential speedup. Unlike throughput or IOPS, latency is directly experienced by the application and end-users.
+- **Utilization**: A measure of how busy a resource is over a time interval. High utilization is not inherently a problem; it may simply indicate efficient use of resources. However, it is a prerequisite for saturation.
+- **Saturation**: The degree to which a resource has more work than it can service, often resulting in queued work. Saturation directly causes performance degradation through increased latency. It is a more critical indicator of a bottleneck than utilization. For example, a CPU can be 100% utilized but not saturated if there's only one thread running; it becomes saturated when multiple threads are waiting in the run queue.
+- **Scalability**: A measure of how a system's performance changes as load increases. A typical system exhibits linear scalability up to a "knee point," after which resource contention causes performance to degrade. Understanding a system's scalability profile is crucial for capacity planning.
+- **Trade-Offs**: Performance tuning often involves compromises. A common trade-off is between CPU and memory, where memory can cache results to reduce CPU usage, or CPU can be used to compress data to reduce memory usage. The "good, fast, cheap: pick two" principle also applies, where projects often prioritize on-time delivery and low cost, deferring performance tuning.
+
+### 2.2. Key Analytical Methodologies
+
+The source outlines several structured approaches to guide performance investigations.
+
+| Methodology               | Description                                                                                                                                                                                                                                                         |
+| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| The USE Method            | A foundational, high-speed strategy for checking all system resources. For every resource, check Utilization, Saturation, and Errors. This ensures a comprehensive first-pass analysis that can quickly identify the bottlenecked resource class.                   |
+| Workload Characterization | Focuses on the input to the system. It answers questions like: Who is causing the load? What requests are being made? Why are they being made? This methodology can reveal that the system is not broken but is simply overloaded or performing unnecessary work.   |
+| Drill-Down Analysis       | A top-down investigative process that starts with a high-level view and progressively narrows the focus. It involves digging through deeper layers of the software stack (e.g., from application to syscalls to file system to disk driver) to find the root cause. |
+| Latency Analysis          | Measures the time taken for key operations to identify where time is being spent. This is critical for quantifying bottlenecks. Analysis should examine the full distribution of latencies, not just averages, to identify performance outliers.                    |
+| Static Performance Tuning | Analyzes the configured architecture of a system when it is at rest (no load). It involves checking system and application configuration, tuning parameters, versions, and resource limits to find misconfigurations that could harm performance.                   |
+
+### 2.3. Observability vs. Experimentation
+
+- **Observability**: The practice of understanding a system's internal state through passive observation of its outputs. This is the preferred approach for production environments as it minimizes perturbation. It includes tools that use:
+    - **Counters**: Kernel-maintained statistics (e.g., from `/proc`, `sar`).
+    - **Profiling**: Sampling data to build a statistical picture of behavior (e.g., CPU sampling with `perf`).
+    - **Tracing**: Recording individual events for detailed analysis (e.g., tracing syscalls with `bpftrace`).
+- **Experimentation**: The practice of actively modifying a system's state to determine its performance characteristics. This is done using benchmark tools (`fio`, `iperf`) and is best suited for test or development environments, as it can be disruptive.
+
+## 3. Modern Observability Landscape
+
+The ability to observe a system is contingent on its instrumentation. Modern Linux offers a powerful suite of instrumentation technologies and front-end tools.
+
+### 3.1. Core Technologies
+
+- **Extended BPF (eBPF)**: A revolutionary in-kernel execution environment that acts as a sandboxed virtual machine. It allows user-supplied programs to run safely within the kernel, enabling the creation of highly efficient and powerful performance analysis tools. It is the foundation for the BCC framework and `bpftrace`.
+- **perf**: A cornerstone of Linux performance analysis, perf is a versatile tool built on the `perf_event_open()` syscall. Its primary capabilities include CPU profiling via timed sampling, accessing hardware Performance Monitoring Counters (PMCs) for deep architectural insights (like cache misses and instructions per cycle), and tracing static (tracepoints) and dynamic (kprobes) events.
+- **Ftrace**: The official, built-in tracer of the Linux kernel. It is extremely efficient for tracing kernel function calls, measuring function execution time, and generating function call graphs, making it invaluable for deep kernel exploration.
+
+### 3.2. Instrumentation Types
+
+| Type                    | Description                                                                                                                                  | Examples in Linux                | Stability                                                        |
+| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- | ---------------------------------------------------------------- |
+| Static Instrumentation  | Instrumentation points that are hard-coded into the source code at logical locations.                                                        | Tracepoints, USDT probes         | Stable API, intended for long-term use.                          |
+| Dynamic Instrumentation | The ability to insert instrumentation on-the-fly into arbitrary locations in a running kernel or user-space application without recompiling. | kprobes (kernel), uprobes (user) | Unstable; function names can change between kernel/app versions. |
+
+### 3.3. Key Tooling Front-ends
+
+- **BCC (BPF Compiler Collection)**: A toolkit that simplifies writing BPF programs. It provides Python and Lua bindings and comes with over 80 ready-to-use performance tools for analyzing CPU scheduling, memory, disk I/O, networking, and more.
+- `bpftrace`: A high-level tracing language that provides a simple but powerful syntax for creating custom one-liners and short scripts using BPF. It lowers the barrier to entry for custom tracing and is ideal for ad-hoc and exploratory analysis.
+
+## 4. Analysis Across the System Stack
+
+A holistic performance investigation examines all layers of the system where an application's request may spend time.
+
+### 4.1. Applications
+
+- **Focus**: Performance is best tuned closest to the work. Analysis involves identifying the most frequently executed code paths (the "common case") and optimizing them.
+- **Techniques**: Common performance-enhancing techniques include caching data to avoid redundant work, using concurrency (multiprocessing/multithreading) and parallelism to leverage multiple CPUs, and employing non-blocking or asynchronous I/O to avoid waiting.
+- **Analysis**:
+  - **CPU Profiling**: Used to analyze on-CPU time. Tools like `perf` sample stack traces of running threads to reveal which functions are consuming the most CPU.
+  - **Off-CPU Analysis**: Used to analyze wait time. This involves tracing scheduler events to capture the stack traces of threads that are blocked (e.g., waiting on I/O, locks, or timers).
+  - **Flame Graphs**: A powerful visualization for both on-CPU and off-CPU stack trace profiles, allowing for rapid identification of dominant code paths.
+
+### 4.2. CPU
+
+- **Metrics**: Analysis goes beyond simple utilization percentages. Key metrics include user time vs. kernel time, Instructions Per Cycle (IPC) to gauge execution efficiency, and memory stall cycles, which can reveal that a CPU is busy waiting for memory, not executing instructions.
+- **Analysis**: perf is the primary tool. CPU profiling identifies software bottlenecks, while hardware counters (PMCs) can diagnose hardware-level issues like excessive cache misses or branch mispredictions.
+- **Saturation**: CPU saturation is identified by a run queue length greater than the number of CPUs or by measuring scheduler latency (the time a runnable thread waits for a CPU).
+
+### 4.3. Memory
+
+- **Key Issues**: Saturation is the main concern, manifesting as excessive paging/swapping or, in extreme cases, the Out-of-Memory (OOM) Killer terminating processes. Memory leaks, where memory usage grows unboundedly, are a common root cause.
+- **Analysis**: Tools like `vmstat` and `sar` are used to monitor for page scanning and swapping activity. Tracing tools based on BPF can be used to track memory allocations (`malloc()`) and page faults to identify the responsible code paths.
+
+### 4.4. File Systems & Disks
+
+- **Key Concept**: The distinction between logical I/O (at the application/VFS layer) and physical I/O (at the block device/disk layer) is crucial. Caching, prefetching, and write-back buffering mean that logical I/O does not directly correspond to physical I/O, making disk-level stats (`iostat`) potentially misleading about application performance.
+- **Analysis**: Latency is the primary metric. Investigation should start at the VFS layer to measure the latency experienced by the application. Tools like `iostat` provide high-level disk statistics, while BPF tools can provide detailed latency distributions (`biolatency`, `ext4slower`) and trace individual I/O events (`biosnoop`).
+
+### 4.5. Network
+
+- **Metrics**: Analysis covers throughput, connection rates, and especially error events like dropped packets and TCP retransmits, which are major sources of latency.
+- **Analysis**: The `ss` command offers deep insights into socket state and statistics. `tcpdump` allows for packet-level inspection. BPF-based tools like `tcplife` and `tcpretrans` provide low-overhead ways to log all TCP sessions and track retransmissions, respectively. The choice of TCP congestion control algorithm (e.g., Cubic, BBR) has a profound impact on performance.
+
+## 5. Cloud and Virtualization Performance
+
+Cloud environments introduce layers of abstraction that create unique performance challenges and require specialized analysis techniques.
+
+### 5.1. Hardware Virtualization (Virtual Machines)
+
+- **Performance Impact**:
+  - **Hypervisor Overhead**: Guest actions that require privilege, such as I/O or certain CPU instructions, cause an expensive exit to the hypervisor (VM-exit). Modern hardware virtualization extensions (Intel VT-x, AMD-V) minimize this but do not eliminate it.
+  - **I/O Virtualization**: I/O from a guest must be handled by the hypervisor, often via a proxy process (e.g., QEMU), adding latency and CPU overhead.
+  - **Memory Virtualization**: MMU virtualization requires an extra layer of page table lookups (e.g., Intel's EPT), and misses can be costly.
+  - **Noisy Neighbors**: The primary challenge is contention for shared physical resources (CPU cores, LLC, memory bus, network) with other tenants on the same host.
+- **Observability**: From within a guest VM, physical contention is often only visible indirectly as stolen time (%st in top/vmstat), which is CPU time the guest was ready to run but the hypervisor gave to another tenant. Full observability is only possible from the host hypervisor.
+
+### 5.2. OS Virtualization (Containers)
+
+- **Performance Impact**:
+  - Containers share the host kernel, so CPU and memory access is near bare-metal speed. Performance is almost entirely dictated by resource contention and explicit limits.
+  - Resource Controls (cgroups): These are the primary mechanism for managing container performance. They enforce limits on CPU (shares and absolute caps), memory, and block I/O. A "performance problem" in a container is frequently the result of the application hitting its configured cgroup limit.
+- **Observability Challenge**:
+  - **Lack of Container Awareness**: Many traditional tools (`vmstat`, `iostat`), when run inside a container, report host-wide statistics, causing significant confusion.
+  - **Effective Analysis**: Requires correlating information from two perspectives: from inside the container to see the application's view, and from the host to see the container's true resource consumption and whether it is being throttled by its cgroup limits.
+
+## 6. Benchmarking and Case Study Insights
+
+### 6.1. Principles of Effective Benchmarking
+
+- **Rigor is Essential**: "Casual benchmarking" (running a tool and trusting the output) often leads to measuring the wrong thing and drawing bogus conclusions (e.g., measuring file system cache performance when intending to measure disk performance).
+- **Active Benchmarking**: The recommended approach is to use observability tools while a benchmark is running. This allows the analyst to confirm what the benchmark is actually testing and to identify the true system limiter by applying methodologies like the USE Method.
+- **Common Pitfalls**: Key failures include using tests that are too short, ignoring workload variance, changing multiple factors between tests, and not performing sanity checks on the results (e.g., checking if reported throughput exceeds known hardware limits).
+
+### 6.2. Netflix Case Study: VM vs. Container
+
+A real-world case study analyzing why a microservice ran 3-4x faster after being moved from a dedicated VM to an isolated container highlights a holistic analysis process. The performance difference was not due to a single factor but a combination of issues on the VM:
+
+1. **CPU Saturation**: The VM was heavily loaded (load average of 85 on 48 CPUs), leading to high run queue latency and a high rate of context switches (2 million/sec).
+2. **Poor Cache Performance**: The application on the VM experienced a last-level cache (LLC) hit ratio of only 30%, compared to 90% in the container.
+3. **Root Causes**: This was traced to a combination of a smaller LLC on the VM host, a more complex workload on the VM, and the frequent context switches preventing the CPU caches from "warming up." This confluence of factors explained the significant performance disparity and demonstrated the necessity of a multi-faceted investigation using a variety of tools (`mpstat`, `perf`, BPF).