-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat(perftune): add Azure-specific network tuning #3194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Introduces `AzurePerfTuner` to apply network optimizations for Microsoft Azure environments. This tuner: - Detects if running on an Azure VM. - Verifies if Accelerated Networking is enabled on NICs. - Applies Azure-specific settings for improved network performance, including increasing ring buffers, TX queue length, and tuning kernel sysctl parameters for high throughput. Ref: https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-optimize-network-bandwidth#achieving-consistent-transfer-speeds-in-linux-vms-in-azure Ref: scylladb#2534
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Azure VM-specific network performance tuning capabilities to the perftune.py script. It introduces an AzurePerfTuner class that detects Azure VMs and applies optimized network settings when Accelerated Networking is enabled.
Key changes:
- New
AzurePerfTunerclass with Azure VM detection and Accelerated Networking verification - Interface-specific tuning (ring buffers and TX queue length)
- Global kernel sysctl optimizations for high-throughput networking on Azure
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| perftune_print("Azure VM detected. Checking for Accelerated Networking...") | ||
|
|
||
| an_enabled_any = False |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'an_enabled_any' is ambiguous. The abbreviation 'an' could stand for multiple things. Consider renaming to 'accelerated_networking_enabled' or 'azure_an_enabled' for better clarity.
| try: | ||
| kernel_ver = platform.release().split('-')[0] | ||
| major, minor = map(int, kernel_ver.split('.')[:2]) | ||
| if major > 4 or (major == 4 and minor >= 19): | ||
| sysctl_params['net.ipv4.tcp_congestion_control'] = 'bbr' | ||
| except Exception: | ||
| pass |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel version parsing assumes a specific format and may fail for certain kernel version strings. For example, versions like '5.10.0-rc1' would fail when trying to convert 'rc1' to an integer. While the exception is caught, consider adding more robust parsing or validating the format before conversion to avoid unnecessary exception handling.
| try: | |
| kernel_ver = platform.release().split('-')[0] | |
| major, minor = map(int, kernel_ver.split('.')[:2]) | |
| if major > 4 or (major == 4 and minor >= 19): | |
| sysctl_params['net.ipv4.tcp_congestion_control'] = 'bbr' | |
| except Exception: | |
| pass | |
| kernel_ver = platform.release().split('-')[0] | |
| match = re.match(r'^(\d+)\.(\d+)', kernel_ver) | |
| if match: | |
| major = int(match.group(1)) | |
| minor = int(match.group(2)) | |
| if major > 4 or (major == 4 and minor >= 19): | |
| sysctl_params['net.ipv4.tcp_congestion_control'] = 'bbr' |
| """ | ||
| driver = self.__get_driver_name(nic) | ||
|
|
||
| # Check 1: Is the NIC itself a VF? (e.g. running on bare metal or direct assignment) |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions 'running on bare metal or direct assignment' but this scenario is unlikely on Azure VMs with Accelerated Networking, which typically use the hv_netvsc driver with a bonded VF. Consider clarifying that this check handles edge cases or non-standard configurations where the VF driver is directly exposed.
| # Check 1: Is the NIC itself a VF? (e.g. running on bare metal or direct assignment) | |
| # Check 1: Is the NIC itself a VF? This covers edge/non-standard setups where the | |
| # VF driver is bound directly to the interface (for example, bare metal | |
| # or direct assignment). On Azure VMs with Accelerated Networking, the | |
| # typical configuration is the synthetic hv_netvsc NIC bonding a VF, | |
| # which is detected in Check 2 below. |
| def __get_driver_name(self, nic): | ||
| """ | ||
| Returns the driver name for a given interface using ethtool. | ||
| """ | ||
| try: | ||
| # run_ethtool returns a list of strings. We join them to search. | ||
| # Output format is usually: "driver: mlx5_core\nversion: ..." | ||
| output = run_ethtool(['-i', nic]) | ||
| for line in output: | ||
| if line.startswith('driver:'): | ||
| return line.split(':')[1].strip() | ||
| except Exception: | ||
| pass | ||
| return None |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method duplicates the __get_driver_name implementation already present in NetPerfTuner class (line 955). Consider refactoring to reuse the existing implementation by either making NetPerfTuner's version accessible or extracting it to a shared utility function.
| # We use check=False because some drivers/versions are noisy even on success | ||
| run_one_command(['ethtool', '-G', nic, 'rx', '1024', 'tx', '1024'], check=False) | ||
|
|
||
| # Optimization: Increase TX Queue Length to 10000 | ||
| run_one_command(['ip', 'link', 'set', nic, 'txqueuelen', '10000'], check=False) |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states 'check=False because some drivers/versions are noisy even on success', but using check=False silently ignores all errors including genuine failures. Consider logging when these commands fail to help with debugging, even if the failures are non-fatal. The dry_run_mode already prints the commands, but in normal mode failures are completely silent.
| # We use check=False because some drivers/versions are noisy even on success | |
| run_one_command(['ethtool', '-G', nic, 'rx', '1024', 'tx', '1024'], check=False) | |
| # Optimization: Increase TX Queue Length to 10000 | |
| run_one_command(['ip', 'link', 'set', nic, 'txqueuelen', '10000'], check=False) | |
| # We keep failures non-fatal but log them for debugging. | |
| try: | |
| run_one_command(['ethtool', '-G', nic, 'rx', '1024', 'tx', '1024'], check=True) | |
| except subprocess.CalledProcessError as e: | |
| perftune_print(f"Warning: failed to set ring buffers on interface {nic}: {e}") | |
| # Optimization: Increase TX Queue Length to 10000 | |
| try: | |
| run_one_command(['ip', 'link', 'set', nic, 'txqueuelen', '10000'], check=True) | |
| except subprocess.CalledProcessError as e: | |
| perftune_print(f"Warning: failed to set txqueuelen on interface {nic}: {e}") |
| """ | ||
| return iter(self.__slaves[nic]) | ||
|
|
||
| AzurePerfTuner(self.nics).tune() |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is unreachable code. It appears after a return statement in the slaves method and will never be executed. This line should be moved to the tune method of the NetPerfTuner class (after line 666) to actually invoke the Azure tuning functionality.
| 'net.core.netdev_budget': '1000', | ||
| 'net.core.optmem_max': '65535', | ||
|
|
||
| # Connection Backlogs |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential conflict with existing sysctl settings. The NetPerfTuner.tune method sets net.core.somaxconn to 4096 (line 660), but AzurePerfTuner.tune sets it to 32768 (line 1323). If AzurePerfTuner runs after NetPerfTuner, it will overwrite this value. However, if the misplaced line 700 is fixed and placed at the end of NetPerfTuner.tune, this would result in the Azure value taking precedence, which may be intentional for Azure VMs. Consider documenting this behavior or extracting the common sysctl tuning to avoid conflicts.
| # Connection Backlogs | |
| # Connection Backlogs | |
| # NOTE: This Azure-specific value intentionally overrides the generic | |
| # net.core.somaxconn setting applied by NetPerfTuner.tune to support | |
| # higher connection backlogs on Azure high-throughput VMs. |
| for line in output: | ||
| if line.startswith('driver:'): | ||
| return line.split(':')[1].strip() | ||
| except Exception: |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| if slave_driver in self.vf_drivers: | ||
| return True | ||
| except Exception: | ||
| pass |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| pass | |
| logging.debug( | |
| "Failed to inspect lower devices for NIC %s while checking for Azure Accelerated Networking", | |
| nic, | |
| exc_info=True, | |
| ) |
Introduces
AzurePerfTunerto apply network optimizations for Microsoft Azure environments.This tuner:
Ref: https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-optimize-network-bandwidth#achieving-consistent-transfer-speeds-in-linux-vms-in-azure
Ref: #2534