Skip to content

Commit 173388c

Browse files
acoifmannvidiasholeksandr
authored andcommitted
Refactor hw-management: Split sync service into peripheral and thermal updaters
Split monolithic hw_management_sync.py into two independent services: - hw_management_peripheral_updater.py: Handles fans, BMC, leakage, power button, ASIC chipup - hw_management_thermal_updater.py: Handles ASIC and module temperature monitoring Key improvements: - ASIC chipup tracking now independent of thermal monitoring (critical fix) - Centralized platform configuration in hw_management_platform_config.py - Services can be stopped/started independently for better reliability Bug #4546995 Signed-off-by: Abraham Coifman <acoifman@nvidia.com>
1 parent 2697804 commit 173388c

File tree

61 files changed

+20583
-45
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+20583
-45
lines changed

.gitignore

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,34 @@
11
output/
2+
3+
# Python cache
4+
__pycache__/
5+
*.pyc
6+
*.pyo
7+
*.pyd
8+
.Python
9+
10+
# Test artifacts
11+
tests/logs/
12+
tests/__pycache__/
13+
.pytest_cache/
14+
.benchmarks/
15+
16+
# Build artifacts
17+
*.egg-info/
18+
dist/
19+
build/
20+
21+
# IDE and editor files
22+
.vscode/
23+
.idea/
24+
*.swp
25+
*.swo
26+
*~
27+
28+
# CI/CD tools
29+
.ngci_tool/
30+
31+
# Temporary files
32+
*.log
33+
*.bak
34+
*.tmp

debian/hw-management.hw-management-sync.service renamed to debian/hw-management.hw-management-peripheral-updater.service

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[Unit]
2-
Description=Hw-management events sync service of Nvidia systems
2+
Description=Hw-management peripheral updater service (fans, BMC, leakage sensors)
33
After=hw-management.service
44
Requires=hw-management.service
55
PartOf=hw-management.service
@@ -8,7 +8,7 @@ StartLimitIntervalSec=1200
88
StartLimitBurst=5
99

1010
[Service]
11-
ExecStart=/bin/sh -c "/usr/bin/hw_management_sync.py"
11+
ExecStart=/bin/sh -c "/usr/bin/hw_management_peripheral_updater.py"
1212
ExecStop=/bin/kill $MAINPID
1313
TimeoutStopSec=1
1414

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
[Unit]
2+
Description=Hw-management thermal updater service for ASIC and module temperature monitoring
3+
After=hw-management.service
4+
Requires=hw-management.service
5+
PartOf=hw-management.service
6+
7+
StartLimitIntervalSec=1200
8+
StartLimitBurst=5
9+
10+
[Service]
11+
ExecStart=/bin/sh -c "/usr/bin/hw_management_thermal_updater.py"
12+
ExecStop=/bin/kill $MAINPID
13+
TimeoutStopSec=1
14+
15+
Restart=on-failure
16+
RestartSec=10s
17+
18+
[Install]
19+
WantedBy=multi-user.target
20+

debian/rules

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -52,21 +52,24 @@ endif
5252
override_dh_installinit:
5353
dh_installinit --name=hw-management
5454
dh_installinit --name=hw-management-tc
55-
dh_installinit --name=hw-management-sync
55+
dh_installinit --name=hw-management-peripheral-updater
56+
dh_installinit --name=hw-management-thermal-updater
5657
dh_installinit --name=hw-management-sysfs-monitor
5758
dh_installinit --name=hw-management-fast-sysfs-monitor
5859

5960
override_dh_systemd_enable:
6061
dh_systemd_enable --name=hw-management
6162
dh_systemd_enable --name=hw-management-tc
62-
dh_systemd_enable --name=hw-management-sync
63+
dh_systemd_enable --name=hw-management-peripheral-updater
64+
dh_systemd_enable --name=hw-management-thermal-updater
6365
dh_systemd_enable --name=hw-management-sysfs-monitor
6466
dh_systemd_enable --name=hw-management-fast-sysfs-monitor
6567

6668
override_dh_systemd_start:
6769
dh_systemd_start --name=hw-management
6870
dh_systemd_start --name=hw-management-tc
69-
dh_systemd_start --name=hw-management-sync
71+
dh_systemd_start --name=hw-management-peripheral-updater
72+
dh_systemd_start --name=hw-management-thermal-updater
7073
dh_systemd_start --name=hw-management-sysfs-monitor
7174
dh_systemd_start --name=hw-management-fast-sysfs-monitor
7275

0 commit comments

Comments
 (0)