Skip to content

Conversation

@dustccc
Copy link

@dustccc dustccc commented Aug 1, 2025

for information, peek at this issue.

When new PID born, container can't read its env by default, so the show info go wrong.

root@test:/workspace# nvidia-smi
Fri Aug  1 10:48:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:38:00.0 Off |                    0 |
| N/A   50C    P0             68W /  400W |       0MiB /   8192MiB |     19%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
root@test:/workspace# ssh localhost
root@localhost's password: 
root@test:~# nvidia-smi
[HAMI-core Msg(2011:139706544437056:libvgpu.c:837)]: Initializing.....
Fri Aug  1 10:47:29 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:38:00.0 Off |                    0 |
| N/A   52C    P0             83W /  400W |       0MiB /  81920MiB |     57%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
[HAMI-core Msg(2011:139706544437056:multiprocess_memory_limit.c:498)]: Calling exit handler 2011

I changed the part that get the env vars from PID 1. If there is no such env, just return and go next step; if exists then read it and use the injected default env vars. After thinking, i not pick single env var to load, cuz later code maybe add new env var.

root@ct-1400911962493157376:/workspace# nvidia-smi
Fri Aug  1 10:56:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:38:00.0 Off |                    0 |
| N/A   49C    P0             71W /  400W |       0MiB /   8192MiB |      7%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
root@test:/workspace# ssh localhost
root@localhost's password: 
root@test:~# nvidia-smi
Fri Aug  1 10:58:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:38:00.0 Off |                    0 |
| N/A   53C    P0            250W /  400W |       0MiB /   8192MiB |     76%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@hami-robot
Copy link
Contributor

hami-robot bot commented Aug 1, 2025

Welcome @dustccc! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

@hami-robot hami-robot bot added the size/S label Aug 1, 2025
Copy link
Contributor

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that you used SSH for testing.

As far as I know, due to security reasons, sshd does not inherit environment variables (ENV) by default. However, you can configure sshd to inherit ENV, so that when you log in via SSH, the environment variables will be correctly loaded.

@dustccc
Copy link
Author

dustccc commented Sep 16, 2025

I noticed that you used SSH for testing.

As far as I know, due to security reasons, sshd does not inherit environment variables (ENV) by default. However, you can configure sshd to inherit ENV, so that when you log in via SSH, the environment variables will be correctly loaded.

This is also for security reasons, but I've noticed that many issues in Project-HAMi mention inaccurate graphics card information when different users or different methods use containers for training, and component errors may occur.
Or can I modify the currently submitted code, filter the environment variables used by hami, and perform global synchronization to solve this problem?
For example
Project-HAMi/HAMi#1090
Project-HAMi/HAMi#604

@hami-robot
Copy link
Contributor

hami-robot bot commented Sep 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dustccc
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants