Adding rank based logging for torch distributed examples#3897

Open

apbose wants to merge 3 commits intoabose/trt_llm_installation_distfrom

abose/trt_llm_installation_changes_debug

Collaborator

apbose commented Nov 14, 2025

This PR

Adds rank based logging for the distributed examples
Corrects the fallback to pytorch case for NCCL converters
This with Changes to TRT-LLM download tool for multigpu distributed case #3830 provides utilities for running distributed tensor parallel examples using torch.distributed

meta-cla bot added the cla signed label

github-actions bot added component: tests component: conversion component: api [Python] component: dynamo labels

apbose changed the title ~~Adding rank based logging for torch distributed examples. Also correc…~~ Adding rank based logging for torch distributed examples

github-actions bot requested a review from narendasan

November 14, 2025 00:05

apbose marked this pull request as draft

November 14, 2025 00:05

apbose changed the title ~~Adding rank based logging for torch distributed examples~~ Adding rank based logging for torch distributed examples[WIP]

apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 31666e3 to 52ae92a Compare

November 25, 2025 22:59

apbose changed the title ~~Adding rank based logging for torch distributed examples[WIP]~~ Adding rank based logging for torch distributed examples

apbose marked this pull request as ready for review

November 26, 2025 00:28

apbose changed the base branch from main to abose/trt_llm_installation_dist

November 26, 2025 00:28

narendasan reviewed

View reviewed changes

examples/distributed_inference/tensor_parallel_initialize_dist.py

-                  return device_mesh, world_size, rank
+                  # Set C++ TensorRT runtime log level based on most verbose handler
+                  # this is similar to set_log_level()
+                  cpp_level = min(file_level_int, console_level_int)

Collaborator

narendasan Dec 1, 2025

Dont we have an API that abstracts needing to detect if the C++ runtime is available? If not we should add one

Collaborator Author

apbose Dec 2, 2025 •

edited

Loading

I have added a function in _features.py for the above. And also moved all this to logging.py. Let me know if that function placment works

narendasan reviewed

View reviewed changes

tests/py/dynamo/distributed/test_nccl_ops.py Outdated

    
                      not is_platform_supported_for_trtllm(),

                      "Skipped on Windows, Jetson and CUDA13: NCCL backend is not supported.",

                      not is_distributed_nccl_available(),

                      "Skipped: NCCL backend is not available (Windows/Jetson not supported).",

Collaborator

narendasan Dec 1, 2025

Is it jetson or just Orin?

Collaborator Author

apbose Dec 2, 2025

yeah Orin. Changed to Jetson Orin

github-actions bot added the component: torch_compile label

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 00:37:46.920408+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 00:38:18.669710+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 07:00:24.693914+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 07:00:53.634960+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

apbose force-pushed the abose/trt_llm_installation_changes_debug branch from aa4183e to 2ea29e4 Compare

December 2, 2025 15:33

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 15:34:05.984305+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 15:34:37.144980+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 2ea29e4 to 6833fec Compare

December 2, 2025 22:41

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 22:41:27.269191+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 22:41:58.523912+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 6833fec to f8befae Compare

December 2, 2025 23:49

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 23:49:16.116928+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 23:49:48.815063+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-02 23:49:16.689930+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-02 23:50:00.341840+00:00
@@ -74,19 +74,19 @@
            try:
                size = os.path.getsize(path)
                shm_files.append((path, size))
            except OSError:
                shm_files.append((path, -1))
-        
+
        # Sort by size descending
        shm_files.sort(key=lambda x: x[1], reverse=True)
        for path, size in shm_files:
            if size >= 0:
                print(f"  {path}: {size / (1024 * 1024):.2f} MB")
            else:
                print(f"  {path}: <unable to get size>")
-        
+
        if not shm_files:
            print("  (no files found)")
    except Exception as e:
        print(f"  Error listing /dev/shm: {e}")

apbose force-pushed the abose/trt_llm_installation_changes_debug branch from f8befae to f40e84b Compare

December 3, 2025 00:43

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 00:44:03.183076+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 00:44:33.293930+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-03 00:44:03.634077+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-03 00:44:44.650284+00:00
@@ -67,41 +67,39 @@

    # List ALL files in /dev/shm to see what's consuming space
    print("\nAll files in /dev/shm (including hidden):")
    try:
        import subprocess
+
        # Use ls -la to see all files including hidden ones
        result = subprocess.run(
-            ["ls", "-la", "/dev/shm"],
-            capture_output=True,
-            text=True,
-            timeout=5
+            ["ls", "-la", "/dev/shm"], capture_output=True, text=True, timeout=5
        )
        print(result.stdout)
-        
+
        # Also run du to see actual disk usage
        print("\nDisk usage breakdown (du -sh /dev/shm/*):")
        result = subprocess.run(
            ["du", "-sh", "/dev/shm/*"],
            capture_output=True,
            text=True,
            shell=False,
-            timeout=5
+            timeout=5,
        )
        # du with glob needs shell=True
        result = subprocess.run(
            "du -sh /dev/shm/* 2>/dev/null | head -20",
            capture_output=True,
            text=True,
            shell=True,
-            timeout=5
+            timeout=5,
        )
        print(result.stdout if result.stdout else "  (no output)")
-        
+
    except Exception as e:
        print(f"  Error listing /dev/shm: {e}")
-    
+
    # Also list using Python for comparison
    print("\nPython os.listdir():")
    try:
        shm_files = []
        for f in os.listdir("/dev/shm"):
@@ -109,25 +107,27 @@
            try:
                size = os.path.getsize(path)
                shm_files.append((path, size))
            except OSError:
                shm_files.append((path, -1))
-        
+
        # Sort by size descending
        shm_files.sort(key=lambda x: x[1], reverse=True)
        total_listed = 0
        for path, size in shm_files:
            if size >= 0:
                print(f"  {path}: {size / (1024 * 1024):.2f} MB")
                total_listed += size
            else:
                print(f"  {path}: <unable to get size>")
-        
+
        print(f"\nTotal from listed files: {total_listed / (1024 * 1024):.2f} MB")
        print(f"Reported used: {usage_before.get('used_mb', 'N/A')} MB")
-        print(f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!")
-        
+        print(
+            f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!"
+        )
+
        if not shm_files:
            print("  (no files found)")
    except Exception as e:
        print(f"  Error: {e}")

@@ -135,11 +135,11 @@
        "/dev/shm/nccl-*",
        "/dev/shm/torch_*",
        "/dev/shm/py_shared_memory_*",
        "/dev/shm/*multiprocessing*",
        "/dev/shm/vader_segment*",  # Open MPI shared memory
-        "/dev/shm/sem.*",           # POSIX semaphores
+        "/dev/shm/sem.*",  # POSIX semaphores
    ]

    total_files = 0
    total_bytes_freed = 0

apbose force-pushed the abose/trt_llm_installation_changes_debug branch from f40e84b to 99ded8c Compare

December 3, 2025 05:23

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 05:24:09.711666+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 05:24:42.564679+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-03 05:24:10.282669+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-03 05:24:54.082736+00:00
@@ -67,33 +67,31 @@

    # List ALL files in /dev/shm to see what's consuming space
    print("\nAll files in /dev/shm (including hidden):")
    try:
        import subprocess
+
        # Use ls -la to see all files including hidden ones
        result = subprocess.run(
-            ["ls", "-la", "/dev/shm"],
-            capture_output=True,
-            text=True,
-            timeout=5
+            ["ls", "-la", "/dev/shm"], capture_output=True, text=True, timeout=5
        )
        print(result.stdout)
-        
+
        # Also run du to see actual disk usage
        print("\nDisk usage breakdown (du -sh /dev/shm/*):")
        result = subprocess.run(
            "du -sh /dev/shm/* 2>/dev/null | head -20",
            capture_output=True,
            text=True,
            shell=True,
-            timeout=5
+            timeout=5,
        )
        print(result.stdout if result.stdout else "  (no output)")
-        
+
    except Exception as e:
        print(f"  Error listing /dev/shm: {e}")
-    
+
    # Also list using Python for comparison
    print("\nPython os.listdir():")
    try:
        shm_files = []
        for f in os.listdir("/dev/shm"):
@@ -101,25 +99,27 @@
            try:
                size = os.path.getsize(path)
                shm_files.append((path, size))
            except OSError:
                shm_files.append((path, -1))
-        
+
        # Sort by size descending
        shm_files.sort(key=lambda x: x[1], reverse=True)
        total_listed = 0
        for path, size in shm_files:
            if size >= 0:
                print(f"  {path}: {size / (1024 * 1024):.2f} MB")
                total_listed += size
            else:
                print(f"  {path}: <unable to get size>")
-        
+
        print(f"\nTotal from listed files: {total_listed / (1024 * 1024):.2f} MB")
        print(f"Reported used: {usage_before.get('used_mb', 'N/A')} MB")
-        print(f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!")
-        
+        print(
+            f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!"
+        )
+
        if not shm_files:
            print("  (no files found)")
    except Exception as e:
        print(f"  Error: {e}")

@@ -127,11 +127,11 @@
        "/dev/shm/nccl-*",
        "/dev/shm/torch_*",
        "/dev/shm/py_shared_memory_*",
        "/dev/shm/*multiprocessing*",
        "/dev/shm/vader_segment*",  # Open MPI shared memory
-        "/dev/shm/sem.*",           # POSIX semaphores
+        "/dev/shm/sem.*",  # POSIX semaphores
    ]

    total_files = 0
    total_bytes_freed = 0

apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 99ded8c to 6e91c4e Compare

December 3, 2025 14:38

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 14:38:26.671953+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 14:38:59.549613+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 6e91c4e to 3e42d12 Compare

December 3, 2025 14:39

apbose added 3 commits

December 3, 2025 08:26


          Adding rank based logging for torch distributed examples. Also correc…

27c6002

…ting TRT-LLM installation fallback cases


          addressing review comments

88315b5


          addressing L2 CI errors

091c2e4

apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 3e42d12 to 091c2e4 Compare

December 3, 2025 16:26

narendasan reviewed

View reviewed changes

py/torch_tensorrt/logging.py

+                  else:
+                      logger.setLevel(level)
+                  if has_torchscript_frontend():

Collaborator

narendasan Dec 10, 2025

If we have the frontend we necessarily have the runtime, I dont think we need to use these APIs

narendasan reviewed

View reviewed changes

py/torch_tensorrt/logging.py

                       _LOGGER.setLevel(logging.CRITICAL)
-                      if ENABLED_FEATURES.torchscript_frontend:
+                      if has_torchscript_frontend():

Collaborator

narendasan Dec 10, 2025

lets just remove the has_torchscript_frontend cases

narendasan approved these changes

View reviewed changes

Collaborator

narendasan left a comment

Just remove the TS ones since we should be able to handle both with the runtime and then LGTM

apbose mentioned this pull request

Rank based logging for distributed examples #4081

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] component: conversion component: dynamo component: tests component: torch_compile