Skip to content

mcrl/tt-lock

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”’ TT-Lock

A device locking system for TT hardware that prevents concurrent access to devices by multiple processes.

Overview

tt-lock ensures exclusive access to TT devices (0-3) by managing file-based locks in /tmp. It supports both Python and C++ applications and provides command-line utilities for monitoring and managing locks.

Features

  • βœ… File-based locking: Uses /tmp/tt_lock_[device]_[PID].lock format
  • βœ… Device validation: Ensures TT_VISIBLE_DEVICES is set and valid
  • βœ… Lock monitoring: View which processes hold locks and for how long
  • βœ… Automatic cleanup: Locks released on program exit or signal (SIGTERM/SIGINT)
  • βœ… Garbage collection: Automatic cleanup of stale locks from dead processes
  • βœ… Automatic reset: Optional device reset before program execution with all devices locked
  • βœ… Python & C++ support: Works with both languages
  • βœ… Race condition handling: Robust lock acquisition with retry logic
  • βœ… Signal handling: Proper cleanup on Ctrl+C or normal termination
  • βœ… System service: Can run garbage collector as systemd daemon

Installation

Python Package

cd tt-lock
pip install .

This installs:

  • Python module for python -m tt-lock
  • Command-line tool tt-lock
  • Garbage collector tt-lock-gc

Optional: Install Garbage Collector as System Service

# Copy and enable systemd service
sudo cp tt-lock-gc.service /etc/systemd/system/
sudo systemctl enable tt-lock-gc
sudo systemctl start tt-lock-gc

C++ Library

cd tt-lock
make

This creates:

  • libtt_lock.so - Shared library
  • test_program - Example executable

Usage

Python

Wrap your Python program with lock management:

# Set which devices to lock (comma-separated: 0,1,2,3)
export TT_VISIBLE_DEVICES=0,2

# Run your program through tt-lock
python -m tt_lock python your_script.py [args...]

Example:

TT_VISIBLE_DEVICES=0,1 python -m tt_lock python train.py --epochs 100

C++

  1. Include the library in your code:
#include <iostream>

// Declare tt_lock functions
extern "C" void tt_lock_init();
extern "C" void tt_lock_cleanup();

int main() {
    // Initialize and acquire locks
    tt_lock_init();
    
    // Your program logic here
    std::cout << "Running with device locks..." << std::endl;
    
    // Clean up and release locks
    tt_lock_cleanup();
    return 0;
}
  1. Compile and link:
g++ -std=c++17 your_program.cpp -L. -ltt_lock -o your_program
  1. Run:
export TT_VISIBLE_DEVICES=1,3
LD_LIBRARY_PATH=. ./your_program

Environment Variables

TT_VISIBLE_DEVICES (Required)

Comma-separated list of device IDs to lock. Must be subset of {0, 1, 2, 3}.

export TT_VISIBLE_DEVICES=0,2    # Lock devices 0 and 2
export TT_VISIBLE_DEVICES=1      # Lock only device 1
export TT_VISIBLE_DEVICES=0,1,2,3  # Lock all devices

❌ Error if not set:

Error: TT_VISIBLE_DEVICES environment variable is NOT set.

TT_RESET (Optional)

If set to 1, waits for all devices to be free, runs tt-smi -r, then starts the program.

export TT_RESET=1
TT_VISIBLE_DEVICES=0,1 python -m tt-lock python train.py

Behavior:

  1. Acquires locks on all devices (0, 1, 2, 3)
  2. Executes tt-smi -r to reset devices while holding all locks
  3. Releases reset locks
  4. Acquires locks for requested devices (from TT_VISIBLE_DEVICES)
  5. Runs the program

Command-Line Tools

tt-lock-gc (Garbage Collector)

Automatic cleanup of stale lock files from dead processes. Can run as:

  • System daemon (recommended for production)
  • Foreground process (for testing/debugging)
  • One-time cleanup (manual cleanup)

Usage

# Run once and exit (manual cleanup)
tt-lock-gc --once

# Run in foreground with logging (testing)
tt-lock-gc --foreground

# Run with custom interval (default: 30 seconds)
tt-lock-gc --interval 60 --foreground

# Run as background daemon (production)
tt-lock-gc  # Silent background mode

Example Output

$ tt-lock-gc --once
πŸ”’ TT-Lock Garbage Collector - Single Run
  🧹 Cleaning stale lock: tt_lock_0_12345.lock (PID 12345 not found)
  🧹 Cleaning stale lock: tt_lock_2_12345.lock (PID 12345 not found)

Results:
  Total locks found: 4
  Active locks: 2
  Stale locks: 2
  Cleaned: 2

System Service Installation (Linux)

Install as a systemd service for automatic startup:

# Copy service file
sudo cp tt-lock-gc.service /etc/systemd/system/

# Install the Python script
sudo pip install -e .
# OR manually copy:
# sudo cp tt_lock_gc.py /usr/local/bin/

# Enable and start the service
sudo systemctl enable tt-lock-gc
sudo systemctl start tt-lock-gc

# Check status
sudo systemctl status tt-lock-gc

# View logs
sudo journalctl -u tt-lock-gc -f

Service Configuration:

  • Runs as: root (required to clean locks from all users)
  • Check interval: 30 seconds (configurable)
  • Auto-restart on failure
  • Limited permissions (ProtectSystem, ProtectHome)

tt-lock show

Display current lock status for all devices:

$ tt-lock show
## πŸ”’ Device Lock Status
--------------------------------------------------
Device  PID     User        Held For    
--------------------------------------------------
0       FREE    -           -           
1       12345   jongwon     0:02:34     
2       FREE    -           -           
3       12345   jongwon     0:02:34     

πŸ”’ 2 device(s) locked
--------------------------------------------------

Output columns:

  • Device: Device ID (0-3)
  • PID: Process ID holding the lock (or "FREE")
  • User: Username of lock holder
  • Held For: Duration lock has been held (HH:MM:SS)

tt-lock reset

Acquire locks on all devices, then reset them:

$ tt-lock reset
## πŸ”„ Device Reset Utility
Acquiring locks on devices 0, 1, 2, 3 for reset...
Waiting for lock on device 1...
Waiting for lock on device 3...
βœ“ Acquired lock on device 0
βœ“ Acquired lock on device 1
βœ“ Acquired lock on device 2
βœ“ Acquired lock on device 3

βœ“ All devices locked. Executing reset command...
Executing: tt-smi -r
βœ“ Device reset successful.

Releasing locks...
Released lock: tt_lock_0_12345.lock
Released lock: tt_lock_1_12345.lock
Released lock: tt_lock_2_12345.lock
Released lock: tt_lock_3_12345.lock

Use cases:

  • Safely reset all devices (prevents conflicts during reset)
  • Clear stuck devices between runs
  • Ensure clean state before critical operations

Lock File Format

Lock files are created in /tmp with the naming convention:

tt_lock_[device]_[PID].lock

Examples:

  • tt_lock_0_12345.lock - Device 0 locked by process 12345
  • tt_lock_2_98765.lock - Device 2 locked by process 98765

File contents:

[PID],[timestamp],[username]

Example: 12345,1700400000.123,jongwon


How It Works

Lock Acquisition

  1. Validate TT_VISIBLE_DEVICES environment variable
  2. Check for existing lock files: /tmp/tt_lock_[device]_*.lock
  3. Wait if locks exist (check every 5 seconds)
  4. Create lock file with PID, timestamp, and username
  5. Acquire exclusive file lock using fcntl.lockf() (Python) or flock() (C++)
  6. Keep file handle open to maintain lock

Lock Release

  • Automatic when program exits normally (atexit handler in Python, destructor in C++)
  • Automatic on signal interruption (SIGTERM, SIGINT) via signal handlers
  • Manual via tt_lock_cleanup() in C++
  • ❌ SIGKILL (kill -9): Cannot be trapped, locks may remain (clean manually)

Race Condition Handling

If two processes try to acquire the same lock simultaneously:

  1. Both check for existing lock files (both see none)
  2. Both try to create lock file
  3. One succeeds in acquiring exclusive lock
  4. Other detects lock is held and retries

Examples

Example 1: Training Script (Python)

#!/bin/bash
export TT_VISIBLE_DEVICES=0,1
export TT_RESET=1

python -m tt-lock python train.py \
    --model resnet50 \
    --batch-size 32 \
    --epochs 100

Example 2: Inference Server (C++)

#include <iostream>
#include <unistd.h>

extern "C" void tt_lock_init();
extern "C" void tt_lock_cleanup();

int main() {
    tt_lock_init();
    
    std::cout << "Starting inference server..." << std::endl;
    // Run server logic
    sleep(3600);
    
    tt_lock_cleanup();
    return 0;
}
# Compile
g++ -std=c++17 server.cpp -L. -ltt_lock -o server

# Run on devices 2,3
TT_VISIBLE_DEVICES=2,3 LD_LIBRARY_PATH=. ./server

Example 3: Monitoring Locks

Terminal 1:

$ TT_VISIBLE_DEVICES=0,1 python -m tt-lock python long_running.py
Attempting to acquire locks for devices: {0, 1}
Successfully acquired lock on device 0.
Successfully acquired lock on device 1.
Running program...

Terminal 2:

$ tt-lock show
## πŸ”’ Device Lock Status
--------------------------------------------------
Device  PID     User        Held For    
--------------------------------------------------
0       45678   jongwon     0:01:23     
1       45678   jongwon     0:01:23     
2       FREE    -           -           
3       FREE    -           -           

πŸ”’ 2 device(s) locked
--------------------------------------------------

Troubleshooting

Lock Files Not Cleaning Up

Locks are automatically cleaned up on:

  • βœ… Normal program exit
  • βœ… SIGTERM (normal kill signal)
  • βœ… SIGINT (Ctrl+C)
  • βœ… Garbage collector (cleans stale locks from dead processes)

Locks will NOT be cleaned up immediately on:

  • ❌ SIGKILL (kill -9 or pkill -9) - but garbage collector will clean these up

Recommended Solutions:

Option 1: Garbage Collector (Recommended)

# Run garbage collector once to clean up stale locks
tt-lock-gc --once

# Or install as system service for automatic cleanup
sudo systemctl start tt-lock-gc

Option 2: Manual Cleanup

# Manual cleanup (use with caution!)
rm /tmp/tt_lock_*.lock

# Or use tt-lock reset (waits for active locks first)
tt-lock reset

Option 3: Install GC as System Service

For production environments, install the garbage collector as a systemd service:

sudo cp tt-lock-gc.service /etc/systemd/system/
sudo systemctl enable tt-lock-gc
sudo systemctl start tt-lock-gc

This ensures stale locks are automatically cleaned every 30 seconds.

Permission Denied

Ensure /tmp is writable:

ls -ld /tmp
# Should show: drwxrwxrwt

"TT_VISIBLE_DEVICES is NOT set"

Always set the environment variable before running:

export TT_VISIBLE_DEVICES=0,1
python -m tt-lock python your_script.py

C++ Library Not Found

Add library path to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=/path/to/tt-lock:$LD_LIBRARY_PATH
./your_program

Building from Source

# Clone repository
git clone <repository-url>
cd tt-lock

# Install Python package
pip install -e .

# Build C++ library
make

# Run tests
TT_VISIBLE_DEVICES=0,1 LD_LIBRARY_PATH=. ./test_program

API Reference

Python

# In __init__.py (internal use)
check_tt_visible_devices() -> set[int]
acquire_lock(device_id: int) -> tuple[Path, object]
release_locks(acquired_locks: list)
perform_reset(devices: set[int])

C++

extern "C" void tt_lock_init();      // Initialize and acquire locks
extern "C" void tt_lock_cleanup();   // Release locks

License

See LICENSE file for details.


Support

For issues or questions, please contact the TT infrastructure team or file an issue in the repository.

About

Device lock system for TT.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published