A device locking system for TT hardware that prevents concurrent access to devices by multiple processes.
tt-lock ensures exclusive access to TT devices (0-3) by managing file-based locks in /tmp. It supports both Python and C++ applications and provides command-line utilities for monitoring and managing locks.
- β
File-based locking: Uses
/tmp/tt_lock_[device]_[PID].lockformat - β
Device validation: Ensures
TT_VISIBLE_DEVICESis set and valid - β Lock monitoring: View which processes hold locks and for how long
- β Automatic cleanup: Locks released on program exit or signal (SIGTERM/SIGINT)
- β Garbage collection: Automatic cleanup of stale locks from dead processes
- β Automatic reset: Optional device reset before program execution with all devices locked
- β Python & C++ support: Works with both languages
- β Race condition handling: Robust lock acquisition with retry logic
- β Signal handling: Proper cleanup on Ctrl+C or normal termination
- β System service: Can run garbage collector as systemd daemon
cd tt-lock
pip install .This installs:
- Python module for
python -m tt-lock - Command-line tool
tt-lock - Garbage collector
tt-lock-gc
Optional: Install Garbage Collector as System Service
# Copy and enable systemd service
sudo cp tt-lock-gc.service /etc/systemd/system/
sudo systemctl enable tt-lock-gc
sudo systemctl start tt-lock-gccd tt-lock
makeThis creates:
libtt_lock.so- Shared librarytest_program- Example executable
Wrap your Python program with lock management:
# Set which devices to lock (comma-separated: 0,1,2,3)
export TT_VISIBLE_DEVICES=0,2
# Run your program through tt-lock
python -m tt_lock python your_script.py [args...]Example:
TT_VISIBLE_DEVICES=0,1 python -m tt_lock python train.py --epochs 100- Include the library in your code:
#include <iostream>
// Declare tt_lock functions
extern "C" void tt_lock_init();
extern "C" void tt_lock_cleanup();
int main() {
// Initialize and acquire locks
tt_lock_init();
// Your program logic here
std::cout << "Running with device locks..." << std::endl;
// Clean up and release locks
tt_lock_cleanup();
return 0;
}- Compile and link:
g++ -std=c++17 your_program.cpp -L. -ltt_lock -o your_program- Run:
export TT_VISIBLE_DEVICES=1,3
LD_LIBRARY_PATH=. ./your_programComma-separated list of device IDs to lock. Must be subset of {0, 1, 2, 3}.
export TT_VISIBLE_DEVICES=0,2 # Lock devices 0 and 2
export TT_VISIBLE_DEVICES=1 # Lock only device 1
export TT_VISIBLE_DEVICES=0,1,2,3 # Lock all devicesβ Error if not set:
Error: TT_VISIBLE_DEVICES environment variable is NOT set.
If set to 1, waits for all devices to be free, runs tt-smi -r, then starts the program.
export TT_RESET=1
TT_VISIBLE_DEVICES=0,1 python -m tt-lock python train.pyBehavior:
- Acquires locks on all devices (0, 1, 2, 3)
- Executes
tt-smi -rto reset devices while holding all locks - Releases reset locks
- Acquires locks for requested devices (from
TT_VISIBLE_DEVICES) - Runs the program
Automatic cleanup of stale lock files from dead processes. Can run as:
- System daemon (recommended for production)
- Foreground process (for testing/debugging)
- One-time cleanup (manual cleanup)
# Run once and exit (manual cleanup)
tt-lock-gc --once
# Run in foreground with logging (testing)
tt-lock-gc --foreground
# Run with custom interval (default: 30 seconds)
tt-lock-gc --interval 60 --foreground
# Run as background daemon (production)
tt-lock-gc # Silent background mode$ tt-lock-gc --once
π TT-Lock Garbage Collector - Single Run
π§Ή Cleaning stale lock: tt_lock_0_12345.lock (PID 12345 not found)
π§Ή Cleaning stale lock: tt_lock_2_12345.lock (PID 12345 not found)
Results:
Total locks found: 4
Active locks: 2
Stale locks: 2
Cleaned: 2Install as a systemd service for automatic startup:
# Copy service file
sudo cp tt-lock-gc.service /etc/systemd/system/
# Install the Python script
sudo pip install -e .
# OR manually copy:
# sudo cp tt_lock_gc.py /usr/local/bin/
# Enable and start the service
sudo systemctl enable tt-lock-gc
sudo systemctl start tt-lock-gc
# Check status
sudo systemctl status tt-lock-gc
# View logs
sudo journalctl -u tt-lock-gc -fService Configuration:
- Runs as:
root(required to clean locks from all users) - Check interval: 30 seconds (configurable)
- Auto-restart on failure
- Limited permissions (ProtectSystem, ProtectHome)
Display current lock status for all devices:
$ tt-lock show
## π Device Lock Status
--------------------------------------------------
Device PID User Held For
--------------------------------------------------
0 FREE - -
1 12345 jongwon 0:02:34
2 FREE - -
3 12345 jongwon 0:02:34
π 2 device(s) locked
--------------------------------------------------Output columns:
- Device: Device ID (0-3)
- PID: Process ID holding the lock (or "FREE")
- User: Username of lock holder
- Held For: Duration lock has been held (HH:MM:SS)
Acquire locks on all devices, then reset them:
$ tt-lock reset
## π Device Reset Utility
Acquiring locks on devices 0, 1, 2, 3 for reset...
Waiting for lock on device 1...
Waiting for lock on device 3...
β Acquired lock on device 0
β Acquired lock on device 1
β Acquired lock on device 2
β Acquired lock on device 3
β All devices locked. Executing reset command...
Executing: tt-smi -r
β Device reset successful.
Releasing locks...
Released lock: tt_lock_0_12345.lock
Released lock: tt_lock_1_12345.lock
Released lock: tt_lock_2_12345.lock
Released lock: tt_lock_3_12345.lockUse cases:
- Safely reset all devices (prevents conflicts during reset)
- Clear stuck devices between runs
- Ensure clean state before critical operations
Lock files are created in /tmp with the naming convention:
tt_lock_[device]_[PID].lock
Examples:
tt_lock_0_12345.lock- Device 0 locked by process 12345tt_lock_2_98765.lock- Device 2 locked by process 98765
File contents:
[PID],[timestamp],[username]
Example: 12345,1700400000.123,jongwon
- Validate
TT_VISIBLE_DEVICESenvironment variable - Check for existing lock files:
/tmp/tt_lock_[device]_*.lock - Wait if locks exist (check every 5 seconds)
- Create lock file with PID, timestamp, and username
- Acquire exclusive file lock using
fcntl.lockf()(Python) orflock()(C++) - Keep file handle open to maintain lock
- Automatic when program exits normally (atexit handler in Python, destructor in C++)
- Automatic on signal interruption (SIGTERM, SIGINT) via signal handlers
- Manual via
tt_lock_cleanup()in C++ - β SIGKILL (
kill -9): Cannot be trapped, locks may remain (clean manually)
If two processes try to acquire the same lock simultaneously:
- Both check for existing lock files (both see none)
- Both try to create lock file
- One succeeds in acquiring exclusive lock
- Other detects lock is held and retries
#!/bin/bash
export TT_VISIBLE_DEVICES=0,1
export TT_RESET=1
python -m tt-lock python train.py \
--model resnet50 \
--batch-size 32 \
--epochs 100#include <iostream>
#include <unistd.h>
extern "C" void tt_lock_init();
extern "C" void tt_lock_cleanup();
int main() {
tt_lock_init();
std::cout << "Starting inference server..." << std::endl;
// Run server logic
sleep(3600);
tt_lock_cleanup();
return 0;
}# Compile
g++ -std=c++17 server.cpp -L. -ltt_lock -o server
# Run on devices 2,3
TT_VISIBLE_DEVICES=2,3 LD_LIBRARY_PATH=. ./serverTerminal 1:
$ TT_VISIBLE_DEVICES=0,1 python -m tt-lock python long_running.py
Attempting to acquire locks for devices: {0, 1}
Successfully acquired lock on device 0.
Successfully acquired lock on device 1.
Running program...Terminal 2:
$ tt-lock show
## π Device Lock Status
--------------------------------------------------
Device PID User Held For
--------------------------------------------------
0 45678 jongwon 0:01:23
1 45678 jongwon 0:01:23
2 FREE - -
3 FREE - -
π 2 device(s) locked
--------------------------------------------------Locks are automatically cleaned up on:
- β Normal program exit
- β SIGTERM (normal kill signal)
- β SIGINT (Ctrl+C)
- β Garbage collector (cleans stale locks from dead processes)
Locks will NOT be cleaned up immediately on:
- β SIGKILL (
kill -9orpkill -9) - but garbage collector will clean these up
Option 1: Garbage Collector (Recommended)
# Run garbage collector once to clean up stale locks
tt-lock-gc --once
# Or install as system service for automatic cleanup
sudo systemctl start tt-lock-gcOption 2: Manual Cleanup
# Manual cleanup (use with caution!)
rm /tmp/tt_lock_*.lock
# Or use tt-lock reset (waits for active locks first)
tt-lock resetOption 3: Install GC as System Service
For production environments, install the garbage collector as a systemd service:
sudo cp tt-lock-gc.service /etc/systemd/system/
sudo systemctl enable tt-lock-gc
sudo systemctl start tt-lock-gcThis ensures stale locks are automatically cleaned every 30 seconds.
Ensure /tmp is writable:
ls -ld /tmp
# Should show: drwxrwxrwtAlways set the environment variable before running:
export TT_VISIBLE_DEVICES=0,1
python -m tt-lock python your_script.pyAdd library path to LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=/path/to/tt-lock:$LD_LIBRARY_PATH
./your_program# Clone repository
git clone <repository-url>
cd tt-lock
# Install Python package
pip install -e .
# Build C++ library
make
# Run tests
TT_VISIBLE_DEVICES=0,1 LD_LIBRARY_PATH=. ./test_program# In __init__.py (internal use)
check_tt_visible_devices() -> set[int]
acquire_lock(device_id: int) -> tuple[Path, object]
release_locks(acquired_locks: list)
perform_reset(devices: set[int])extern "C" void tt_lock_init(); // Initialize and acquire locks
extern "C" void tt_lock_cleanup(); // Release locksSee LICENSE file for details.
For issues or questions, please contact the TT infrastructure team or file an issue in the repository.