Skip to content

Commit 4d73595

Browse files
jeremymanningclaude
andcommitted
Add --resume flag support to remote_train.sh
- Added --resume/-r flag to remote_train.sh for continuing interrupted training - Script now passes resume mode through SSH to remote server - Updated README with resume documentation for remote training - Supports combining --kill and --resume flags for restart scenarios 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 0d38ef8 commit 4d73595

File tree

2 files changed

+38
-13
lines changed

2 files changed

+38
-13
lines changed

README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,15 +238,21 @@ Once Git credentials are configured on your server, run `remote_train.sh` **from
238238
# From your local machine, start training on the remote GPU server
239239
./remote_train.sh
240240

241+
# Resume training from existing checkpoints
242+
./remote_train.sh --resume # or -r
243+
241244
# Kill existing training sessions and optionally start new one
242245
./remote_train.sh --kill # or -k
243246

247+
# Kill and resume (restart interrupted training)
248+
./remote_train.sh --kill --resume
249+
244250
# You'll be prompted for:
245251
# - Server address (hostname or IP)
246252
# - Username
247253
```
248254

249-
**What this script does:** The `remote_train.sh` script connects to your GPU server via SSH and executes `run_llm_stylometry.sh --train -y` in a `screen` session. This allows you to disconnect your local machine while the GPU server continues training.
255+
**What this script does:** The `remote_train.sh` script connects to your GPU server via SSH and executes `run_llm_stylometry.sh --train -y` (or `--train --resume -y` if resuming) in a `screen` session. This allows you to disconnect your local machine while the GPU server continues training.
250256

251257
The script will:
252258
1. SSH into your GPU server

remote_train.sh

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -23,16 +23,26 @@ echo "=================================================="
2323
echo
2424
echo "Usage: $0 [options]"
2525
echo "Options:"
26-
echo " --kill, -k Kill existing training sessions before starting new one"
26+
echo " --kill, -k Kill existing training sessions before starting new one"
27+
echo " --resume, -r Resume training from existing checkpoints"
2728
echo
2829

29-
# Check for --kill flag
30-
if [ "$1" = "--kill" ] || [ "$1" = "-k" ]; then
31-
echo "Kill mode: Will terminate existing training sessions"
32-
KILL_MODE=true
33-
else
34-
KILL_MODE=false
35-
fi
30+
# Parse command line arguments
31+
KILL_MODE=false
32+
RESUME_MODE=false
33+
34+
for arg in "$@"; do
35+
case $arg in
36+
--kill|-k)
37+
echo "Kill mode: Will terminate existing training sessions"
38+
KILL_MODE=true
39+
;;
40+
--resume|-r)
41+
echo "Resume mode: Will continue training from existing checkpoints"
42+
RESUME_MODE=true
43+
;;
44+
esac
45+
done
3646

3747
# Get server details
3848
read -p "Enter GPU server address (hostname or IP): " SERVER_ADDRESS
@@ -57,7 +67,7 @@ fi
5767
echo
5868

5969
# Execute the remote script via SSH
60-
ssh -t "$USERNAME@$SERVER_ADDRESS" "KILL_MODE='$KILL_MODE' bash -s" << 'ENDSSH'
70+
ssh -t "$USERNAME@$SERVER_ADDRESS" "KILL_MODE='$KILL_MODE' RESUME_MODE='$RESUME_MODE' bash -s" << 'ENDSSH'
6171
#!/bin/bash
6272
set -e
6373
@@ -138,8 +148,9 @@ sleep 5
138148
screen -X -S llm_training quit 2>/dev/null || true
139149
140150
# Start training in screen (use --no-confirm flag for non-interactive mode)
141-
# Create a script file first
142-
cat > /tmp/llm_train.sh << 'TRAINSCRIPT'
151+
# Create a script file first with RESUME_MODE variable
152+
echo "RESUME_MODE='$RESUME_MODE'" > /tmp/llm_train.sh
153+
cat >> /tmp/llm_train.sh << 'TRAINSCRIPT'
143154
#!/bin/bash
144155
set -e # Exit on error
145156
@@ -163,7 +174,15 @@ chmod +x ./run_llm_stylometry.sh
163174
164175
# Run the training script with non-interactive flag
165176
echo "Starting training with run_llm_stylometry.sh..." | tee -a $LOG_FILE
166-
./run_llm_stylometry.sh --train -y 2>&1 | tee -a $LOG_FILE
177+
178+
# Check if we're in resume mode
179+
if [ "$RESUME_MODE" = "true" ]; then
180+
echo "Running in resume mode - continuing from existing checkpoints" | tee -a $LOG_FILE
181+
./run_llm_stylometry.sh --train --resume -y 2>&1 | tee -a $LOG_FILE
182+
else
183+
echo "Running full training from scratch" | tee -a $LOG_FILE
184+
./run_llm_stylometry.sh --train -y 2>&1 | tee -a $LOG_FILE
185+
fi
167186
168187
echo "Training completed at $(date)" | tee -a $LOG_FILE
169188
TRAINSCRIPT

0 commit comments

Comments
 (0)