Skip to content

Commit 949b56f

Browse files
Merge pull request #18 from jeremymanning/main
Implement analysis variants feature (Issues #13, #16, #17)
2 parents 28a1041 + a59b44d commit 949b56f

File tree

773 files changed

+520041
-171
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

773 files changed

+520041
-171
lines changed

README.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,127 @@ python generate_figures.py --list
143143

144144
**Note**: The t-test calculations (Figure 2) take approximately 2-3 minutes due to statistical computations across all epochs and authors.
145145

146+
## Analysis Variants
147+
148+
The project supports three linguistic analysis variants to understand what stylistic features models learn:
149+
150+
### Content-Only Variant
151+
Masks function words with `<FUNC>` token, preserving only content words (nouns, verbs, adjectives, etc.)
152+
- **Tests:** Whether models distinguish authors based on vocabulary and word choice
153+
- **Example transformation:**
154+
- Original: "The quick brown fox jumps over the lazy dog"
155+
- Transformed: "<FUNC> quick brown fox jumps <FUNC> <FUNC> lazy dog"
156+
157+
### Function-Only Variant
158+
Masks content words with `<CONTENT>` token, preserving only function words (articles, prepositions, conjunctions)
159+
- **Tests:** Whether models distinguish authors based on grammatical structure
160+
- **Example transformation:**
161+
- Original: "The quick brown fox jumps over the lazy dog"
162+
- Transformed: "The <CONTENT> <CONTENT> <CONTENT> <CONTENT> over the <CONTENT> <CONTENT>"
163+
164+
### Part-of-Speech (POS) Variant
165+
Replaces all words with their POS tags using Universal Dependencies tagset
166+
- **Tests:** Whether models distinguish authors based on syntactic patterns
167+
- **Example transformation:**
168+
- Original: "The quick brown fox jumps over the lazy dog"
169+
- Transformed: "DET ADJ ADJ NOUN VERB ADP DET ADJ NOUN"
170+
171+
### Training Variants
172+
173+
```bash
174+
# Train a single variant (8 authors × 10 seeds = 80 models per variant)
175+
./run_llm_stylometry.sh --train --content-only
176+
./run_llm_stylometry.sh --train --function-only
177+
./run_llm_stylometry.sh --train --part-of-speech
178+
179+
# Short flags
180+
./run_llm_stylometry.sh -t -co # content-only
181+
./run_llm_stylometry.sh -t -fo # function-only
182+
./run_llm_stylometry.sh -t -pos # part-of-speech
183+
184+
# Train baseline (no variant flag)
185+
./run_llm_stylometry.sh -t # baseline (80 models)
186+
187+
# To train all conditions sequentially (baseline + 3 variants = 320 models total):
188+
./run_llm_stylometry.sh -t # baseline
189+
./run_llm_stylometry.sh -t --content-only # content variant
190+
./run_llm_stylometry.sh -t --function-only # function variant
191+
./run_llm_stylometry.sh -t --part-of-speech # POS variant
192+
```
193+
194+
### Generating Variant Figures
195+
196+
```bash
197+
# Generate all figures for a single variant
198+
./run_llm_stylometry.sh --content-only
199+
./run_llm_stylometry.sh --function-only
200+
./run_llm_stylometry.sh --part-of-speech
201+
202+
# Generate specific figure for a variant
203+
./run_llm_stylometry.sh -f 1a --content-only
204+
./run_llm_stylometry.sh -f 1a --function-only
205+
206+
# Generate baseline figures (no variant flag)
207+
./run_llm_stylometry.sh # all baseline figures
208+
./run_llm_stylometry.sh -f 1a # specific baseline figure
209+
210+
# To generate all figures for all conditions:
211+
./run_llm_stylometry.sh # baseline
212+
./run_llm_stylometry.sh --content-only # content variant
213+
./run_llm_stylometry.sh --function-only # function variant
214+
./run_llm_stylometry.sh --part-of-speech # POS variant
215+
```
216+
217+
### Computing Variant Statistics
218+
219+
```bash
220+
# Single variant statistics
221+
./run_stats.sh # baseline (default)
222+
./run_stats.sh --content-only # content variant
223+
./run_stats.sh --function-only # function variant
224+
./run_stats.sh --part-of-speech # POS variant
225+
226+
# All statistics (baseline + all 3 variants)
227+
./run_stats.sh --all
228+
```
229+
230+
### Remote Training with Variants
231+
232+
```bash
233+
# Train a single variant on GPU server
234+
./remote_train.sh --content-only
235+
./remote_train.sh --function-only
236+
./remote_train.sh --part-of-speech
237+
238+
# Resume variant training
239+
./remote_train.sh --resume --content-only
240+
241+
# Train baseline on remote server (no variant flag)
242+
./remote_train.sh
243+
244+
# To train all conditions on remote server, run sequentially:
245+
./remote_train.sh # baseline
246+
./remote_train.sh --content-only # content variant
247+
./remote_train.sh --function-only # function variant
248+
./remote_train.sh --part-of-speech # POS variant
249+
```
250+
251+
### Model Naming Convention
252+
253+
Models include variant in their directory names:
254+
- Baseline: `{author}_tokenizer=gpt2_seed={0-9}/`
255+
- Content: `{author}_variant=content_tokenizer=gpt2_seed={0-9}/`
256+
- Function: `{author}_variant=function_tokenizer=gpt2_seed={0-9}/`
257+
- POS: `{author}_variant=pos_tokenizer=gpt2_seed={0-9}/`
258+
259+
### Figure Output Paths
260+
261+
Figures include variant suffix:
262+
- Baseline: `paper/figs/source/all_losses.pdf`
263+
- Content: `paper/figs/source/all_losses_content.pdf`
264+
- Function: `paper/figs/source/all_losses_function.pdf`
265+
- POS: `paper/figs/source/all_losses_pos.pdf`
266+
146267
### Using Pre-computed Results
147268

148269
The repository includes pre-computed results from training 80 models (8 authors × 10 random seeds). These results are consolidated in `data/model_results.pkl`.

code/compute_stats.py

Lines changed: 48 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,32 @@
1010
from pathlib import Path
1111
from constants import AUTHORS
1212

13-
def load_data():
14-
"""Load the model results data."""
15-
with open('data/model_results.pkl', 'rb') as f:
16-
return pickle.load(f)
13+
def load_data(data_path='data/model_results.pkl', variant=None):
14+
"""
15+
Load and filter model results by variant.
16+
17+
Args:
18+
data_path: Path to consolidated results pickle file
19+
variant: One of ['content', 'function', 'pos'] or None for baseline
20+
21+
Returns:
22+
DataFrame filtered to specified variant
23+
"""
24+
with open(data_path, 'rb') as f:
25+
df = pickle.load(f)
26+
27+
# Filter by variant
28+
if variant is None:
29+
# Baseline: exclude any models with variant column set
30+
if 'variant' in df.columns:
31+
df = df[df['variant'].isna()].copy()
32+
else:
33+
# Specific variant: filter to that variant
34+
if 'variant' not in df.columns:
35+
raise ValueError(f"No variant column in data. Cannot filter for variant '{variant}'")
36+
df = df[df['variant'] == variant].copy()
37+
38+
return df
1739

1840

1941
def find_twain_threshold_epoch(df, p_threshold=0.001):
@@ -138,13 +160,32 @@ def generate_author_comparison_table(df):
138160

139161
def main():
140162
"""Main function to compute and display all statistics."""
163+
import argparse
164+
165+
parser = argparse.ArgumentParser(description='Compute statistics for LLM stylometry')
166+
parser.add_argument(
167+
'--variant',
168+
choices=['content', 'function', 'pos'],
169+
default=None,
170+
help='Analysis variant to compute stats for (default: baseline)'
171+
)
172+
parser.add_argument(
173+
'--data',
174+
default='data/model_results.pkl',
175+
help='Path to model results file (default: data/model_results.pkl)'
176+
)
177+
178+
args = parser.parse_args()
179+
180+
# Update header to show variant
181+
variant_label = f" (Variant: {args.variant})" if args.variant else " (Baseline)"
141182
print("=" * 60)
142-
print("LLM Stylometry Statistical Analysis")
183+
print(f"LLM Stylometry Statistical Analysis{variant_label}")
143184
print("=" * 60)
144185

145-
# Load data
186+
# Load data with variant filter
146187
print("\nLoading data...")
147-
df = load_data()
188+
df = load_data(data_path=args.data, variant=args.variant)
148189

149190
# 1. Find Twain threshold epoch
150191
print("\n1. Twain Model P-Threshold Analysis")

code/consolidate_model_results.py

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,19 @@ def consolidate_model_results(models_dir='models', output_path='data/model_resul
4646
dir_name = model_dir.name
4747
parts = dir_name.split('_')
4848

49-
# Extract author and seed from directory name
50-
# Format: {author}_tokenizer={tokenizer}_seed={seed}
49+
# Extract author, variant, tokenizer, and seed from directory name
50+
# Baseline format: {author}_tokenizer={tokenizer}_seed={seed}
51+
# Variant format: {author}_variant={variant}_tokenizer={tokenizer}_seed={seed}
5152
author = parts[0]
5253

53-
# Find tokenizer and seed
54+
# Find variant, tokenizer, and seed
55+
variant = None
5456
tokenizer = None
5557
seed = None
5658
for part in parts[1:]:
57-
if part.startswith('tokenizer='):
59+
if part.startswith('variant='):
60+
variant = part.split('=')[1]
61+
elif part.startswith('tokenizer='):
5862
tokenizer = part.split('=')[1]
5963
elif part.startswith('seed='):
6064
seed = int(part.split('=')[1])
@@ -75,6 +79,7 @@ def consolidate_model_results(models_dir='models', output_path='data/model_resul
7579
# Add model metadata
7680
df['model_name'] = dir_name
7781
df['author'] = author
82+
df['variant'] = variant # None for baseline, variant name for variant models
7883
df['tokenizer'] = tokenizer
7984
df['checkpoint_path'] = str(model_dir)
8085

@@ -103,7 +108,7 @@ def consolidate_model_results(models_dir='models', output_path='data/model_resul
103108
# Ensure column order matches expected format
104109
expected_columns = [
105110
'seed', 'train_author', 'epochs_completed', 'loss_dataset',
106-
'loss_value', 'model_name', 'author', 'tokenizer',
111+
'loss_value', 'model_name', 'author', 'variant', 'tokenizer',
107112
'model_config', 'generation_config', 'checkpoint_path'
108113
]
109114

@@ -128,10 +133,18 @@ def consolidate_model_results(models_dir='models', output_path='data/model_resul
128133
print(f"Also saved CSV for inspection: {csv_path}")
129134

130135
# Print summary statistics
131-
print("\nSummary by author:")
132-
summary = consolidated_df.groupby('train_author')['seed'].nunique()
133-
for author, num_seeds in summary.items():
134-
print(f" {author}: {num_seeds} seeds")
136+
print("\nSummary by author and variant:")
137+
if 'variant' in consolidated_df.columns:
138+
# Use dropna=False to include None (baseline) values
139+
summary = consolidated_df.groupby(['train_author', 'variant'], dropna=False)['seed'].nunique()
140+
for (author, variant), num_seeds in summary.items():
141+
variant_label = "baseline" if pd.isna(variant) else variant
142+
print(f" {author} ({variant_label}): {num_seeds} seeds")
143+
else:
144+
# Fallback for old data without variant column
145+
summary = consolidated_df.groupby('train_author')['seed'].nunique()
146+
for author, num_seeds in summary.items():
147+
print(f" {author}: {num_seeds} seeds")
135148

136149
return consolidated_df
137150

code/constants.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,29 @@ def find_project_root():
3737
"fitzgerald",
3838
"twain",
3939
]
40+
41+
# Analysis variants
42+
ANALYSIS_VARIANTS = ['content', 'function', 'pos']
43+
44+
45+
def get_data_dir(variant=None):
46+
"""
47+
Get data directory based on analysis variant.
48+
49+
Args:
50+
variant: One of ANALYSIS_VARIANTS or None for baseline
51+
52+
Returns:
53+
Path to data directory
54+
"""
55+
if variant is None:
56+
return CLEANED_DATA_DIR
57+
58+
if variant not in ANALYSIS_VARIANTS:
59+
raise ValueError(f"Invalid variant: {variant}. Must be one of {ANALYSIS_VARIANTS}")
60+
61+
variant_dir = CLEANED_DATA_DIR / f"{variant}_only"
62+
if not variant_dir.exists():
63+
raise FileNotFoundError(f"Variant directory not found: {variant_dir}")
64+
65+
return variant_dir

0 commit comments

Comments
 (0)