Stylometry CLI (local/offline) — v1.0

This is a small, offline Python tool to extract stylometric artifacts/patterns from text and optionally compute simple similarity signals between corpora using character n-grams.

It’s designed to slot into your Stylometry Orchestrator workflow by emitting SAO-style ResultBundle_*.json files plus CSV artifacts.

What it does

For each document (and each chunk of a document), it computes:

Lexical
- word count, unique word count
- average word length
- MATTR lexical diversity (more length-robust than raw TTR)
Syntactic (proxy)
- average sentence length
- sentence length variation (population SD)
Habitual
- function word frequencies (configurable list)
- punctuation rates (commas/semicolons/etc per 1000 words and per sentence)

If 2+ corpora are provided and there are enough chunks, it also computes:

Char n-gram TF-IDF centroid cosine similarity across corpora (corpus_similarity_char_ngrams.csv)
Nearest-centroid chunk assignment (chunk_assignments_char_ngrams.csv)

Note: these are signals, not definitive authorship proof. Topic/genre/boilerplate can dominate.

Requirements

Windows, macOS, or Linux
Python 3.12+
pip install of dependencies

Install (Windows PowerShell)

py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt

Quick check:

python -c "import numpy, pandas, sklearn; print('ok')"

Input formats

You provide one or more --corpus LABEL=PATH arguments.

PATH can be:

a single .txt / .md file
a folder containing .txt / .md files (recursively)
a .zip archive containing .txt / .md files (recursively)

Examples of folder layouts that work:

Single corpus

my_corpus/
  speech1.txt
  speech2.txt
  speech3.txt

Multiple corpora

corpora/
  A/
    doc1.txt
    doc2.txt
  B/
    doc3.txt
    doc4.txt

You can point each corpus to its subfolder:

--corpus A=corpora/A --corpus B=corpora/B

Run examples

1) Characterize a single document

python stylometry_run.py --task characterize --corpus TextA=./speech1.txt --output ./out_textA

2) Build a profile from many documents (single corpus)

python stylometry_run.py --task profile_build --corpus PersonX=./my_corpus --output ./out_personx

3) Compare two corpora

python stylometry_run.py --task compare --corpus A=./corpora/A --corpus B=./corpora/B --output ./out_compare

4) Use zip archives

python stylometry_run.py --task compare --corpus A=./A.zip --corpus B=./B.zip --output ./out_compare_zip

Outputs

The output folder contains:

manifest.json — corpus manifest (doc list + word counts + local provenance paths)
doc_metrics.csv — per-document metrics
chunk_metrics.csv — per-chunk metrics
ResultBundle_ArtifactExtractor.json — SAO-compatible bundle describing artifacts produced
run_metadata.json — parameters and reproducibility info

If 2+ corpora and enough chunks:

corpus_similarity_char_ngrams.csv
chunk_assignments_char_ngrams.csv
ResultBundle_Comparator.json

If matplotlib is installed and working, it also saves:

plot_avg_sentence_len_boxplot.png
plot_mattr_boxplot.png

Useful options

--chunk-words 1200 — set chunk size (default 1200)
--mattr-window 500 — MATTR window size (default 500)
--function-words-file path.txt — override function word list (newline-delimited)
--include-chunk-text — include chunk text in chunk_metrics.csv (can be large)
--char-analyzer char_wb|char — default char_wb (often better for stylometry)
--max-features 50000 and --min-df 2 — control n-gram feature size

Notes for political/public-figure corpora

Prepared remarks and official publications can reflect speechwriters, staff editing, or transcript normalization. Use “channel-specific” corpora where possible (e.g., floor speeches vs press releases vs prepared remarks).

Troubleshooting

If plots aren’t produced: ensure matplotlib installed and you have write permission.
If Unicode errors: convert source files to UTF-8, or the script will fall back to forgiving decodes.
If it’s slow on huge corpora: increase --min-df, reduce --max-features, or reduce corpus size.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
examples		examples
stylometry		stylometry
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stylometry CLI (local/offline) — v1.0

What it does

Requirements

Install (Windows PowerShell)

Input formats

Run examples

1) Characterize a single document

2) Build a profile from many documents (single corpus)

3) Compare two corpora

4) Use zip archives

Outputs

Useful options

Notes for political/public-figure corpora

Troubleshooting

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stylometry CLI (local/offline) — v1.0

What it does

Requirements

Install (Windows PowerShell)

Input formats

Run examples

1) Characterize a single document

2) Build a profile from many documents (single corpus)

3) Compare two corpora

4) Use zip archives

Outputs

Useful options

Notes for political/public-figure corpora

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages