Author: Jonas Gann
Course: Generative Neural Networks, University of Heidelberg
This project implements a Transformer model from scratch to generate bash commands, serving as both an educational tool for understanding Transformer architecture and a practical bash command autocompletion system. Instead of training on natural language, we focus on shell command patterns to create a specialized generative model for command-line interactions.
- Custom Transformer Implementation: Built from scratch with multi-head self-attention
 - Bash Command Generation: Trained specifically on shell command datasets
 - Character-level Tokenization: 543-token vocabulary for shell commands
 - Hyperparameter Optimization: Integrated Optuna for automated tuning
 - Experiment Tracking: Weights & Biases (wandb) integration
 - Interactive Notebooks: Easy-to-use Jupyter interfaces for training and inference
 
The implementation includes several key components:
- SelfAttentionBlock: Single attention head with causal masking
 - MultiHeadSelfAttention: Multiple parallel attention heads with projection
 - TransformerDecoder: Stack of transformer decoder layers with residual connections
 - Positional Encoding: Learned positional embeddings for sequence modeling
 
- Block Size: 256 tokens (configurable)
 - Vocabulary: 543 unique characters from shell commands
 - Dropout: 0.2 for regularization
 - Architecture: Decoder-only transformer (GPT-style)
 
shell-transformer/
βββ README.md                           # This file
βββ transformer.ipynb                   # Main training and inference notebook
βββ data.ipynb                         # Data preprocessing and analysis
βββ stoi                               # String-to-index vocabulary mapping
βββ itos                               # Index-to-string vocabulary mapping
βββ optuna.db                          # Hyperparameter optimization database
βββ final_with_preprocessing/
    βββ jumping-river-27/
        βββ shell_transformer_23000    # Trained model checkpoint
pip install torch numpy optuna wandb plotly bashlex- 
Load and Generate Commands:
# Open transformer.ipynb and run the cells to: # - Load the pre-trained model # - Generate new bash commands # - Experiment with different prompts
 - 
Interactive Generation: The notebook provides an easy interface to:
- Input partial commands
 - Generate completions
 - Explore model predictions
 
 
- 
Data Preparation:
# Use data.ipynb to: # - Load bash command datasets # - Clean and preprocess data # - Create vocabulary mappings
 - 
Model Training:
# In transformer.ipynb: # - Configure hyperparameters # - Train the model # - Monitor training with wandb # - Save checkpoints
 
The model is trained on multiple bash command datasets:
- MUNI KYPO Commands: Shell commands from cybersecurity training environments
 - Bash History Dataset: Real-world bash command histories
 - Shell Dataset: Curated shell command examples
 
Total Commands: ~100k+ bash commands
Vocabulary Size: 543 unique characters
Command Types: File operations, system commands, git operations, package management, etc.
- Embedding Size: Configurable (typically 128-512)
 - Number of Layers: Optimized via Optuna
 - Attention Heads: Configurable multi-head setup
 - Learning Rate: Adaptive with evaluation-based scheduling
 - Batch Size: Optimized for available hardware
 
- Loss Function: Cross-entropy loss for next-token prediction
 - Optimization: Adam optimizer with learning rate scheduling
 - Evaluation: Regular validation on held-out test set
 - Early Stopping: Based on validation loss improvements
 
The model performance is evaluated on:
- Perplexity: Measure of prediction uncertainty
 - Generation Quality: Manual assessment of generated commands
 - Completion Accuracy: How well it completes partial commands
 - Syntax Validity: Whether generated commands are syntactically correct
 
Key configuration parameters in the notebooks:
# Model Configuration
block_size = 256        # Maximum sequence length
dropout = 0.2          # Dropout rate
eval_interval = 500    # Evaluation frequency
eval_iters = 200       # Evaluation iterations
# Training Configuration
batch_size = 64        # Training batch size
learning_rate = 1e-3   # Initial learning rate
max_iters = 10000      # Maximum training iterationsInput:  "git add"
Output: "git add ."
        "git add -A"
        "git add file.py"Input:  "ls -"
Output: "ls -la"
        "ls -lah"
        "ls -lt"Input:  "cp "
Output: "cp file.txt backup/"
        "cp -r directory/ destination/"- Context Awareness: Incorporate current directory and file listings
 - Command Validation: Add syntax checking for generated commands
 - Interactive CLI: Build a command-line interface for real-time completion
 - Fine-tuning: Domain-specific adaptation for different environments
 - Multi-modal: Incorporate command documentation and man pages
 
- PyTorch: Deep learning framework
 - NumPy: Numerical computations
 - Optuna: Hyperparameter optimization
 - Weights & Biases: Experiment tracking
 - bashlex: Bash command parsing
 - Plotly: Interactive visualizations
 
- GPU: Recommended for training (CUDA support)
 - RAM: 8GB+ for training, 4GB+ for inference
 - Storage: 1GB+ for datasets and model checkpoints
 
This is an educational project for the Generative Neural Networks course. If you'd like to extend or improve the model:
- Fork the repository
 - Create a feature branch
 - Implement your improvements
 - Add tests and documentation
 - Submit a pull request
 
This project was developed as part of the "Generative Neural Networks" course at the University of Heidelberg. The goal was to implement a Transformer model from scratch to gain hands-on experience with:
- Attention mechanisms
 - Transformer architecture
 - Autoregressive generation
 - Sequence modeling
 - Neural language modeling
 
- Jonas Gann: GitHub Profile
 
Built with β€οΈ for learning and understanding Transformer architectures