A sophisticated next-word prediction system using LSTM neural networks, trained on Shakespeare's Hamlet. This project demonstrates advanced NLP techniques with deep learning for text generation and word prediction.
- LSTM-based Neural Network: Advanced recurrent neural network architecture for sequence prediction
- Shakespeare's Hamlet Dataset: Trained on classic literature for rich linguistic patterns
- Early Stopping: Prevents overfitting with intelligent training termination
- Interactive Web Interface: Streamlit-powered UI for real-time predictions
- Model Persistence: Save and load trained models for future use
- Preprocessing Pipeline: Complete text tokenization and sequence preparation
NextWord-AI/
├── app.py # Streamlit web application
├── experiments.ipynb # Jupyter notebook with model development
├── next_word_lstm.h5 # Trained LSTM model
├── tokenizer.pickle # Saved tokenizer for text processing
├── hamlet.txt # Shakespeare's Hamlet text dataset
├── requirements.txt # Python dependencies
└── README.md # Project documentation
-
Clone the repository
git clone https://github.com/CyberMage7/NextWord-AI.git cd NextWord-AI -
Install dependencies
pip install -r requirements.txt
Launch the Streamlit interface for interactive predictions:
streamlit run app.pyNavigate to http://localhost:8501 in your browser and enter text to get next-word predictions.
Open and run the experiments.ipynb notebook to:
- Download and preprocess the Hamlet dataset
- Train the LSTM model with early stopping
- Save the trained model and tokenizer
- Test predictions on custom text
The LSTM model features:
- Embedding Layer: 100-dimensional word embeddings
- LSTM Layers: Two LSTM layers (150 and 100 units) with dropout
- Output Layer: Softmax activation for word probability distribution
- Early Stopping: Monitors validation loss with patience of 5 epochs
- Data Collection: Downloads Shakespeare's Hamlet from NLTK corpus
- Preprocessing: Tokenizes text and creates n-gram sequences
- Sequence Padding: Ensures uniform input length
- Train/Test Split: 80/20 split for model validation
- Model Training: Uses categorical crossentropy loss with Adam optimizer
Input: "To be or not to"
Prediction: "be"
Input: "To be bad is better than"
Prediction: [context-dependent prediction]
- Framework: TensorFlow/Keras
- Architecture: Sequential LSTM
- Optimizer: Adam
- Loss Function: Categorical Crossentropy
- Validation: Early stopping with best weight restoration
- tensorflow>=2.16.0
- pandas
- numpy
- scikit-learn
- matplotlib
- tensorboard
- streamlit
- scikeras
- nltk
- Fork the repository
- Create a feature branch (
git checkout -b feature/enhancement) - Commit changes (
git commit -am 'Add new feature') - Push to branch (
git push origin feature/enhancement) - Open a Pull Request
This project is licensed under the terms specified in the LICENSE file.
- Shakespeare's works via NLTK Gutenberg corpus
- TensorFlow team for the deep learning framework
- Streamlit for the web interface framework