This project implements a deep learning model for masked word prediction, a natural language processing task where the model predicts missing words in sentences. The current version achieves a 28% accuracy rate, with ongoing efforts to improve performance.
The model uses a recurrent neural network architecture with the following components:
- Pre-trained GloVe word embeddings (100-dimensional)
- Bidirectional GRU layers for capturing context from both directions
- Layer normalization and dropout for regularization
- Multiple stacked GRU layers for deeper feature extraction
- Dense output layer with softmax activation
- Framework: TensorFlow/Keras
- Word Embeddings: GloVe 6B 100d
- Sequence Length: 256 tokens (padded)
- Vocabulary Size: 30,000 tokens
- Training Strategy: Masked language modeling approach
- Current Accuracy: 28%
The model processes input sentences by:
- Tokenizing text using Keras Tokenizer
- Padding sequences to uniform length
- Creating training examples by masking individual words
- Using the surrounding context to predict the masked word
The model is trained with:
- Adam optimizer (learning rate: 0.001)
- Sparse categorical cross-entropy loss
- Early stopping based on validation loss
- Batch size of 64
- Maximum 10 epochs
Train Data.csv
: Contains training sentencesTest Datas.csv
: Contains test sentences with masked wordsglove.6B.100d.txt
: Pre-trained GloVe word embeddingsgru_model.h5
: Saved model weightssubmission.csv
: Generated predictions for evaluation
Considering the current 28% accuracy, potential improvements include:
- Experimenting with larger embedding dimensions
- Adding attention mechanisms
- Increasing model capacity (more layers/units)
- Hyperparameter optimization
- Data augmentation techniques
- Alternative architectures (Transformer-based models)
- TensorFlow 2.x
- NumPy
- Pandas
- GloVe word embeddings
- Download GloVe embeddings
- Prepare training and test data in required CSV format
- Run the training script
- Generate predictions on masked sentences