Masked Word Prediction NLP Model

Overview

This project implements a deep learning model for masked word prediction, a natural language processing task where the model predicts missing words in sentences. The current version achieves a 28% accuracy rate, with ongoing efforts to improve performance.

Model Architecture

The model uses a recurrent neural network architecture with the following components:

Pre-trained GloVe word embeddings (100-dimensional)
Bidirectional GRU layers for capturing context from both directions
Layer normalization and dropout for regularization
Multiple stacked GRU layers for deeper feature extraction
Dense output layer with softmax activation

Technical Details

Framework: TensorFlow/Keras
Word Embeddings: GloVe 6B 100d
Sequence Length: 256 tokens (padded)
Vocabulary Size: 30,000 tokens
Training Strategy: Masked language modeling approach
Current Accuracy: 28%

Data Processing

The model processes input sentences by:

Tokenizing text using Keras Tokenizer
Padding sequences to uniform length
Creating training examples by masking individual words
Using the surrounding context to predict the masked word

Training

The model is trained with:

Adam optimizer (learning rate: 0.001)
Sparse categorical cross-entropy loss
Early stopping based on validation loss
Batch size of 64
Maximum 10 epochs

Files

Train Data.csv: Contains training sentences
Test Datas.csv: Contains test sentences with masked words
glove.6B.100d.txt: Pre-trained GloVe word embeddings
gru_model.h5: Saved model weights
submission.csv: Generated predictions for evaluation

Future Improvements

Considering the current 28% accuracy, potential improvements include:

Experimenting with larger embedding dimensions
Adding attention mechanisms
Increasing model capacity (more layers/units)
Hyperparameter optimization
Data augmentation techniques
Alternative architectures (Transformer-based models)

Requirements

TensorFlow 2.x
NumPy
Pandas
GloVe word embeddings

Usage

Download GloVe embeddings
Prepare training and test data in required CSV format
Run the training script
Generate predictions on masked sentences

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Word Embeddings		Word Embeddings
Model.ipynb		Model.ipynb
README.md		README.md
Report.docx		Report.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Masked Word Prediction NLP Model

Overview

Model Architecture

Technical Details

Data Processing

Training

Files

Future Improvements

Requirements

Usage

About

Uh oh!

Releases

Packages

Languages

BHK4321/Masked-Word-Prediction

Folders and files

Latest commit

History

Repository files navigation

Masked Word Prediction NLP Model

Overview

Model Architecture

Technical Details

Data Processing

Training

Files

Future Improvements

Requirements

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages