Big Data Text Analysis Project - EKU Student Handbook 2022

Overview

This project implements a comprehensive text analysis pipeline for the EKU Student Handbook 2022 using three different approaches:

Word Count Analysis using MapReduce framework
TF-IDF (Term Frequency-Inverse Document Frequency) analysis
LDA (Latent Dirichlet Allocation) topic modeling

🚀 Features

Enhanced Components

Modular Design: Clean separation of concerns with reusable components
Error Handling: Robust error handling and logging throughout the pipeline
Configuration Management: Centralized configuration for easy customization
Testing Framework: Comprehensive test suite for validation
Documentation: Detailed documentation and type hints
Performance Optimization: Improved algorithms and data structures
Cross-Platform Support: Works on Windows, Linux, and macOS

Analysis Methods

1. Word Count Analysis

Purpose: Count frequency of words in the document
Implementation: MapReduce framework with preprocessing
Features:
- Text cleaning and normalization
- Stopword removal
- Case-insensitive counting
- Sorted output by frequency

2. TF-IDF Analysis

Purpose: Evaluate term importance relative to document collection
Implementation: MapReduce with scikit-learn integration
Features:
- Advanced text preprocessing
- Configurable TF-IDF parameters
- Multi-document support
- Normalized scoring

3. LDA Topic Modeling

Purpose: Discover latent topics in the document
Implementation: Enhanced with spaCy lemmatization
Features:
- Lemmatization for better topic quality
- Configurable number of topics
- Top word extraction per topic
- Improved preprocessing pipeline

📁 Project Structure

BigDataCSC782/
├── config.py              # Configuration management
├── utils.py               # Utility functions
├── preprocess.py          # Word count preprocessing
├── mapper.py              # Word count mapper
├── reducer.py             # Word count reducer
├── preprocess_tfidf.py    # TF-IDF preprocessing
├── mapper_tfidf.py        # TF-IDF mapper
├── reducer_tfidf.py       # TF-IDF reducer
├── mapper_lda.py          # LDA mapper
├── reducer_lda.py         # LDA reducer
├── run_pipeline.sh        # Linux/Mac pipeline script
├── run_pipeline.bat       # Windows pipeline script
├── setup.py               # Linux/Mac setup script
├── setup_windows.bat      # Windows setup script
├── test_analysis.py       # Cross-platform test suite
├── test_windows.bat       # Windows test script
├── requirements.txt       # Python dependencies
├── input/                 # Input files directory
├── output/                # Intermediate outputs
├── results/               # Final analysis results
└── README.md             # This file

🛠️ Installation & Setup

Prerequisites

Python 3.8+
Virtual environment (recommended)
Bash shell (Linux/Mac) or Command Prompt (Windows)

Quick Start

Option 1: Automated Setup (Recommended)

Windows:

# Run the Windows setup script
setup_windows.bat

Linux/Mac:

# Run the Python setup script
python setup.py

Option 2: Manual Setup

Clone and Setup:

git clone <repository-url>
cd BigDataCSC782
python3 -m venv myenv
source myenv/bin/activate  # On Windows: myenv\Scripts\activate

Install Dependencies:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Run Tests:
```
python test_analysis.py
```

Execute Pipeline:

# Linux/Mac
chmod +x run_pipeline.sh
./run_pipeline.sh

# Windows
run_pipeline.bat

🔧 Configuration

Main Configuration (`config.py`)

Stopwords: Customizable stopword list
TF-IDF Parameters: Max features, document frequency thresholds
LDA Parameters: Number of topics, random state, iterations
Text Processing: Word length limits, cleanup patterns

Environment Variables

LOG_LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)
PYTHONPATH: Ensure modules are discoverable

📊 Usage Examples

Individual Components

Word Count Analysis:

# Linux/Mac
cat input.txt | python preprocess.py | python mapper.py | python reducer.py

# Windows
type input.txt | python preprocess.py | python mapper.py | python reducer.py

TF-IDF Analysis:

# Linux/Mac
cat input.txt | python preprocess_tfidf.py | python mapper_tfidf.py doc1 | python reducer_tfidf.py

# Windows
type input.txt | python preprocess_tfidf.py | python mapper_tfidf.py doc1 | python reducer_tfidf.py

LDA Analysis:

# Linux/Mac
cat input.txt | python mapper_lda.py | python reducer_lda.py

# Windows
type input.txt | python mapper_lda.py | python reducer_lda.py

Full Pipeline

# Linux/Mac
./run_pipeline.sh

# Windows
run_pipeline.bat

🧪 Testing

Run All Tests

# Linux/Mac
python test_analysis.py

# Windows
test_windows.bat

Test Individual Components

# Test preprocessing
# Linux/Mac
echo "test text" | python preprocess.py

# Windows
echo test text | python preprocess.py

# Test word count pipeline
# Linux/Mac
echo "test text" | python preprocess.py | python mapper.py | python reducer.py

# Windows
echo test text | python preprocess.py | python mapper.py | python reducer.py

📈 Output Formats

Word Count Output

word1    15
word2    12
word3    8

TF-IDF Output

Word                 Avg_TF-IDF
-----------------------------------
important_term       0.123456
another_term         0.098765

LDA Output

LDA Topic Analysis Results (5 topics):
==================================================

Topic 1:
  Top words: academic, policy, student, university, campus
--------------------------------------------------------------

🔍 Key Improvements Made

Code Quality

✅ Type hints and documentation
✅ Error handling and logging
✅ Modular design and reusability
✅ Consistent coding standards

Performance

✅ Optimized data structures
✅ Efficient algorithms
✅ Memory management
✅ Parallel processing support

Maintainability

✅ Configuration management
✅ Testing framework
✅ Clear documentation
✅ Version control ready

Robustness

✅ Input validation
✅ Error recovery
✅ Graceful degradation
✅ Cross-platform compatibility

🚨 Troubleshooting

Common Issues & Solutions

1. Pip Upgrade Issues (Windows)

Problem: Error when trying to upgrade pip during setup

ERROR: To modify pip, please run the following command:
C:\Users\...\python.exe -m pip install --upgrade pip

Solution: Use the Windows-specific setup script: setup_windows.bat

2. Virtual Environment Issues

Problem: Virtual environment not found or activation fails Solution:

# Linux/Mac
python3 -m venv myenv
source myenv/bin/activate

# Windows
python -m venv myenv
myenv\Scripts\activate

3. spaCy Model Installation Issues

Problem: spaCy English model fails to install Solution:

pip install spacy
python -m spacy download en_core_web_sm

4. Import Errors

Problem: Module not found errors Solution:

# Activate virtual environment first
source myenv/bin/activate  # Linux/Mac
myenv\Scripts\activate     # Windows

# Install missing packages
pip install -r requirements.txt

5. Permission Issues (Linux/Mac)

Problem: Permission denied when running scripts Solution:

chmod +x run_pipeline.sh
chmod +x setup.py

6. File Not Found Errors

Problem: Input files not found Solution:

Place your input files in the project directory
Or use the sample file: input/sample_input.txt
Check file paths and permissions

Debugging Steps

Check Environment:

python --version
which python  # Linux/Mac
where python  # Windows
pip list

Test Individual Components:

# Test preprocessing
echo "test text" | python preprocess.py

# Test mapper
echo "word1, word2" | python mapper.py

# Test reducer
echo -e "word1\t1\nword2\t1" | python reducer.py

System Requirements:
- RAM: Minimum 4GB, Recommended 8GB+
- Disk Space: At least 2GB free
- Python: 3.8 or higher
- OS: Windows 10+, macOS 10.14+, or Linux

Quick Fix Checklist

Python 3.8+ installed
Virtual environment created and activated
Dependencies installed (pip install -r requirements.txt)
spaCy model downloaded (python -m spacy download en_core_web_sm)
Input files present in correct location
Proper permissions set (Linux/Mac)
Using correct commands for your OS

🔍 Key Improvements Made

Code Quality

✅ Type hints and documentation
✅ Error handling and logging
✅ Modular design and reusability
✅ Consistent coding standards

Performance

✅ Optimized data structures
✅ Efficient algorithms
✅ Memory management
✅ Parallel processing support

Maintainability

✅ Configuration management
✅ Testing framework
✅ Clear documentation
✅ Version control ready

Robustness

✅ Input validation
✅ Error recovery
✅ Graceful degradation
✅ Cross-platform compatibility

Cross-Platform Support

✅ Windows batch scripts (*.bat)
✅ Linux/Mac shell scripts (*.sh)
✅ OS-specific command handling
✅ Automated setup for all platforms

🎯 Quick Reference Commands

Essential Commands:

# Setup
setup_windows.bat                    # Windows setup
python setup.py                      # Linux/Mac setup

# Activate environment
myenv\Scripts\activate               # Windows
source myenv/bin/activate            # Linux/Mac

# Run full analysis
run_pipeline.bat                     # Windows
./run_pipeline.sh                    # Linux/Mac

# Test components
test_windows.bat                     # Windows
python test_analysis.py              # Cross-platform

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

📝 License

This project is part of the CSC 782 Big Data course at Eastern Kentucky University.

👥 Authors

Hamza Khattak
Enock Kipchumba

🙏 Acknowledgments

EKU Computer Science Department
Big Data course instructors
Open source community for libraries and tools

📚 Additional Resources

Configuration Guide: Edit config.py to customize parameters
Troubleshooting: See troubleshooting section above
Testing: Use platform-specific test scripts for validation
Documentation: All Python files include detailed docstrings and type hints

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
input		input
Hamza Khattak - Enock Kipchumba - Textual Analysis of the EKU Student Handbook 2022 using Word Count.pdf		Hamza Khattak - Enock Kipchumba - Textual Analysis of the EKU Student Handbook 2022 using Word Count.pdf
README.md		README.md
config.py		config.py
mapper.py		mapper.py
mapper_lda.py		mapper_lda.py
mapper_tfidf.py		mapper_tfidf.py
preprocess.py		preprocess.py
preprocess_tfidf.py		preprocess_tfidf.py
reducer.py		reducer.py
reducer_lda.py		reducer_lda.py
reducer_tfidf.py		reducer_tfidf.py
requirements.txt		requirements.txt
run_pipeline.bat		run_pipeline.bat
run_pipeline.sh		run_pipeline.sh
setup.py		setup.py
setup_windows.bat		setup_windows.bat
test_analysis.py		test_analysis.py
test_windows.bat		test_windows.bat
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Big Data Text Analysis Project - EKU Student Handbook 2022

Overview

🚀 Features

Enhanced Components

Analysis Methods

1. Word Count Analysis

2. TF-IDF Analysis

3. LDA Topic Modeling

📁 Project Structure

🛠️ Installation & Setup

Prerequisites

Quick Start

Option 1: Automated Setup (Recommended)

Option 2: Manual Setup

🔧 Configuration

Main Configuration (config.py)

Environment Variables

📊 Usage Examples

Individual Components

Full Pipeline

🧪 Testing

Run All Tests

Test Individual Components

📈 Output Formats

Word Count Output

TF-IDF Output

LDA Output

🔍 Key Improvements Made

Code Quality

Performance

Maintainability

Robustness

🚨 Troubleshooting

Common Issues & Solutions

1. Pip Upgrade Issues (Windows)

2. Virtual Environment Issues

3. spaCy Model Installation Issues

4. Import Errors

5. Permission Issues (Linux/Mac)

6. File Not Found Errors

Debugging Steps

Quick Fix Checklist

🔍 Key Improvements Made

Code Quality

Performance

Maintainability

Robustness

Cross-Platform Support

🎯 Quick Reference Commands

Essential Commands:

🤝 Contributing

📝 License

👥 Authors

🙏 Acknowledgments

📚 Additional Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Main Configuration (`config.py`)

Packages