Skip to content

EnockKipchumba/BigDataCSC782

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Big Data Text Analysis Project - EKU Student Handbook 2022

Overview

This project implements a comprehensive text analysis pipeline for the EKU Student Handbook 2022 using three different approaches:

  1. Word Count Analysis using MapReduce framework
  2. TF-IDF (Term Frequency-Inverse Document Frequency) analysis
  3. LDA (Latent Dirichlet Allocation) topic modeling

πŸš€ Features

Enhanced Components

  • Modular Design: Clean separation of concerns with reusable components
  • Error Handling: Robust error handling and logging throughout the pipeline
  • Configuration Management: Centralized configuration for easy customization
  • Testing Framework: Comprehensive test suite for validation
  • Documentation: Detailed documentation and type hints
  • Performance Optimization: Improved algorithms and data structures
  • Cross-Platform Support: Works on Windows, Linux, and macOS

Analysis Methods

1. Word Count Analysis

  • Purpose: Count frequency of words in the document
  • Implementation: MapReduce framework with preprocessing
  • Features:
    • Text cleaning and normalization
    • Stopword removal
    • Case-insensitive counting
    • Sorted output by frequency

2. TF-IDF Analysis

  • Purpose: Evaluate term importance relative to document collection
  • Implementation: MapReduce with scikit-learn integration
  • Features:
    • Advanced text preprocessing
    • Configurable TF-IDF parameters
    • Multi-document support
    • Normalized scoring

3. LDA Topic Modeling

  • Purpose: Discover latent topics in the document
  • Implementation: Enhanced with spaCy lemmatization
  • Features:
    • Lemmatization for better topic quality
    • Configurable number of topics
    • Top word extraction per topic
    • Improved preprocessing pipeline

πŸ“ Project Structure

BigDataCSC782/
β”œβ”€β”€ config.py              # Configuration management
β”œβ”€β”€ utils.py               # Utility functions
β”œβ”€β”€ preprocess.py          # Word count preprocessing
β”œβ”€β”€ mapper.py              # Word count mapper
β”œβ”€β”€ reducer.py             # Word count reducer
β”œβ”€β”€ preprocess_tfidf.py    # TF-IDF preprocessing
β”œβ”€β”€ mapper_tfidf.py        # TF-IDF mapper
β”œβ”€β”€ reducer_tfidf.py       # TF-IDF reducer
β”œβ”€β”€ mapper_lda.py          # LDA mapper
β”œβ”€β”€ reducer_lda.py         # LDA reducer
β”œβ”€β”€ run_pipeline.sh        # Linux/Mac pipeline script
β”œβ”€β”€ run_pipeline.bat       # Windows pipeline script
β”œβ”€β”€ setup.py               # Linux/Mac setup script
β”œβ”€β”€ setup_windows.bat      # Windows setup script
β”œβ”€β”€ test_analysis.py       # Cross-platform test suite
β”œβ”€β”€ test_windows.bat       # Windows test script
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ input/                 # Input files directory
β”œβ”€β”€ output/                # Intermediate outputs
β”œβ”€β”€ results/               # Final analysis results
└── README.md             # This file

πŸ› οΈ Installation & Setup

Prerequisites

  • Python 3.8+
  • Virtual environment (recommended)
  • Bash shell (Linux/Mac) or Command Prompt (Windows)

Quick Start

Option 1: Automated Setup (Recommended)

Windows:

# Run the Windows setup script
setup_windows.bat

Linux/Mac:

# Run the Python setup script
python setup.py

Option 2: Manual Setup

  1. Clone and Setup:

    git clone <repository-url>
    cd BigDataCSC782
    python3 -m venv myenv
    source myenv/bin/activate  # On Windows: myenv\Scripts\activate
  2. Install Dependencies:

    pip install -r requirements.txt
    python -m spacy download en_core_web_sm
  3. Run Tests:

    python test_analysis.py
  4. Execute Pipeline:

    # Linux/Mac
    chmod +x run_pipeline.sh
    ./run_pipeline.sh
    
    # Windows
    run_pipeline.bat

πŸ”§ Configuration

Main Configuration (config.py)

  • Stopwords: Customizable stopword list
  • TF-IDF Parameters: Max features, document frequency thresholds
  • LDA Parameters: Number of topics, random state, iterations
  • Text Processing: Word length limits, cleanup patterns

Environment Variables

  • LOG_LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)
  • PYTHONPATH: Ensure modules are discoverable

πŸ“Š Usage Examples

Individual Components

  1. Word Count Analysis:

    # Linux/Mac
    cat input.txt | python preprocess.py | python mapper.py | python reducer.py
    
    # Windows
    type input.txt | python preprocess.py | python mapper.py | python reducer.py
  2. TF-IDF Analysis:

    # Linux/Mac
    cat input.txt | python preprocess_tfidf.py | python mapper_tfidf.py doc1 | python reducer_tfidf.py
    
    # Windows
    type input.txt | python preprocess_tfidf.py | python mapper_tfidf.py doc1 | python reducer_tfidf.py
  3. LDA Analysis:

    # Linux/Mac
    cat input.txt | python mapper_lda.py | python reducer_lda.py
    
    # Windows
    type input.txt | python mapper_lda.py | python reducer_lda.py

Full Pipeline

# Linux/Mac
./run_pipeline.sh

# Windows
run_pipeline.bat

πŸ§ͺ Testing

Run All Tests

# Linux/Mac
python test_analysis.py

# Windows
test_windows.bat

Test Individual Components

# Test preprocessing
# Linux/Mac
echo "test text" | python preprocess.py

# Windows
echo test text | python preprocess.py

# Test word count pipeline
# Linux/Mac
echo "test text" | python preprocess.py | python mapper.py | python reducer.py

# Windows
echo test text | python preprocess.py | python mapper.py | python reducer.py

πŸ“ˆ Output Formats

Word Count Output

word1    15
word2    12
word3    8

TF-IDF Output

Word                 Avg_TF-IDF
-----------------------------------
important_term       0.123456
another_term         0.098765

LDA Output

LDA Topic Analysis Results (5 topics):
==================================================

Topic 1:
  Top words: academic, policy, student, university, campus
--------------------------------------------------------------

πŸ” Key Improvements Made

Code Quality

  • βœ… Type hints and documentation
  • βœ… Error handling and logging
  • βœ… Modular design and reusability
  • βœ… Consistent coding standards

Performance

  • βœ… Optimized data structures
  • βœ… Efficient algorithms
  • βœ… Memory management
  • βœ… Parallel processing support

Maintainability

  • βœ… Configuration management
  • βœ… Testing framework
  • βœ… Clear documentation
  • βœ… Version control ready

Robustness

  • βœ… Input validation
  • βœ… Error recovery
  • βœ… Graceful degradation
  • βœ… Cross-platform compatibility

🚨 Troubleshooting

Common Issues & Solutions

1. Pip Upgrade Issues (Windows)

Problem: Error when trying to upgrade pip during setup

ERROR: To modify pip, please run the following command:
C:\Users\...\python.exe -m pip install --upgrade pip

Solution: Use the Windows-specific setup script: setup_windows.bat

2. Virtual Environment Issues

Problem: Virtual environment not found or activation fails Solution:

# Linux/Mac
python3 -m venv myenv
source myenv/bin/activate

# Windows
python -m venv myenv
myenv\Scripts\activate

3. spaCy Model Installation Issues

Problem: spaCy English model fails to install Solution:

pip install spacy
python -m spacy download en_core_web_sm

4. Import Errors

Problem: Module not found errors Solution:

# Activate virtual environment first
source myenv/bin/activate  # Linux/Mac
myenv\Scripts\activate     # Windows

# Install missing packages
pip install -r requirements.txt

5. Permission Issues (Linux/Mac)

Problem: Permission denied when running scripts Solution:

chmod +x run_pipeline.sh
chmod +x setup.py

6. File Not Found Errors

Problem: Input files not found Solution:

  • Place your input files in the project directory
  • Or use the sample file: input/sample_input.txt
  • Check file paths and permissions

Debugging Steps

  1. Check Environment:

    python --version
    which python  # Linux/Mac
    where python  # Windows
    pip list
  2. Test Individual Components:

    # Test preprocessing
    echo "test text" | python preprocess.py
    
    # Test mapper
    echo "word1, word2" | python mapper.py
    
    # Test reducer
    echo -e "word1\t1\nword2\t1" | python reducer.py
  3. System Requirements:

    • RAM: Minimum 4GB, Recommended 8GB+
    • Disk Space: At least 2GB free
    • Python: 3.8 or higher
    • OS: Windows 10+, macOS 10.14+, or Linux

Quick Fix Checklist

  • Python 3.8+ installed
  • Virtual environment created and activated
  • Dependencies installed (pip install -r requirements.txt)
  • spaCy model downloaded (python -m spacy download en_core_web_sm)
  • Input files present in correct location
  • Proper permissions set (Linux/Mac)
  • Using correct commands for your OS

πŸ” Key Improvements Made

Code Quality

  • βœ… Type hints and documentation
  • βœ… Error handling and logging
  • βœ… Modular design and reusability
  • βœ… Consistent coding standards

Performance

  • βœ… Optimized data structures
  • βœ… Efficient algorithms
  • βœ… Memory management
  • βœ… Parallel processing support

Maintainability

  • βœ… Configuration management
  • βœ… Testing framework
  • βœ… Clear documentation
  • βœ… Version control ready

Robustness

  • βœ… Input validation
  • βœ… Error recovery
  • βœ… Graceful degradation
  • βœ… Cross-platform compatibility

Cross-Platform Support

  • βœ… Windows batch scripts (*.bat)
  • βœ… Linux/Mac shell scripts (*.sh)
  • βœ… OS-specific command handling
  • βœ… Automated setup for all platforms

🎯 Quick Reference Commands

Essential Commands:

# Setup
setup_windows.bat                    # Windows setup
python setup.py                      # Linux/Mac setup

# Activate environment
myenv\Scripts\activate               # Windows
source myenv/bin/activate            # Linux/Mac

# Run full analysis
run_pipeline.bat                     # Windows
./run_pipeline.sh                    # Linux/Mac

# Test components
test_windows.bat                     # Windows
python test_analysis.py              # Cross-platform

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

πŸ“ License

This project is part of the CSC 782 Big Data course at Eastern Kentucky University.

πŸ‘₯ Authors

  • Hamza Khattak
  • Enock Kipchumba

πŸ™ Acknowledgments

  • EKU Computer Science Department
  • Big Data course instructors
  • Open source community for libraries and tools

πŸ“š Additional Resources

  • Configuration Guide: Edit config.py to customize parameters
  • Troubleshooting: See troubleshooting section above
  • Testing: Use platform-specific test scripts for validation
  • Documentation: All Python files include detailed docstrings and type hints

About

CSC 782 class for Masters at EKU, Big Data project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors