This project implements a comprehensive text analysis pipeline for the EKU Student Handbook 2022 using three different approaches:
- Word Count Analysis using MapReduce framework
- TF-IDF (Term Frequency-Inverse Document Frequency) analysis
- LDA (Latent Dirichlet Allocation) topic modeling
- Modular Design: Clean separation of concerns with reusable components
- Error Handling: Robust error handling and logging throughout the pipeline
- Configuration Management: Centralized configuration for easy customization
- Testing Framework: Comprehensive test suite for validation
- Documentation: Detailed documentation and type hints
- Performance Optimization: Improved algorithms and data structures
- Cross-Platform Support: Works on Windows, Linux, and macOS
- Purpose: Count frequency of words in the document
- Implementation: MapReduce framework with preprocessing
- Features:
- Text cleaning and normalization
- Stopword removal
- Case-insensitive counting
- Sorted output by frequency
- Purpose: Evaluate term importance relative to document collection
- Implementation: MapReduce with scikit-learn integration
- Features:
- Advanced text preprocessing
- Configurable TF-IDF parameters
- Multi-document support
- Normalized scoring
- Purpose: Discover latent topics in the document
- Implementation: Enhanced with spaCy lemmatization
- Features:
- Lemmatization for better topic quality
- Configurable number of topics
- Top word extraction per topic
- Improved preprocessing pipeline
BigDataCSC782/
βββ config.py # Configuration management
βββ utils.py # Utility functions
βββ preprocess.py # Word count preprocessing
βββ mapper.py # Word count mapper
βββ reducer.py # Word count reducer
βββ preprocess_tfidf.py # TF-IDF preprocessing
βββ mapper_tfidf.py # TF-IDF mapper
βββ reducer_tfidf.py # TF-IDF reducer
βββ mapper_lda.py # LDA mapper
βββ reducer_lda.py # LDA reducer
βββ run_pipeline.sh # Linux/Mac pipeline script
βββ run_pipeline.bat # Windows pipeline script
βββ setup.py # Linux/Mac setup script
βββ setup_windows.bat # Windows setup script
βββ test_analysis.py # Cross-platform test suite
βββ test_windows.bat # Windows test script
βββ requirements.txt # Python dependencies
βββ input/ # Input files directory
βββ output/ # Intermediate outputs
βββ results/ # Final analysis results
βββ README.md # This file
- Python 3.8+
- Virtual environment (recommended)
- Bash shell (Linux/Mac) or Command Prompt (Windows)
Windows:
# Run the Windows setup script
setup_windows.batLinux/Mac:
# Run the Python setup script
python setup.py-
Clone and Setup:
git clone <repository-url> cd BigDataCSC782 python3 -m venv myenv source myenv/bin/activate # On Windows: myenv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt python -m spacy download en_core_web_sm
-
Run Tests:
python test_analysis.py
-
Execute Pipeline:
# Linux/Mac chmod +x run_pipeline.sh ./run_pipeline.sh # Windows run_pipeline.bat
- Stopwords: Customizable stopword list
- TF-IDF Parameters: Max features, document frequency thresholds
- LDA Parameters: Number of topics, random state, iterations
- Text Processing: Word length limits, cleanup patterns
LOG_LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)PYTHONPATH: Ensure modules are discoverable
-
Word Count Analysis:
# Linux/Mac cat input.txt | python preprocess.py | python mapper.py | python reducer.py # Windows type input.txt | python preprocess.py | python mapper.py | python reducer.py
-
TF-IDF Analysis:
# Linux/Mac cat input.txt | python preprocess_tfidf.py | python mapper_tfidf.py doc1 | python reducer_tfidf.py # Windows type input.txt | python preprocess_tfidf.py | python mapper_tfidf.py doc1 | python reducer_tfidf.py
-
LDA Analysis:
# Linux/Mac cat input.txt | python mapper_lda.py | python reducer_lda.py # Windows type input.txt | python mapper_lda.py | python reducer_lda.py
# Linux/Mac
./run_pipeline.sh
# Windows
run_pipeline.bat# Linux/Mac
python test_analysis.py
# Windows
test_windows.bat# Test preprocessing
# Linux/Mac
echo "test text" | python preprocess.py
# Windows
echo test text | python preprocess.py
# Test word count pipeline
# Linux/Mac
echo "test text" | python preprocess.py | python mapper.py | python reducer.py
# Windows
echo test text | python preprocess.py | python mapper.py | python reducer.pyword1 15
word2 12
word3 8
Word Avg_TF-IDF
-----------------------------------
important_term 0.123456
another_term 0.098765
LDA Topic Analysis Results (5 topics):
==================================================
Topic 1:
Top words: academic, policy, student, university, campus
--------------------------------------------------------------
- β Type hints and documentation
- β Error handling and logging
- β Modular design and reusability
- β Consistent coding standards
- β Optimized data structures
- β Efficient algorithms
- β Memory management
- β Parallel processing support
- β Configuration management
- β Testing framework
- β Clear documentation
- β Version control ready
- β Input validation
- β Error recovery
- β Graceful degradation
- β Cross-platform compatibility
Problem: Error when trying to upgrade pip during setup
ERROR: To modify pip, please run the following command:
C:\Users\...\python.exe -m pip install --upgrade pip
Solution: Use the Windows-specific setup script: setup_windows.bat
Problem: Virtual environment not found or activation fails Solution:
# Linux/Mac
python3 -m venv myenv
source myenv/bin/activate
# Windows
python -m venv myenv
myenv\Scripts\activateProblem: spaCy English model fails to install Solution:
pip install spacy
python -m spacy download en_core_web_smProblem: Module not found errors Solution:
# Activate virtual environment first
source myenv/bin/activate # Linux/Mac
myenv\Scripts\activate # Windows
# Install missing packages
pip install -r requirements.txtProblem: Permission denied when running scripts Solution:
chmod +x run_pipeline.sh
chmod +x setup.pyProblem: Input files not found Solution:
- Place your input files in the project directory
- Or use the sample file:
input/sample_input.txt - Check file paths and permissions
-
Check Environment:
python --version which python # Linux/Mac where python # Windows pip list
-
Test Individual Components:
# Test preprocessing echo "test text" | python preprocess.py # Test mapper echo "word1, word2" | python mapper.py # Test reducer echo -e "word1\t1\nword2\t1" | python reducer.py
-
System Requirements:
- RAM: Minimum 4GB, Recommended 8GB+
- Disk Space: At least 2GB free
- Python: 3.8 or higher
- OS: Windows 10+, macOS 10.14+, or Linux
- Python 3.8+ installed
- Virtual environment created and activated
- Dependencies installed (
pip install -r requirements.txt) - spaCy model downloaded (
python -m spacy download en_core_web_sm) - Input files present in correct location
- Proper permissions set (Linux/Mac)
- Using correct commands for your OS
- β Type hints and documentation
- β Error handling and logging
- β Modular design and reusability
- β Consistent coding standards
- β Optimized data structures
- β Efficient algorithms
- β Memory management
- β Parallel processing support
- β Configuration management
- β Testing framework
- β Clear documentation
- β Version control ready
- β Input validation
- β Error recovery
- β Graceful degradation
- β Cross-platform compatibility
- β
Windows batch scripts (
*.bat) - β
Linux/Mac shell scripts (
*.sh) - β OS-specific command handling
- β Automated setup for all platforms
# Setup
setup_windows.bat # Windows setup
python setup.py # Linux/Mac setup
# Activate environment
myenv\Scripts\activate # Windows
source myenv/bin/activate # Linux/Mac
# Run full analysis
run_pipeline.bat # Windows
./run_pipeline.sh # Linux/Mac
# Test components
test_windows.bat # Windows
python test_analysis.py # Cross-platform- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is part of the CSC 782 Big Data course at Eastern Kentucky University.
- Hamza Khattak
- Enock Kipchumba
- EKU Computer Science Department
- Big Data course instructors
- Open source community for libraries and tools
- Configuration Guide: Edit
config.pyto customize parameters - Troubleshooting: See troubleshooting section above
- Testing: Use platform-specific test scripts for validation
- Documentation: All Python files include detailed docstrings and type hints