Mini-Project: Word Frequency Statistics in Text

Objective

Write a Python program that takes any text input, preprocesses it (by removing punctuation, special characters, and stopwords) to extract meaningful words, counts the frequency of each word, and outputs the results to the screen or saves them to a file. The preprocessing step is essential to ensure accurate word frequency analysis by eliminating noise and focusing on significant terms.

Detailed Requirements

Input

A text passage (or load from a .txt file).
The text can be in Vietnamese or English.

Text Processing

Convert all text to lowercase.
Remove punctuation and special characters (keep only letters and numbers).
Split the text into individual words (tokenization).

Statistics

Count the number of occurrences of each word in the text.
Store the results in a suitable data structure (e.g., dict or Counter).

Output

Print the list of words with their frequencies (sorted in descending order of frequency).
Optionally: save the results to a .csv or .txt file.

Advanced Requirements (Optional)

Remove stopwords if the text is in English or Vietnamese.
Visualize the results with a chart (using matplotlib).
Count phrase frequencies (bigram/trigram).
Write functions to allow direct text input from the keyboard or from a file.

Example Input

Xin chào! Đây là ví dụ về dự án nhỏ. Xin chào mọi người.

Example Output

'chào': 2
'xin': 2
'đây': 1
'ví': 1
'dụ': 1

Final Requirement

Write Python code that meets the above requirements.
Ensure the code is easy to understand, with comments explaining each step.
The code should work with various texts.

CI/CD Pipeline

This project includes automated Continuous Integration and Continuous Deployment (CI/CD) workflows using GitHub Actions.

CI Pipeline (`.github/workflows/ci.yml`)

Triggers: Push to main branch or releases/** branches

Workflow Steps:

Run Linters
- Python version: 3.13
- Tools: black (code formatter) and flake8 (style checker)
- Ensures code quality and style consistency
Run Tests
- Python versions tested: 3.9, 3.10, 3.11, 3.12, 3.13
- Runs all unit tests using unittest
- Generates code coverage reports using coverage
- Matrix strategy ensures compatibility across Python versions
Build Docker Image
- Only runs on push to main branch
- Builds Docker image and pushes to GitHub Container Registry (ghcr.io)
- Supports multi-architecture builds (QEMU + Docker Buildx)

CD Pipeline (`.github/workflows/cd.yml`)

Triggers: Successful completion of CI workflow on main branch

Workflow Steps:

Deploy to Production
- Environment: Production
- Deployment target: Render (via deploy hook)
- Requires RENDER_DEPLOY_HOOK_URL secret in repository settings

Running Tests Locally

# Run all tests
python -m unittest discover tests

# Run tests with coverage
python -m coverage run -m unittest
python -m coverage report -m

# Run specific test file
python -m unittest tests.test_text_stats

# Run specific test case
python -m unittest tests.test_text_stats.TestTextStats.test_import_data_returns_string

CI/CD Requirements

GitHub Secrets Required:

RENDER_DEPLOY_HOOK_URL: Render deployment webhook URL (for CD)

GitHub Variables Required:

RENDER_APP_URL: Production application URL (for environment tracking)

Badge Status

You can add these badges to track CI/CD status:

![CI Status](https://github.com/YOUR_USERNAME/word-frequency-mini-project/workflows/CI%20-%20Tests%20&%20Quality%20Checks/badge.svg)
![CD Status](https://github.com/YOUR_USERNAME/word-frequency-mini-project/workflows/CD%20-%20Continuous%20Deployment/badge.svg)

🚀 Getting Started - Quick Guide

Step 1: Prerequisites

Required Software:

Python 3.8 or higher - Download Python
pip (comes with Python)
Visual Studio Code (recommended) or any text editor

Check your Python installation:

python --version
pip --version

Step 2: Quick Start (Easiest!)

Just run the startup script:

# Windows
start.bat

# Or use Python directly (cross-platform)
python start.py

# Linux/Mac
chmod +x start.sh
./start.sh

This automated script will:

✅ Create all necessary directories
✅ Create required __init__.py files
✅ Check and optionally install dependencies
✅ Download NLTK data
✅ Start the FastAPI server automatically

That's it! The server will start at http://localhost:5000

Step 2 (Alternative): Manual Setup

1. Clone or download this repository:

# If using Git
git clone <repository-url>
cd word-frequency-mini-project

# Or download ZIP and extract it

2. Create virtual environment (Recommended):

# Windows
python -m venv venv
venv\Scripts\activate

# Linux/Mac
python -m venv venv
source venv/bin/activate

3. Install all dependencies:

pip install -r requirements.txt

4. Download required NLTK data:

python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

Step 3: Verify Project Structure

Ensure you have these essential __init__.py files:

word-frequency-mini-project/
├── src/
│   ├── __init__.py          ⚠️ REQUIRED!
│   ├── app/
│   │   ├── __init__.py      ⚠️ REQUIRED!
│   │   ├── main.py
│   │   ├── models.py
│   │   └── middleware.py
│   └── pipeline/
│       ├── __init__.py      ⚠️ REQUIRED!
│       └── text_stats.py
├── data/                    # Input text files
├── output/                  # Output files (auto-created)
├── requirements.txt
└── README.md

Create missing __init__.py files if needed:

# Windows PowerShell
New-Item -ItemType File -Path "src\__init__.py" -Force
New-Item -ItemType File -Path "src\app\__init__.py" -Force
New-Item -ItemType File -Path "src\pipeline\__init__.py" -Force

# Linux/Mac
touch src/__init__.py
touch src/app/__init__.py
touch src/pipeline/__init__.py

Step 4: Create Required Directories

# Create output directory if it doesn't exist
mkdir output

Step 5: Run the Application

Option A: Run FastAPI Web Server (Recommended)

# From project root directory
uvicorn src.app.main:app --reload --port 5000

Option B: Run using Python module

# From project root directory
python -m src.app.main

Access the API:

Open browser: http://localhost:5000/docs (Swagger UI)
API will be available at: http://localhost:5000

Step 6: Test the API

Using Swagger UI (Easiest):

Go to http://localhost:5000/docs
Try the /analyses/text endpoint
Click "Try it out"
Enter sample text and format
Click "Execute"

Using curl (Command Line):

# Test text analysis
curl -X POST "http://localhost:5000/analyses/text" \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"Xin chào! Đây là ví dụ về dự án nhỏ. Xin chào mọi người.\", \"format\": \"json\"}"

# Test file upload
curl -X POST "http://localhost:5000/analyses/file" \
  -F "file=@data/sample.txt" \
  -F "format=json"

Using Python requests:

import requests

# Analyze text
response = requests.post("http://localhost:5000/analyses/text",
    json={"text": "Hello world! This is a test.", "format": "json"})
print(response.json())

# Analyze file
with open("data/sample.txt", "rb") as f:
    response = requests.post("http://localhost:5000/analyses/file",
        files={"file": f},
        data={"format": "csv"})
    with open("output/result.csv", "wb") as out:
        out.write(response.content)

📋 Usage Guide

Method 1: Using the FastAPI Web Service

Endpoints:

POST /analyses/text - Analyze text directly

{
  "text": "Your text here",
  "format": "json" // Options: json, csv, png
}

POST /analyses/file - Upload and analyze text file
- Upload a .txt file (UTF-8 encoded)
- Choose format: json, csv, or png

Method 2: Using Core Pipeline Functions

from src.pipeline import text_stats as ts

# Load data from file
text = ts.import_data("sample.txt", "data")

# Process text and get word frequencies
tokens = ts.preprocessing(text)
word_stats = ts.statistics(tokens)

# Export results to CSV
ts.export_results(word_stats, "word_frequency.csv", "output")

# Visualize results
ts.visualize_results(word_stats, "word_frequency.png", "output")

File Structure

word-frequency-mini-project/
├── src/
│   ├── __init__.py
│   ├── app/
│   │   ├── __init__.py
│   │   ├── main.py          # FastAPI application
│   │   ├── models.py        # Pydantic models
│   │   └── middleware.py    # Custom middleware
│   └── pipeline/
│       ├── __init__.py
│       └── text_stats.py    # Core processing functions
├── data/                    # Input text files
├── output/                  # Output CSV/PNG files
├── requirements.txt         # Dependencies
└── README.md               # This file

Features

Bilingual Support: Processes both English and Vietnamese text using NLTK and underthesea
Smart Stopword Removal: Removes common words (179 English + 264+ Vietnamese stopwords)
Multiple Output Formats: JSON, CSV, and PNG visualization
RESTful API: FastAPI-based web service with Swagger documentation
File Upload Support: Process text files directly
Flexible Input: Load from .txt files or process strings directly
Encoding Support: Properly handles UTF-8 Vietnamese diacritics and special characters
Security Middleware: File size limits (5MB) and content type validation

Example

Input text:

Xin chào! Đây là ví dụ về dự án nhỏ. Xin chào mọi người.

Output CSV:

words,counts
chào,2
xin,2
đây,1
ví,1
dụ,1

🔧 Troubleshooting

Common Issues & Solutions

1. ModuleNotFoundError: No module named 'src'

Solution:

Ensure you're running from the project root directory
Make sure all __init__.py files exist
Use correct command: uvicorn src.app.main:app or python -m src.app.main

2. ImportError: attempted relative import beyond top-level package

Solution:

Check that __init__.py files exist in src/, src/app/, and src/pipeline/
Run using: python -m src.app.main (not python src/app/main.py)

3. underthesea installation fails on Windows

Solution:

Install Visual Studio Build Tools: Download here
Or install pre-built wheel: pip install underthesea --prefer-binary

4. NLTK data not found (punkt, stopwords)

Solution:

import nltk
nltk.download('punkt')
nltk.download('stopwords')

5. Encoding errors with Vietnamese text

Solution:

Ensure input files are saved as UTF-8
In VS Code: Check bottom-right corner → should say "UTF-8"
If needed, use utf-8-sig encoding for BOM handling

6. Port 5000 already in use

Solution:

# Use different port
uvicorn src.app.main:app --port 8000

# Or kill process using port 5000 (Windows)
netstat -ano | findstr :5000
taskkill /PID <process_id> /F

7. Empty output after processing

Solution:

Check if input file exists and is not empty
Verify file is UTF-8 encoded
Ensure text contains valid words (not all stopwords)
Check output/ directory permissions

8. FileNotFoundError: output directory not found

Solution:

mkdir output

9. Virtual environment not activating

Solution:

# Windows - if execution policy error
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

# Then activate again
venv\Scripts\activate

💡 Pro Tips

Use virtual environment to avoid dependency conflicts
Check API docs at http://localhost:5000/docs for interactive testing
Start with JSON format for debugging, then switch to CSV/PNG
Test with small text samples before processing large files
Keep output directory clean - old files are not auto-deleted
Monitor console logs for detailed error messages

📞 Need Help?

If you encounter issues not listed here:

Check console/terminal error messages
Verify all installation steps were completed
Ensure Python version is 3.8+
Review API documentation at /docs endpoint
Check file paths and permissions

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Mini-Project: Word Frequency Statistics in Text

Objective

Detailed Requirements

Input

Text Processing

Statistics

Output

Advanced Requirements (Optional)

Example Input

Example Output

Final Requirement

CI/CD Pipeline

CI Pipeline (.github/workflows/ci.yml)

CD Pipeline (.github/workflows/cd.yml)

Running Tests Locally

CI/CD Requirements

Badge Status

🚀 Getting Started - Quick Guide

Step 1: Prerequisites

Step 2: Quick Start (Easiest!)

Step 2 (Alternative): Manual Setup

Step 3: Verify Project Structure

Step 4: Create Required Directories

Step 5: Run the Application

Step 6: Test the API

📋 Usage Guide

Method 1: Using the FastAPI Web Service

Method 2: Using Core Pipeline Functions

File Structure

Features

Example

🔧 Troubleshooting

Common Issues & Solutions

1. ModuleNotFoundError: No module named 'src'

2. ImportError: attempted relative import beyond top-level package

3. underthesea installation fails on Windows

4. NLTK data not found (punkt, stopwords)

5. Encoding errors with Vietnamese text

6. Port 5000 already in use

7. Empty output after processing

8. FileNotFoundError: output directory not found

9. Virtual environment not activating

💡 Pro Tips

📞 Need Help?

CI Pipeline (`.github/workflows/ci.yml`)

CD Pipeline (`.github/workflows/cd.yml`)