Write a Python program that takes any text input, preprocesses it (by removing punctuation, special characters, and stopwords) to extract meaningful words, counts the frequency of each word, and outputs the results to the screen or saves them to a file. The preprocessing step is essential to ensure accurate word frequency analysis by eliminating noise and focusing on significant terms.
- A text passage (or load from a .txt file).
- The text can be in Vietnamese or English.
- Convert all text to lowercase.
- Remove punctuation and special characters (keep only letters and numbers).
- Split the text into individual words (tokenization).
- Count the number of occurrences of each word in the text.
- Store the results in a suitable data structure (e.g., dict or Counter).
- Print the list of words with their frequencies (sorted in descending order of frequency).
- Optionally: save the results to a .csv or .txt file.
- Remove stopwords if the text is in English or Vietnamese.
- Visualize the results with a chart (using matplotlib).
- Count phrase frequencies (bigram/trigram).
- Write functions to allow direct text input from the keyboard or from a file.
Xin chào! Đây là ví dụ về dự án nhỏ. Xin chào mọi người.
'chào': 2
'xin': 2
'đây': 1
'ví': 1
'dụ': 1
- Write Python code that meets the above requirements.
- Ensure the code is easy to understand, with comments explaining each step.
- The code should work with various texts.
This project includes automated Continuous Integration and Continuous Deployment (CI/CD) workflows using GitHub Actions.
Triggers: Push to main branch or releases/** branches
Workflow Steps:
-
Run Linters
- Python version: 3.13
- Tools:
black(code formatter) andflake8(style checker) - Ensures code quality and style consistency
-
Run Tests
- Python versions tested: 3.9, 3.10, 3.11, 3.12, 3.13
- Runs all unit tests using
unittest - Generates code coverage reports using
coverage - Matrix strategy ensures compatibility across Python versions
-
Build Docker Image
- Only runs on push to
mainbranch - Builds Docker image and pushes to GitHub Container Registry (ghcr.io)
- Supports multi-architecture builds (QEMU + Docker Buildx)
- Only runs on push to
Triggers: Successful completion of CI workflow on main branch
Workflow Steps:
- Deploy to Production
- Environment: Production
- Deployment target: Render (via deploy hook)
- Requires
RENDER_DEPLOY_HOOK_URLsecret in repository settings
# Run all tests
python -m unittest discover tests
# Run tests with coverage
python -m coverage run -m unittest
python -m coverage report -m
# Run specific test file
python -m unittest tests.test_text_stats
# Run specific test case
python -m unittest tests.test_text_stats.TestTextStats.test_import_data_returns_stringGitHub Secrets Required:
RENDER_DEPLOY_HOOK_URL: Render deployment webhook URL (for CD)
GitHub Variables Required:
RENDER_APP_URL: Production application URL (for environment tracking)
You can add these badges to track CI/CD status:

Required Software:
- Python 3.8 or higher - Download Python
- pip (comes with Python)
- Visual Studio Code (recommended) or any text editor
Check your Python installation:
python --version
pip --versionJust run the startup script:
# Windows
start.bat
# Or use Python directly (cross-platform)
python start.py
# Linux/Mac
chmod +x start.sh
./start.shThis automated script will:
- ✅ Create all necessary directories
- ✅ Create required
__init__.pyfiles - ✅ Check and optionally install dependencies
- ✅ Download NLTK data
- ✅ Start the FastAPI server automatically
That's it! The server will start at http://localhost:5000
1. Clone or download this repository:
# If using Git
git clone <repository-url>
cd word-frequency-mini-project
# Or download ZIP and extract it2. Create virtual environment (Recommended):
# Windows
python -m venv venv
venv\Scripts\activate
# Linux/Mac
python -m venv venv
source venv/bin/activate3. Install all dependencies:
pip install -r requirements.txt4. Download required NLTK data:
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"Ensure you have these essential __init__.py files:
word-frequency-mini-project/
├── src/
│ ├── __init__.py ⚠️ REQUIRED!
│ ├── app/
│ │ ├── __init__.py ⚠️ REQUIRED!
│ │ ├── main.py
│ │ ├── models.py
│ │ └── middleware.py
│ └── pipeline/
│ ├── __init__.py ⚠️ REQUIRED!
│ └── text_stats.py
├── data/ # Input text files
├── output/ # Output files (auto-created)
├── requirements.txt
└── README.md
Create missing __init__.py files if needed:
# Windows PowerShell
New-Item -ItemType File -Path "src\__init__.py" -Force
New-Item -ItemType File -Path "src\app\__init__.py" -Force
New-Item -ItemType File -Path "src\pipeline\__init__.py" -Force
# Linux/Mac
touch src/__init__.py
touch src/app/__init__.py
touch src/pipeline/__init__.py# Create output directory if it doesn't exist
mkdir outputOption A: Run FastAPI Web Server (Recommended)
# From project root directory
uvicorn src.app.main:app --reload --port 5000Option B: Run using Python module
# From project root directory
python -m src.app.mainAccess the API:
- Open browser:
http://localhost:5000/docs(Swagger UI) - API will be available at:
http://localhost:5000
Using Swagger UI (Easiest):
- Go to
http://localhost:5000/docs - Try the
/analyses/textendpoint - Click "Try it out"
- Enter sample text and format
- Click "Execute"
Using curl (Command Line):
# Test text analysis
curl -X POST "http://localhost:5000/analyses/text" \
-H "Content-Type: application/json" \
-d "{\"text\": \"Xin chào! Đây là ví dụ về dự án nhỏ. Xin chào mọi người.\", \"format\": \"json\"}"
# Test file upload
curl -X POST "http://localhost:5000/analyses/file" \
-F "file=@data/sample.txt" \
-F "format=json"Using Python requests:
import requests
# Analyze text
response = requests.post("http://localhost:5000/analyses/text",
json={"text": "Hello world! This is a test.", "format": "json"})
print(response.json())
# Analyze file
with open("data/sample.txt", "rb") as f:
response = requests.post("http://localhost:5000/analyses/file",
files={"file": f},
data={"format": "csv"})
with open("output/result.csv", "wb") as out:
out.write(response.content)Endpoints:
-
POST
/analyses/text- Analyze text directly{ "text": "Your text here", "format": "json" // Options: json, csv, png } -
POST
/analyses/file- Upload and analyze text file- Upload a
.txtfile (UTF-8 encoded) - Choose format:
json,csv, orpng
- Upload a
from src.pipeline import text_stats as ts
# Load data from file
text = ts.import_data("sample.txt", "data")
# Process text and get word frequencies
tokens = ts.preprocessing(text)
word_stats = ts.statistics(tokens)
# Export results to CSV
ts.export_results(word_stats, "word_frequency.csv", "output")
# Visualize results
ts.visualize_results(word_stats, "word_frequency.png", "output")word-frequency-mini-project/
├── src/
│ ├── __init__.py
│ ├── app/
│ │ ├── __init__.py
│ │ ├── main.py # FastAPI application
│ │ ├── models.py # Pydantic models
│ │ └── middleware.py # Custom middleware
│ └── pipeline/
│ ├── __init__.py
│ └── text_stats.py # Core processing functions
├── data/ # Input text files
├── output/ # Output CSV/PNG files
├── requirements.txt # Dependencies
└── README.md # This file
- Bilingual Support: Processes both English and Vietnamese text using NLTK and underthesea
- Smart Stopword Removal: Removes common words (179 English + 264+ Vietnamese stopwords)
- Multiple Output Formats: JSON, CSV, and PNG visualization
- RESTful API: FastAPI-based web service with Swagger documentation
- File Upload Support: Process text files directly
- Flexible Input: Load from .txt files or process strings directly
- Encoding Support: Properly handles UTF-8 Vietnamese diacritics and special characters
- Security Middleware: File size limits (5MB) and content type validation
Input text:
Xin chào! Đây là ví dụ về dự án nhỏ. Xin chào mọi người.
Output CSV:
words,counts
chào,2
xin,2
đây,1
ví,1
dụ,1Solution:
- Ensure you're running from the project root directory
- Make sure all
__init__.pyfiles exist - Use correct command:
uvicorn src.app.main:apporpython -m src.app.main
Solution:
- Check that
__init__.pyfiles exist insrc/,src/app/, andsrc/pipeline/ - Run using:
python -m src.app.main(notpython src/app/main.py)
Solution:
- Install Visual Studio Build Tools: Download here
- Or install pre-built wheel:
pip install underthesea --prefer-binary
Solution:
import nltk
nltk.download('punkt')
nltk.download('stopwords')Solution:
- Ensure input files are saved as UTF-8
- In VS Code: Check bottom-right corner → should say "UTF-8"
- If needed, use
utf-8-sigencoding for BOM handling
Solution:
# Use different port
uvicorn src.app.main:app --port 8000
# Or kill process using port 5000 (Windows)
netstat -ano | findstr :5000
taskkill /PID <process_id> /FSolution:
- Check if input file exists and is not empty
- Verify file is UTF-8 encoded
- Ensure text contains valid words (not all stopwords)
- Check
output/directory permissions
Solution:
mkdir outputSolution:
# Windows - if execution policy error
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Then activate again
venv\Scripts\activate- Use virtual environment to avoid dependency conflicts
- Check API docs at
http://localhost:5000/docsfor interactive testing - Start with JSON format for debugging, then switch to CSV/PNG
- Test with small text samples before processing large files
- Keep output directory clean - old files are not auto-deleted
- Monitor console logs for detailed error messages
If you encounter issues not listed here:
- Check console/terminal error messages
- Verify all installation steps were completed
- Ensure Python version is 3.8+
- Review API documentation at
/docsendpoint - Check file paths and permissions