Skip to content

Releases: huypq02/word-frequency-mini-project

Release v2.1.1

17 Jan 16:53

Choose a tag to compare

Release Notes - v2.1.1

Release Date: January 17, 2026
Type: Hotfix Release

Overview

This hotfix release addresses critical deployment issues encountered when running the application in containerized environments (Docker/Render) with non-root users.

Fixes

🐛 Container Permission Issues

Problem: Application failed to start in production due to permission errors when running as non-root user.

Issues Fixed:

  1. NLTK Data Download Failures

    • Error: PermissionError: [Errno 13] Permission denied: '/home/app'
    • Root Cause: NLTK attempted to download language resources to /home/app/nltk_data at runtime, which is not writable by the app user
    • Solution: Pre-download NLTK data (punkt, punkt_tab, stopwords) during Docker build as root user to /usr/local/share/nltk_data/
    • Impact: Faster startup time, no runtime network calls, eliminates permission errors
  2. Matplotlib Cache Directory Errors

    • Error: mkdir -p failed for path /home/app/.config/matplotlib: Permission denied
    • Root Cause: Matplotlib tried to create cache directory in user home directory
    • Solution: Set MPLCONFIGDIR=/tmp/matplotlib environment variable
    • Impact: Matplotlib can now write cache files to writable /tmp directory
  3. Fontconfig Cache Errors

    • Error: Fontconfig error: No writable cache directories
    • Root Cause: Fontconfig (used by matplotlib for font rendering) couldn't write cache
    • Solution: Set XDG_CACHE_HOME=/tmp/.cache environment variable
    • Impact: Eliminates font cache warnings, improves matplotlib performance

🚀 CI/CD Improvements

CD Workflow Enhancement

  • Added conditional check to deploy only when CI tests pass
  • Change: Added if: ${{ github.event.workflow_run.conclusion == 'success' }} to deployment job
  • Impact: Prevents deploying broken builds to production

Technical Details

Dockerfile Changes

# Pre-download NLTK data as root before switching to non-root user
RUN pip install --no-cache-dir -r requirements.txt && \
    python -c "import nltk; \
    nltk.download('punkt', download_dir='/usr/local/share/nltk_data'); \
    nltk.download('punkt_tab', download_dir='/usr/local/share/nltk_data'); \
    nltk.download('stopwords', download_dir='/usr/local/share/nltk_data')"

# Set cache directories to /tmp to avoid permission issues
ENV MPLCONFIGDIR=/tmp/matplotlib
ENV XDG_CACHE_HOME=/tmp/.cache

Files Modified

  • Dockerfile - Added NLTK pre-download and cache environment variables
  • .github/workflows/cd.yml - Added success-only deployment condition

Testing

✅ Verified on Render deployment platform
✅ Confirmed NLTK resources load successfully
✅ Matplotlib/fontconfig errors eliminated
✅ Application starts without permission errors

Upgrade Notes

  • No breaking changes
  • No database migrations required
  • No API changes
  • Simply redeploy using the updated Docker image

Full Changelog: v2.1.0...v2.1.1

Release v2.1.0

12 Jan 18:37

Choose a tag to compare

Release Notes - Version 2.1.0

Release Date: January 13, 2026


🎉 What's New in v2.1.0

1. CI/CD Workflow with GitHub Actions

We've implemented automated Continuous Integration and Continuous Deployment workflows to improve code quality and streamline the deployment process.

CI Pipeline (.github/workflows/ci.yml)

  • Automated Linting: Code quality checks using black and flake8
  • Multi-Version Testing: Automatic testing across Python 3.9, 3.10, 3.11, 3.12, and 3.13
  • Code Coverage: Generates coverage reports to track test coverage
  • Docker Build: Automated Docker image builds and publishing to GitHub Container Registry (ghcr.io)
  • Triggers: Runs on push to main branch or releases/** branches

CD Pipeline (.github/workflows/cd.yml)

  • Automated Deployment: Deploys to Render production environment after successful CI
  • Production Tracking: Environment tracking with deployment URLs
  • Triggers: Runs after CI workflow completes successfully on main branch

Benefits:

  • Ensures code quality before merging
  • Prevents breaking changes from reaching production
  • Automatic deployment on successful builds
  • Multi-version Python compatibility verification

2. Basic Test Cases for import_data Function

Added initial unit tests for the import_data function to ensure reliable file handling.

Test Coverage (tests/test_text_stats.py)

  • ✅ Verifies function returns a string
  • ✅ Tests file reading functionality
  • ✅ Ensures proper UTF-8 encoding support

Benefits:

  • Validates core functionality
  • Prevents regressions
  • Foundation for expanded test coverage
  • Automated testing in CI pipeline

📚 Documentation Updates

  • Added comprehensive CI/CD Pipeline section to README.md
  • Documented testing procedures and commands
  • Added CI/CD requirements (GitHub secrets and variables)
  • Included badge status examples for repository visibility

🚀 How to Use

Run Tests Locally

# Run all tests
python -m unittest discover tests

# Run specific test
python -m unittest tests.test_text_stats

CI/CD Setup

See the CI/CD Pipeline section in README.md for complete setup instructions.


📦 Installation

No breaking changes. Simply pull the latest code:

git pull origin main
pip install -r requirements.txt

🔗 Links


👥 Contributors

Thank you to everyone who contributed to this release!


Previous Version: 2.0.0
Current Version: 2.1.0


Full Changelog: https://github.com/huypq02/word-frequency-mini-project/commits/releases/v2.1.0

Release v2.0.0

04 Jan 17:02
694a19d

Choose a tag to compare

Release Notes - Word Frequency Mini Project

Version 2.0.0 (January 4, 2026)

🚀 Major Release - Web API & Enhanced Architecture

Complete rewrite with FastAPI web service, professional NLP processing, and automated setup.


🎯 What's New

🌐 Web API Service

  • RESTful API with FastAPI framework
  • Swagger UI at /docs for interactive testing
  • Two endpoints: /analyses/text (JSON input) and /analyses/file (file upload)
  • Multiple formats: JSON, CSV, PNG outputs

🔒 Security & Validation

  • File size limits (5MB default)
  • Content-Type validation (text/plain only)
  • Pydantic models for request/response validation
  • Comprehensive error handling

🛠️ Developer Experience

  • One-click setup: python start.py or start.bat/start.sh
  • Automated installer: Checks dependencies, downloads NLTK data, starts server
  • Modular architecture: Clean separation (app, pipeline, models, middleware)
  • Cross-platform: Windows, Linux, macOS support

📊 Version Comparison

Feature v1.0.0 v2.0.0
Interface CLI script FastAPI REST API
Server None Uvicorn ASGI
Setup Manual Automated scripts
Documentation Basic Comprehensive + API docs
File Upload Manual script execution HTTP multipart upload

🆕 New Dependencies

# NEW in v2.0.0
fastapi>=0.104.0          # Web framework
uvicorn[standard]>=0.24.0 # ASGI server
python-multipart>=0.0.6   # File upload support
pandas>=1.5.0             # Data manipulation (replaces csv module)
numpy>=1.21.0             # Numerical operations
nltk>=3.8                 # English NLP (replaces re module)
underthesea>=1.3.0        # Vietnamese NLP (NEW)

# CARRIED OVER from v1.0.0
matplotlib>=3.5.0         # Visualization
setuptools>=65.0.0        # Build tools
wheel>=0.37.0             # Build tools

🚀 Quick Start

# Automated setup (recommended)
python start.py

# Manual setup
pip install -r requirements.txt
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
uvicorn src.app.main:app --reload --port 5000

Access: http://localhost:5000/docs


📡 API Usage Examples

Text Analysis:

curl -X POST "http://localhost:5000/analyses/text" \
  -H "Content-Type: application/json" \
  -d '{"text": "Xin chào! Đây là ví dụ.", "format": "json"}'

File Upload:

curl -X POST "http://localhost:5000/analyses/file" \
  -F "file=@sample.txt" \
  -F "format=csv"

Python:

import requests
response = requests.post("http://localhost:5000/analyses/text",
    json={"text": "Hello world!", "format": "json"})
print(response.json())

⚠️ Breaking Changes

Migration required from v1.0.0:

  1. Installation: Install new dependencies (fastapi, uvicorn, nltk, underthesea, pandas, numpy)
  2. Execution: Run via uvicorn or startup scripts instead of direct Python execution
  3. Interface: Primary interface is HTTP API (CLI script deprecated)
  4. Output: Files saved to output/ directory with timestamp naming
  5. Imports: Module structure changed to src.pipeline.text_stats

Direct function usage still available:

from src.pipeline import text_stats as ts
tokens = ts.preprocessing(text)
stats = ts.statistics(tokens)

🐛 Bug Fixes

  • ✅ Better handling of punctuation and special characters

📁 Project Structure

word-frequency-mini-project/
├── src/
│   ├── app/
│   │   ├── main.py          # FastAPI application
│   │   ├── models.py        # Request/Response models
│   │   └── middleware.py    # Security middleware
│   └── pipeline/
│       └── text_stats.py    # Core NLP processing
├── data/                    # Input files
├── output/                  # Generated reports
├── start.py                 # Automated setup
├── start.bat/start.sh       # Platform-specific launchers
└── requirements.txt         # Dependencies

🔮 Roadmap

v2.1.0 (Q2 2026):

  • Word cloud visualization (from v1.0.0 roadmap)
  • TF-IDF analysis (from v1.0.0 roadmap)
  • Batch file processing (from v1.0.0 roadmap)
  • Export to PDF

v3.0.0 (Q3 2026):

  • .docx and .pdf file support (from v1.0.0 roadmap)
  • Database integration
  • User authentication
  • Docker containerization
  • Web UI frontend (from v1.0.0 roadmap)

🙏 Acknowledgments

  • NLTK Team - Natural Language Toolkit
  • Underthesea Team - Vietnamese NLP
  • FastAPI - Modern web framework
  • Matplotlib - Visualization library

🛠️ Tech Stack

Language: Python 3.x

Libraries:

  • matplotlib - Data visualization
  • csv - CSV file handling
  • re - Regular expressions for text processing

📝 Notes

  • First stable release
  • Tested with Vietnamese educational content
  • CSV output compatible with Excel and data analysis tools

🔮 Future Improvements

  • Support for additional file formats (.docx, .pdf)
  • Web interface for file upload
  • Word cloud visualization
  • TF-IDF analysis
  • Multi-file batch processing

Full Changelog: https://github.com/huypq02/word-frequency-mini-project/commits/releases/v2.0.0/

Release v1.0.0

04 Dec 16:58
cd1454d

Choose a tag to compare

Release Notes

Version 1.0.0 (September 24, 2025)

🎉 Initial Release

Word Frequency Mini Project - A text analysis tool for Vietnamese and English documents.


✅ Completed Features

Input

  • Support for .txt file input
  • Support for Vietnamese text
  • Support for English text

Text Processing

  • Convert all text to lowercase
  • Remove punctuation and special characters (keep only letters and numbers)
  • Split text into individual words (tokenization)

Statistics

  • Count the number of occurrences of each word in the text

Output

  • Print the list of words with their frequencies
  • Sort results in descending order of frequency
  • Save results to .csv file

Advanced Features (Optional)

  • Remove stopwords for English text
  • Remove stopwords for Vietnamese text
  • Visualize results with bar chart (using matplotlib)
  • Count phrase frequencies (bigram analysis)

📁 Output Sample

Words Counts
học tập 6
công nghệ 4
phát triển 4
giúp 4
mà còn 2
xã hội 2

🛠️ Tech Stack

  • Language: Python 3.x
  • Libraries:
    • matplotlib - Data visualization
    • csv - CSV file handling
    • re - Regular expressions for text processing

📝 Notes

  • First stable release
  • Tested with Vietnamese educational content
  • CSV output compatible with Excel and data analysis tools

🔮 Future Improvements

  • Support for additional file formats (.docx, .pdf)
  • Web interface for file upload
  • Word cloud visualization
  • TF-IDF analysis
  • Multi-file batch processing