API: https://finnish-nlp-toolkit-api.onrender.com/health
A production-ready FastAPI service for Finnish natural language processing
Scope note: This is a lightweight REST wrapper for basic Finnish morphological analysis — lemmatization, complexity scoring, and profanity detection across three domains. For the full platform with 30+ endpoints, all 15 grammatical cases, and Voikko-backed production morphology, see Finnish-nlp-2.0.
Finnish NLP Toolkit is a comprehensive backend service providing state-of-the-art Finnish language processing capabilities through a modern REST API and interactive web interface. Built with FastAPI and powered by advanced NLP libraries, it offers lemmatization, complexity analysis, and profanity detection for Finnish text.
- ⚡ High Performance: Sub-100ms response times for most operations
- 🔄 Batch Processing: Process up to 100 texts in a single request
- 🎯 Dual-Mode Architecture: Basic (rule-based) and Advanced (ML-powered) engines
- 📊 Comprehensive Testing: 66 tests with 100% pass rate
- 🐳 Docker Ready: Full containerization with docker-compose
- 📚 Auto-Generated Docs: Interactive OpenAPI/Swagger documentation
- 🎨 Interactive UI: Streamlit-based frontend for testing
Convert Finnish words to their dictionary forms with detailed morphological analysis.
- Basic Engine: Rule-based Finnish morphology (works out-of-the-box)
- Advanced Engine: Voikko integration for production-grade accuracy
- Output: Base form, part-of-speech, case, number, person, tense, mood
Analyze linguistic complexity of Finnish sentences and texts.
- Basic Engine: Heuristic-based clause detection and metrics
- Advanced Engine: UDPipe dependency parsing + spaCy pipeline
- Metrics: Clause count, morphological depth, average word length, case distribution
Detect and filter toxic content in Finnish text.
- Basic Engine: Keyword-based filtering (30+ Finnish swear words)
- Advanced Engine: FinBERT ML classification (ready for custom models)
- Features: Toxicity scoring, severity levels, flagged word identification
- Batch Processing: Process multiple texts efficiently with aggregate statistics
- Smart Caching: Redis-backed caching with in-memory LRU fallback
- Graceful Degradation: Automatic fallback to basic engines if advanced features unavailable
- Configuration Management: Feature flags via environment variables
- Comprehensive Logging: Structured logging for monitoring and debugging
- Python 3.9 or higher
- pip (Python package manager)
- Docker and Docker Compose (optional, for containerized deployment)
# Clone the repository
git clone https://github.com/Aliipou/finnish-nlp-toolkit-api
cd finapi
# Install dependencies
pip install -r requirements.txt
# Start the API server
uvicorn app.main:app --reload --port 8000
# In another terminal, start the frontend
streamlit run frontend/app.py# Start all services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down| Service | URL | Description |
|---|---|---|
| REST API | http://localhost:8000 | Main API service |
| API Documentation | http://localhost:8000/docs | Interactive Swagger UI |
| Alternative Docs | http://localhost:8000/redoc | ReDoc documentation |
| Frontend UI | http://localhost:8501 | Streamlit interface |
| Health Check | http://localhost:8000/health | Service status |
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
API information and version |
| GET | /health |
Health check endpoint |
| GET | /version |
API version details |
Endpoints:
GET /api/lemmatize?text={text}&include_morphology={bool}POST /api/lemmatize
Request:
curl "http://localhost:8000/api/lemmatize?text=kissani"Response:
{
"text": "kissani",
"lemmas": [
{
"original": "kissani",
"lemma": "kissa",
"pos": "NOUN",
"morphology": {
"case": "Nominative",
"number": "Singular",
"possessive": "1Sg"
}
}
],
"word_count": 1
}Endpoints:
GET /api/complexity?text={text}&detailed={bool}POST /api/complexity
Request:
curl "http://localhost:8000/api/complexity?text=Kun%20aurinko%20paistoi,%20lapset%20leikkivät&detailed=true"Response:
{
"text": "Kun aurinko paistoi, lapset leikkivät",
"sentence_count": 1,
"word_count": 5,
"clause_count": 2,
"morphological_depth_score": 45.2,
"average_word_length": 7.0,
"complexity_rating": "Moderate",
"case_distribution": {
"nominative": 2,
"genitive": 0,
"partitive": 0,
"inessive": 0
}
}Endpoints:
GET /api/swear-check?text={text}&return_flagged_words={bool}&threshold={float}POST /api/swear-check
Request:
curl "http://localhost:8000/api/swear-check?text=puhdas%20teksti&return_flagged_words=true"Response:
{
"text": "puhdas teksti",
"is_toxic": false,
"toxicity_score": 0.0,
"severity": "None",
"flagged_words": null
}Process multiple texts in a single request for improved efficiency.
Endpoints:
POST /api/batch/lemmatizePOST /api/batch/complexityPOST /api/batch/swear-check
Request:
curl -X POST http://localhost:8000/api/batch/lemmatize \
-H "Content-Type: application/json" \
-d '{
"texts": ["kissa", "koira", "lintu"],
"include_morphology": true
}'Response:
{
"results": [
{
"text": "kissa",
"lemmas": [...],
"word_count": 1
},
{
"text": "koira",
"lemmas": [...],
"word_count": 1
},
{
"text": "lintu",
"lemmas": [...],
"word_count": 1
}
],
"total_texts": 3,
"total_words": 3,
"average_words_per_text": 1.0
}finapi/
├── app/ # Main application package
│ ├── main.py # FastAPI application entry point
│ ├── config.py # Configuration management
│ │
│ ├── routers/ # API endpoint definitions
│ │ ├── lemmatizer.py # Lemmatization endpoints
│ │ ├── complexity.py # Complexity analysis endpoints
│ │ ├── profanity.py # Profanity detection endpoints
│ │ └── batch_processing.py # Batch processing endpoints
│ │
│ ├── services/ # Business logic and NLP engines
│ │ ├── lemma_engine.py # Basic lemmatizer
│ │ ├── complexity_engine.py # Basic complexity analyzer
│ │ ├── profanity_model.py # Basic profanity detector
│ │ ├── advanced_lemma_engine.py # Voikko-powered lemmatizer
│ │ ├── advanced_complexity_engine.py # UDPipe + spaCy analyzer
│ │ └── advanced_profanity_model.py # ML-based detector
│ │
│ ├── models/ # Data models and schemas
│ │ └── schemas.py # Pydantic request/response models
│ │
│ ├── utils/ # Utility modules
│ │ └── cache.py # Caching layer (Redis + LRU)
│ │
│ └── tests/ # Test suite
│ ├── test_lemmatizer.py # Lemmatizer tests (10 tests)
│ ├── test_complexity.py # Complexity tests (15 tests)
│ ├── test_profanity.py # Profanity tests (16 tests)
│ └── test_api_integration.py # Integration tests (25 tests)
│
├── data/ # Data directory
│ ├── models/ # NLP models (download required)
│ ├── corpus/ # Text corpora for training/testing
│ ├── datasets/ # Training datasets
│ ├── cache/ # Runtime cache storage
│ └── scripts/
│ └── download_models.py # Model download automation
│
├── frontend/ # Web interface
│ ├── app.py # Streamlit application
│ └── Dockerfile # Frontend container
│
├── docs/ # Documentation
│ ├── API_REFERENCE.md # Complete API reference
│ └── DEPLOYMENT.md # Deployment guide
│
├── Dockerfile # API container definition
├── docker-compose.yml # Multi-service orchestration
├── requirements.txt # Python dependencies
├── requirements-advanced.txt # Advanced NLP dependencies
├── pytest.ini # Test configuration
├── .env.example # Environment variables template
└── .gitignore # Git ignore rules
Create a .env file based on .env.example:
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=False
LOG_LEVEL=INFO
# Advanced NLP Features (optional)
USE_VOIKKO=false # Requires system libvoikko installation
USE_UDPIPE=false # Requires model download
USE_SPACY=false # Requires spaCy model
USE_TRANSFORMERS=false # Requires FinBERT model
USE_REDIS=false # Requires Redis server
# Model Paths
UDPIPE_MODEL_PATH=data/models/finnish-tdt-ud-2.5-191206.udpipe
TOXICITY_MODEL_PATH=data/models/finnish-toxicity-bert
# Redis Configuration
REDIS_URL=redis://localhost:6379/0
CACHE_TTL=3600
# CORS Settings
CORS_ORIGINS=*The application supports graceful degradation through feature flags:
| Flag | Default | Requires | Description |
|---|---|---|---|
USE_VOIKKO |
false | libvoikko | Real Finnish morphological analysis |
USE_UDPIPE |
false | Model file | Dependency parsing |
USE_SPACY |
false | spaCy model | Modern NLP pipeline |
USE_TRANSFORMERS |
false | BERT model | ML-based toxicity detection |
USE_REDIS |
false | Redis server | Distributed caching |
When advanced features are unavailable, the system automatically falls back to basic implementations.
Run the automated download script:
python data/scripts/download_models.pyThis will download:
- UDPipe Finnish Model (~21MB) - Dependency parsing
- spaCy Finnish Model - Modern NLP pipeline
- Sample Corpus - Finnish text examples
After downloading models, update .env:
USE_UDPIPE=true
USE_SPACY=trueFor production deployments with multiple instances:
# Install and start Redis
docker run -d -p 6379:6379 redis:alpine
# Enable in .env
USE_REDIS=true
REDIS_URL=redis://localhost:6379/0# Run all tests
pytest app/tests/ -v
# Run specific test file
pytest app/tests/test_lemmatizer.py -v
# Run with coverage
pytest app/tests/ --cov=app --cov-report=html
# View coverage report
open htmlcov/index.html✓ 66 tests passing (100% pass rate)
├── Lemmatizer: 10 tests
├── Complexity: 15 tests
├── Profanity: 16 tests
└── Integration: 25 tests
Execution time: ~2.6s
| Operation | Basic Engine | Advanced Engine | With Cache |
|---|---|---|---|
| Lemmatization | ~50ms | ~100ms (Voikko) | ~5ms |
| Complexity | ~30ms | ~150ms (UDPipe) | ~5ms |
| Profanity | ~20ms | ~300ms (FinBERT) | ~5ms |
| Batch (10 texts) | ~200ms | ~1s | ~50ms |
Benchmarks on 2.5GHz processor, average over 100 requests
| Document | Description |
|---|---|
| API Reference | Complete endpoint documentation with examples |
| Deployment Guide | Production deployment on Railway, Render, Heroku |
| Interactive Docs | Swagger UI (when server running) |
| ReDoc | Alternative API documentation |
- FastAPI - Modern async web framework
- Pydantic - Data validation using Python type annotations
- Uvicorn - Lightning-fast ASGI server
- Voikko - Finnish morphological analyzer
- UDPipe - Universal Dependencies parsing
- spaCy - Industrial-strength NLP
- Transformers - State-of-the-art ML models
- Docker - Containerization
- Redis - In-memory caching (optional)
- Pytest - Testing framework
- Streamlit - Interactive web UI
# Build and start services
docker-compose up -d
# View logs
docker-compose logs -f api
# Scale API service
docker-compose up -d --scale api=3
# Stop services
docker-compose down# Install Railway CLI
npm install -g @railway/cli
# Deploy
railway login
railway init
railway up- Connect your GitHub repository
- Select "Docker" as environment
- Deploy automatically from
Dockerfile
# Login to Heroku
heroku login
heroku container:login
# Create app and deploy
heroku create finnish-nlp-api
heroku container:push web
heroku container:release webSee DEPLOYMENT.md for detailed instructions.
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure all tests pass (
pytest app/tests/) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Turku Dependency Treebank - Finnish linguistic resources
- Universal Dependencies - Treebank annotation standards
- Voikko Project - Open-source Finnish language tools
- TurkuNLP - Finnish NLP research group
Made with ❤️ for the Finnish NLP community




