A production-ready Flask-based API service for serving machine learning models to predict exoplanet characteristics. This service supports multiple model types including Random Forest, XGBoost, and Neural Networks with ensemble capabilities.
- Multiple Model Support: Random Forest, XGBoost, and PyTorch Neural Networks
- Ensemble Predictions: Combine multiple models for improved accuracy
- Batch Processing: Process multiple files in a single request
- Data Validation: Comprehensive input validation and preprocessing
- Feature Alignment: Automatic feature matching between training and inference
- Configurable: Flexible configuration through JSON files or environment variables
- Production Ready: Logging, error handling, and monitoring capabilities
- RESTful API: Clean, well-documented endpoints
- Type Safety: Comprehensive type hints throughout the codebase
back_end_1.0/
├── __init__.py # Legacy Flask application
├── app.py # Main Flask application with all improvements
├── config.py # Configuration management
├── logger.py # Logging utilities
├── io_utils.py # Input/output and data handling
├── preprocess.py # Data preprocessing and feature engineering
├── models.py # Model loading and prediction utilities
├── requirements.txt # Python dependencies
└── tests/ # Unit tests
├── test_io_utils.py
├── test_preprocess.py
└── test_models.py
- Python 3.8 or higher
- pip package manager
- (Optional) CUDA-capable GPU for PyTorch acceleration
- Clone the repository:
git clone <repository-url>
cd back_end_1.0- Create a virtual environment:
python -m venv venv
# On Windows
venv\Scripts\activate
# On Linux/Mac
source venv/bin/activate- Install dependencies:
# Install core dependencies
pip install -r back_end_1_0/requirements.txt
# For development (includes testing and linting tools)
pip install -r back_end_1_0/requirements.txt- Create configuration file (optional):
# Create a config.json file in the project root
cat > config.json << EOF
{
"api": {
"host": "0.0.0.0",
"port": 5000,
"debug": false,
"max_file_size": 104857600,
"enable_cors": true
},
"models": {
"rf": {
"path": "path/to/rf_model.pkl",
"aligner_path": "path/to/rf_aligner.pkl",
"threshold": 0.5,
"enabled": true
},
"xgb": {
"path": "path/to/xgb_model.pkl",
"aligner_path": "path/to/xgb_aligner.pkl",
"threshold": 0.5,
"enabled": true
},
"nn": {
"path": "path/to/nn_model.pt",
"aligner_path": "path/to/nn_aligner.pkl",
"threshold": 0.5,
"batch_size": 1024,
"enabled": true
}
},
"logging": {
"level": "INFO",
"file_path": "logs/app.log",
"enable_console": true,
"enable_file": true
}
}
EOF# Using the Flask development server
python -m back_end_1_0.app
# Or with environment variables
export FLASK_APP=back_end_1_0.app
export FLASK_ENV=development
flask run# Using Gunicorn (recommended for production)
gunicorn -w 4 -b 0.0.0.0:5000 back_end_1_0.app:create_app()
# With custom configuration
gunicorn -w 4 -b 0.0.0.0:5000 --timeout 300 --max-requests 1000 back_end_1_0.app:create_app()You can configure the service using environment variables:
# API Configuration
export API_HOST=0.0.0.0
export API_PORT=5000
export API_DEBUG=false
# Model Paths
export MODEL_RF_PATH=/path/to/rf_model.pkl
export MODEL_RF_ALIGNER_PATH=/path/to/rf_aligner.pkl
export MODEL_XGB_PATH=/path/to/xgb_model.pkl
export MODEL_XGB_ALIGNER_PATH=/path/to/xgb_aligner.pkl
export MODEL_NN_PATH=/path/to/nn_model.pt
export MODEL_NN_ALIGNER_PATH=/path/to/nn_aligner.pkl
# Logging
export LOG_LEVEL=INFO
export LOG_FILE=logs/app.log
# Configuration File
export CONFIG_FILE=config.jsonGET /healthReturns service health status and available models.
GET /infoReturns service capabilities and configuration.
POST /validate
Content-Type: multipart/form-data
Parameters:
- file: CSV or ZIP file
- is_zip: "true" if file is ZIP (optional)Validates uploaded data without making predictions.
POST /predict/rf
Content-Type: multipart/form-data
Parameters:
- file: CSV file with features
- model_path: Path to model file (optional, uses config)
- aligner_path: Path to feature aligner (optional)
- label_col: Name of label column for metrics (optional)
- threshold: Classification threshold (optional, default: 0.5)POST /predict/xgb
Content-Type: multipart/form-data
Parameters:
- file: CSV or ZIP file
- is_zip: "true" if file is ZIP
- model_path: Path to model file (optional)
- aligner_path: Path to feature aligner (optional)
- label_col: Name of label column (optional)
- threshold: Classification threshold (optional)POST /predict/nn
Content-Type: multipart/form-data
Parameters:
- file: CSV file
- model_path: Path to PyTorch model (optional)
- aligner_path: Path to feature aligner (optional)
- label_col: Name of label column (optional)
- threshold: Classification threshold (optional)
- batch_size: Batch size for inference (optional)POST /predict/ensemble
Content-Type: multipart/form-data
Parameters:
- file: CSV file
- models: Comma-separated model names (default: "rf,xgb,nn")
- weights: Comma-separated weights (optional)
- voting: "soft" or "hard" (default: "soft")
- label_col: Name of label column (optional)POST /predict
Content-Type: multipart/form-data
Parameters:
- file: CSV file
- label_col: Name of label column (optional)Returns predictions from all configured models.
POST /batch
Content-Type: multipart/form-data
Parameters:
- files: Multiple CSV files
- model: Model type to use (default: "ensemble")All prediction endpoints return JSON responses with the following structure:
{
"model": "model_name",
"predictions": [0, 1, 0, ...],
"probabilities": [0.23, 0.87, 0.15, ...],
"threshold": 0.5,
"n_samples": 100,
"n_features": 50,
"metrics": {
"accuracy": 0.85,
"precision": 0.82,
"recall": 0.88,
"f1": 0.85
}
}# Run all tests
pytest
# Run with coverage
pytest --cov=back_end_1_0 --cov-report=html
# Run specific test file
pytest tests/test_models.py
# Run with verbose output
pytest -v
# Run only fast tests (exclude slow tests)
pytest -m "not slow"# Format code with Black
black back_end_1_0/
# Check code style with flake8
flake8 back_end_1_0/
# Type checking with mypy
mypy back_end_1_0/
# Sort imports with isort
isort back_end_1_0/To train models compatible with this service:
- Feature Aligner: Save the feature names from training
from back_end_1_0.preprocess import FeatureAligner
# During training
aligner = FeatureAligner(feature_names=X_train.columns.tolist())
aligner.save("model_aligner.pkl")- Scikit-learn Models: Use joblib to save
import joblib
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
joblib.dump(model, "rf_model.pkl")- PyTorch Models: Save state dict
import torch
# Save model
torch.save(model.state_dict(), "nn_model.pt")
# Model architecture must match ExoplanetModel classCreate a Dockerfile:
FROM python:3.9-slim
WORKDIR /app
COPY back_end_1_0/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY back_end_1_0/ ./back_end_1_0/
COPY config.json .
EXPOSE 5000
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "back_end_1_0.app:create_app()"]Build and run:
docker build -t exoplanet-backend .
docker run -p 5000:5000 -v /path/to/models:/models exoplanet-backend- Use batch processing for multiple files
- Enable model caching in production
- Use GPU acceleration for PyTorch models when available
- Implement request rate limiting for public APIs
- Use a reverse proxy (nginx) in production
- Enable response compression for large predictions
-
Model not found error:
- Check model paths in config.json
- Ensure model files exist and are readable
- Verify file permissions
-
Memory errors with large files:
- Adjust
max_file_sizein configuration - Use batch processing for large datasets
- Increase available RAM
- Adjust
-
Slow predictions:
- Enable GPU acceleration for PyTorch
- Reduce batch size for neural networks
- Use ensemble only when necessary
-
Feature mismatch errors:
- Ensure feature aligner is properly configured
- Check that input data has expected columns
- Verify preprocessing pipeline matches training
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Create an issue on GitHub
- Contact the development team
- Check the documentation
- Initial release with multi-model support
- Comprehensive refactoring and improvements
- Added ensemble predictions
- Implemented configuration management
- Added logging and monitoring
- Created comprehensive test suite
- Added batch processing capabilities
- Improved error handling and validation