A work absense hours ML project
This project builds a machine learning system to forecast employee absenteeism hours using structured HR and productivity data.
The goal is to identify patterns in employee behavior and workplace conditions to anticipate absenteeism risk, enabling better planning and cost reduction.
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
pip install -r requirements.txt
Run MLFlow server for experiment tracking:
mlflow server --backend-store-uri sqlite:///my.db --default-artifact-root ./mlruns --host 0.0.0.0 --port 5000The MLflow UI will be available at http://localhost:5000
The project includes a FastAPI-based REST API for making absenteeism predictions.
Build and start the API:
docker-compose up -dStop the API:
docker-compose downThe API will be available at http://localhost:8000
- GET / - API information and available endpoints
- GET /health - Health check endpoint
- POST /predict - Make absenteeism predictions
- GET /docs - Interactive Swagger UI documentation
Health check:
curl http://localhost:8000/healthGet API information:
curl http://localhost:8000/Make a prediction:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"reason_for_absence": 23,
"month_of_absence": 7,
"day_of_the_week": 3,
"seasons": 1,
"transportation_expense": 289,
"distance_from_residence_to_work": 36,
"service_time": 13,
"age": 33,
"work_load_average/day": 239.554,
"hit_target": 97,
"disciplinary_failure": 0,
"education": 1,
"son": 2,
"social_drinker": 1,
"social_smoker": 0,
"pet": 1,
"weight": 90,
"height": 172
}'Expected response:
{
"prediction": 1,
"prediction_label": "High",
"confidence": 0.7343
}Once the server is running, access the interactive API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
These interfaces allow you to:
- View all available endpoints
- See request/response schemas
- Test the API directly from your browser
- Download OpenAPI specification
βββ LICENSE
βββ Makefile <- Makefile with commands like `make data` or `make train`
βββ README.md <- The top-level README for developers using this project.
βββ Dockerfile <- Docker configuration for the API server
βββ docker-compose.yml <- Docker Compose configuration for easy deployment
βββ requirements.txt <- Full requirements file for development and training
βββ requirements-api.txt <- Minimal requirements for the API server
β
βββ data
β βββ external <- Data from third party sources.
β βββ interim <- Intermediate data that has been transformed.
β βββ processed <- The final, canonical data sets for modeling.
β βββ raw <- The original, immutable data dump.
β
βββ docs <- A default Sphinx project; see sphinx-doc.org for details
β
βββ models <- Trained and serialized models, model predictions, or model summaries
β
βββ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
β the creator's initials, and a short `-` delimited description, e.g.
β `1.0-jqp-initial-data-exploration`.
β
βββ references <- Data dictionaries, manuals, and all other explanatory materials.
β
βββ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β βββ figures <- Generated graphics and figures to be used in reporting
β
βββ setup.py <- makes project pip installable (pip install -e .) so src can be imported
βββ src <- Source code for use in this project.
β βββ __init__.py <- Makes src a Python module
β β
β βββ api <- FastAPI REST API for predictions
β β βββ server.py <- API server implementation
β β
β βββ data <- Scripts to download or generate data
β β βββ make_dataset.py
β β
β βββ features <- Scripts to turn raw data into features for modeling
β β βββ build_features.py
β β
β βββ models <- Scripts to train models and then use trained models to make
β β β predictions
β β βββ predict_model.py
β β βββ train_model.py
β β βββ preprocessors.py
β β
β βββ visualization <- Scripts to create exploratory and results oriented visualizations
β βββ visualize.py
β
βββ tests <- Unit and integration tests
β βββ unit/ <- Unit tests for individual components
β βββ integration/ <- End-to-end integration tests
β
βββ tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
Project based on the cookiecutter data science project template. #cookiecutterdatascience
A comprehensive test suite with both unit tests and integration tests has been created:
tests/
βββ __init__.py # Package initialization
βββ conftest.py # Shared pytest fixtures (root level)
β
βββ unit/ # Unit tests
β βββ __init__.py
β βββ conftest.py # Unit-specific fixtures
β βββ test_preprocessors.py # Tests for custom transformers
β βββ test_train_model.py # Tests for training pipeline
β βββ test_predict_model.py # Tests for prediction pipeline
β βββ test_data_utils.py # Tests for data utilities
β βββ test_evaluation.py # Tests for model evaluation
β
βββ integration/ # Integration tests
βββ __init__.py
βββ conftest.py # Integration-specific fixtures
βββ test_pipeline_integration.py # End-to-end pipeline tests
Additional files:
βββ pytest.ini # Pytest configuration
βββ Dockerfile.test # Docker image for testing
βββ docker-compose.test.yml # Docker compose for running tests
-
Preprocessors (
test_preprocessors.py)DropColumnsTransformer: Column dropping functionalityIQRClippingTransformer: Outlier handling using IQR methodToStringTransformer: Type conversion to strings- Integration with sklearn pipelines
-
Model Training (
test_train_model.py)- Data loading and preparation
- Pipeline construction
- Model creation (Logistic Regression, Random Forest, Neural Network)
- Training and evaluation
- Multiple model training
- Model persistence
-
Model Prediction (
test_predict_model.py)- Model loading
- Making predictions on new data
- Data handling in prediction pipeline
-
Data Utilities (
test_data_utils.py)- CSV file loading
- Column name normalization
- Data shape validation
- Data value preservation
-
Model Evaluation (
test_evaluation.py)- Metrics calculation (accuracy, F1, recall, precision)
- Classification reports
- Confusion matrix creation
End-to-End Pipeline (test_pipeline_integration.py)
- Complete ML Workflow (
test_realistic_ml_workflow)- Data loading and preparation
- Train/test split
- Preprocessing pipeline creation
- Model training
- Model persistence (save/load)
- Prediction on new data
- Metrics evaluation
- Confusion matrix generation
- File artifact verification
This project uses GitHub Actions to automatically run tests on every push and pull request.
The CI pipeline runs:
- Unit tests on all test files in
tests/unit/ - Integration tests on all test files in
tests/integration/ - Coverage reports with XML and HTML output
- Tests on Python 3.9
- Navigate to the Actions tab in the GitHub repository
- Click on any workflow run to see detailed test results
- Coverage reports are uploaded as artifacts (available for 30 days)
The workflow is defined in .github/workflows/tests.yml and triggers on:
- Pushes to
mainanddevelopbranches - Pull requests to
mainanddevelopbranches
Build the docker image that contains the required environment to run the tests:
docker build -f Dockerfile.test -t work-absenteeism-test:latest .Run all tests (unit + integration):
docker-compose -f docker-compose.test.yml run --rm testRun only unit tests:
docker-compose -f docker-compose.test.yml run --rm test pytest tests/unit/ -vRun only integration tests:
docker-compose -f docker-compose.test.yml run --rm test pytest tests/integration/ -vRun tests with coverage report:
docker-compose -f docker-compose.test.yml run --rm test-coverageUse pytest markers to run specific test categories:
# Run only unit tests
pytest -m unit
# Run only integration tests
pytest -m integration
# Run only slow tests
pytest -m slow