Automated data cleaning and validation pipeline for sub-Saharan Africa feeds composition database (SSA Feeds).
This pipeline includes two components:
- Data Cleaning: Processes feed composition records through 9 sequential cleaning steps.
- Data Validation: Validates new data entries against 6 validation rules before database import.
Data Cleaning Pipeline:
source("run_SSAfeeds_data_cleaning.R")Data Validation:
source("run_SSAfeeds_data_validation.R new_data.csv")- Data Quality Flagging (
01_data_quality_flagging.R) - Remove duplicates, mixtures, trial codes, generic feeds - Naming Standardisation (
02_naming_standardisation.R) - Standardize crop and feed names - Scientific Name Population (
03_scientific_name_population.R) - Add taxonomic information - Plant Parts Population (
04_plant_parts_population.R) - Extract plant parts from names - Feed Type Mapping (
05a_feed_type_mapping.R) - Classify into 9 feed categories - Biological Validation (
06_biological_validation.R) - Check biological constraints - Nutritional Range Validation (
07_nutritional_range_validation.R) - Validate parameter ranges - Column Selection (
08a_column_selection.R) - Export final dataset - Visualization (
09_boxplots_by_feedtype.R) - Generate boxplots by feed type
- Reference ID Format - Must be 6 digits, unique
- Numeric Values Only - All nutritional parameters must be numeric
- Nutritional Parameter Ranges - Values within acceptable biological ranges
- Biological Constraints - ADF ≤ NDF, DM/OM ≤ 100%
- Required Fields - Crop name and at least one nutritional parameter
- Reference Data Validation - Valid feed types, plant parts, countries, crop names, genus
Data Cleaning:
- Final dataset: Clean records with standardized columns
- Metadata completeness: High completion rates for feed types and taxonomic information
- Visualizations: Boxplots for 8 nutritional parameters by feed type
Data Validation:
- Passed records:
filename_passed.csv- Records ready for import - Failed records:
filename_failed.csv- Records requiring correction - Validation report: Console output with detailed failure reasons
- R 4.0+
- Required packages: dplyr, ggplot2, readr