Di process wey dem dey use build, use, and maintain machine learning models and di data wey dem dey use no be di same as oda development workflows. For dis lesson, we go break di process down, and show di main techniques wey you need sabi. You go:
- Understand di processes wey dey under machine learning for high level.
- Check base concepts like 'models', 'predictions', and 'training data'.
🎥 Click di image wey dey up to watch short video wey explain dis lesson.
For high level, di work wey dem dey do to create machine learning (ML) processes get plenty steps:
- Decide di question. Most ML processes dey start by asking question wey no fit get answer with simple conditional program or rules-based engine. Dis kind question dey usually about predictions based on data wey dem collect.
- Collect and prepare data. To fit answer your question, you need data. Di quality and sometimes di quantity of your data go determine how well you fit answer di question. To see di data well, you go need visualize am. Dis phase still include how you go divide di data into training and testing group to build di model.
- Choose training method. Based on your question and di kind data wey you get, you go choose how you wan train di model to fit reflect di data well and make correct predictions. Dis part of ML process need special skill and sometimes plenty trial and error.
- Train di model. With your training data, you go use different algorithms to train di model to sabi di patterns wey dey di data. Di model fit use internal weights wey dem fit adjust to focus on some parts of di data to make di model better.
- Evaluate di model. You go use data wey di model never see before (your testing data) to check how di model dey perform.
- Parameter tuning. Based on how di model perform, you fit start di process again with different parameters or variables wey dey control di behavior of di algorithms wey dem use train di model.
- Predict. Use new inputs to test di accuracy of your model.
Computers sabi well well how to find hidden patterns for data. Dis skill dey help researchers wey get questions about one area wey no fit get answer with conditionally-based rules engine. For example, if dem wan calculate di life expectancy of smokers vs non-smokers, data scientist fit create rules for am.
But if di question get plenty variables, ML model fit dey more efficient to predict future life expectancy based on past health history. Another example fit be to predict weather for April for one place based on data like latitude, longitude, climate change, how close di place dey to di ocean, jet stream patterns, and more.
✅ Dis slide deck about weather models dey give historical perspective on how dem dey use ML for weather analysis.
Before you start to build your model, you go need do some tasks. To test your question and form hypothesis based on di model predictions, you go need identify and set some things.
To fit answer your question well, you need plenty data wey dey correct. You go do two things for dis stage:
- Collect data. Remember di lesson wey we talk about fairness for data analysis, collect your data well. Know di source of di data, any bias wey fit dey inside, and write down where you get am from.
- Prepare data. Di process to prepare data get steps. You fit need join data together and normalize am if e come from different sources. You fit improve di quality and quantity of di data by converting strings to numbers (like we do for Clustering). You fit still generate new data from di original one (like we do for Classification). You fit clean and edit di data (like we go do before di Web App lesson). Finally, you fit need randomize and shuffle di data, depending on di training techniques.
✅ After you don collect and process your data, check if di shape of di data go fit help you answer di question wey you wan solve. E fit be say di data no go work well for di task, like we see for our Clustering lessons!
Feature na measurable property of your data. For many datasets, e dey show as column heading like 'date', 'size', or 'color'. Your feature variable, wey dem dey usually represent as X for code, na di input variable wey dem go use train di model.
Target na di thing wey you wan predict. Dem dey usually represent target as y for code, and e dey answer di question wey you dey ask from your data: for December, which color of pumpkin go cheap pass? For San Francisco, which neighborhood go get di best real estate price? Sometimes, dem dey call target label attribute.
🎓 Feature Selection and Feature Extraction How you go sabi which variable to choose when you dey build model? You go probably go through process of feature selection or feature extraction to choose di correct variables for di best model. But dem no be di same thing: "Feature extraction dey create new features from functions of di original features, but feature selection dey return subset of di features." (source)
One important tool wey data scientist dey use na di power to visualize data with libraries like Seaborn or MatPlotLib. To show your data visually fit help you see hidden correlations wey you fit use. Your visualizations fit still help you see bias or unbalanced data (like we see for Classification).
Before you train, you go need divide your dataset into two or more parts wey no dey equal but still represent di data well.
- Training. Dis part of di dataset na di one wey you go use train di model. E dey make up di majority of di original dataset.
- Testing. Test dataset na independent group of data, wey you go use check di performance of di model wey you don build.
- Validating. Validation set na smaller independent group of examples wey you go use tune di model hyperparameters or architecture to improve di model. Depending on di size of your data and di question wey you dey ask, you fit no need build dis third set (like we talk for Time Series Forecasting).
With your training data, your goal na to build model, or statistical representation of your data, using different algorithms to train am. Training di model go expose am to di data and e go make assumptions about di patterns wey e see, validate, and accept or reject.
Based on your question and di kind data wey you get, you go choose method to train am. If you check Scikit-learn's documentation - wey we dey use for dis course - you go see plenty ways to train model. Based on your experience, you fit need try different methods to build di best model. You go likely go through process where data scientists dey check di performance of di model by giving am data wey e never see before, check for accuracy, bias, and oda issues, and choose di best training method for di task.
With your training data, you go 'fit' am to create model. You go notice say for many ML libraries, you go see code like 'model.fit' - na dis time you go send your feature variable as array of values (usually 'X') and target variable (usually 'y').
Once di training process don complete (e fit take many iterations, or 'epochs', to train big model), you go fit evaluate di model quality by using test data to check how e perform. Dis data na subset of di original data wey di model never analyze before. You fit print table of metrics about di model quality.
🎓 Model fitting
For machine learning, model fitting mean how accurate di model function dey as e dey try analyze data wey e no sabi.
🎓 Underfitting and overfitting na common problems wey dey reduce di quality of di model, as di model fit no fit well or e fit too fit. Dis one dey make di model predictions either too close or too far from di training data. Overfit model dey predict training data too well because e don sabi di details and noise of di data too much. Underfit model no dey accurate because e no fit analyze di training data or di data wey e never see well.
Infographic by Jen Looper
After your first training, check di quality of di model and think of how you fit improve am by adjusting di 'hyperparameters'. Read more about di process for di documentation.
Dis na di time wey you go use new data to test di model accuracy. For 'applied' ML setting, where you dey build web assets to use di model for production, dis process fit involve collecting user input (like button press) to set variable and send am to di model for inference or evaluation.
For dis lessons, you go learn how to use dis steps to prepare, build, test, evaluate, and predict - all di work wey data scientist dey do and more, as you dey progress to become 'full stack' ML engineer.
Draw flow chart wey show di steps of ML practitioner. Where you dey now for di process? Where you think say you go get difficulty? Wetin dey easy for you?
Search online for interviews with data scientists wey dey talk about their daily work. Here na one.
Disclaimer:
Dis dokyument don use AI transleto service Co-op Translator do di translation. Even as we dey try make am correct, abeg sabi say machine translation fit get mistake or no dey accurate well. Di original dokyument for im native language na di main source wey you go fit trust. For important mata, e good make professional human transleto check am. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.

