Skip to content

Commit 839406d

Browse files
committed
init converted lesson with converted episode files, updated yaml file, and updated image paths
1 parent 3660a7a commit 839406d

89 files changed

Lines changed: 28923 additions & 120 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

config.yaml

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,32 +15,32 @@
1515
carpentry: 'incubator'
1616

1717
# Alt-text description of the lesson.
18-
carpentry_description: 'Lesson Description'
18+
carpentry_description: 'Beginner-friendly introduction to machine learning in Python using scikit-learn.'
1919

2020
# Overall title for pages.
21-
title: 'Lesson Title' # FIXME
21+
title: 'Introduction to Machine Learning with Scikit-Learn'
2222

2323
# Date the lesson was created (YYYY-MM-DD, this is empty by default)
24-
created: ~ # FIXME
24+
created: '2013-07-13' # FIXME
2525

2626
# Comma-separated list of keywords for the lesson
2727
keywords: 'software, data, lesson, The Carpentries' # FIXME
2828

2929
# Life cycle stage of the lesson
3030
# possible values: pre-alpha, alpha, beta, stable
31-
life_cycle: 'pre-alpha' # FIXME
31+
life_cycle: 'beta' # FIXME
3232

3333
# License of the lesson
3434
license: 'CC-BY 4.0'
3535

3636
# Link to the source repository for this lesson
37-
source: 'https://github.com/carpentries/workbench-template-md' # FIXME
37+
source: 'https://github.com/UW-Madison-DataScience/machine-learning-novice-sklearn-v2'
3838

3939
# Default branch of your lesson
4040
branch: 'main'
4141

4242
# Who to contact if there are any issues
43-
contact: 'team@carpentries.org' # FIXME
43+
contact: 'endemann@wisc.edu' # FIXME
4444

4545
# Navigation ------------------------------------------------
4646
#
@@ -67,6 +67,14 @@ contact: 'team@carpentries.org' # FIXME
6767
# Order of episodes in your lesson
6868
episodes:
6969
- introduction.md
70+
- 02-regression.md
71+
- 03-classification.md
72+
- 04-ensemble-methods.md
73+
- 05-clustering.md
74+
- 06-dimensionality-reduction.md
75+
- 07-neural-networks.md
76+
- 08-ethics.md
77+
- 09-learn-more.md
7078

7179
# Information for Learners
7280
learners:

episodes/01-introduction.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: Introduction
3+
teaching: 30
4+
exercises: 10
5+
---
6+
7+
:::::::::::::::::::::::::::::::::::::: questions
8+
9+
- What is machine learning?
10+
- What are some useful machine learning techniques?
11+
12+
::::::::::::::::::::::::::::::::::::::::::::::::
13+
14+
::::::::::::::::::::::::::::::::::::: objectives
15+
16+
- Gain an overview of what machine learning is and the techniques available.
17+
- Understand how machine learning, deep learning, and artificial intelligence differ.
18+
- Be aware of some caveats when using machine mearning.
19+
20+
::::::::::::::::::::::::::::::::::::::::::::::::
21+
22+
# What is machine learning?
23+
24+
Machine learning is a set of techniques that enable computers to use data to improve their performance in a given task. This is similar in concept to how humans learn to make predictions based upon previous experience and knowledge. Machine learning is "data-driven", meaning that it uses the underlying statistics of a set of data to achieve a task.
25+
26+
Machine learning encompasses a wide range of tasks and activities, but broadly speaking it can be used to: find trends in a dataset, classify data into groups or categories, make predictions based upon data, and even "learn" how to interact with an environment when provided with goals to achieve.
27+
28+
### Artificial intelligence vs machine learning
29+
30+
The term machine learning (ML) is often mentioned alongside artificial intelligence (AI) and deep learning (DL). Deep learning is a subset of machine learning, and machine learning is a subset of artificial intelligence.
31+
32+
AI is increasingly being used as a catch-all term to describe things that encompass ML and DL systems - from simple email spam filters, to more complex image recognition systems, to large language models such as ChatGPT. The more specific term "Artificial General Intelligence" (AGI) is used to describe a system possessing a "general intelligence" that can be applied to solve a diverse range of problems, often mimicking the behaviour of intelligent biological systems. Modern attempts at AGI are getting close to fooling humans, but while there have been great advances in AI research, human-like intelligence is only possible in a few specialist areas.
33+
34+
ML refers to techniques where a computer can "learn" patterns in data, usually by being shown many training examples. While ML algorithms can learn to solve specific problems, or multiple similar problems, they are not considered to possess a general intelligence. ML algorithms often need hundreds or thousands of examples to learn a task and are confined to activities such as simple classifications. A human-like system could learn much quicker than this, and potentially learn from a single example by using it's knowledge of many other problems.
35+
36+
DL is a particular field of machine learning where algorithms called neural networks are used to create highly complex systems. Large collections of neural networks are able to learn from vast quantities of data. Deep learning can be used to solve a wide range of problems, but it can also require huge amounts of input data and computational resources to train.
37+
38+
The image below shows the relationships between artificial intelligence, machine learning and deep learning.
39+
40+
![An infographic showing some of the relationships between AI, ML, and DL](fig/introduction/AI_ML_DL_differences.png)
41+
The image above is by Tukijaaliwa, CC BY-SA 4.0, via Wikimedia Commons, original source
42+
43+
44+
### Machine learning in our daily lives
45+
46+
Machine learning has quickly become an important technology and is now frequently used to perform services we encounter in our daily lives. Here are just a few examples:
47+
48+
* Banks look for trends in transaction data to detect outliers that may be fraudulent
49+
* Email inboxes use text to decide whether an email is spam or not, and adjust their rules based upon how we flag emails
50+
* Travel apps use live and historic data to estimate traffic, travel times, and journey routes
51+
* Retail companies and streaming services use data to recommend new content we might like based upon our demographic and historical preferences
52+
* Image, object, and pattern recognition is used to identify humans and vehicles, capture text, generate subtitles, and much more
53+
* Self-driving cars and robots use object detection and performance feedback to improve their interaction with the world
54+
55+
::::::::::::::::::::::::::::::::::::: challenge
56+
57+
58+
## Where else have you encountered machine learning already?
59+
Now that we have explored machine learning in a bit more detail, discuss with the person next to you:
60+
1. Where else have I seen machine learning in use?
61+
2. What kind of input data does that machine learning system use to make predictions/classifications?
62+
3. Is there any evidence that your interaction with the system contributes to further training?
63+
4. Do you have any examples of the system failing?
64+
65+
::::::::::::::::::::::::::::::::::::::::::::::::
66+
67+
68+
### Limitations of machine learning
69+
70+
Like any other systems machine learning has limitations, caveats, and "gotchas" to be aware of that may impact the accuracy and performance of a machine learning system.
71+
72+
#### Garbage in = garbage out
73+
74+
There is a classic expression in computer science: "garbage in = garbage out". This means that if the input data we use is garbage then the ouput will be too. If, for example, we try to use a machine learning system to find a link between two unlinked variables then it may well manage to produce a model attempting this, but the output will be meaningless.
75+
76+
#### Biases due to training data
77+
78+
The performance of a ML system depends on the breadth and quality of input data used to train it. If the input data contains biases or blind spots then these will be reflected in the ML system. For example, if we collect data on public transport use from only high socioeconomic areas, the resulting input data may be biased due to a range of factors that may increase the likelihood of people from those areas using private transport vs public options.
79+
80+
#### Extrapolation
81+
82+
We can only make reliable predictions about data which is in the same range as our training data. If we try to extrapolate beyond the boundaries of the training data we cannot be confident in our results. As we shall see some algorithms are better suited (or less suited) to extrapolation than others.
83+
84+
#### Over fitting
85+
86+
Sometimes ML algorithms become over-trained and subsequently don't perform well when presented with real data. It's important to consider how many rounds of training a ML system has recieved and whether or not it may have become over-trained.
87+
88+
#### Inability to explain answers
89+
90+
Machine learning techniques will return an answer based on the input data and model parameters even if that answer is wrong. Most systems are unable to explain the logic used to arrive at that answer. This can make detecting and diagnosing problems difficult.
91+
92+
93+
# Getting started with Scikit-Learn
94+
95+
### About Scikit-Learn
96+
97+
[Scikit-Learn](http://github.com/scikit-learn/scikit-learn) is a python package designed to give access to well-known machine learning algorithms within Python code, through a clean application programming interface (API). It has been built by hundreds of contributors from around the world, and is used across industry and academia.
98+
99+
Scikit-Learn is built upon Python's [NumPy (Numerical Python)](http://numpy.org) and [SciPy (Scientific Python)](http://scipy.org) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, Scikit-Learn is not specifically designed for extremely large datasets, though there is [some work](https://github.com/ogrisel/parallel_ml_tutorial) in this area. For this introduction to ML we are going to stick to processing small to medium datasets with Scikit-Learn, without the need for a graphical processing unit (GPU).
100+
101+
Like any other Python package, we can import Scikit-Learn and check the package version using the following Python commands:
102+
103+
~~~
104+
import sklearn
105+
print('scikit-learn:', sklearn.__version__)
106+
~~~
107+
{: .language-python}
108+
109+
### Representation of Data in Scikit-learn
110+
111+
Machine learning is about creating models from data: for that reason, we'll start by discussing how data can be represented in order to be understood by the computer.
112+
113+
Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]
114+
115+
We typically have a "Features Matrix" (usually referred to as the code variable `X`) which are the "features" data we wish to train on.
116+
117+
* n_samples: The number of samples. A sample can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.
118+
* n_features: The number of features (variables) that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases.
119+
120+
If we want our ML models to make predictions or classifications, we also provide "labels" as our expected "answers/results". The model will then be trained on the input features to try and match our provided labels. This is done by providing a "Target Array" (usually referred to as the code variable `y`) which contains the "labels or values" that we wish to predict using the features data.
121+
122+
![Types of Machine Learning](fig/introduction/sklearn_input.png)
123+
Figure from the [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)
124+
125+
# What will we cover today?
126+
127+
This lesson will introduce you to some of the key concepts and sub-domains of ML such as supervised learning, unsupervised learning, and neural networks.
128+
129+
The figure below provides a nice overview of some of the sub-domains of ML and the techniques used within each sub-domain. We recommend checking out the Scikit-Learn [webpage](https://scikit-learn.org/stable/index.html) for additional examples of the topics we will cover in this lesson. We will cover topics highlighted in blue: classical learning techniques such as regression, classification, clustering, and dimension reduction, as well as ensemble methods and a brief introduction to neural networks using perceptrons.
130+
131+
![Types of Machine Learning](fig/introduction/ML_summary.png)
132+
[Image from Vasily Zubarev via their blog](https://vas3k.com/blog/machine_learning/) with modifications in blue to denote lesson content.
133+
134+
{% include links.md %}
135+
136+
::::::::::::::::::::::::::::::::::::: keypoints
137+
138+
- Machine learning is a set of tools and techniques that use data to make predictions.
139+
- Artificial intelligence is a broader term that refers to making computers show human-like intelligence.
140+
- Deep learning is a subset of machine learning.
141+
- All machine learning systems have limitations to be aware of.
142+
143+
::::::::::::::::::::::::::::::::::::::::::::::::

0 commit comments

Comments
 (0)