You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 1_datasets/guide.md
+320Lines changed: 320 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,3 +5,323 @@ Store your local datasets in this folder (`.csv`, `.xlsx`, `.json`, `.sqlite`, .
5
5
One of the primary goals of this repository is that anyone can clone and replicate your research. To make this possible **DO NOT modify or overwrite your raw datasets**! You should keep them _exactly_ as they were when you downloaded them, you may even want to name them `dataset.raw.ext` (eg. `daily_temperatures.raw.csv`).
6
6
7
7
When cleaning and processing your datasets, you should save the prepared data to a _new_ file with a descriptive name. This approach will result in many dataset files, but that's ok!
8
+
9
+
## Types of Dataset
10
+
11
+
A dataset is "simply" a collection of related measurements or observations. To create a good model of your problem using data you must understanding what _kinds_ of data exist, how to understand them, and the best ways to analyze each one. The kind of data you choose impacts:
12
+
13
+
- The tools you use for exploration and analysis
14
+
- How we visualize the data
15
+
- The statistical methods you can apply
16
+
- The type of conclusions you draw
17
+
- And how confident you are of your conclusions
18
+
19
+
Below is an overview of different kinds of dataset you will encounter:
20
+
21
+
1.[Classification by Data Type](#classification-by-data-type)
22
+
2.[Classification by Structure](#classification-by-structure)
23
+
3.[Classification by Collection Method](#classification-by-collection-method)
24
+
4.[Classification by Size and Complexity](#classification-by-size-and-complexity)
25
+
5.[Classification by Access Type](#classification-by-access-type)
26
+
6.[Classification by Purpose](#classification-by-purpose)
27
+
7.[Classification by Format](#classification-by-format)
28
+
29
+
### Quantitative (Numerical) Data
30
+
31
+
Data that represents quantities and can represented as numbers.
32
+
33
+
#### Continuous Data
34
+
35
+
-**Definition**: Can take any value within a range (including fractions and decimals)
36
+
-**Examples**: Height, weight, temperature, time, distance
37
+
-**Analysis**: Mean, median, standard deviation, histograms, scatter plots
38
+
-**Real-world example**: Recording daily temperature over a month (72.5°F, 68.3°F, etc.)
39
+
40
+
#### Discrete Data
41
+
42
+
-**Definition**: Countable values, typically whole numbers
43
+
-**Examples**: Number of children, items sold, count of occurrences
44
+
-**Analysis**: Frequency tables, bar charts, mode
45
+
-**Real-world example**: Number of customers visiting a store each day (45, 52, 38, etc.)
46
+
47
+
### Qualitative (Categorical) Data
48
+
49
+
Data that describes qualities or characteristics of what you want to study.
50
+
51
+
#### Nominal Data
52
+
53
+
-**Definition**: Categories with no inherent order or ranking
54
+
-**Examples**: Gender, blood type, country, color, product type
55
+
-**Analysis**: Frequency counts, mode, chi-square tests, pie charts
56
+
-**Real-world example**: Survey responses for favorite color (red, blue, green, etc.)
57
+
58
+
#### Ordinal Data
59
+
60
+
-**Definition**: Categories with a meaningful order or ranking
61
+
-**Examples**: Education level, satisfaction ratings (1-5), economic status
62
+
-**Analysis**: Median, percentiles, rank correlations, stacked bar charts
0 commit comments