You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To do meaningful research in a domain, you need to learn what others already do and don't understand in this area. Use this folder to organize your group's understanding of your research domain including: your own summaries, helpful PDFs, links you found helpful, ...
3
+
To do meaningful research in a domain, you need to learn what others already do
4
+
and don't understand in this area. Use this folder to organize your group's
5
+
understanding of your research domain including: your own summaries, helpful
6
+
PDFs, links you found helpful, ...
4
7
5
-
This folder is different from `/notes` because it contains _only_ information about your research domain. When deciding what goes here, ask yourself this question: _Would someone need to know this to understand our research?_
8
+
This folder is different from `/notes` because it contains _only_ information
9
+
about your research domain. When deciding what goes here, ask yourself this
10
+
question: _Would someone need to know this to understand our research?_
Copy file name to clipboardExpand all lines: 1_datasets/guide.md
+50-22Lines changed: 50 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,25 @@
1
1
# Datasets: Guide
2
2
3
-
Store your local datasets in this folder (`.csv`, `.xlsx`, `.json`, `.sqlite`, ...). You can use the README to document each dataset (where it's from, what data & types it contains, what you use it for, ...).
3
+
Store your local datasets in this folder (`.csv`, `.xlsx`, `.json`, `.sqlite`,
4
+
...). You can use the README to document each dataset (where it's from, what
5
+
data & types it contains, what you use it for, ...).
4
6
5
-
One of the primary goals of this repository is that anyone can clone and replicate your research. To make this possible **DO NOT modify or overwrite your raw datasets**! You should keep them _exactly_ as they were when you downloaded them, you may even want to name them `dataset.raw.ext` (eg. `daily_temperatures.raw.csv`).
7
+
One of the primary goals of this repository is that anyone can clone and
8
+
replicate your research. To make this possible **DO NOT modify or overwrite your
9
+
raw datasets**! You should keep them _exactly_ as they were when you downloaded
10
+
them, you may even want to name them `dataset.raw.ext` (eg.
11
+
`daily_temperatures.raw.csv`).
6
12
7
-
When cleaning and processing your datasets, you should save the prepared data to a _new_ file with a descriptive name. This approach will result in many dataset files, but that's ok!
13
+
When cleaning and processing your datasets, you should save the prepared data to
14
+
a _new_ file with a descriptive name. This approach will result in many dataset
15
+
files, but that's ok!
8
16
9
17
## Types of Dataset
10
18
11
-
A dataset is "simply" a collection of related measurements or observations. To create a good model of your problem using data you must understanding what _kinds_ of data exist, how to understand them, and the best ways to analyze each one. The kind of data you choose impacts:
19
+
A dataset is "simply" a collection of related measurements or observations. To
20
+
create a good model of your problem using data you must understanding what
21
+
_kinds_ of data exist, how to understand them, and the best ways to analyze each
22
+
one. The kind of data you choose impacts:
12
23
13
24
- The tools you use for exploration and analysis
14
25
- How we visualize the data
@@ -32,17 +43,20 @@ Data that represents quantities and can represented as numbers.
32
43
33
44
#### Continuous Data
34
45
35
-
-**Definition**: Can take any value within a range (including fractions and decimals)
46
+
-**Definition**: Can take any value within a range (including fractions and
47
+
decimals)
36
48
-**Examples**: Height, weight, temperature, time, distance
37
49
-**Analysis**: Mean, median, standard deviation, histograms, scatter plots
38
-
-**Real-world example**: Recording daily temperature over a month (72.5°F, 68.3°F, etc.)
50
+
-**Real-world example**: Recording daily temperature over a month (72.5°F,
51
+
68.3°F, etc.)
39
52
40
53
#### Discrete Data
41
54
42
55
-**Definition**: Countable values, typically whole numbers
43
56
-**Examples**: Number of children, items sold, count of occurrences
44
57
-**Analysis**: Frequency tables, bar charts, mode
45
-
-**Real-world example**: Number of customers visiting a store each day (45, 52, 38, etc.)
58
+
-**Real-world example**: Number of customers visiting a store each day (45, 52,
59
+
38, etc.)
46
60
47
61
### Qualitative (Categorical) Data
48
62
@@ -53,14 +67,16 @@ Data that describes qualities or characteristics of what you want to study.
53
67
-**Definition**: Categories with no inherent order or ranking
54
68
-**Examples**: Gender, blood type, country, color, product type
55
69
-**Analysis**: Frequency counts, mode, chi-square tests, pie charts
56
-
-**Real-world example**: Survey responses for favorite color (red, blue, green, etc.)
70
+
-**Real-world example**: Survey responses for favorite color (red, blue, green,
71
+
etc.)
57
72
58
73
#### Ordinal Data
59
74
60
75
-**Definition**: Categories with a meaningful order or ranking
61
76
-**Examples**: Education level, satisfaction ratings (1-5), economic status
62
77
-**Analysis**: Median, percentiles, rank correlations, stacked bar charts
This folder is for any Python scripts or notebooks you use to clean & prepare your datasets. These files should:
3
+
This folder is for any Python scripts or notebooks you use to clean & prepare
4
+
your datasets. These files should:
4
5
5
6
1. Read in datasets from `0_datasets`
6
7
2. Clean, reformat, or otherwise process the datasets for later.
7
8
3. Write the processed dataset into `0_datasets` with a helpful file name.
8
9
9
-
**DO NOT modify an existing dataset in `0_datasets`! Instead, save your processed data to a _new_ file.** This is critical to open research: Someone should be able to clone this repository and run your scripts to replicate your research. If you modify an original dataset, others cannot replicate your work.
10
+
**DO NOT modify an existing dataset in `0_datasets`! Instead, save your
11
+
processed data to a _new_ file.** This is critical to open research: Someone
12
+
should be able to clone this repository and run your scripts to replicate your
13
+
research. If you modify an original dataset, others cannot replicate your work.
This folder is for any Python scripts or notebooks you use to _explore and understand_ your datasets. These files should:
3
+
This folder is for any Python scripts or notebooks you use to _explore and
4
+
understand_ your datasets. These files should:
4
5
5
6
1. Read in prepared datasets from `0_datasets`
6
7
2. Explore and understand the dataset without running a deep analysis:
7
-
- Generate some visualizations (in a notebook, or in a separate image file saved to this folder)
8
-
- Run some descriptive statistics (_[beware](https://www.researchgate.net/publication/316652618_Same_Stats_Different_Graphs_Generating_Datasets_with_Varied_Appearance_and_Identical_Statistics_through_Simulated_Annealing) the [Datasaurus Dozen](https://www.research.autodesk.com/publications/same-stats-different-graphs/)!_)
9
-
- ... let your curiosity guide you, but _avoid_ running any inferential statistics or using any machine learning at this stage.
8
+
- Generate some visualizations (in a notebook, or in a separate image file
- ... let your curiosity guide you, but _avoid_ running any inferential
15
+
statistics or using any machine learning at this stage.
10
16
11
-
**DO NOT modify an existing dataset in `0_datasets`!** This is critical to open research: Someone should be able to clone this repository and run your scripts to replicate your research. If you modify an original dataset, others cannot replicate your work.
17
+
**DO NOT modify an existing dataset in `0_datasets`!** This is critical to open
18
+
research: Someone should be able to clone this repository and run your scripts
19
+
to replicate your research. If you modify an original dataset, others cannot
20
+
replicate your work.
12
21
13
-
> [Chapter 4 - Exploratory Data Analysis](https://bookdown.org/rdpeng/artofdatascience/exploratory-data-analysis.html) from the Art of Data Science is a good starting reference.
22
+
> [Chapter 4 - Exploratory Data Analysis](https://bookdown.org/rdpeng/artofdatascience/exploratory-data-analysis.html)
23
+
> from the Art of Data Science is a good starting reference.
This folder is for any Python scripts or notebooks you use to gain insights from your data through modeling, inferential statistics, and other analytical techniques. These files should:
3
+
This folder is for any Python scripts or notebooks you use to gain insights from
4
+
your data through modeling, inferential statistics, and other analytical
5
+
techniques. These files should:
4
6
5
7
1. Read in prepared datasets from `0_datasets`
6
-
2. Learn from your datasets using methods that are appropriate to your research question, dataset and team's constraints.
8
+
2. Learn from your datasets using methods that are appropriate to your research
9
+
question, dataset and team's constraints.
7
10
8
-
**DO NOT modify an existing dataset in `0_datasets`!** This is critical to open research: Someone should be able to clone this repository and run your scripts to replicate your research. If you modify an original dataset, others cannot replicate your work.
11
+
**DO NOT modify an existing dataset in `0_datasets`!** This is critical to open
12
+
research: Someone should be able to clone this repository and run your scripts
13
+
to replicate your research. If you modify an original dataset, others cannot
14
+
replicate your work.
9
15
10
-
> [Chapters 5-8](https://bookdown.org/rdpeng/artofdatascience) from the Art of Data Science are a good starting reference.
16
+
> [Chapters 5-8](https://bookdown.org/rdpeng/artofdatascience) from the Art of
This folder is here to organize the communication strategy for your research findings. You can use it however you like. Your communication artefact doesn't need to be stored here - it could be a video hosted on YouTube, a SM campaign, ... don't constrain yourself to something that can be stored on GitHub!
3
+
This folder is here to organize the communication strategy for your research
4
+
findings. You can use it however you like. Your communication artefact doesn't
5
+
need to be stored here - it could be a video hosted on YouTube, a SM campaign,
6
+
... don't constrain yourself to something that can be stored on GitHub!
0 commit comments