Skip to content

Commit 3baef42

Browse files
types of data, individual rerospectives
1 parent 2803ea1 commit 3baef42

File tree

2 files changed

+329
-1
lines changed

2 files changed

+329
-1
lines changed

1_datasets/guide.md

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,323 @@ Store your local datasets in this folder (`.csv`, `.xlsx`, `.json`, `.sqlite`, .
55
One of the primary goals of this repository is that anyone can clone and replicate your research. To make this possible **DO NOT modify or overwrite your raw datasets**! You should keep them _exactly_ as they were when you downloaded them, you may even want to name them `dataset.raw.ext` (eg. `daily_temperatures.raw.csv`).
66

77
When cleaning and processing your datasets, you should save the prepared data to a _new_ file with a descriptive name. This approach will result in many dataset files, but that's ok!
8+
9+
## Types of Dataset
10+
11+
A dataset is "simply" a collection of related measurements or observations. To create a good model of your problem using data you must understanding what _kinds_ of data exist, how to understand them, and the best ways to analyze each one. The kind of data you choose impacts:
12+
13+
- The tools you use for exploration and analysis
14+
- How we visualize the data
15+
- The statistical methods you can apply
16+
- The type of conclusions you draw
17+
- And how confident you are of your conclusions
18+
19+
Below is an overview of different kinds of dataset you will encounter:
20+
21+
1. [Classification by Data Type](#classification-by-data-type)
22+
2. [Classification by Structure](#classification-by-structure)
23+
3. [Classification by Collection Method](#classification-by-collection-method)
24+
4. [Classification by Size and Complexity](#classification-by-size-and-complexity)
25+
5. [Classification by Access Type](#classification-by-access-type)
26+
6. [Classification by Purpose](#classification-by-purpose)
27+
7. [Classification by Format](#classification-by-format)
28+
29+
### Quantitative (Numerical) Data
30+
31+
Data that represents quantities and can represented as numbers.
32+
33+
#### Continuous Data
34+
35+
- **Definition**: Can take any value within a range (including fractions and decimals)
36+
- **Examples**: Height, weight, temperature, time, distance
37+
- **Analysis**: Mean, median, standard deviation, histograms, scatter plots
38+
- **Real-world example**: Recording daily temperature over a month (72.5°F, 68.3°F, etc.)
39+
40+
#### Discrete Data
41+
42+
- **Definition**: Countable values, typically whole numbers
43+
- **Examples**: Number of children, items sold, count of occurrences
44+
- **Analysis**: Frequency tables, bar charts, mode
45+
- **Real-world example**: Number of customers visiting a store each day (45, 52, 38, etc.)
46+
47+
### Qualitative (Categorical) Data
48+
49+
Data that describes qualities or characteristics of what you want to study.
50+
51+
#### Nominal Data
52+
53+
- **Definition**: Categories with no inherent order or ranking
54+
- **Examples**: Gender, blood type, country, color, product type
55+
- **Analysis**: Frequency counts, mode, chi-square tests, pie charts
56+
- **Real-world example**: Survey responses for favorite color (red, blue, green, etc.)
57+
58+
#### Ordinal Data
59+
60+
- **Definition**: Categories with a meaningful order or ranking
61+
- **Examples**: Education level, satisfaction ratings (1-5), economic status
62+
- **Analysis**: Median, percentiles, rank correlations, stacked bar charts
63+
- **Real-world example**: Customer satisfaction ratings (very dissatisfied, dissatisfied, neutral, satisfied, very satisfied)
64+
65+
### Binary Data
66+
67+
- **Definition**: Data with only two possible values
68+
- **Examples**: Yes/no questions, pass/fail outcomes, true/false conditions
69+
- **Analysis**: Proportions, odds ratios, logistic regression
70+
- **Real-world example**: Email spam classification (spam/not spam)
71+
72+
### Time Series Data
73+
74+
- **Definition**: Sequential data points collected at specific time intervals
75+
- **Examples**: Stock prices, weather data, website traffic
76+
- **Analysis**: Trend analysis, seasonal decomposition, forecasting
77+
- **Real-world example**: Monthly sales figures over several years
78+
79+
## Classification by Structure
80+
81+
### Structured Data
82+
83+
- **Definition**: Organized in a consistent, predefined format
84+
- **Examples**: Relational databases, spreadsheets, CSV files
85+
- **Characteristics**:
86+
- Follows a schema
87+
- Easy to search and analyze
88+
- Typically stored in rows and columns
89+
- **Real-world example**: Customer information in a CRM database
90+
91+
### Semi-structured Data
92+
93+
- **Definition**: Has some organizational properties but not rigid schema
94+
- **Examples**: JSON, XML, email, HTML
95+
- **Characteristics**:
96+
- Flexible format
97+
- Contains tags or markers to separate elements
98+
- Self-describing structure
99+
- **Real-world example**: JSON response from a web API
100+
101+
### Unstructured Data
102+
103+
- **Definition**: No predefined format or organization
104+
- **Examples**: Text documents, images, audio, video, social media posts
105+
- **Characteristics**:
106+
- Difficult to process with traditional tools
107+
- Often requires specialized techniques (NLP, computer vision)
108+
- Comprises ~80-90% of all data generated
109+
- **Real-world example**: Customer reviews or feedback in free text format
110+
111+
## Classification by Collection Method
112+
113+
### Primary Data
114+
115+
- **Definition**: Collected firsthand for a specific purpose
116+
- **Examples**: Surveys, experiments, interviews, direct observations
117+
- **Advantages**: Tailored to research needs, higher control over quality
118+
- **Disadvantages**: Time-consuming, potentially expensive
119+
- **Real-world example**: Market research survey designed specifically for a new product
120+
121+
### Secondary Data
122+
123+
- **Definition**: Data previously collected for other purposes
124+
- **Examples**: Census data, published studies, company records
125+
- **Advantages**: Cost-effective, time-saving, often larger sample sizes
126+
- **Disadvantages**: May not perfectly fit current research needs
127+
- **Real-world example**: Using government census data for demographic analysis
128+
129+
### [Proxy Data](https://centerforgov.gitbooks.io/benchmarking/content/Proxy.html)
130+
131+
- **Definition**: Data that is
132+
- **Examples**: Tree rings to proxy historical weather patterns, tax data to proxy incomes
133+
- **Advantages**: Helos you understand phenomena that are difficult or impossible to study directly.
134+
- **Disadvantages**: You cannot draw conclusions with the same confidence.
135+
- **Real-world example**: Using the stock market + unemployment rates as a proxy for the economy..
136+
137+
### Experimental Data
138+
139+
- **Definition**: Generated from controlled experiments with manipulated variables
140+
- **Examples**: A/B tests, clinical trials, laboratory experiments
141+
- **Characteristics**:
142+
- Control and treatment groups
143+
- Controlled conditions
144+
- Designed to establish causality
145+
- **Real-world example**: Testing whether a new website design increases conversion rates
146+
147+
### Observational Data
148+
149+
- **Definition**: Collected through observation without direct intervention
150+
- **Examples**: Traffic patterns, wildlife behavior, market trends
151+
- **Characteristics**:
152+
- Natural setting
153+
- No manipulation of variables
154+
- Good for establishing correlation (not causation)
155+
- **Real-world example**: Observing and recording consumer shopping behaviors in a store
156+
157+
## Classification by Size and Complexity
158+
159+
### Small Data
160+
161+
- **Definition**: Datasets manageable with traditional tools and methods
162+
- **Characteristics**:
163+
- Can fit in memory of a typical computer
164+
- Processable with standard software (Excel, SPSS)
165+
- Usually under several gigabytes
166+
- **Analysis**: Standard statistical methods, desktop tools
167+
- **Real-world example**: Survey responses from 500 participants
168+
169+
### Big Data
170+
171+
- **Definition**: Datasets too large or complex for traditional processing
172+
- **Characterized by the 5 Vs**:
173+
- **Volume**: Extremely large size
174+
- **Velocity**: Generated at high speed
175+
- **Variety**: Various formats and types
176+
- **Veracity**: Uncertainty and reliability concerns
177+
- **Value**: Extracting meaningful insights
178+
- **Analysis**: Specialized tools (Hadoop, Spark), distributed computing
179+
- **Real-world example**: Social media data from millions of users
180+
181+
### High-dimensional Data
182+
183+
- **Definition**: Many variables or features per observation
184+
- **Examples**: Genomic data, image data, complex sensors
185+
- **Challenges**:
186+
- Curse of dimensionality
187+
- Feature selection importance
188+
- Visualization difficulties
189+
- **Analysis**: Dimension reduction techniques (PCA, t-SNE), specialized algorithms
190+
- **Real-world example**: Gene expression data with thousands of genes measured for each sample
191+
192+
## Classification by Access Type
193+
194+
### Public Data
195+
196+
- **Definition**: Freely available to anyone
197+
- **Examples**: Government data portals, open datasets, public research data
198+
- **Characteristics**:
199+
- No access restrictions
200+
- Often licensed for reuse
201+
- May have usage guidelines
202+
- **Real-world example**: World Bank development indicators
203+
204+
### Private Data
205+
206+
- **Definition**: Access restricted to authorized users
207+
- **Examples**: Company internal data, personal health records, proprietary research
208+
- **Characteristics**:
209+
- Security measures required
210+
- Often subject to privacy regulations
211+
- May require anonymization for broader use
212+
- **Real-world example**: Patient medical records in a hospital database
213+
214+
### Proprietary Data
215+
216+
- **Definition**: Owned by organizations and often commercially valuable
217+
- **Examples**: Nielsen ratings, credit scores, market research data
218+
- **Characteristics**:
219+
- Commercial value
220+
- Legal protections
221+
- Often licensed for a fee
222+
- **Real-world example**: Credit bureau consumer data
223+
224+
## Classification by Purpose
225+
226+
### Transactional Data
227+
228+
- **Definition**: Records of business or system transactions
229+
- **Examples**: Sales records, banking transactions, server logs
230+
- **Characteristics**:
231+
- High volume
232+
- Time-stamped
233+
- Operation-focused
234+
- **Real-world example**: Point-of-sale data from retail stores
235+
236+
### Analytical Data
237+
238+
- **Definition**: Processed and organized for analysis and decision-making
239+
- **Examples**: Data warehouses, OLAP cubes, aggregated reports
240+
- **Characteristics**:
241+
- Often derived from transactional data
242+
- Optimized for querying and analysis
243+
- May include historical perspectives
244+
- **Real-world example**: Quarterly sales performance dashboard
245+
246+
### Master Data
247+
248+
- **Definition**: Core business entities that rarely change
249+
- **Examples**: Customer database, product catalog, employee records
250+
- **Characteristics**:
251+
- Reference data
252+
- Shared across systems
253+
- Requires governance
254+
- **Real-world example**: Product master list with SKUs, descriptions, and categories
255+
256+
### Metadata
257+
258+
- **Definition**: Data about data
259+
- **Examples**: File creation dates, database schema, data dictionaries
260+
- **Characteristics**:
261+
- Describes structure and context
262+
- Essential for data management
263+
- Facilitates data discovery
264+
- **Real-world example**: Column names and descriptions for a dataset
265+
266+
## Classification by Format
267+
268+
### Tabular Data
269+
270+
- **Definition**: Organized in tables with rows and columns
271+
- **Examples**: CSV, Excel files, database tables
272+
- **Characteristics**:
273+
- Most common format for analysis
274+
- Each row is an observation, each column a variable
275+
- **Real-world example**: Excel spreadsheet of monthly expenses
276+
277+
### Hierarchical Data
278+
279+
- **Definition**: Organized in a tree-like structure with parent-child relationships
280+
- **Examples**: XML, JSON, file systems
281+
- **Characteristics**:
282+
- Nested structure
283+
- Good for representing complex relationships
284+
- **Real-world example**: Organization chart
285+
286+
### Network Data
287+
288+
- **Definition**: Represents connections between entities
289+
- **Examples**: Social networks, transportation systems, web links
290+
- **Characteristics**:
291+
- Consists of nodes and edges
292+
- Focus on relationships
293+
- **Real-world example**: LinkedIn connections network
294+
295+
### Spatial Data
296+
297+
- **Definition**: Contains geographic or geometric information
298+
- **Examples**: GIS data, maps, satellite imagery
299+
- **Characteristics**:
300+
- Contains coordinates or shape information
301+
- Often requires specialized tools
302+
- **Real-world example**: Census data with geographic coordinates
303+
304+
### Temporal Data
305+
306+
- **Definition**: Emphasizes time dimension
307+
- **Examples**: Time series, event logs, historical records
308+
- **Characteristics**:
309+
- Time-stamped
310+
- May show patterns over time
311+
- **Real-world example**: Server logs with timestamp for each entry
312+
313+
## Key Considerations for Beginners
314+
315+
### Data Quality Assessment
316+
317+
- **Completeness**: Missing values, coverage
318+
- **Accuracy**: Errors, outliers, validity
319+
- **Consistency**: Internal contradictions, logical issues
320+
- **Timeliness**: How recent is the data?
321+
322+
### Ethical Considerations
323+
324+
- **Privacy**: Personal identifiable information (PII)
325+
- **Consent**: Was data collected with proper consent?
326+
- **Bias**: Is the sample representative?
327+
- **Transparency**: Can methods and sources be disclosed?

collaboration/retrospective.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
## Lessons Learned
1212

13-
______________________________________________________________________
13+
---
1414

1515
## Strategy vs. Board
1616

@@ -21,3 +21,11 @@ ______________________________________________________________________
2121
### Did you need to add things that weren't in your strategy?
2222

2323
### Or remove extra steps?
24+
25+
---
26+
27+
## Individual Rerospectives
28+
29+
### Name
30+
31+
<!-- write a 2-3 sentence reflection on your contributions, challenges and progress in this milestone -->

0 commit comments

Comments
 (0)