Skip to content

Commit c4a09f6

Browse files
Merge pull request #101 from ihmeuw/develop
Release v0.6.0
2 parents af4eda4 + 68400c5 commit c4a09f6

File tree

52 files changed

+1193
-700
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+1193
-700
lines changed

.github/CODEOWNERS

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# default owners
2-
* @ihmeuw/vivarium-dev
2+
* @albrja @hussain-jafari @mattkappel @ramittal @rmudambi @stevebachmeier
33
/docs/* @ihmeuw/vivarium-research @ihmeuw/vivarium-dev
4-
*.rst @ihmeuw/vivarium-research @ihmeuw/vivarium-dev
4+
*.rst @ihmeuw/vivarium-research @ihmeuw/vivarium-dev @zmbc @pletale
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
name: 'Feature request'
3+
about: Suggest an idea for improving Pseudopeople
4+
title: '[FEAT] <title>'
5+
labels: 'enhancement'
6+
assignees: ''
7+
---
8+
9+
### Is your proposal related to a problem?
10+
<!--
11+
Provide a clear and concise description of what the problem is. For example, "I'm always frustrated when..."
12+
-->
13+
14+
15+
16+
### Describe the solution you'd like
17+
<!--
18+
Provide a clear and concise description of what you want to happen.
19+
-->
20+
21+
22+
23+
### Describe alternatives you've considered
24+
<!--
25+
Let us know about other solutions you've tried or researched.
26+
-->
27+
28+
29+
30+
### Additional context
31+
<!--
32+
Is there anything else you can add about the proposal? You might want to link to related issues here, if you haven't already.
33+
-->
34+
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
name: Bug report
2+
description: Create a report to help us improve
3+
body:
4+
- type: textarea
5+
attributes:
6+
label: What happens?
7+
description: A short, clear and concise description of what the bug is.
8+
validations:
9+
required: true
10+
11+
- type: textarea
12+
attributes:
13+
label: To Reproduce
14+
description: Steps to reproduce the behavior.
15+
validations:
16+
required: true
17+
18+
- type: markdown
19+
attributes:
20+
value: "# Environment (please complete the following information):"
21+
- type: input
22+
attributes:
23+
label: "OS:"
24+
placeholder: e.g. iOS
25+
validations:
26+
required: true
27+
- type: input
28+
attributes:
29+
label: "Pseudopeople version:"
30+
placeholder: e.g. 0.5.1
31+
validations:
32+
required: true

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,3 +131,6 @@ dmypy.json
131131

132132
# Local user jupyter notebooks directory
133133
notebooks/
134+
135+
# Mac OS stuff
136+
.DS_Store

CHANGELOG.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,16 @@
1+
**0.6.0 - 04/19/23**
2+
- Update documentation (landing page, datasets section, quickstart)
3+
- Update zipcode miswriting function to act on each digit independently
4+
- Modify config key names
5+
- Update sample datsets to include all GQ types
6+
- Scale household survey data to account for oversampling
7+
- Implement user config value validation
8+
- Change the term "Form" to "Dataset" throughout
9+
- Update the default config values
10+
- Change "american_communities_survey" to "american_community_survey"
11+
- Implement config interface and get_config function
12+
- Add a github issues template
13+
114
**0.5.1 - 04/14/23**
215
- Formatting of noised dates implemented
316
- Moved from pd.NA to np.nan
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.. automodule:: pseudopeople.configuration.interface

docs/source/concepts/datasets.rst

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
.. _datasets_concept:
2+
3+
==============
4+
Datasets
5+
==============
6+
7+
.. contents::
8+
:depth: 2
9+
:local:
10+
:backlinks: none
11+
12+
13+
14+
15+
What is a dataset?
16+
------------------
17+
18+
A dataset in the Pseudopeople framework contains un-noised simulated data
19+
representing specific real-life data, eg a census survey or tax document.
20+
The types of datasets that are compatible with the Pseudopeople framework include:
21+
22+
.. list-table:: **Types of Datasets**
23+
:header-rows: 1
24+
:widths: 20
25+
26+
* - Name
27+
* - | Decennial census
28+
* - | American communities survey
29+
* - | Current population survey
30+
* - | Women, infrants, and children survey
31+
* - | Social security
32+
* - | Tax W2 and 1099 forms
33+
* - | Tax 1040 form

docs/source/concepts/forms.rst

Lines changed: 0 additions & 33 deletions
This file was deleted.

docs/source/concepts/noise_functions.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ What is a noise function?
1717

1818
A noise function is ultimately where the configuration (add link) provided is
1919
applied to the raw data which is then noised or altered and returned to the user
20-
in a state where real world data error have been added to each form (add link).
20+
in a state where real world data error have been added to each dataset (add link).
2121
Noise functions will be applied to datasets by column or by row. There are
2222
several noise functions that are applied to the raw data which include:
2323

docs/source/datasets/index.rst

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
.. _datasets_main:
2+
3+
========
4+
Datasets
5+
========
6+
Here we cover the realistic simulated datasets, which are analogous to 'real world' administrative records such as tax documents
7+
and census surveys, that users can generate using :code:`pseudopeople` for developing and testing Entity Resolution algorithms
8+
and software.
9+
10+
The below table offers a list of the datasets that can be generated. Each row of a given dataset represents
11+
an individual simulant, with the columns representing different simulant attributes, such as name, age, sex, et cetera.
12+
13+
14+
.. contents::
15+
:depth: 2
16+
:local:
17+
:backlinks: none
18+
19+
20+
.. list-table:: **Available Datasets**
21+
:header-rows: 1
22+
:widths: 20
23+
24+
* - Name
25+
* - | US Decennial Census
26+
* - | American Communities Survey (ACS)
27+
* - | Current Population Survey (CPS)
28+
* - | Women, Infants, and Children (WIC) Administrative Data
29+
* - | Social Security Administration (SSA) Data
30+
* - | Tax W2 and 1099 Forms
31+
* - | Tax 1040 Form
32+
33+
34+
US Decennial Census
35+
-------------------
36+
The Decennial Census dataset is a simulated enumeration of the US Census Bureau's Decennial Census Survey. The years
37+
that have been simulated are 2020, 2030, and 2040.
38+
39+
The following simulant attributes are included in this dataset:
40+
41+
.. list-table:: **Simulant attributes**
42+
:header-rows: 1
43+
44+
* - Attribute Name
45+
- Column Name
46+
- Notes
47+
* - Unique simulant ID
48+
- :code:`simulant_id`
49+
- Not affected by noise functions; intended use is 'ground truth' for PRL tracking.
50+
* - First name
51+
- :code:`first_name`
52+
-
53+
* - Middle initial
54+
- :code:`middle_initial`
55+
-
56+
* - Last name
57+
- :code:`last_name`
58+
-
59+
* - Age
60+
- :code:`age`
61+
- Rounded down to an integer.
62+
* - Date of birth
63+
- :code:`date_of_birth`
64+
- Formatted as MM/DD/YYYY.
65+
* - Physical address street number
66+
- :code:`street_number`
67+
-
68+
* - Physical address street name
69+
- :code:`street_name`
70+
-
71+
* - Physical address unit
72+
- :code:`unit_number`
73+
-
74+
* - Physical address city
75+
- :code:`city`
76+
-
77+
* - Physical address state
78+
- :code:`state`
79+
-
80+
* - Physical address ZIP code
81+
- :code:`zipcode`
82+
-
83+
* - Relationship to person 1 (head of household)
84+
- :code:`relationship_to_household_head`
85+
- 'Person 1', 'head of household', and 'Reference person' are all synonymous in this context. Possible values for this indicator include:
86+
Reference person; Biological child; Adopted child; Stepchild; Sibling; Parent; Grandchild; Parent-in-law; Child-in-law; Other relative;
87+
Roommate; Foster child; and Other nonrelative.
88+
* - Sex
89+
- :code:`sex`
90+
- Binary; 'male' or 'female'.
91+
* - Race/ethnicity
92+
- :code:`race_ethnicity`
93+
- The exhaustive and mutually exclusive categories for the single composite 'race/ethnicity' indicator are as follows:
94+
White; Black; Latino; American Indian and Alaskan Native (AIAN); Asian; Native Hawaiian and Other Pacific Islander (NHOPI); and
95+
Multiracial or Some Other Race.
96+
97+
Household Surveys: ACS and CPS
98+
------------------------------
99+
There are two simulated household survey datasets that can be used: the American
100+
Communities Survey (ACS) and the Current Population Survey (CPS).
101+
102+
103+
.. list-table:: **Simulant attributes**
104+
:header-rows: 1
105+
106+
* - Attribute Name
107+
- Column Name
108+
- Notes
109+
* - Unique simulant ID
110+
- simulant_id
111+
- Not affected by noise functions; intended use is 'ground truth' for PRL tracking.
112+
* - Household ID
113+
- :code:`household_id`
114+
- Not affected by noise functions; intended use is 'ground truth' for PRL tracking.
115+
* - First name
116+
- :code:`first_name`
117+
-
118+
* - Middle initial
119+
- :code:`middle_initial`
120+
-
121+
* - Last name
122+
- :code:`last_name`
123+
-
124+
* - Age
125+
- :code:`age`
126+
- Rounded to nearest integer.
127+
* - Date of birth
128+
- :code:`date_of_birth`
129+
- Formatted as MM/DD/YYYY.
130+
* - Physical address street number
131+
- :code:`street_number`
132+
-
133+
* - Physical address street name
134+
- :code:`street_name`
135+
-
136+
* - Physical address unit
137+
- :code:`unit_number`
138+
-
139+
* - Physical address city
140+
- :code:`city`
141+
-
142+
* - Physical address state
143+
- :code:`state`
144+
-
145+
* - Physical address ZIP code
146+
- :code:`zipcode`
147+
-
148+
* - Relationship to person 1
149+
- :code:`relationship_to_household_head`
150+
- 'Person 1', 'head of household', and 'Reference person' are all synonymous in this context. Possible values for this indicator include:
151+
Reference person; Biological child; Adopted child; Stepchild; Sibling; Parent; Grandchild; Parent-in-law; Child-in-law; Other relative;
152+
Roommate; Foster child; and Other nonrelative.
153+
* - Sex
154+
- :code:`sex`
155+
- Binary; 'male' or 'female'
156+
* - Race/ethnicity
157+
- :code:`race_ethnicity`
158+
- The following exhaustive and mutually exclusive categories for the single composite 'race/ethnicity' indicator are as follows:
159+
White; Black; Latino; American Indian and Alaskan Native (AIAN); Asian; Native Hawaiian and Other Pacific Islander (NHOPI); and
160+
Multiracial or Some Other Race.
161+
162+
163+
WIC
164+
---
165+
166+
167+
Social Security
168+
---------------
169+
170+
171+
Tax W-2 & 1099
172+
--------------
173+
174+
175+
Tax 1040
176+
--------

0 commit comments

Comments
 (0)