-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathcensusmatchr_algo.html
69 lines (59 loc) · 4.27 KB
/
censusmatchr_algo.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
<!DOCTYPE HTML>
<html>
<head>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-46774844-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
<title>Steven Ellis</title>
<link rel="stylesheet" type="text/css" href="self.css">
</head>
<link rel="icon"
type="image/png"
href="/favicon.png">
<link rel="icon"
type="image/png"
href="favicon.png">
<body>
<a href="index.html"><img id="logo" src="ideographic_description_character_overlaid.svg" type="image/svg+xml">
</img></a>
<div id="wrapdiv">
<div id="links"><h2>
<a href="index.html">Profile</a>
<a id="sel" href="research.html">Research</a>
<a href="visualization.html">Visualization</a>
</h2></div>
<br>
<div id="story">
<h1>CensusMatchR - How it Works</h1>
<p>Most comparisons you make in statistics are looking for differences. Odds are that every time you've heard something about "statistical significance", the "significant" aspect was that one group of people was <i>different</i> from another in some way.
For a regional research project, I needed to do the opposite - find regions of the United States where people were exactly the same as my group's target region for launching our new service. To use traditional methods would have meant comparing the average, aggregate-level measurements of one geographic region to another, looking for regions that were merely <i>not</i> statistically different from one another, as below:</p>
<div class="image"><img src="old_method.png" width="500"></div>
<p>However, I wanted to look for similarity, and I wanted to do it on an individual level, not at the aggregate level as in the above example. I found a paper describing such a method in the context of neuroscience research <a href="http://www-stat.wharton.upenn.edu/~rosenbap/crossmatch.pdf">(Rosenbaum, 2005)</a>, and intuited that I could apply this approach to the census data I'd worked with in my think-tank days, as below:</p>
<div class="image"><img src="new_method.png" width="500"></div>
<p>Each circle represents a Census respondent. With the lines connecting respondents, we're looking to make the closest connection possible - even if that means the connections end up being within the same geographic region. We use non-bipartite matching, as it doesn't bind you to the assumption of even connections between groups. The more connections that are within groups, as opposed to between them, the less likely it is that they're exact matches. The below image shows this matching occurring in one dimension:</p>
<div class="image"><img src="nonbipartite_matching.png" width="500"></div>
The algorithm I produced (<a href="https://github.com/stevenae/CensusMatchR">Github</a>):
<ol>
<li>takes in census data (the <a href="http://en.wikipedia.org/wiki/American_Community_Survey">ACS</a>)</li>
<li>takes in your target zip code(s)</li>
<li>matches every individual from your target zip code up against one of the ~41,000 other zip codes in the US</li>
<li>determines whether the groups are statistically <i>the same</i> and saves that answer, along with a rank-able number measure the degree of similarity</li>
<li>repeats this for every single zip code in the country</li>
</ol>
<p>All of this can be done in 9 lines of R, inluding specifying all of the necessary file locations:</p>
<script src="https://gist.github.com/stevenae/8186962.js"></script>
<p>You end up with a list of every zip code that is statistically the same as your target zip code, along with a score indicating how close to perfect the match is.
As you prepare to use this algorithm for a project, you can set a lower sampling rate to speed up algorithm completion, raising it when you want your final results.</p>
<p>I've also provided code (<a href="https://github.com/stevenae/CensusMatchR/blob/master/ama_art_map.R">Github</a>) to easily visualize the results of CensusMatchR's output, as below (dark dots are exact matches to Chicago):
</p>
<div class="image"><img src="ama_art_map.png" width="500"></div>
</div>
</body>
</html>