-
Notifications
You must be signed in to change notification settings - Fork 4
Description
What is the name of your project?
Comparing the accuracy and speed of Match*Pro, fastLink, and splink at the Florida Cancer Data System
What is the purpose of your project?
Use the pseudopeople Python package to compare the accuracy and speed of Match*Pro, fastLink, and splink for linkage data requests at the Florida Cancer Data System (FCDS). Medium-to-large linkages at the FCDS means using input datasets with approximately 250,000 * 4,000,000 records, which in terms of size corresponds to the simulated "Rhode Island" pseudopeople population of 1,000,000 people. The FCDS reports to the Florida Department of Health (FDOH).
The FCDS uses Match*Pro for the NAACCR Virtual Pool Registry Cancer Linkage System (VPR-CLS) "Phase 1" linkages which do not use a clerical review. In contrast, "VPR-CLS Phase 2" linkages use a clerical review for more accuracy. The FCDS uses fastLink for regular linkage data requests, including but not limited to "VPR-CLS Phase 2" linkages, because fastLink was more accurate in prior FCDS testing. The FCDS recently used preliminary "Rhode Island"-sized data from Abraham Flaxman which was very helpful to determine that splink is a feasible alternative to fastLink.
Who is involved in the project? Which of these people will have direct access to the pseudopeople input data?
Anders Alexandersson - Senior Research Associate (FCDS) - Direct access (main person to work with the data)
Brad Wohler - Manager of Statistics (FCDS) - Direct access
David Lee - Project Director and Principal Investigator (FCDS) - Direct access
Gary Levin - Deputy Project Director (FCDS) - Direct access
Mark Rudolph - Manager of Computers/Systems Programmer (FCDS) - Direct access (for security reasons)
Heather Lake-Burger - Registries and Surveillance Administrator (FDOH) - NO direct access (will receive the report with findings)
Contact info at https://fcds.med.miami.edu/inc/staff.shtml.
What funding is the project under? What expectations with respect to open access and access to data come with that funding?
The FCDS is funded by FDOH and the Centers for Disease Control and Prevention’s National Program of Cancer Registries (CDC-NPCR). The project is funded by FDOH (Contract CODJU) and CDC through the NPCR (DP003872-04).
The funding does not come with stated expectations with respect to open access and access to data. However, the end goal of the FCDS is to have open access (public) data for fully transparent comparisons of the the accuracy and speed of probabilistic record linkage software using simulated (artificial) data. Therefore, if the project is successful, it is realistic to expect that the FCDS and FDOH would like to 1) share the findings in a report with the Match*Pro, fastLink, and splink developers, and 2) that the report will have some individual level data, for example a listing of the first or last 5 records.
We commit to:
- be responsive to further questions from interested parties
- deprecate and replace our version of the pseudopeople input data when a new version is released
What data would you like to request?
- Full US
- Rhode Island
- Other (may not be available immediately)
Other data - more explanation
The FCDS needs more noise errors to the already provided "noisy" pre-pseudopeople "Rhode Island" level data from Abraham Flaxman in at least two ways:
-
The major limitation with the provided data for the FCDS is that it has no errors in simulated Social Security Number (SSN), only 15% missingness. The FCDS needs to compare partial matches in SSN using the Damerau-Levenshtein string distance in the three software because the FCDS often has incomplete access to SSN data but SSN has, say, 3-5% noise errors such as typos (including transpositions) and fake/wrong use such as the SSN of a family member.
-
The FCDS also needs some noise in date of birth which currently is without noise in the provided "noisy" dataset, which is not realistic.
Update: We are using Python 3.11. Currently, pseudopeople is not compatible with Python 3.11.