Skip to content

Module 3A: Advanced Cleaning with Internet Archive Data

Scotty Carlson edited this page Mar 12, 2019 · 15 revisions

If you've been able to work through the previous two modules, nice work. This module isn't mandatory, but it is recommended for people who are interested in more advanced data matching and cleaning.

As mentioned earlier, Refine’s reconciliation function was actually designed to match local data to external "linked data", so that URIs (uniform resource identifiers) could be pulled out and added to local data. URIs represent the “linked” part of Linked Data -- the most common form of a URI is a URL. By pulling URIs out of this process and adding it to data, Refine is adding an extra dimension to what was once static data.

But what if we didn't care about enriching data with URIs? What if we just cared about matching values in our data, and using non-matched data to clean up or analyze what's already there?

The Internet Archive

If you're a Deadhead, I probably don't need to explain the Internet Archive/Live Music Archive to you. For everyone else, read on!

Created in 2004, the Internet Archive's Live Music Archive host thousands of audience and soundboard recordings of Grateful Dead concerts. The collection is a natural extension of the band's openness to fans taping shows. And after more than 10 years, the IA has posted 13,610 Grateful Dead recordings (as of March 2019). Of course, there is significant overlap as many different tapes exist for the same shows -- especially after 1984, the year the Dead officially sanctioned fan-taping in its own section.

And as with the other collections of the Internet Archive, all of the metadata for these shows is searchable and downloadable through either their REST API or their advanced search page. In the interest of saving time, I have acquired and downloaded information on all of the LMA's Grateful Dead recordings (circa mid-2016). The resulting data (which can be found in the file IA-Data.csv) contains the following Internet Archive metadata fields:

collection
creator
date of concert
description
identifier
title
year of concert

So what is there for us to do with this data? We could upload it and facet on year to find out which years have the highest concentration of uploads... But we could just as easily look at the IA site and tell that as well.

mod3

A data collection of this scale calls for something more involved and unique. What if we wanted to see what percentage of known Dead concert dates are represented with recordings in the IA?

Such a project would likely roll out thusly:

1 Identify a baseline source of data on Grateful Dead concerts that we will use for the basis of our reconciliation.

2 Using that source, match up the date values in our IA data and clean any errors or discrepancies using Refine.

3 With the IA data cleaned, export the unique dates of represented concerts and calculate a percentage of coverage.

To get started, we'll need to find our baseline source of Grateful Dead concert data.

Next >

Clone this wiki locally