-
Notifications
You must be signed in to change notification settings - Fork 6
Module 2D: IMDB Data (Extending and Enriching Data)
We've learned how to parse our data, and before that, we learned how to clean and normalize it. is that all we can do with our data?
Of course not. There's always room for improvement. In this section, we're going to learn how to further normalize our data, as well as extend it. And we're going to do that by reconciling our data.
From Refine's Github wiki:
Reconciliation is a semi-automated process of matching text names to database IDs (keys). This is semi-automated because in some cases, machine alone is not sufficient and human judgement is essential. For example, given "Ocean's Eleven" as the name of a film, should it be matched to the original 1960 "Ocean's Eleven" (Freebase entry), or the 2001 remake "Ocean's Eleven" starring George Clooney (Freebase entry)?
While the wiki talks about databases, a more accurate term would be "name registries" -- in libraries, we call them authority files. Library data is described almost entirely using names (of people, places, organizations, etc) and conceptual topics (subject headings), so to keep track of everything, we rely on name authorities and subject authorities, respectively. The Library of Congress has its own Name Authority and Subject Authority files; there are others as well, such as the Virtual International Authority File.
Nowadays, as libraries move toward linked data, the concept of an authorized form of a name or subject heading can be expressible as a Uniform Resource Identifier (URI) -- that is, a permanent link that signifies that concept. The reason these name registries (and permanent URIs signifying them) are so important is so that when we tag or reference Jerry Garcia in our data, we are sure that we mean the former member of the Grateful Dead, and not someone else with a similar name.
Right now, we have static data in the actor, director, and writer columns, so let's extend them by matching them to an external name registry and giving them permanent URIs.
NOTE: if you're using OpenRefine 3.0 or later, skip this section, as the Named Entity Recognition extension will not run with newer Refine builds.
One would think that we would start with the Reconciliation function that can be found in the column dropdowns. Right? Well, not really. We're going to start by installing an open-source extension that will analyze our data using named-entity recognition.
Start by downloading the ZIP package found here. Once downloaded, unzip it; you should find a folder entitled named-entity-recognition.
Then, in the opening Refine window (remember, you can always get there by clicking on the Refine logo in the upper left), click on Browse workspace directory in the lower lefthand corner.
This should open up the local folder on your hard drive where Refine data is stored. Here's what mine looks like:
If there isn't a folder called
extensions
then create one. Put the folder from the ZIP file into this directory.
You'll have to restart Refine to see the change take effect. Once you restart, you should see a new button in the upper righthand corner:
Click the new button and select Configure Services. A new window will open with a selection of external N-ER services.
Unfortunately, for most, you'll need a special API key to access the N-ER services. So let's get one.
The easiest N-ER key to get is from Dandelion, formerly known as dataTXT. Click here to see Dandelion's entity recognition in action. Otherwise, to get an access API, go here and register for a free account. Once you register, you'll see a page with your App ID and App Key codes. Paste them into the N-ER window in Refine and click Update.
Now we're ready to start looking for Named Entities.
The data we'll be reconciling is the Actor column we parsed from the JSON, so be sure all multi-value cells are split apart (the JSON data is separated with a comma). When you're ready, click the dropdown box and select Extract Named Entities.
Select dataTXT in the Extract window. Since some of the JSON results included the data "N/A" for records without actors, the only change we'll make to the standard settings is to set the minimum length of words to analyze as 4 characters. Click start. Reconciliation will run for a while.
When finished, you should see a new column next to the Actor data. This is the result of the process, where the extension has automatically analyzed your data, found (what it believes to be) the correct entity, and created a result that links to a Wikipedia page. For example, the N-ER has found correct records for Mickey Hart, Bruce Hornsby, and Jeff Chimenti.
Clicking on Mickey's result takes us to his Wikipedia page:
(Note: Wikipedia should not be considered a permanent identifier for anything right now -- this is just an example.)
Sometimes, this N-ER analysis works very well; other times, not so much. In this instance, the name of anthropologist Wade David has been split to two unrelated names. The problem here was most likely the three records in Wikipedia for people named Wade Davis:
In another, sound technician Jimmy 'Coach' Armstrong (featured in the documentary A Warehouse on Tchoupitoulas) has been split to a subject and a name, likely because there is no Wikipedia page for him:
These results are a reminder that Name-Entity Recognition is imperfect, and requires human intervention. At the same time, remember that when we set up the dataTXT search, we left the 'Confidence' level at 0.6 -- which means if the N-ER extension and the dataTXT results are 60 percent sure that it has found the correct entity, it will automatically select that as the correct entity. Moving the Confidence level much higher will prevent these false-positives.
To remove false positive results, click on 'Choose New Match' for each cell, and then click on the 'Edit' button that appears in its corner. Like before, you call delete the entry and leave it blank, cancelling the match.
We're not done just yet. The 'dataTXT' column is itself ephemeral data -- we need to extract the links each match represents. When you've finished double-checking your matches and removing false positives, click on the dropdown for 'dataTXT' and add a column based on this column. Name the new column 'ActorIDs' and use this GREL expression:
cell.recon.match.id
The transform window should show any cell with an N-ER result putting a Wikipedia link (standing in for a URI) into the new column.
You can try using the name-entity recognition matching function on the Writer and Director data we parsed, but the real power of N-ER lies in its ability to analyze large amounts of text for recognized entities. On Dandelion's web site, users can try their Entity Extraction demo with user-submitted text. Here is what happened when I entered the first two paragraphs of the Wikipedia article on Workingman's Dead for analysis:
To see the full power of N-ER on untagged data for yourself, try this:
- Parse the JSON column to move the Plot summary data to a new column.
- Set the N-ER match with either a high or low confidence rating.
- Click start and watch the tags fly.
End of Module 2.