-
Notifications
You must be signed in to change notification settings - Fork 7
Module 1E: ASCAP Data (Further Cleaning and Enriching)
So, does this mean we're done? If you want, it could be. But I'm kind of picky, and I'm not ready to call it quits yet.
Earlier, I wrote that the power of Refine lies within its Facet capabilities. This was actually only partly true; the full power lies with Clustering, a tool you can access on facets.
Clustering is Refine’s way of comparing the data in a column against itself to look for inconsistencies. Refine uses two methods (Key Collision and Nearest Neighbor) with different functions to look for inconsistencies in the data. These are a lot of fun to play around with, but for our purposes, we will stick with the default Cluster method, Key Collision/Fingerprint, which is designed to give as few false-positive results as possible.
Let's start by separating the data in the 'Writers' column. Select Edit Cells > Split Multi-Valued Cells and tell Refine to look for the double-pipe.
Now create a Text Facet on the column. In the upper corner of the facet window, click the 'Cluster' button.
The cluster that comes up should be blank. Why is this? It's because there aren't any overt inconsistencies between songwriter names, which means that ASCAP is using a fairly robust set of name authorities. (Good for them!)
What happens if we close that facet and cluster a new one on 'Title'?
We are seeing inconsistencies between titles of song records -- titles that, in some cases, are probably the same song.
Or are they?
If you hang out long enough on ASCAP's ACE Repertory search, you'll notice there's a huge piece of information missing from data downloads. Each song in ASCAP's database is given a unique Work ID to differentiate it from other songs with the same (or similar) title.
For some reason, this data intrinsic to ASCAP's database records is not available in the download. Without this data, we could mistakenly assumed that Slipknot and Slip Knot needed to be folded into each other, when in fact they are (for whatever reason not detailed in the amount of data we were given) considered separate records. This missing data is vital to each record in the database.
So let's add it.
One of the files in the ZIP archive you downloaded for this project is called ASCAP-Data-Extended.google-refine.tar.gz. This is a Refine Project. When you click on the Export button in the upper righthand corner of the Refine interface, this is the first selection you will see. It is meant to live as a TAR file, so that you can easily distribute or share the project from the last work you did to your project.
Let's install this project in your local Refine directory. From the Refine welcome screen, click on 'Import Project'. Choose the TAR file and click 'Import Project'. (Don't change the name.)
Presto -- a new Refine project, containing song titles and work IDs. If we click on the Refine logo in the upper lefthand corner, we can see both projects living together peacefully.
Let's go back to our original ASCAP project. On the 'Title' column, choose Edit Column > Add Column Based on This Column. In our transformation window, we're going to add this GREL expression:
cell.cross("ASCAP Data Extended", "Title").cells["WorkID"].value[0]
Let's also change the name of the new column to 'ASCAP Work ID'.
A couple of things are happening here:
- cell.cross is telling Refine that we're going to be looking in another project.
- "ASCAP Data Extended" is the project that we're going to look into for data, and "Title" is the column we're looking into on that project.
- cells["WorkID"] is the value we want to copy over, should the data match what's in our current project.
(For the spreadsheet-minded, this is not unlike a VLOOKUP command.)
Once applied, we should see our Work IDs now living in our cleaned ASCAP data.







