-
Notifications
You must be signed in to change notification settings - Fork 6
Module 2B: IMDB Data (Getting Our Hands Dirty)
In your ZIP package of tutorial data, there is a file labelled IMDB-titles.txt. The file contains 32 titles of works found in the IMDB that contain the words "Grateful Dead" in either the cast listings, titles, plot summaries, or other places. We will use this list as our starting point to acquire more data on each title.
First, copy the contents of the text file. In Refine, start a new project, and select Get data From Clipboard. Paste the text file contents there and create the project, called IMDB.
One of the first things you should notice is that there is errant white space before and after some of the lines of data -- a common occurrence for data copied and pasted from different web sites and other textual sources. Luckily, there is a quick fix for this. On the column dropdown menu, select Edit Cells > Common Transformations > Trim Leading and Trailing Whitespace/
Now we should be good to go.
Now it's time for the good stuff. On the column of titles, go to the dropdown and select Edit Column > Add Column By Fetching URLs.
To get the data out of the OMDB, we're going to create a GREL expression that uses the data in our title column as a search string, and will (hopefully) return our data formatted for the transmission.
The request URL to get the data looks like this:
http://www.omdbapi.com/?t=[title]&y=&plot=short&r=json
Let's break down what's going on here:
1 ?t=[title] is what we'll be focused on -- the actual title search. 2 &y= is the year of the title in question, and since we don't have years, this will stay blank. 3 &plot=full tells the API to give us the long-form of the stored plot metadata. 4 &r=json is the other key command, telling hte API to return our results in JSON format. More on this later.
Now that we know what the search will look like, we can build the GREL expression
'http://www.omdbapi.com/?t=' + escape(value, 'url') + '&y=&plot=full&r=json'
The first and last part of the GREL expression are defined with single quotes, telling Refine that those won't change from search to search. Only the data in our column will change, represented by value.
But because the title is being sent out as a URL, our title data needs to be encoded so the API can understand it -- this is what escape is doing. When our data is encoded for URL, instead of:
Peyote to LSD: A Psychedelic Odyssey
...Refine will actually be sending out this encoded version:
Peyote+to+LSD%3A+A+Psychedelic+Odyssey
Once you have the GREL expression entered, you should see a preview of what Refine will send out:
If your window looks like this, click OK. Refine will send out to the API and retrieve the data. (This could take a while.) You should see a progress bar as it works:
When Refine is finished, your project should look like this:
This may look like a jumbled mess, but it isn't. This is the data returned as JSON. JSON stands for Javascript Object Notation, and is a data-interchange format that is easy for humans to read and easy for machines to parse. Most (if not all) APIs should be able to return queries in JSON format.
Take a look at your data. You'll notice that some searches died:
{"Response":"False","Error":"Movie not found!"}
Such is life when using the OMDB -- the data is good, but since it is scraped from IMDB, it isn't necessarily perfect. Let's remove the failed searches We'll star them as we did in the last module, and then in the All column, we'll select Facet > Facet By Star. For all of the data that is TRUE for being starred, we'll return to the ALL column dropdown and select Edit Rows > Remove All Matching Rows.
You should now be left with 24 rows to work with.