Skip to content

Module 2B: IMDB Data (Getting Our Hands Dirty)

Scotty Carlson edited this page Jan 27, 2017 · 10 revisions

In your ZIP package of tutorial data, there is a file labelled IMDB-titles.txt. The file contains 32 titles of works found in the IMDB that contain the words "Grateful Dead" in either the cast listings, titles, plot summaries, or other places. We will use this list as our starting point to acquire more data on each title.

Prepping the Data

First, copy the contents of the text file. In Refine, start a new project, and select Get data From Clipboard. Paste the text file contents there and create the project, called IMDB.

Mod2

One of the first things you should notice is that there is errant white space before and after some of the lines of data -- a common occurrence for data copied and pasted from different web sites and other textual sources. Luckily, there is a quick fix for this. On the column dropdown menu, select Edit Cells > Common Transformations > Trim Leading and Trailing Whitespace/

Mod2

Now we should be good to go.

Calling Out For Data

Now it's time for the good stuff. On the column of titles, go to the dropdown and select Edit Column > Add Column By Fetching URLs.

mod2

To get the data out of the OMDB, we're going to create a GREL expression that uses the data in our title column as a search string, and will (hopefully) return our data formatted for the transmission.

The request URL to get the data looks like this:

http://www.omdbapi.com/?t=[title]&y=&plot=short&r=json

Let's break down what's going on here:

1 ?t=[title] is what we'll be focused on -- the actual title search. 2 &y= is the year of the title in question, and since we don't have years, this will stay blank. 3 &plot=full tells the API to give us the long-form of the stored plot metadata. 4 &r=json is the other key command, telling hte API to return our results in JSON format. More on this later.

Now that we know what the search will look like, we can build the GREL expression

'http://www.omdbapi.com/?t=' + escape(value, 'url') + '&y=&plot=full&r=json'

The first and last part of the GREL expression are defined with single quotes, telling Refine that those won't change from search to search. Only the data in our column will change, represented by value.

But because the title is being sent out as a URL, our title data needs to be encoded so the API can understand it -- this is what escape is doing. When our data is encoded for URL, instead of:

Peyote to LSD: A Psychedelic Odyssey

...Refine will actually be sending out this encoded version:

Peyote+to+LSD%3A+A+Psychedelic+Odyssey

Once you have the GREL expression entered, you should see a preview of what Refine will send out:

mod2

If your window looks like this, click OK. Refine will send out to the API and retrieve the data. (This could take a while.) You should see a progress bar as it works:

mod2

Receiving

When Refine is finished, your project should look like this:

mod2

This may look like a jumbled mess, but it isn't. This is the data returned as JSON. JSON stands for Javascript Object Notation, and is a data-interchange format that is easy for humans to read and easy for machines to parse. Most (if not all) APIs should be able to return queries in JSON format.

Take a look at your data. You'll notice that some searches died:

{"Response":"False","Error":"Movie not found!"}

Such is life when using the OMDB -- the data is good, but since it is scraped from IMDB, it isn't necessarily perfect. Let's remove the failed searches We'll star them as we did in the last module, and then in the All column, we'll select Facet > Facet By Star. For all of the data that is TRUE for being starred, we'll return to the ALL column dropdown and select Edit Rows > Remove All Matching Rows.

Mod2

You should now be left with 24 rows to work with.

Next >