This repo contains datasets I've created or collected, mostly via web scraping.
While re-watching parts of the MCU series during paternity leave, I compiled a dataset measuring things like budget, box office sales, and Rotten Tomatoes rating for the 23 movies. Using this data, I created an interactive visual in Tableau allowing comparison of measures across the films in different orders, like release date and chronological order.
Contains a list of Billboard Hot 100 Artists from 2005 to 2019, scraped from billboard.com on July 19, 2020. There are three columns: year, rank, and artist.
List of URLs to reviews for rap albums written by Pitchfork. 2,256 reviews when script was run on June 23, 2020.
Spotify Charts exposes the current top 200 tracks, as well as date drop-down to view historical chart data. Since Spotify Charts has built-in CSV download functionality, a simple R script helped compile and aggregate the daily chart data, which stretches back to early 2017.
As stated on thier website, the purpose of kids-in-mind.com is "to provide parents and other adults with objective and complete information about a film’s content so that they can decide, based on their own value system, whether they should watch a movie with or without their kids, or at all." This dataset contains data for 5,525 movies, including title
, year
, mpaa_rating
, and ratings assigned by the editors. It was scraped on March 5, 2022.
Last.fm is one of the best ways to track the music you listen to. Last.fm connects to music streaming services and tracks listening behavior via "scrobbling". Using Ben Foxall's convenient lastfm-to-csv
service, I exported the list of 52,036 tracks I've listened to on Spotify between April 2017 and January 2023. Looking to visual personal genre trends over time, this was enriched (adding a artist_primary_genre
field) using the Spotify API and spotifyr
package.
With over 15 million listeners, Spotify’s RapCaviar has been called “the most influential playlist in music.” RapCaiver is curated by Spotify’s editorial team and updated daily to represent the latest and greatest hip-hop and rap tracks. For the last year, I’ve saved a daily snapshot of the playlist using the Spotify API to empirically determine the biggest rappers in hip hop today.
As one of the most visited car shopping sites in the United States, CarGurus tracks prices for millions of used car listings every year. Looking to get acquainted with prices in the used minivan market, I scraped 20 years’ worth of monthly average price data from CarGurus for five minivan models: Kia Sedona, Toyota Sienna, Chrysler Pacifica, Honda Odyssey, and Dodge Grand Caravan.
In 2018 my wife and I moved to New York for the start of a new job. Initially overwhelmed by the scope and pace of the NYC housing market, we were given the very generous and unexpected opportunity by a family friend to live in a house north of the city in Westchester County. Built in the early 1930s, the historic home is situated in central Scarsdale, an affluent suburban town known for high-achieving schools and extravagant real estate. Wishing to analyze the houses of Scarsdale in a more systematic way, I contacted the Scarsdale Village administration and was sent an Excel file with the complete set of residential properties, rich with detail and with few missing values.
My friend @bryanwhiting published an R package that compiles ~50 years worth of speeches from sessions of General Conference from the Church of Jesus Christ of Latter-Day Saints.
As of November 11, 2023 when scraped.