Home

CMU Workshop on Cascading plus City of Palo Alto Open Data

We have built an example app in Cascading and Apache Hadoop, based on the City of Palo Alto open data provided via Junar: http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/

Students can extend the example workflow to build derivative apps, or use it as a starting point for other ways to leverage this data.

We will also draw some introductory material from these two previous talks:

Example App

We used some of the CoPA open data for parks, roads, trees, etc., and have shown how to Cascading and Hadoop to clean up the raw, unstructured download. Based on that initial ETL workflow, we get geolocation + metadata for each item of interest:

trees w/ species
road pavement w/ traffic conditions
parks

One use case could be “Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo.” In other words, we could determine estimates for albedo vs. relative shade. Perhaps as the starting point for a mobile killer app. Or something.

Additional data is included here, to be joined with the cleaned-up CoPA data about trees and roads. We will also use log data collected using GPS Tracks.

Relevant data science aspects include:

Bayesian point estimates on the GPS logs
Kriging the geo distribution of estimated metrics
Dirichlet tessellation to optimize recommendations, if someone feels especially ambitious, etc.

Note that this example blends the key elements of great Data Science apps:

ETL of unstructured data (CoPA GIS export)
curated metadata: tree species dataset, road albedo dataset
log files: iPhone personalized mobile coordinates
calibration and testing based on R
algorithms: geospatial search, Bayesian point estimates

Caveats

data quality: some species names have spelling errors or misclassifications
missing data
needs: common names for trees, photos, natives vs. invasives, toxicity, etc.

Enriching Data

We could combine this CoPA open data with access to external APIs:

Trulia neighborhood data, housing prices [uses Cascading]
Factual local business (FB Places, etc.) [uses Cascading]
Google geocoding
Wunderground local weather data
WalkScore neighborhood data, walkability
Beer need we say more?
Data.gov US federal open data
Data.NASA.gov NASA open data
DBpedia datasets derived from Wikipedia
CommonCrawl open source full web crawl
GeoWordNet semantic knowledge base about localized terminology
CityData US city profiles
Geolytics demographics, GIS, etc.
Foursquare, Yelp, CityGrid, Localeze, YP
Programmable Web API mashup directory
various photo sharing

Other Potential Use Cases

Trulia:

estimate allergy zones, for real estate preferences
optimize sales leads: target sites for conversion to residential solar
optimize sales leads: target sites for an urban agriculture venture

Calflora:

report observations of natives on endangered species list
report new observations of invasives / toxicology
infer regions of affinity for beneficial insects

City of Palo Alto:

premium payment / bid system for an open parking spot in the shade
welcome services for visitors (ecotourism, translated park info, etc.)
city planning: expected rates for tree replanting, natives vs. invasives, etc.
liabilities: e.g., oleander (common, highly toxic) near day care centers
epidemiology, e.g. there are outbreaks of disastrous tree diseases -- with big impact on property values

community organizations:

volunteer events: harvest edibles to donate to shelters

start-ups:

some of the invasive species are valuable in Chinese medicine while others can be converted to biodiesel -- potential win-win for targeted harvest services

Extending The Data

Looks like this data would be even more valuable if it included ambient noise levels. Somehow.

Question: How could your new business obtain data for ambient noise levels in Palo Alto?

infer from road data
infer from bus lines, rail schedule
sample/aggregate from mobile devices in exchange for micropayments
buy/aggregate data from home security networks
fly nano quadrotors, DIY "Street View" for audio
fly micro aerostats, with Arduino-based accelerometer and positioned parabolic mic
partner with City of Palo Alto to deploy a simple audio sensor grid

App Development Process

Clean up the raw, unstructured data from CoPA download… aka ETL
Perform sampling before modeling
Perform visualization and summary statistics in RStudio
Ideation and research for potential use cases
Iterate on business process for the app workflow 5.1. TDD at scale 5.2. best practices
Integrate with end use cases as the workflow endpoints
…
PROFIT!

Some caveats:

Arguably, this is not a large data set; however, it’s early for the open data initiative, and besides Palo Alto has only 65K population.
This provides a good area for a POC, prior to deploying in other, larger metro areas.
This example helps illustrate how in terms of “Big Data”, complexity is more important to consider than big.

Build Instructions

To generate an IntelliJ project use:

gradle ideaModule

To build the sample app from the command line use:

gradle clean jar

Before running this sample app, be sure to set your HADOOP_HOME environment variable. Then clear the output directory, then to run on a desktop/laptop with Apache Hadoop in standalone mode:

rm -rf output
hadoop jar ./build/libs/copa.jar data/copa.csv data/meta_tree.tsv data/meta_road.tsv data/gps.csv output/trap output/tsv output/tree output/road output/park output/shade output/reco

To view the results, for example the output recommendations in reco:

ls output
more output/reco/part-00000

An example of log captured from a successful build+run is at https://gist.github.com/3660888

About Cascading

There is a tutorial about getting started with Cascading in the blog post series called Cascading for the Impatient. Other documentation is available at http://www.cascading.org/documentation/.

For more discussion, see the cascading-user email forum. We also have a meetup started.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly