Skip to content
ceteri edited this page Sep 7, 2012 · 24 revisions

CMU Workshop on Cascading plus City of Palo Alto Open Data

We have built an example app in Cascading and Apache Hadoop, based on the City of Palo Alto open data provided via Junar: http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/

Students can extend the example workflow to build derivative apps, or use it as a starting point for other ways to leverage this data.

We will also draw some introductory material from these two previous talks:

Example App

We used some of the CoPA open data for parks, roads, trees, etc., and have shown how to Cascading and Hadoop to clean up the raw, unstructured download. Based on that initial ETL workflow, we get geolocation + metadata for each item of interest:

  • trees w/ species
  • road pavement w/ traffic conditions
  • parks

One use case could be “Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo.” In other words, we could determine estimates for albedo vs. relative shade. Perhaps as the starting point for a mobile killer app. Or something.

Additional data is included here, to be joined with the cleaned-up CoPA data about trees and roads. We will also use log data collected using GPS Tracks.

Relevant data science aspects include:

Note that this example blends the key elements of great Data Science apps:

  • ETL of unstructured data (CoPA GIS export)
  • curated metadata: tree species dataset, road albedo dataset
  • log files: iPhone personalized mobile coordinates
  • calibration and testing based on R
  • algorithms: geospatial search, Bayesian point estimates

Caveats

  • data quality: some species names have spelling errors or misclassifications
  • missing data
  • needs: common names for trees, photos, natives vs. invasives, toxicity, etc.

Enriching Data

We could combine this CoPA open data with access to external APIs:

Other Potential Use Cases

Trulia:

  • estimate allergy zones, for real estate preferences
  • optimize sales leads: target sites for conversion to residential solar
  • optimize sales leads: target sites for an urban agriculture venture

Calflora:

  • report observations of natives on endangered species list
  • report new observations of invasives / toxicology
  • infer regions of affinity for beneficial insects

City of Palo Alto:

  • premium payment / bid system for an open parking spot in the shade
  • welcome services for visitors (ecotourism, translated park info, etc.)
  • city planning: expected rates for tree replanting, natives vs. invasives, etc.
  • liabilities: e.g., oleander (common, highly toxic) near day care centers
  • epidemiology, e.g. there are outbreaks of disastrous tree diseases -- with big impact on property values

community organizations:

  • volunteer events: harvest edibles to donate to shelters

start-ups:

  • some of the invasive species are valuable in Chinese medicine while others can be converted to biodiesel -- potential win-win for targeted harvest services

Extending The Data

Looks like this data would be even more valuable if it included ambient noise levels. Somehow.

Question: How could your new business obtain data for ambient noise levels in Palo Alto?

  • infer from road data
  • infer from bus lines, rail schedule
  • sample/aggregate from mobile devices in exchange for micropayments
  • buy/aggregate data from home security networks
  • fly nano quadrotors, DIY "Street View" for audio
  • fly micro aerostats, with Arduino-based accelerometer and positioned parabolic mic
  • partner with City of Palo Alto to deploy a simple audio sensor grid

App Development Process

  1. Clean up the raw, unstructured data from CoPA download… aka ETL
  2. Perform sampling before modeling
  3. Perform visualization and summary statistics in RStudio
  4. Ideation and research for potential use cases
  5. Iterate on business process for the app workflow 5.1. TDD at scale 5.2. best practices
  6. Integrate with end use cases as the workflow endpoints
  7. PROFIT!

Some caveats:

  • Arguably, this is not a large data set; however, it’s early for the open data initiative, and besides Palo Alto has only 65K population.
  • This provides a good area for a POC, prior to deploying in other, larger metro areas.
  • This example helps illustrate how in terms of “Big Data”, complexity is more important to consider than big.

Build Instructions

To generate an IntelliJ project use:

gradle ideaModule

To build the sample app from the command line use:

gradle clean jar

Before running this sample app, be sure to set your HADOOP_HOME environment variable. Then clear the output directory, then to run on a desktop/laptop with Apache Hadoop in standalone mode:

rm -rf output
hadoop jar ./build/libs/copa.jar data/copa.csv data/meta_tree.tsv data/meta_road.tsv data/gps.csv output/trap output/tsv output/tree output/road output/park output/shade output/reco

To view the results, for example the output recommendations in reco:

ls output
more output/reco/part-00000

An example of log captured from a successful build+run is at https://gist.github.com/3660888

About Cascading

There is a tutorial about getting started with Cascading in the blog post series called Cascading for the Impatient. Other documentation is available at http://www.cascading.org/documentation/.

For more discussion, see the cascading-user email forum. We also have a meetup started.

Clone this wiki locally