Bird Feed is an app that lets you see bird activity near Central Park in real-time. You can find 'trending' birds (most sightings over the last minute), which birds are near your location in a 50 meter radius, and bird hotspots. Additionally, Bird Feed provides historical data for the total and unique number of sightings over the past seven days. This information can be seen for all birds, by family, or by specific bird.
Bird Feed is powered by:
- Kafka
- HDFS
- Spark
- Flink
- Cassandra
- Elasticsearch
- Flask
A video of the project in action can be seen here.
All of the techonlogies listed above are connected together as follows.
Since there isn't a data source with enough velocity and volume to thoroughly test the infasructure, all records are manufactured. Bird names, families, and images were scraped from the Cornell Lab of Ornithology's website All About Birds.
Manufacted data in the form of JSON records containing name, family, latitude, longitude, and timestamp are generated by randomly selecting a bird out of 250,000 unique candidates. Each sample is taken from a Gaussian Distribution over the indicies of the array of candidates. Periodically this array is shuffled to have different dynamics throughout the day. Once a bird is selected, a latitude and longitude are randomly generated (uniform) between two bounds that surround Central Park. The timestamp is also generated at creation time and given a random (Gaussian) delay to simulate out-of-order sensor data.
Each record is placed onto a Kafka queue where it is broadcast to the rest of the pipeline. The data is stored in HDFS in sequential 20MB files for daily batch processing.
The batch processing for Bird Feed makes use of Spark's data frame API. Since total and unique sightings for families and overall can be derived from processing on the level of names, caching is used to speed up calculations. Data is recalculated every night from raw logs to enable changes in the future and avoid losing data due to potential mistakes. The calculations are persisted in Cassandra for access on the front end.
Batch processes include:
- Calculating total sightings overall, by family, and by bird for each day
- Calculating unique sightings overall, by family, and by bird for each day
Real-time processing is handled by Flink. Flink allows you to impose windows upon the stream to do mini-calculations before persisting data which reduces the load on the database. A window segments your data by time (processing time, ingestion time, or event time) which waits until the specified moment to run the calculation on the rolling stream.
Windowing also allows you to set watermarks to handle out-of-order data in a reliable fashion. Watermarks are inserted into the stream and indicate when a calculation on a window should occur. This gives the user a sense of accuracy on the calculation as long as the watermarks are reflective of the data. For example, if you know that 99.9% of your data is always within 5 seconds of each other, you can set a periodic event time watermark based on event time with a delay of 5 seconds.
Data is passed to both Cassandra and Elasticsearch for short-term persistence.
Real-time processing includes:
- Calculating total sightings by bird in a minute-long five second sliding window.
- Updating the current position of each bird in Elasticsearch for hotspots and 'near me' queries.
Both Cassandra and Elasticsearch are used to display data on the front end. Cassandra is able to handle the large volume of writes that are needed to keep an up-to-date view of what birds are trending. Cassandra has been set up with TTLs (time to live) on all trending writes so any record selected is fresh. Elasticsearch, on the other hand, easily handles geographic and time filtering with it's robust query language. Geohashes are also calculated on recent data to determine hotspots across the park.