Spark application that creates a machine learning model for a real-world problem, using real-world data: Predicting the arrival delay of commercial flights
- Setting up a Spark machine learning project with Scala, sbt and MLlib
- Building a Big Data Machine Learning Spark Application for Flight Delay Prediction
- Linear Regression
- Random Forest Trees
- Gradient-Boosted Trees
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt
$ sudo apt-get install scala
Download data data from here.
Extract and save data to /flightdelaypreditor/data/
$ cd flightdelaypredictor
$ sbt
sbt:regressionTree> run
| n. | Forbidden | Name | Description |
|---|---|---|---|
| 1 | Year | 1987-2008 | |
| 2 | Month | 1-12 | |
| 3 | DayofMonth | 1-31 | |
| 4 | DayOfWeek | 1 (Monday) - 7 (Sunday) | |
| 5 | DepTime | actual departure time (local, hhmm) | |
| 6 | CRSDepTime | scheduled departure time (local, hhmm) | |
| 7 | x | ArrTime | actual arrival time (local, hhmm) |
| 8 | CRSArrTime | scheduled arrival time (local, hhmm) | |
| 9 | UniqueCarrier | unique carrier code | |
| 10 | FlightNum | flight number | |
| 11 | TailNum | plane tail number | |
| 12 | x | ActualElapsedTime | in minutes |
| 13 | CRSElapsedTime | in minutes | |
| 14 | x | AirTime | in minutes |
| 15 | ArrDelay | arrival delay, in minutes | |
| 16 | DepDelay | departure delay, in minutes | |
| 17 | Origin | origin IATA airport code | |
| 18 | Dest | destination IATA airport code | |
| 19 | Distance | in miles | |
| 20 | x | TaxiIn | taxi in time, in minutes |
| 21 | TaxiOut | taxi out time in minutes | |
| 22 | Cancelled | was the flight cancelled? | |
| 23 | CancellationCode | reason for cancellation (A = carrier, B = weather, C = NAS, D = security) | |
| 24 | x | Diverted | 1 = yes, 0 = no |
| 25 | x | CarrierDelay | in minutes |
| 26 | x | WeatherDelay | in minutes |
| 27 | x | NASDelay | in minutes |
| 28 | x | SecurityDelay | in minutes |
| 29 | x | LateAircraftDelay | in minutes |