Author: Peter McMaster, 2021
Traffic sucks. Technology can help by providing insight for policy, alleviating road congestion and decreasing the frequency and severity of accidents. This project is a continuation of research performed by the Research and Development Team at Lyft to investigate the factors surrounding car accidents, and an effort to extract meaningful insights for public policy by investigating some Big Data on the topic.
The data used here consist in a 3.5-million observation dataset of US Car Accidents from 2016 to 2020 and include environmental factors such as:
- location
- road signage
- time of day
- weather
The goal is to explore this data to uncover insights and correlations which may guide public policy and increase road safety. The original authors of the dataset conducted experiments to predict accident likelihood of any given intersection at any given time within city blocks. This project celebrates and continues their research by experimenting with various models on a binary classification task - to predict accident severity (high or low) given environmental parameters using:
- Logistic Regression
- K-Nearest Neighbors
- Naive Bayes
- Decision Tree
- Random Forest
- Support Vector Machines (linear & rbf kernels)
- XGBoost
XGBoost outperformed the rest, achieving an improvement of 21 percentage points over the baseline of a random classifier.
- Import Data
- Inspect Data
- Data Cleaning (Round 1)
- Data Cleaning (Round 2)
- #EDA 1: Full Dataset
- Pie Plots
- Cat Plots
- Line Plots
- Box and Whisker Plots
- EDA 2: Data Subsets
- Create Subsets
- Pie Plots
- Multiline Plots
- Probability Density Functions of continuous variables compared
- Skewness
- Maps (Los Angeles, CA)
- Base Map
- Accidents Markers
- Zipcode Choropleth Map
- Feature Sets
- df50: (X, y)
- df: (Xall, yall)
- dfsvm: (Xsvm, ysvm)
- Models
- Decision Tree Classifier
- Random Forest Classifier
- K Nearest Neighbors
- Naive Bayes (Guassian)
- Linear Regression
- Logistic Regression
- Support Vector Machine
- XGBoost
Data Source: https://osu.app.box.com/v/us-accidents-june20
This dataset consists of car accidents reports within 49 states of the U.S. in the date range from February 2016 through June 2020. These data have been sourced from several data providers, including 2 APIs (Mapquest, Bing) and a host of related entities (the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks). There are approximately 3.5 million accident observations and exactly 49 columns of heterogeneous data in this dataset. For additional information follow this link to https://smoosavi.org/datasets/us_accidents.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, arXiv preprint arXiv:1906.05409 (2019).
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

