Crashlytics is a data science project that analyzes and models US traffic accident data to explore accident patterns and predict accident severity based on environmental, temporal, and infrastructural factors.
The notebook walks through data exploration, feature engineering, preprocessing, model training, and evaluation, producing insights that can help guide safety improvements.
Traffic accidents have major human and economic costs.
This project uses the US Accidents Dataset to:
- Analyze accident patterns across locations, times, and weather conditions.
- Identify environmental and road features most related to severe accidents.
- Build machine learning models to predict accident severity.
The workflow includes EDA, data cleaning, feature selection, and model evaluation.
- Language: Python
- Data Processing: Pandas, NumPy
- Data Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-learn
- Tools: Jupyter Notebook
- Dataset: US Accidents Dataset (2016–2020)
📦 Crashlytics
┣ 📁 data/ – Dataset files (download separately)
┣ 📓 crashlytics_notebook.ipynb – Main Jupyter notebook (EDA + ML pipeline)
┣ 📄 requirements.txt – Python dependencies
┗ 📄 README.md – Project documentation
- Accident counts by state (heatmap & bar chart).
- Text analysis: most frequent words in severity 4 accident descriptions.
- Most common road features present during accidents.
- Relationship between accident distance and severity.
- Accident counts by weather condition and weekday.
- Temporal trends highlighting rush-hour peaks and weekday/weekend differences.
- Temporal feature extraction from
Start_Time(year, month, day, weekday, hour, minute). - Correlation analysis to detect and drop redundant features (e.g.,
End_Lat,End_Lng,Wind_Chill). - Removal of irrelevant identifiers and redundant time/location variables.
- Duplicate removal and handling of erroneous/missing values.
- Encoding of categorical variables.
- Dropped constant-value columns and variables with little predictive power.
- Retained only impactful features identified via EDA and correlation matrix.
- Split dataset into train, validation, and test sets.
- Trained and evaluated multiple classifiers:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Evaluated models with:
- Accuracy
- Precision, Recall, and F1-score
- Confusion matrices
- California, Texas, and Florida are the most accident-prone states.
- Severe accidents often occur during poor visibility and adverse weather.
- Accidents are more frequent during weekday rush hours.
- Junctions, crossings, and nearby stations are common in severe accident locations.
git clone https://github.com/your-username/Crashlytics.git
cd Crashlyticspip install -r requirements.txtjupyter notebook crashlytics_notebook.ipynb- Highest accuracy achieved with Random Forest Classifier among tested models (Logistic Regression, Decision Tree, Random Forest).
- Identified key spatio-temporal and environmental factors influencing accident severity, enabling data-informed safety strategies.
- Found that adverse weather, poor visibility, and peak traffic hours correlate strongly with severe accidents.
- Analysis highlights California, Texas, and Florida as priority states for targeted road safety measures.
- Produced a modular, reusable ML pipeline for future scaling and integration with traffic monitoring systems.
Contributions are welcome!
- Fork the repository
- Create a new branch for your feature or bug fix
- Submit a pull request for review