Real-time crash prediction of urban highways using machine learning algorithms



Journal Title

Journal ISSN

Volume Title



Motor vehicle crashes in the United States continue to be a serious safety concern for state highway agencies, with over 30,000 fatal crashes reported each year. The World Health Organization (WHO) reported in 2016 that vehicle crashes were the eighth leading cause of death globally. Crashes on roadways are rare and random events that occur due to the result of the complex relationship between the driver, vehicle, weather, and roadway. A significant breadth of research has been conducted to predict and understand why crashes occur through spatial and temporal analyses, understanding information about the driver and roadway, and identification of hazardous locations through geographic information system (GIS) applications. Also, previous research studies have investigated the effectiveness of safety devices designed to reduce the number and severity of crashes. Today, data-driven traffic safety studies are becoming an essential aspect of the planning, design, construction, and maintenance of the roadway network. This can only be done with the assistance of state highway agencies collecting and synthesizing historical crash data, roadway geometry data, and environmental data being collected every day at a resolution that will help researchers develop powerful crash prediction tools. The objective of this research study was to predict vehicle crashes in real-time. This exploratory analysis compared three well-known machine learning methods, including logistic regression, random forest, support vector machine. Additionally, another methodology was developed using variables selected from random forest models that were inserted into the support vector machine model. The study review of the literature noted that this study’s selected methods were found to be more effective in terms of prediction power. A total of 475 crashes were identified from the selected urban highway network in Kansas City, Kansas. For each of the 475 identified crashes, six no-crash events were collected at the same location. This was necessary so that the predictive models could distinguish a crash-prone traffic operational condition from regular traffic flow conditions. Multiple data sources were fused to create a database including traffic operational data from the KC Scout traffic management center, crash and roadway geometry data from the Kanas Department of Transportation; and weather data from NOAA. Data were downloaded from five separate roadway radar sensors close to the crash location. This enable understanding of the traffic flow along the roadway segment (upstream and downstream) during the crash. Additionally, operational data from each radar sensor were collected in five minutes intervals up to 30 minutes prior to a crash occurring. Although six no-crash events were collected for each crash observation, the ratio of crash and no-crash were then reduced to 1:4 (four non-crash events), and 1:2 (two non-crash events) to investigate possible effects of class imbalance on crash prediction. Also, 60%, 70%, and 80% of the data were selected in training to develop each model. The remaining data were then used for model validation. The data used in training ratios were varied to identify possible effects of training data as it relates to prediction power. Additionally, a second database was developed in which variables were log-transformed to reduce possible skewness in the distribution. Model results showed that the size of the dataset increased the overall accuracy of crash prediction. The dataset with a higher observation count could classify more data accurately. The highest accuracies in all three models were observed using the dataset of a 1:6 ratio (one crash event for six no-crash events). The datasets with1:2 ratio predicted 13% to 18% lower than the 1:6 ratio dataset. However, the sensitivity (true positive prediction) was observed highest for the dataset of a 1:2 ratio. It was found that reducing the response class imbalance; the sensitivity could be increased with the disadvantage of a reduction in overall prediction accuracy. The effects of the split ratio were not significantly different in overall accuracy. However, the sensitivity was found to increase with an increase in training data. The logistic regression model found an average of 30.79% (with a standard deviation of 5.02) accurately. The random forest models predicted an average of 13.36% (with a standard deviation of 9.50) accurately. The support vector machine models predicted an average of 29.35% (with a standard deviation of 7.34) accurately. The hybrid approach of random forest and support vector machine models predicted an average of 29.86% (with a standard deviation of 7.33) accurately. The significant variables found from this study included the variation in speed between the posted speed limit and average roadway traffic speed around the crash location. The variations in speed and vehicle per hour between upstream and downstream traffic of a crash location in the previous five minutes before a crash occurred were found to be significant as well. This study provided an important step in real-time crash prediction and complemented many previous research studies found in the literature review. Although the models investigate were somewhat inconclusive, this study provided an investigation of data, variables, and combinations of variables that have not been investigated previously. Real-time crash prediction is expected to assist with the on-going development of connected and autonomous vehicles as the fleet mix begins to change, and new variables can be collected, and data resolution becomes greater. Real-time crash prediction models will also continue to advance highway safety as metropolitan areas continue to grow, and congestion continues to increase.



Traffic safety, Real-time crash prediction, Machine learning, Logistic Regression, Random Forest, Support Vector Machine

Graduation Month



Doctor of Philosophy


Department of Civil Engineering

Major Professor

Eric J. Fitzsimmons