Ohio State Update - the model had had them going all the way, they lost in the first round. Upsets are a great part of college basketball, but it always hurts to be on the losing side. We will readjust and re-run our model from Sweet Sixteen onwards.
Jump right to our Machine Learning Bracket Predictions or Betting Odds
It’s that time of year again when millions of people fill out March Madness brackets to compete against friends or co-workers in an office pool. Everyone has a different methodology: there’s the pick-higher-seeds bracket, the 12-5-upset-bracket, the pick-my-alma-mater-all-the-way bracket, and of course, the mascot bracket.
March Madness is a perfect application for AI. Traditionally, March Madness bracket predictions are made by experts who watch a lot of college basketball. They rely on win/loss records, head-to-head records, and what they see in games. They’re also influenced by a team’s seed, which is really just how another panel of experts rank the teams.
But increasingly, pundits are using data to predict March Madness matchups. And as the amount of data available expands, better prediction quality follows.
At Akkio, we make a no-code machine learning platform that lets you create and deploy machine learning (ML) prediction models without needing to be a software engineer or data scientist. We used over 30 years of historical data on the NCAA Tourney to create a model that predicts this year’s bracket. In this article, we’ll show you how we did it so you can use AI and data you can assemble to make one too.
Machine learning is most useful when you have a large amount of historical data with associated outcomes. You can use that data to create a model that predicts future outcomes of new information.
Sports generate a massive and ever-increasing amount of game data along with a very clean result — a win/loss and score. 2003’s Moneyball by Michael Lewis brought sports analytics and modelling into the public eye for the first time. In the ensuing years, every professional team in every sport has come to rely on a staff of data scientists to generate an edge.
Basketball is no exception. Teams have looked at metrics like field goal percentage and points per minute for decades, but now they also track win shares, player fatigue, and player efficiency rating to create optimal lineups.
But most March Madness pundit brackets are still driven by narratives, not data. Machine learning is particularly useful for areas where you’re swimming in data, so much so that you can’t reasonably examine all of it, and there are clear outcomes. That sounds like a perfect description of March Madness to us.
The strength of any machine learning model is entirely dependent on the quality of historical data you use to train the model. With an application like basketball, there is plenty of data, - so the upfront work was both choosing what data to include in the model and then building the dataset. We decided to keep it simple for our first year doing this and only gather team-level data.
We found that Sports Reference had data going back to 1960 that included key stats and team averages for the year leading up to each tournament as well as matchups and outcomes. We decided to use post-1987 data on the thesis that it is both more complete as well as more representative of the modern game while containing sufficient matchup-results to train a robust model.
Unfortunately, while Sports Reference has great data, they have not made it particularly easy to download so we had to do some work to extract it into a usable form (we may cover this in another blog post). This step was not fun, so to save you the work you can download the training dataset we put together here: NCAA data.
As you can see from the data, for each historical NCAA tourney team we have many types of season averages covering both offense and defense. Stats include field goal percentage, 3 pointers, free throws, blocks, rebounds (defensive and offensive), etc. The data also includes the match-up results for each team in each round of the tourney — the score, and who won the game. Finally, the data includes each team’s seed going into the tournament (which contains the aggregate selection panel best guess on relative team strength).
We used the Akkio ML engine to train a model that learns from the past outcomes of NCAA matchups and team season stats. We ignored the team names when training the model, which is a debatable choice but avoids attaching any weight to a given team’s history in the tournament and instead looks only at the team’s performance over that season.
If you download the training dataset you will also notice that we have duplicated every game played and swapped the order of “team 1” and “team 2”. This is because the initial dataset had a biased order (higher seeds were always team 1). By changing up the order of the teams, we force the model to learn only the predictive power of each team’s full-season statistics and their seed.
After that small bit of data wrangling, we uploaded the training dataset and built a model to predict which team would win. When we train a model, we do a sensitivity analysis to determine which variables are most important to the predictive outcome. Here are the results:
This makes sense — “G1” is the number of games played during the season by Team 1 (and “G2” the games played by Team 2). If a team did well in exhibition tournaments and their conference tournament this number will be higher; if a team did poorly it will be lower. Next up is the seed position, which also makes sense as it includes the selection committee’s best take on team ability. After the number of games and seed, defensive rebounds are the next most important statistic.
The interesting thing about this model is that the remaining variables account for over 54% of the predictive power of the ML model, so everything is being taken into account here. The one thing that we would have liked to include (and will consider for next year) is going a click deeper to player-specific stats for each matchup - including any injuries. The ML engine can handle the complexity, but getting the right data will take some work.
So how good is the model really? One way to get a sense of its performance is to back-test it. We used the model to predict the 2019 NCAA tournament:
We correctly predicted Virginia to win, and the model had Michigan State (correct), Gonzaga, and Auburn (correct) in the final four. That’s a pretty good performance on the back-test, 3 out of 4 final four teams and correctly picking the winner!
For reference, let’s compare that to Nate Silver's pre-tournament forecast from 2019 (you will need to scroll down and select forecast from “pre-tournament”). Nate had Duke winning with Virginia, Gonzaga, and UNC rounding out the final four in that order of probability. Virginia actually won, with Michigan State, Texas Tech, and Auburn rounding out the final four. Nate did not correctly predict the winner, and only correctly predicted one of the Final Four teams.
So without further ado, here’s what our model predicts!
Our bracket isn’t boring! We predict Ohio State will beat Iowa by one point in a final game between a pair of 2 seeds. We only have one 1 seed, Illinois, making the Final Four (sorry Gonzaga fans). And our model has 1 seed Baylor getting knocked out by Wisconsin and not even making it to the Sweet Sixteen.
We predict two classic 12-5 upsets by Oregon State (over Tennessee) and Winthrop (over Villanova) in the first round. In fact, our model predicts deep tournament runs for cinderellas like Oregon State and Morehead State in the Midwest bracket, both of which eventually end in the most conventional 1, 2 seed showdown of all (Illinois vs Houston) in that region’s Elite Eight.
Here are the odds that any given team wins its matchup in each round:
So how will it play out? We’ll be able to evaluate how predictive this model is over the next few weeks. We based this model on relatively basic data. If you’d like to generate different predictions with AI by using a more in-depth dataset, we’d love to have you do so with Akkio. It’s free to try and you only need to supply an email address to use our 10K action tier for a month.
We are going to update our model for each round of the tournament, evaluating how the model performed, and re-running any matchups that go differently than predicted. We will share the results and predictions for the next round via Twitter and LinkedIn, so follow us if you are interested in seeing how the model performs throughout the tournament.