DATA STORIES | SPORT ANALYTICS | KNIME ANALYTICS PLATFORM
Who is the best team in the NFL?
A data-driven journey to reveal the true leaders on the gridiron and evaluate NFL Teams with Logistic Regression in the open-source KNIME Analytics Platform
Yodime, the genetic cross between a Star Wars guru (Yoda) and a data science guru (KNIME), has already shown in the past that he can predict the winner of a sports tournament using simple statistical methods.
Will he be able to do the same for the NFL?
The way to the data-driven approach
The history of NFL power rankings reflects the evolution of sports analytics and data-driven decision-making. Before the advent of formal power rankings, fans and sports commentators informally ranked teams based on their observations and subjective judgments. These rankings were often based on factors like win-loss records or point differentials.
With the advancement of computing and statistical analysis in the late 20th century, some experts started to create power rankings that incorporated statistical data.
Today, power rankings are an integral part of the NFL fan experience, providing a more objective perspective on team performance and helping fans engage in informed debates about the league’s best teams.
I’ll show you how to calculate NFL team rankings yourself using the free data science tool KNIME Analytics Platform.
The visual programming language of KNIME is self-explanatory and therefore easy to learn. Here is a good “Getting Started Guide” where you can also download the opensource software for free.
But before we start evaluating the teams, we first need to understand the NFL season format.
The NFL Season Format
The heart of the NFL season is the regular season, which usually spans 18 weeks, with each of the 32 teams playing 17 games.
The season typically starts in September and concludes in late December or early January. Teams are divided into two conferences: the American Football Conference (AFC) and the National Football Conference (NFC). Each conference is further divided into four divisions (North, East, South and West).
Throughout the year, each team plays their divisional rivals home and away (equal 6 games), while 10 other predetermined matchups make up their 16-game schedule.
At the end of the regular season, the teams with the best record in each division, along with the teams with the two next best records, the wild card teams, qualify for the playoffs.
Finally, the winners of each conference’s playoffs meet in the biggest annual American sports event in the world, the Super Bowl. The following video from the NFL Yotube channel explains the playoffs very well and briefly.
The strength of schedule
American football is a very physical game and so all teams can benefit from a bye week during the regular season when they play 16 games. With 32 teams in the league, a full round-robin schedule like in European soccer leagues would require each team to play 31 other teams.
This would result in an excessively long and impractical regular season.
The NFL therefore places great emphasis on divisional games, as teams play a large portion of their schedule against divisional rivals.
This makes local rivalries more exciting and boosts the overall competition in the league. However, it also means that teams might have significantly different levels of difficulty in their schedules. As a result, a win-loss record of 8–8 could have a very different meaning for some teams than it does for others.
This leads to the following:
- Teams with a bye week should perform better in their next game
- Divisional games should be more competitive
- The win-loss record of a team could lead to wrong conclusions
The science behind ranking teams
In order to accurately predict as many NFL games as possible, determining the optimal ranking order for the 32 teams would require evaluating a staggering number of possibilities. In combinatorics, this is referred to as the number of permutations.
The number of permutations for ranking 32 teams is 32! ≈2.63130837×10³⁵.
So, 32! is approximately 2.632 followed by 35 zeros. This is an extremely large number, even with today’s computer power.
On the other hand, we could simply compare the win-loss records of the teams with each other. That would be much easier. But we have already seen above that the strength of the schedule can lead to wrong conclusions.
There must be something in between, and indeed statisticians have long discovered that Logistic Regression is very good at estimating how the probability of an event depends on independent variables.
We assume the probability p of home team winning the game can be determined from:
Hrating and arating are the respective ratings of the home teams and the away teams and H is the home advantage.
Rearranging the equation above we find:
To estimate the ratings for each team, we use the method of the maximum likelihood. We choose the ratings and the home edge that maximize the probability of wins and losses we observed.
Building the model
The first step is to get the data with the games of this year. We have seen in our last articles how we can scrape the data from the web.
For our purpose, we will use the games from this Github site:
The reason for this is that, in addition to the results and odds, we also have access to information such as wind and weather and the number of rest days of the teams.
The following KNIME workflow will do the whole job.
The KNIME workflow with all the following examples can be found on my KNIME Community Hub space.
First we load the games into KNIME using the CSV Reader node. All we have to do is enter the link where the games are located.
Next, we filter just the games of this year (2023) and then we build the following matrix:
0: if the team has not played the match
1: if the team has played at home
-1: if the team has played away
These are our independent variables whose parameters we will estimate using logistic regression.
Next, in our Logistic Regression model, we set the target column “Winner”, which we created earlier in a node and which has the value “HomeTeam” or “AwayTeam”, and include the independent team columns that we created above (marked in red in the illustration below).
With this model we score an accuracy of 69.79%. This means that almost 70% of the games can be correctly classified based on the team ratings found.
The team ratings look very promising Philly, Detroit, Baltimore and the reigning champion the Kansas City Chiefs are on the top.
These are the results after week 10. You can update the values after each game or each week.
Let us now try to improve our model with two further indicators.
Let’s take the div_games column, which indicates whether it is a divisional game, and the column df_rest_days, which shows the difference in days between the teams’ last game. This is how we take the bye week into account. The model improves now to 70.47%.
The top 10 teams are now ranked as follows. The SF 49er’s and CLE Browns have switched places. Furthermore, the AFC North seems to be the toughest division this season with BAL, PIT, CLE and CIN.
But football is not just a game of skill. The weather, especially the wind, can play an important role.
Fortunately, the actual dataset also contains some of this information. In the games where the wind information was missing, we were able to impute it retrospectively using the stadium information (dome or closed = no wind). Otherwise we have used the mean value.
The accuracy now increases to 76.5 %! And surprise! Our ranking list looks different now too. We have a new leader.
Detroit is now at the top and has only lost away to Baltimore in a very windy game. The wind seems to give the home team an extra advantage. The Seahawks also improve from 9th to third place. Not surprising, on the other hand, is CAR and NE at the bottom. Nothing seems to work out for these teams this year.
Further improvements to the model are certainly possible.
Warning! The spread line is in the dataset reversed. It shows the handicap of the awayteam instead of that of the hometeam.
Try it out for yourself!
Material for this project:
- KNIME workflow: KNIME Community Hub
- NFL data: https://github.com/nflverse/nfldata/tree/master
- NFL data dictionary: https://nflreadr.nflverse.com/articles/dictionary_schedules.html
Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.
Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “Data Science with Yodime”