DATA STORIES | SPORT ANALYTICS | KNIME ANALYTICS PLATFORM

Who is the best team in the NFL?

A data-driven journey to reveal the true leaders on the gridiron and evaluate NFL Teams with Logistic Regression in the open-source KNIME Analytics Platform

Dennis Ganzaroli
Low Code for Data Science
8 min readNov 13, 2023

--

Fig 1: Yodime is scoring a Touchdown (image by author).

Yodime, the genetic cross between a Star Wars guru (Yoda) and a data science guru (KNIME), has already shown in the past that he can predict the winner of a sports tournament using simple statistical methods.

Will he be able to do the same for the NFL?

The way to the data-driven approach

The history of NFL power rankings reflects the evolution of sports analytics and data-driven decision-making. Before the advent of formal power rankings, fans and sports commentators informally ranked teams based on their observations and subjective judgments. These rankings were often based on factors like win-loss records or point differentials.

With the advancement of computing and statistical analysis in the late 20th century, some experts started to create power rankings that incorporated statistical data.

Today, power rankings are an integral part of the NFL fan experience, providing a more objective perspective on team performance and helping fans engage in informed debates about the league’s best teams.

I’ll show you how to calculate NFL team rankings yourself using the free data science tool KNIME Analytics Platform.

The visual programming language of KNIME is self-explanatory and therefore easy to learn. Here is a good “Getting Started Guide” where you can also download the opensource software for free.

Fig 2: Getting Set Up with KNIME Analytics Platform (image from KNIME).

But before we start evaluating the teams, we first need to understand the NFL season format.

The NFL Season Format

The heart of the NFL season is the regular season, which usually spans 18 weeks, with each of the 32 teams playing 17 games.

The season typically starts in September and concludes in late December or early January. Teams are divided into two conferences: the American Football Conference (AFC) and the National Football Conference (NFC). Each conference is further divided into four divisions (North, East, South and West).

Fig 3: NFL teams in their Division and Conference (image by NFL).

Throughout the year, each team plays their divisional rivals home and away (equal 6 games), while 10 other predetermined matchups make up their 16-game schedule.

At the end of the regular season, the teams with the best record in each division, along with the teams with the two next best records, the wild card teams, qualify for the playoffs.

Finally, the winners of each conference’s playoffs meet in the biggest annual American sports event in the world, the Super Bowl. The following video from the NFL Yotube channel explains the playoffs very well and briefly.

How the NFL playoff picture works (video by NFL).

The strength of schedule

American football is a very physical game and so all teams can benefit from a bye week during the regular season when they play 16 games. With 32 teams in the league, a full round-robin schedule like in European soccer leagues would require each team to play 31 other teams.

This would result in an excessively long and impractical regular season.
The NFL therefore places great emphasis on divisional games, as teams play a large portion of their schedule against divisional rivals.

This makes local rivalries more exciting and boosts the overall competition in the league. However, it also means that teams might have significantly different levels of difficulty in their schedules. As a result, a win-loss record of 8–8 could have a very different meaning for some teams than it does for others.

This leads to the following:

  • Teams with a bye week should perform better in their next game
  • Divisional games should be more competitive
  • The win-loss record of a team could lead to wrong conclusions

The science behind ranking teams

In order to accurately predict as many NFL games as possible, determining the optimal ranking order for the 32 teams would require evaluating a staggering number of possibilities. In combinatorics, this is referred to as the number of permutations.

The number of permutations for ranking 32 teams is 32! ≈2.63130837×10³⁵.
So, 32! is approximately 2.632 followed by 35 zeros. This is an extremely large number, even with today’s computer power.

On the other hand, we could simply compare the win-loss records of the teams with each other. That would be much easier. But we have already seen above that the strength of the schedule can lead to wrong conclusions.

There must be something in between, and indeed statisticians have long discovered that Logistic Regression is very good at estimating how the probability of an event depends on independent variables.

We assume the probability p of home team winning the game can be determined from:

Fig 4: the logit function (image by author)

Hrating and arating are the respective ratings of the home teams and the away teams and H is the home advantage.

Rearranging the equation above we find:

To estimate the ratings for each team, we use the method of the maximum likelihood. We choose the ratings and the home edge that maximize the probability of wins and losses we observed.

Building the model

The first step is to get the data with the games of this year. We have seen in our last articles how we can scrape the data from the web.

For our purpose, we will use the games from this Github site:

The reason for this is that, in addition to the results and odds, we also have access to information such as wind and weather and the number of rest days of the teams.

The following KNIME workflow will do the whole job.

The KNIME workflow with all the following examples can be found on my KNIME Community Hub space.

Fig 5: KNIME Workflow to rank NFL teams (image by author).

First we load the games into KNIME using the CSV Reader node. All we have to do is enter the link where the games are located.

Fig 6: Loading the games in KNIME (image by author).

Next, we filter just the games of this year (2023) and then we build the following matrix:

0: if the team has not played the match
1: if the team has played at home
-1: if the team has played away

These are our independent variables whose parameters we will estimate using logistic regression.

Fig 7: The team matrix for the logistic regression model (image by author).

Next, in our Logistic Regression model, we set the target column “Winner”, which we created earlier in a node and which has the value “HomeTeam” or “AwayTeam”, and include the independent team columns that we created above (marked in red in the illustration below).

Fig 8: Setting the logistic model (image by author).

With this model we score an accuracy of 69.79%. This means that almost 70% of the games can be correctly classified based on the team ratings found.

Fig 9: Evaluating the first model (image by author).

The team ratings look very promising Philly, Detroit, Baltimore and the reigning champion the Kansas City Chiefs are on the top.

These are the results after week 10. You can update the values after each game or each week.

Fig 10: Top 10 of NFL teams after week 9 — first try (image by author).

Let us now try to improve our model with two further indicators.

Let’s take the div_games column, which indicates whether it is a divisional game, and the column df_rest_days, which shows the difference in days between the teams’ last game. This is how we take the bye week into account. The model improves now to 70.47%.

Fig 11: Light better model with more indicators (image by author).

The top 10 teams are now ranked as follows. The SF 49er’s and CLE Browns have switched places. Furthermore, the AFC North seems to be the toughest division this season with BAL, PIT, CLE and CIN.

Fig 12: Ranking of NFL teams after week 10— second try (image by author).

But football is not just a game of skill. The weather, especially the wind, can play an important role.
Fortunately, the actual dataset also contains some of this information. In the games where the wind information was missing, we were able to impute it retrospectively using the stadium information (dome or closed = no wind). Otherwise we have used the mean value.

Fig 13: Imputing the wind column (image by author).

The accuracy now increases to 76.5 %! And surprise! Our ranking list looks different now too. We have a new leader.

Fig 14: Accuracy after implemnetation of wind (image by author).

Detroit is now at the top and has only lost away to Baltimore in a very windy game. The wind seems to give the home team an extra advantage. The Seahawks also improve from 9th to third place. Not surprising, on the other hand, is CAR and NE at the bottom. Nothing seems to work out for these teams this year.

Fig 15: Ranking of NFL teams after week 10 — final try(image by author).

Further improvements to the model are certainly possible.

Warning! The spread line is in the dataset reversed. It shows the handicap of the awayteam instead of that of the hometeam.

Try it out for yourself!

Material for this project:

Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “
Data Science with Yodime

--

--

Dennis Ganzaroli
Low Code for Data Science

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.