DATA STORIES | SOCCER ANALYTICS | KNIME ANALYTICS PLATFORM

FIFA World Cup: Rating Teams with KNIME

Let’s predict the winner of the FIFA World Cup 2022 by calculating the ratings with a linear regression model and forecasting the outcomes of the games using the KNIME Analytics Platform.

Dennis Ganzaroli
Low Code for Data Science
11 min readNov 19, 2022

--

Photo by Fauzan Saari on Unsplash.

Sports tournaments always include an amount of randomness which is hard to predict. This is especially true for soccer matches.

Nevertheless, last time we saw that you can predict the UEFA Euro 2020 finalists using only past matches and a simple linear regression.

The corresponding article can be found here:

For the data preparation and the creation of our rating model of the soccer teams we used and we will use again the KNIME Analytics Platform.

Why a visual programming language like KNIME is the better choice for data engineering and analytics projects is explained in detail in the following article:

Lessons learned from UEFA Euro 2020

The approach used a linear regression to rate each team in the tournament. Each game was a row in the dataset, where each team was encoded as -1, 1, or 0, depending on whether the team was playing from home (1), away (-1), or not playing at all (0), with an additional column for friendly (1) or official (0) games.

So the following table with the palying teams and scores….

Fig. 1: Soccer games before coding for linear regression (image by author).

was transformed into the following table:

Fig. 2: Soccer games after coding for linear regression (image by author).

A linear regression, with as many input features as teams in the tournament plus one, was trained on 684 past games to predict the score difference. The coefficients of this linear regression model were used to rate each team.

Fig. 3: Rating the UEFA Euro 2020 Soccer Teams with a linear regression in KNIME (image by author).

As you can see in the figure above, the model correctly predicted 3 of the last 4 semifinalists (England, Italy and Spain).

On the other hand, the model did not correctly predict the surprising advance of Denmark into the semifinals.

Was that just part of the randomness, or were we just lucky with our predictions?

Fig 4: UEFA Euro 2020 Knockout Phase (image by Wikipedia).

If we take a closer look at the knockout games of the tournament, we notice that according to the ratings, the Netherlands should have advanced against the Czech Republic (see picture above left). But instead they lost the game unexpectedly.

According to the ratings, Denmark was stronger than Wales and the Czech Republic (see picture above ), so in the end the qualification for the semifinals was justified anyway.

For Belgium, however, the same logic applies in reverse. Actually, they would have been one of the four best teams of the tournament, but they were already eliminated in the quarterfinals because they had to play against the stronger rated Italians.

The Italians won the tournament in the end, but had to go all the way to penalties against both Spain in the semifinals and England in the final.
And everyone knows that in the penalty shootout there is also a lot of luck involved.

So can we expect to find the best teams of the tournament again at the current FIFA World Cup in Qatar if we use the same method as for UEFA Euro 2022?

Finding a model to explain the games played before the tournament

The first question that arises is which games we should consider for our analysis and how far back in time we should go.

It quickly becomes evident that we have to deal with several constraints:

1. Problem of few intercontinental games

To calculate our ratings, we need a sufficient number of games that can contribute enough to evaluate the strength of the teams.
We do have the World Cup qualifiers, but this matches are only between teams in the same region or confederation. So even if a team plays a lot of matches but has no matches with teams from other regions, the calculation of the ratings will not be reliable.

Fig 5: Qualified teams by region, with FIFA Men’s World Ranking before the tournament (image by Wikipedia).

To overcome this issue we need games between the different regions. But most of the time they are held rarely, for example, every four years at the World Cup. On the other hand, there are also friendly matches between teams from different confederations, such as Argentina vs Italy.

The following table shows the games played between teams of different confederations in the period after the last World Cup in 2018.
As you can see, there was just one game (Argentina vs Italy) between a member of the south american CONMEBOL confederation and the European UEFA.

Fig 6: Games played between teams of different confederations since WC 2018 (image by author).

If we use a linear regression to calculate the ratings, we depend on the matches between teams of CONMEBOL and CONCACAF and CONCACAF and UEFA being sufficiently informative, as they are the bridge to be able to compare the ratings between South American and European teams.
This, in turn, requires that we have sufficient matches between these two confederations. And this fact leads us to the next point:

2. Problem of selected time window

The most recent games are likely to be more informative, as the teams are more comparable than in games that took place 3 years ago.
But as we have already seen above, we should be able to include enough intercontinental games in our analysis. And these take place relatively rarely. So where can we make a compromise and only include so many games without having to go back too far?

This question is difficult to answer. We will take a pragmatic approach and compare three different time windows:

  • Time Window A: from World Cup 2018 till today
  • Time Window B: from Nov 2019 (the last 3 years)
  • Time Window C: from Nov 2020 (the last 2 years)

We get the following ratings. Some teams, such as Brazil and the USA, keep their rankings consistently across all three time windows. Other teams fluctuate more but also not as strong.

Fig 7: Ratings on different time windows (from WC 2018, last 3 years and last 2 years) (image by author).

The correlations between the score differences and the predicted values are:

  • Time Window A: from World Cup 2018 till today: 0.731
  • Time Window B: from Nov 2019 (the last 3 years): 0.730
  • Time Window C: from Nov 2020 (the last 2 years): 0.745

Therefore, Time Window C seems to be the most promising.

But since the ratings are calculated from the goal difference, an outlier effect may have occurred here. And that is the content of the next point:

3. Problem of different motivation of teams in high scoring games

Different countries have different mentalities. The Germans or the English but also the Japanese play very intensively even against a very weak opponent until the end of the game. Therefore, results like a 10:0 cannot be excluded here.

Fig 8: The most high scoring games of the last three years (image by author).

The Italians are quite different here. Once they are really in the lead, they take it easy until the end of the game. You will rarely see Italy play 10:0 even against such a weak opponent as San Marino.

Therefore, to make these high-scoring games more comparable, we will truncate the score difference at a maximum value of 3 or -3.
After 3 goals difference, the game is usually over, and each additional goal for the winning team no longer has much importance for our ratings.

Fig 9: Truncating the score difference to max 3or min -3 (image by author).

However, in order to be able to compare the quality of our rating with the previous ones, we will no longer use the correlation to the goal difference, but we will take a different approach.

We take the predicted difference in results (pred_dfres) and use it as a input variable for a logistic model. This logistic model is used to predict the Final Time Result (FTR [1,X,2]).

Since the number of home wins is almost twice as frequent as the number of guest wins or draws, the distribution of the outcomes is unequal.
To avoid this, we will double the number of games with away wins “2” and draws “X” in our dataset to get an equal distribution of outcomes.

There is a node in KNIME, the Equal Size Sampling node that does this automatically. However, since we do not have so much data available to reduce the sample size, we prefer to increase it manually.

The accuracy of the model increases with this adjustment from 62.747% to 63.495% (see image below).

Fig 10: Fitting of Logistic Model before and after truncating high scoring games (image by author).

Our ratings have changed again. Spain, the Netherlands and France have improved. Belgium, Denmark, England and Japan have become worse. This also seems intuitively in line with recent performance.

Fig 11: Ratings with KNIME before and after truncating high scoring games (image by author).

4. Problem of big surprises and outliers

But we are still not finished! Soccer is a low scoring game. That means that statistically big surprises and outliers happen here and there.
What is the impact of removing these outliers from our dataset, recalculating the model, and then reapplying it to the whole data set?

We define outliers as games that were predicted to be home wins (“1”) and became away wins (“2”) and vice versa.

Fig 12: Recalculation of ratings by removing outliers increases accuracy (image by author).

Recalculating of ratings by removing outliers increases the accuracy from 63.495% to 64.511%.

Among the removed outlier matches we find again our example from the last UEFA Euro 2020 (Netherlands vs. Czech Republic) but also the game which meant the elimination of the UEFA European Champion Italy against North Macedonia.

5. Including Homefield Advantage

Soccer tournaments such as the World Cup are hosted by a country such as Qatar in the present case. But qualifying matches are played at the home and the visiting team country respectively. Friendly matches can again be played on home ground or on neutral ground.

Home field advantage is a very well-known factor in football and other sports. Therefore, we will add it to our logistic model with 1 for home advantage and 0 for no home advantage. We assume that each team will have the same home advantage on average.

And again our model accuracy increases from 64.511% to 65.954%.

Fig 13: Recalculation of ratings by adding homefield advantage increases accuracy (image by author).

Predicting the group stage

Let’s look at the final ratings for the tournament and map them to the teams of the appropriate qualifying groups.

Fig 14: Teams ordered by ratings in the qualifying groups (image by author).

In Group A, the Netherlands and Equador should advance to the next round. But Qatar is not that far away and, as the organizer of the tournament, will also be able to benefit from the homefield advantage which is about 0.63 points.

In Group B, England and the United States are the favorites to make it to the next round.

In Group C we find the big tournament favorites Argentina. But everything still seems to be open for the runner-up.

In Group D, Denmark and France are here on top.

Group E seems to be very equal. This is quite surprising, as one would not expect Japan to be so strong. This also has to do with the fact that we do not have enough intercontinental games available. That’s why the past games
Japan-Serbia, Japan-Paraguay and Japan-Brazil seems to have such a big impact on the ratings of Japan. Japan lost just 1–0 to Brazil and beat Paraguay 4–1, a result not far off Brazil’s 4–0 win over Paraguay.

Fig 15: Games of Japan in the last 2 years (image by author).

Group F is led by Belgium and Canada seems to be better than expected.

In Group G, there is another favorite for the title: Brazil. According to the ratings, Serbia seems to be better than Switzerland.

Finally, in Group H we find the third favorite for the title: Portugal.
Uruguay should finish second here.

If you compare the final ratings with the average market value of the teams’ players, you can see where the ratings show surprising values:
namely, Japan, Canada, Equador and even Denmark.

Fig 16: Ratings and average market value of players in Mio. Eur (image by author).

Predicting the group stage with the above ratings and feeding them into our logistic model leads to the following result:

Fig 17: Predicting the group stage (image by author).

In Group E with Germany and in Group F with Croatia, we would have expected these two teams to reach the next round. Instead, Japan and Canada will qualify surprisingly. This result is not really intuitively comprehensible. But in football, anything is always possible.

Predicting the knockouts and the winner of the World Cup

If one continues to calculate according to the same procedure, the following picture emerges:

The favorites succeed as expected and reach the quarter final.
But here, too, we still have the “Japan” effect. Belgium also fails against these strong Japanese. In the quarter final, however, it’s all over for Japan.

Fig 18: Predicting the knockout games (image by beinsports).

Finally, the four teams with the highest ratings will play against each other in the semifinals:

Brazil, Argentina, Portugal and Denmark.

On this last point, we will move away a little from our numbers and calculations. Brazil has the best rating of all the teams. But since it will probably be the last World Cup for the two superstars Messi and Ronaldo, a showdown between Argentina and Portugal would be the perfect match.

So the winner of the World Cup 2022 in Qatar will be decided between Argentina and Portugal.

Fig 19: Messi vs Ronaldo in World Cup 2022.

Material for this project:

Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.

If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member.

It’s $5 a month, giving you unlimited access to thousands of Data science articles. If you sign up using my link, I’ll earn a small commission with no extra cost to you.

Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “
Data Science with Yodime

--

--

Low Code for Data Science

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.