Introduction
Welcome to Set Theory - a blog where I’ll talk about tennis data analysis. I am not a professional tennis coach, nor am I anything more than a social tennis player, however I do have a passion for sports and data so hopefully you’ll discover some interesting insights with me along the way.
In the first post of this series, we’ll explore a simple framework to analyse different ATP Men’s players by the types of rallies they win.
We’ll cover:
- Why we should look at individual points
- The data available from Tennis Abstract
- Choosing basic metrics to segment the data
- Future opportunities to expand this analysis
Tennis and its scoring system - why points means prizes
Tennis has a very unique scoring system - you only need 2 more points than your opponent to win a game, 2 more games than your opponent to win a set, and 1 more set than your opponent to win a match.
The implications of this are interesting - because in a more extreme case:
- You could win a point from 40-30 up (therefore getting 5 points to your opponent’s 3, a 63% point winrate)
- You could win 6 games to your opponents 4 to win the set. Assuming that you win your 6 games from 40-30 up, and lose your 4 games from 30-40 down, you will have a point winrate of just 53%.
The maths is examined in much more in detail in this blog post.
The bottom line is:
If you win slightly more than 50% of your points, you will win 70-80% of your matches.
This explains why some players can be so dominant in Grand Slams for long periods of time by only being marginally better than their opponents. Djokovic, Nadal and Federer all have point winrates around 54%, but won over 80% of their matches. In our small sample of their games, we see similar numbers (but lower as this sample of matches is after 2010).
Player | Point Winrate | Match Winrate |
---|---|---|
Novak Djokovic | 54% | 78% |
Rafael Nadal | 53% | 76% |
Roger Federer | 53% | 75% |
It is for this reason that our analysis will entirely be focussed on how to improve “point winrate” as much as possible.
The Data
Free sports data can be hard to come by, and tennis is no exception. Whilst there are a few proprietary providers, for the average fan there isn’t much accessible data around.
Tennis Abstract is an excellent community project by Jeff Sackmann that aims to rectify this. At the time of writing, there are over 16000 charted matches available for free. This means over 2.5 million points were manually labelled by volunteers with information such as Serve Direction (4=wide, 5=body, 6=down the T), type of shot (forehand, backhand, volley, slice etc) and metadata such as unforced errors and so on.
This is truly an incredible project and is the only reason I can start this blog.
This analysis will focus on the data for all of the charted Men’s singles matches during the 2020s (around 2800 matches available).
Choosing some metrics
Remember, our goal is to look for ways to increase “point winrate” - a question we can first ask is
What type of points do certain players win?
To answer this, I tried a variety of complex methods, including using word embeddings to see if we could “encode” rallies in the same way LLMs encode words. In the end, a far simpler approach yielded much more interesting results. I’ll cover the complex stuff in a later post.
The simple approach was to take every point in the data (over 400 thousand) and use simple rules to label them so that each point had a few categories. The idea would then be to analyse which “categories” of points certain players are above and below average at winning.
The most basic type of Tennis point is an ace, a 1 shot rally where the server hits the ball and the receiver does not hit it back. This is a good metric to segregate different players as, by intuition, we know certain players tend to hit more aces than others.
Now that we’ve labelled the aces, we can repeat this to identify other metrics that might indicate different types of points that certain players win. The ones I came up with from intuition were:
- Long rallies (>5 shots)
- Forehand/backhand rallies (3 forehands/backhands in a row somewhere during the point)
- Serve position (wide, body or down the T)
- Point ending with a winner (this ended up not proving very useful for now)
As we’ll see in a moment, some of these vary significantly between players, whilst others are not common in the data and therefore do not distinguish players and playstyles very well.
The Results
To keep things simple, we’ll only analyse from the server’s perspective, and we’ll focus on 5 arbitrary servers’ points - Alcaraz, Djokovic, Sinner, Berrettini and Medvedev.
Let’s start by looking at the ace rates of each player (the percentage of 1st serves that were aces)
Interestingly Berrettini and Medvedev seem to hit aces >20% of the time.
Aces are only feasible on a first serve - if your serve hits the net, players typically opt for a safer second serve. Let’s compare these player’s point winrates by whether they’re hitting a first or second serve.
There are a number of important things to note here.
- First is that all of these players have winrates around 60-70% when serving. This is important as we need to keep an average of >50% of points won. We’ll cover receiving winrates in a later blog, but of course they will be around 30-40% for most players since the overall serve winrate and overall receive winrate should sum to 100%.
- Second, Berrettini is sitting around 53% for his 2nd serve winrate, which about 7% lower than everyone else. This suggests he is bleeding a lot of points due to a weak second serve. He makes up for this by having a strong first serve ace percentage. His playstyle is the most unique relative to the group on these metrics.
- Third, Alcaraz wins relatively few points off his first serve, 71%, compared to the others, whereas his second serve is on par with everyone else’s. It’s worth noting this data is only a small sample, and not necessarily fully representative of his stats on the whole. Moreover, this includes some junior games and games from earlier in his career. We’ll need to examine many more metrics to explain why he is such a strong player.
Serve Position
Let’s see where our players choose to aim their serves. The data is broken down into 3 categories, Wide Serves, Body Serves, and narrow ones that aim down the T. Note that I have not separated Deuce and Ad courts in this case - we’ll revisit this in a later blog.
A seasoned tennis fan might have been able to guess this information, but as someone who has not watched a significant amount of tennis, the contrast between players was quite interesting to me.
Starting with first serves, we see generally the Body Serve is unfavoured by most players, but Alcaraz utilises it the most (he had the weakest first serve point winrate). The other players show slight preference for serving Wide, but generally mix and match fairly well between the two sides, rarely opting for a Body Serve.
The contrast is really apparent when we examine second serve strategy. The highly unpopular Body Serve is now suddenly the overwhelming choice for Sinner and Berrettini. Djokovic is the only player who opts for a roughly even split between the 3 serve types on his second serve, but all 3 players utilise Body Serves at the second attempt. The logical conclusion we would draw from this is that Wide and T serves tend to be aces, or lead to easier points, but are riskier. Therefore players make the trade off to go with a risky Wide/T serve, but if they miss, they opt for the safe Body Serve to ensure they have a higher chance of winning the point. We’ll examine whether this safe second serve is actually a good strategy in a later blog post.
The next thing to do then, is to test out our conclusion:
Do players have higher point winrates when going for Wide/T serves?
The answer is yes, but certainly not by as much as I was expecting. Players opt to use a first serve to body less than 10% of the time, but it still has a 60-70% point winrate. Perhaps they opt for this against weaker players where they prefer playing safely and winning the point in the rally, we may take a deeper look at this later.
Previously, we noted that Berrettini and Sinner have a strong preference for the Body Serve on 2nd. This makes sense for Berrettini, who wins those 2nd Body Serves slightly more than Wide/T ones. Sinner however, actually wins more often on 2nd serving Wide or down the T, yet chooses the Body Serve 55% of the time! There’s probably a valid reason for this, and may be the subject of some further digging.
Conclusions
In this relatively short analysis, we have:
- Understood why maximising point winrate should be the primary goal for any tennis player looking to improve
- Shown that a number of interesting insights can be gained from a basic analysis of community-sourced data
- Begun to identify different players styles, and started to see where data-driven improvements can be made to their games
Next steps
We were able to achieve quite a lot with just this fairly straightforward approach. As part of this work I have already identified several potential avenues to explore in the future, and many of these may become part of this blog series.
- Break down serve types based on whether the server is in the Deuce or Ad court
- Look at unreturned serves as well as aces to access server strength
- Measure the receiver’s point win rate
- Examine the same data for the women’s singles game (I chose the men’s game as there is generally more data available as the games are longer)
- Measure the impact of strategic decisions, e.g. does it always make sense to do a slower second serve?
- Create a Tennis Rally search engine that is able to group similar rallies together so that coaches can quickly analyse points
- Compare playstyles and winrates on different surfaces