Set Theory #2: Creating a Rally Search Engine

September 3, 2025

Introduction

Today, we’ll look at how we can use some nifty algorithms from search engines to create a tool to identify similar rally sequences. I initially tried this in the context of using a clustering algorithm to find similar rallies to identify which players win certain types of rallies. As we’ll see, this top down approach was not fruitful, however the bottom up approach presented in this article is much more promising.

We’ll cover:

  • A couple of examples of similar or identical rallies from the 2025 Roland Garros Final
  • A UI, where a coach can enter a specific rally sequence and get the Point #s of all similar sequences
  • How to repurpose BK Trees and Levenshtein distances to build your own tennis rally search engine

Why do we need a search engine?

First, let’s talk about why this is useful. The data from Tennis Abstract is community sourced, meaning it is not always 100% consistent.

Here’s a couple of examples of shot codes

Shot TypeShot CodesOptional Modifiers
Serve4,5,6 (Wide, Body, T)* = ace, #=unreturnable
Forehand (f)f1 (to forehand), f2 (middle), f3 (to backhand)Depth on Return (7 is short, 8 is mid-court, 9 is deep)

For example, “6, f17, f1*” represents: a serve down the T, a short return to forehand, a forehand winner from the server. The full notation is available at the end of this article.

Since f, f1, f17 and f1* are all valid ways to denote a forehand return, it’s hard to just search for a rally sequence and find exact sequence matches, especially when other filters are applied such as surface, server, court and so on.

In longer rallies, the problem is worse. The odds of two rallies being perfectly identical is highly unlikely. Say you had two 10-shot rallies, where the first 9 shots are the same but the last is slightly different - a simple exact matching algorithm would not pair these two together, but a coach may want to look at both and compare them.

The solution: use fuzzy matching to rank all other rallies by “closeness” to the target search rally, to find similar patterns of play. As a bonus, this technique can also be used to cluster similar rallies. A human can quite easily tell that “6, f17, f1*” and “6, f18, f1*” are almost identical rallies, but doing this programatically is a little trickier.

Some examples of similar rallies

For the rest of this article, I limited myself to only look at the 2025 Roland Garros Final between Jannik Sinner and Carlos Alcaraz. I “searched” every rally in this match to identify all similar rallies from the same match. I have deliberately only shown pairs of rallies which are not perfectly identical, therefore would not show up in a basic search algorithm looking for an exact match.

Long Rallies

Consider the following two points, which our search engine classed as similar. Both points have a serve to body, a 3 shot backhand rally sequence, and end with an unforced error.

Point 5: 5, b37, b3, b3, b1d@

Point 220: 5, b37, b3, b3, b1w@

Most tennis rallies are actually shorter than 5 shots, meaning it’s difficult to find longer rallies that are similar, especially with only 1 match of data to search through. Nonetheless, this is still a neat result as we would not be able to find this pairing by searching for exactly identical patterns.

Short Rallies

Looking at the shorter rallies, I found one interesting rally type that happened 5 times in the match

Wide serve, backhand to middle, forehand winner

This type of rally was played out on Alcaraz’s serve 5 times, including a second serve.

2 such rallies are shown below (Points 144 and 229)

Point 144: 4, b28, f1*

Point 229: 4, b27, f1*

Sample Search Engine

Below is a sample app where you can search through all the simplified rally sequences from the 2025 French Open final. Note that I have limited it to only show those rallies that have at least 1 fuzzy match in the data, meaning those rallies which were unique have been omitted.

You can search for the above rallies with the code ‘4b2f1*’. A coach may use this interface after a match to see how their player did. For example, Alcaraz might use this to watch his movement to re-watch what he did well, or Sinner might use this to find how he can return Alcaraz’s wide serve better next time.



Searching for the answer

Whenever you mistype something into Google, you’ll get greeted with an auto-corrected form of the answer. E.g. If I search for “Cst food” it will actually show me the results for “Cat food” instead. Google knows that “Cat” is close to “Cst” so can probably guess what I meant. Let’s see how.

The Levenshtein Distance

A simple metric to compute the distance between two sequences, or strings, is the minimum number of insertions, deletions and substitutions to be made to go from one sequence to the other. For example, to go from

  • “Cat” -> “Cst” would be 1 substitution of the middle letter
  • “Cat” -> “Cast” would require 1 insertion
  • “Cat” -> “Dog” would be 3 substitutions

and so on.

This allows us to compute the “Levenshtein Distance” between any two strings. Applying this to tennis, we could extend this to find the distance (and therefore similarity) between any two rallies.

For example, consider the Alcaraz winners from earlier:

  • “4,b2,f1*” -> “4,b2,f1@” would be 1 substitution, as the final forehand winner was substituted with a forehand error. This pair of rallies would have a similiarity score of 1.

I also modified this such that swapping from “b2” -> “b3” would count as a substitution cost of 0.5, whereas “b2” -> “f2” would have substitution cost of 1. This makes the search slightly more “fuzzy” in order to generate even more potential similar rallies.

BK Trees

Computing this metric for every possible pair of rallies is incredibly time consuming, as it has complexity O(n^2), however there is a shortcut. If we don’t care about knowing all possible distances up front, we can instead build a data structure to very quickly find all similar rallies for a given input rally - this would, in essence, be our “search engine”. The graph produced is called a BK Tree - a weighted unidirectional graph showing the value of a metric (in our case the Levenshtein distance) between two nodes (strings).

BK-Tree

Source: Mamiemando - Own work, CC BY-SA 4.0, Link

Conclusions

In this article, we’ve shown

  • How a search engine can help us identify not only those rallies which are exactly the same in the data, but also those which are close to each other
  • Why this opens the door to further analysis and questions that a coach may ask after a match

Next steps

We can extend this work in the following ways

  • Use this algorithm to build clusters of similar rallies, and see which types of rallies players prefer to use
  • Make the search work for incomplete rallies, e.g. if I wanted to find all shots with a 3 shot backhand sequence
  • Find more complex shot types, such as serve and volley, in the data so that a coach can analyse all situations where a player lost this type of point and find commonalities

Notation

Serves

  • Direction: 4=wide, 5=body, 6=T, 0=unknown
  • Faults: n=net, w=wide, d=deep, x=wide+deep, g=foot fault, e=unknown, !=shank
  • Other codes: c=let, V=time violation, +=serve-and-volley, *=ace, #=unreturnable, @=unforced return error

Rally Shots

  • Shot letters: f/b=FH/BH groundstroke, r/s=FH/BH slice, v/z=FH/BH volley, h/i=FH/BH half-volley, j/k=FH/BH swing volley, o/p=FH/BH smash, u/y=FH/BH drop shot, l/m=FH/BH lob, t=trick, q=unknown
  • Direction (optional): 1=to FH, 2=middle, 3=to BH, 0=unknown
  • Depth on return only (optional): 7=short, 8=mid-court, 9=deep

Rally Endings

  • *=winner
  • Error types: n/w/d/x/!/e
  • @=unforced error, #=forced error

Optional Modifiers

  • +=approach shot
  • –=hit at net, = =hit from baseline
  • ;=net-cord
  • ^=stop/drop volley