This study used 102 German professional football matches from the Bundesliga (82) and Bundesliga 2 (20) in 31 of the 34 rounds of the 2017/18 season. Each match data contains player (x/y) and ball (x/y/z) position data recorded in 25 fps, event data and basic match information, e.g. names of players/teams and game positions. Position data was collected semi-automatically using an optical tracking system (TRACAB, Chyronhego, NY) and in-game information was collected manually by human observers.23. The accuracy of the tracking system was validated by Linke et al.12. The 36 teams are represented with a minimum and a maximum of two to twenty matches. Every team had home and away games except one which only had away games available. We split the dataset into training (45 matches), validation (10) and testing (47) sets with only Bundesliga matches considered in the training and validation set. Therefore, all Bundesliga 2 matches have been included in the test set. To analyze our data set in terms of class distribution, we compared the number of saves per game, the total duration and the types of saves (kick-off, goal kick, free kick, corner kick, penalty kick and throw-in) with other studies.24.25 and found them similar. The dataset split was designed in a stratified manner so that the number of outages, their duration and their types occur with similar probability in each split.
Four ML algorithms were used: Logistic Regression (LR), as the basic binary classifier, and three tree-based classifiers: Decision Trees (DT), Random Forests (RF) and AdaBoost (AD), as they only have a few hyperparameters. train and are known for their good results26.27. For the experiments here, implementations of these algorithms were provided using the scikit-learn (v0.20.3) library for Python28. We ran our experiments on a 32-core CPU with 64GB of RAM. Outside of LR training, no limits on training time or number of iterations to stop a run were assigned. So the algorithms ran until they terminated naturally. For LR, a maximum number of interactions was defined, which was left by default (100).
From the raw data, we derived a set of features for each period of a match with a frequency of 25 fps. First, we shrunk the dataset for each frame to only contain data from the 22 players on the pitch. When less than 22 active players are present, the remaining values are filled with zeros. The players have been sorted according to the match sheet, which depends on the position of the players and if they play for the home team. The spatial coordinates of all players were reversed in the second half to reverse the side change after halftime. Additionally, training data has been augmented by reversing home and away team information, as home team players are always listed first. This doubles the available training data and removes home advantage, which could influence predictions. Each sample now contains information for 22 players and their positions, speed and acceleration.
Then we augment each sample with features from past and future frames to provide more contextual information to the models. Specifically, each sample contained additional information about 0.4, 0.8, 2, 4, 20, 40, and 80 s frames from the past and future. The reason for this choice of values is to provide information about player actions immediately before and after the current situation as well as long-term context, leaving the selection of important information to the learning algorithm. We assessed their impact on the “Evaluating ML Models” section.
We have provided unscaled data for the DT, RF and AD models. The values of each feature in LR have been linearly transformed so that their empirical mean and standard deviations are zero and one, respectively. Decision trees, however, are based on a comparison of values29. Since the aforementioned transformation does not change the order of values, it does not affect the predictions of a decision tree. Therefore, normalization is not applied to DT, RF and AD.
In summary, all models used the same input features: the base feature set for each frame sample was enhanced by 14 time shifts, both 7 from the past and from the future. And the four individual characteristics of the 22 players on the field are unique to each player. X– and there– coordinate in meters, speed in m/s and acceleration in m/s2. Resulting in each sample consisting of 1320 (15 × 4 × 22) input features.
Searching for hyperparameters
Each learning algorithm requires parameterization which influences the training process and significantly affects performance. Since there is no single configuration for a single model, we performed a hyperparameter search for each model individually. For our study, we used a search grid for LR and AD and a random search30 for RF and DT due to their many configurable hyperparameters through the scikit-learn library28. All configured parameters of the hyperparameter search, as well as the best candidates, are described in detail in the appendix.
Another hyperparameter, as part of post-processing, is the kernel size of the median filter (see “Evaluation of ML models”), which is model independent and is applied only for the evaluation of stops. In other words, the search for the optimal kernel size is performed for each model with the best hyperparameters found for frame-by-frame prediction. The inspected core size range is 1 to 901 frames.
We evaluated the performance of the final models using four approaches. First, a frame by frame evaluation, which compares the state of the ball in the field reality to the prediction in each frame. As performance measures, we calculated the accuracy and the F1-Score for each model. Additionally, we compared the prediction to a per-game random guessing approach by taking the percentage of in-game frames, we call it the knowledge gain. It is calculated for each match using the prediction accuracy minus the percentage of frames in play. However, this performance per frame does not necessarily translate to correctly identified match interruptions since all consecutive frames must have the same value for a stop. For example, a single incoming ball detection in a long series of outgoing ball tags creates two stoppages instead of one.
Consequently, and in a second time, the performances of the model were evaluated at the stop. The basic idea is to extract the stops from the original data and forecast by finding a matching pair between them. Stoppages are extracted by identifying the frames at which the match was paused and resumed, where the first and last frames with the ball out label define the start and end of a stoppage respectively. A suitable metric for comparing assumed stops with actual stops is intersection over union (IoU), which is common in object detection benchmarks31.32. The IoU is calculated for a pair of actual and forecast outages as the overlap time between them divided by the total time spanned by the two outages, i.e. the overlap time plus the sum of the non-overlapping durations . For our article, two stops are matched if their IoU is at least 50%, which ensures that each actual stop is only paired with an expected stop and vice versa.
A good model should apply to analysis tasks. Therefore, the results of performance measurements using ball state prediction should be comparable to those using real data. The IoU metric assigns an expected shutdown to the corresponding ground truth interrupt, even when the overlap is imperfect. Subsequently, a change in the correct start and end points for each stop exists, which affects its application in video analysis tasks.
Third, we checked if the start and end points of the planned stops did not differ much from the ground truth. We calculated the offsets between the actual and predicted points for the start and end. A fundamental task in video analysis is to analyze standard situations, thus, we assumed that our predictions should be within ±2 s. In this case, a practitioner could easily find and add temporal cues to analyze the execution of a standard situation.
Fourth, we assessed the quality of the Total Distance Traveled (TDC) in Effective Playing Time (TDC) performance indicator.E). TDC is one of the most common performance indicators for estimating workload33,34,35. PMHE represents running activity when the ball is in play and can be interpreted as match intensity. Since the ball state prediction error introduces an error at TDCE, we checked if this error is acceptable for performance analysis. Therefore, we calculated the TDCE for each player three times per game using ball status based on (1) ground truth, (2) AD prediction and 3) naïve approach and compared them. For the naive approach, we calculated an approximation for TDCE taking the actual TDC per player for the entire match and the average percentage of effective playing time over all matches. In our test matches, the average percentage of effective playing time was 59.6 ± 6.0% (Min. = 47.4%, Max. = 69.7%). To reduce noise, only outfield players who played a full match were included in the analysis.