A strategy in sports betting with the nearest neighbours search and genetic algorithms

– The point of sports betting is not merely to correctly predict the outcome of a game, but to actually win on a bet. We propose a model of sports betting that uses the nearest neighbours search and genetic algorithms to do the job. It uses data on the teams playing, their respective formations, individual players, results of previous games, as well as odds oﬀered by bookmakers. The model has been trained using the data from the seasons 2002/03 until 2008/09 of the English Premier League and tested against the already played games of the seasons 2009/10 and 2010/11.


Introduction
Sports betting is becoming more and more popular, as evidenced by an increasing number of registered players in the online bookmakers. Interest in this area is also growing in the scientific community. The literature on forecasting results of sporting events is vast.
Of all sports, football enjoys the greatest interest in Europe. Many papers on predicting results in this discipline have been published, see for example [1,2,3]. [3] presents a model that uses the offensive and defensive strengths of the teams participating in the game to predict the result, with the data on previous games providing the input. The model presented in [2] uses fuzzy logic optimized by a genetic algorithm and a neural network. The results of the last five games of each team and two direct matches between them, to a total of 12 games, serve as the input. The model presented in [1] is based on the rule-based reasoning and Bayesian networks. In addition to the historical data on previous games, it also uses expert knowledge to construct the priors. 8 A strategy in sports betting with the nearest neighbours... All these models focus on correctly predicting the result of a game, without paying attention to the financial outcome of a bet. This may be misleading since the payoff of a string of "obviously winning" bets can be easily offset by a couple of unfortunate ones.
The model presented in this paper uses much more data than just the historical results. In contrast to [3] it does not use any expert knowledge or subjective information. The principal novelty of the present approach is that it tries to optimize the actual profits from betting, not just correctly predict results of a series of games.

Data
Analyzing the chances of Team A winning against Team B, it is important to take into account as many factors that may influence the result as possible. Analyzing the results of previous games is not enough. For example, players change teams and somebody who played for Team A in 2005 may be playing for Team B in 2010. For this reason, the analysis should include information on individual players. It is also important to consider which players actually participate in a given game, whether in the starting lineup or on the bench, with a stress on the starting lineup. Random and unexpected weaknesses of various formations also provide valuable information, as well as the current disposition of the whole team, as judged from the results of a set of games immediately preceding the current one, results of direct matches between both teams etc. Humans consider all sorts of such information before placing a bet, and a betting algorithm should do the same.
In the model presented here, each game is represented by 134 parameters. To test the model, all games from the seasons 2002/03 until 2009/10 of the English Premier League, or the total of 3040 games, have been analyzed.

The model
The basic strategy employed in our model is the nearest neighbours search [4]. The probability of winning a bet is calculated on the basis of how many "similar" bets, found by the KNN search, have been won. The objective is not to predict the result of a game, but to optimize the profits from the actual bet.

Bets
The bets that our model admits are: • t over -total goals above 2.5, • t under -total goals under 2.5.
d m serves as a metric in the KNN algorithm, with the size of the neighbourhood, k, set at 15.

Decision-making
After calculating the probability of each of the five types (t 1 , t x , t 2 , t over , t under ) this module decides whether a particular type of bet should be placed or not. This decision depends on whether the type is worth the risk. Here we use the following notation: forecast t -probability of winning the bet t resulting from the forecasting bookieOdd t -odds on bet t set by the bookmaker forecastOdd t -odds on bet t resulting from the forecasting forecastOdd t =

forecastt
Under the assumption that we have correctly predicted the result of the game, the bet t is worth the risk if bookieOdd t ≥forecastOdd t However, as the forecast might not be perfect, a bet is placed only if bookieOdd t ≥ 1.5 * forecastOdd t

Assessment
This module is designed to evaluate the performance of the algorithm at a fixed time interval. To this end each match with the time interval is subjected to the typing. Then the suggested types are compared with the actual results. On this basis the values of yield and maxloss are determined. We use the following notation: T -the set of types with a fixed time interval Because maxloss > 0, it follows we cannot invest all available capital in each type because we will eventually lose. We should not invest 1/9 of the capital, either, because keeping stakes after each loss we may also go bankrupt. We invest 1/10 of the available capital in each type. The yield shows that, on average, we gain 20% of the funds invested. If we substitute these to the formula above, we can see that after investing in 20 types, we gain, under the assumptions specified, 0.02 20 ≈ 1.49 of our initial investment. Our net gain is therefore 49 units.

Optimization
Optimization is crucial for a good performance of the algorithm presented here. The point of the optimization is an appropriate selection of data used to calculate the distance. The optimization is based on a genetic algorithm [5].
Coding: A chromosome consists of 134 genes. Each gene of the chromosome bears the information on the explanatory variable it represents, and its weight, w, set to either 0 or 1. The weight w = 0 means that this particular explanatory variable is not used in calculating the distance.
Selection was based on the roulette method. Crossing: Crossing occurs with the probability of 70% and if so it does on a randomly chosen site.
Mutation: Weights can switch randomly from 0 to 1 and vice versa with the probability 0.1%.
The fitness function is the same f (T ) that was used in the Assessment module. The size of each generation was taken to equal 10 and there were 45 generations.

Results
Finding the optimal strategy The model was tested three times against the results of the English Premier League, the seasons 2007/08 and 2008/09. Fig. 1 shows the average fitness of each of the 45 generations in each of the three tests. Note the logarithmic scale of the vertical axis. One can see that starting with 20 generations, the tests bring the returns of the whole generation. One can also see how quickly the model improves. Fig. 2 shows the fitness of the best chromosome in each generation. A strategy that gives profits has been found already after the 4 th generation of the genetic algorithm. It can be also seen that the algorithm gets stuck in a local optimum. The fitness of the best strategy identified during the test equals 843.
Verification Very promising results obtained during the test have been verified against the games of the season 2009/10, for 7 best strategies only. These results are presented in Table  1. The best strategies identified during the tests were actually the worst, but some strategies still brought profits. It appears that in-sample training, or optimization of strategies to the results of the seasons 2007/08 and 2008/09 performed poorly out-ofsample, as some external conditions might have changed.
The four strategies that brought profits for the season 2009/10 have been further verified against the games of the 2010/11 season that have been completed until January 1, 2011. The results are presented in Table 2. These four strategies also profit in the 2010/11 season.

Conclusions
An original, fully automatic algorithm for investing in sports betting has been presented. Unlike most previous studies on similar subjects, the presented model uses a Pobrane z czasopisma Annales AI-Informatica http://ai.annales.umcs.pl Data: 06/04/2022 11:30:21 U M C S lot more publicly available information on football matches. It also focuses not merely on correctly predicting results of games, but rather on profiting from actual bets. If a bet is deemed not to be profitable, it is not taken, and the gambler does not sustain losses, at the price of abstaining from minute winnings. Our results show that, contrary to the popular opinion, a profitable strategy in investing in sporting bets is possible.
Several possibilities of improving our model appear. For example, introducing the non-Euclidean metric, separate strategies for different types of bets and changing the weights from 0-1 to fractional values can help identify even more profitable strategy. Several programming issues need to be solved in order to run the appropriate software faster. Eventually we want to be able to use our model in different leagues and in sports other than football.