## Predicting the World Champion

A couple of years ago I built a model to evaluate how Froome and Dumoulin would have matched up, if they had not avoided racing against each other over the 2017 season. As we approach the 2019 World Championships Road Race in Yorkshire, I have adopted a more sophisticated approach to try to predict the winner of the men’s race. The smart money could be going on Sam Bennett.

### Deep learning

With only two races outstanding, most of this year’s UCI world tour results are available. I decided to broaden the data set with 2.HC classification European Tour races, such as the OVO Energy Tour of Britain. In order to help with prediction, I included each rider’s weight and height, as well as some meta-data about each race, such as date, distance, average speed, parcours and type (stage, one-day, GC, etc.).

The key question was what exactly are you trying to predict? The UCI allocates points for race results, using a non-linear scale. For example, Mathieu Van Der Poel was awarded 500 points for winning Amstel Gold, while Simon Clarke won 400 for coming second and Jakob Fuglsang picked up 325 for third place, continuing down to 3 points for coming 60th. I created a target variable called PosX, defined as a negative exponential of the rider’s position in any race, equating to 1.000 for a win, 0.834 for second, 0.695 for third, decaying down to 0.032 for 20th. This has a similar profile to the points scheme, emphasising the top positions, and handles races with different numbers of riders.

A random forest would be a typical choice of model for this kind of data set, which included a mixture of continuous and categorical variables. However, I opted for a neural network, using embeddings to encode the categorical variables, with two hidden layers of 200 and 100 activations. This was very straightforward using the fast.ai library. Training was completed in a handful of seconds on my MacBook Pro, without needing a GPU.

After some experimentation on a subset of the data, it was clear that the model was coming up with good predictions on the validation set and the out-of-sample test set. With a bit more coding, I set up a procedure to load a start list and the meta-data for a future race, in order to predict the result.

### Predictions

With the final start list for the World Championships Road Race looking reasonably complete, I was able to generate the predicted top 10. The parcours obviously has an important bearing on who wins a race. With around 3600m of climbing, the course was clearly hilly, though not mountainous. Although the finish was slightly uphill, it was not ridiculously steep, so I decided to classify the parcours as rolling with a flat finish

It was encouraging to see that the model produced a highly credible list of potential top 10 riders, agreeing with the bookies in rating Mathieu Van Der Poel as the most likely winner. Sagan was ranked slightly below Kristoff and Bennett, who are seen as outsiders by the pundits. The popular choice of Philippe Gilbert did not appear in my top 10 and Alaphilippe was only 9th, in spite of their recent strong performances in the Vuelta and the Tour, respectively. Riders in positions 5 to 10 would all be expected to perform well in the cycling classics, which tend to be long and arduous, like the Yorkshire course.

For me, 25/1 odds on Sam Bennett are attractive. He has a strong group of teammates, in Dan Martin, Eddie Dunbar, Connor Dunne, Ryan Mullen and Rory Townsend, who will work hard to keep him with the lead group in the hillier early part of the race. Then he will then face an extremely strong Belgian team that is likely to play the same game that Deceuninck-QuickStep successfully pulled off in stage 17 of the Vuelta, won by Gilbert. But Bennett was born in Belgium and he was clearly the best sprinter out in Spain. He should be able to handle the rises near the finish.

A similar case can be made for Kristoff, while Matthews and Van Avermaet both had recent wins in Canada. Nevertheless it is hard to look past the three-times winner Peter Sagan, though if Van Der Poel launches one of his explosive finishes, there is no one to stop him pulling on the rainbow jersey.

### Appendix

After the race, I checked the predicted position of the eventual winner, Mads Pedersen. He was expected to come 74th. Clearly the bad weather played a role in the result, favouring the larger riders, who were able to keep warmer. The Dane clearly proved to be the strongest rider on the day.

### References

Code used for this project

## Cycling Physique

It is easy to assume that successful professional cyclists are all skinny little guys, but if you look at the data, it turns out that they have an average height of 1.80m and an average weight of around 68kg. If we are to believe the figures posted on ProCyclingStats, hardly any professional cyclists would be considered underweight. In fact, they would struggle to perform at the required level if they did not maintain a healthy weight.

### Taller than you might think

According to a study published in 2013 and updated in 2019, the global average height of adult males born in 1996 was 1.71m, but there is considerable regional variation. The vast majority of professional cyclists come from Europe, North America, Russia and the Antipodes where men tend to be taller than those from Asia, Africa and South America. For the 41 Colombians averaging 1.73m, there are 85 Dutch riders with a mean height of 1.84m. See chart below.

Furthermore, road cycling involves a range of disciplines, including sprinting and time trialling, where size and raw power provide an advantage. The peloton includes larger sprinters alongside smaller climbers.

### Not as light as expected

While 68kg for a 1.80m male is certainly slim, it equates to a body mass index of 21 (BMI = weight / (height)²), which is towards the middle of the recommended healthy range. BMI is not a sophisticated measure, as it does not distinguish between fat and muscle. Since muscle is more dense than fat and cyclists tend to have it a higher percentage of lean body mass, they will look slimmer than a lay person of equivalent height and weight. Nevertheless doctors use BMI as a guide and become concerned when it falls below 18.5.

The chart includes over 1,100 professional cyclists, but very few pros would be considered underweight. The majority of riders have a BMI of between 20 and 22. Although Colombian riders (red) tend to be smaller, specialising in climbing, their average BMI of 20.8 is not that different from larger Dutch riders (orange) with a mean BMI of 21.2. The taller Colombians include the sprinters Hodeg, Gaviria and Molano.

### Types of rider

This chart shows the names of a sample of top riders. All-out sprinters tend to have a BMI of around 24, even if they are small like Caleb Ewan. Sprints at the end of more rolling courses are likely to be won by riders with a BMI of 22, such as Greipel, van Avermaet, Sagan, Gaviria, Groenewegen, Bennet and Kwiatkowski. Time trial specialists like Dennis and Thomas have similar physiques, though Dumoulin and Froome are significantly lighter and remarkably similar to each other.

GC contenders Roglic, Kruiswijk and Gorka Izagirre are near the centre of the distribution with a BMI around 21, close to Viviani, who is unusually light for a sprinter. Pinot, Valverde, Dan Martin, the Yates brothers and Pozzovivo appear to be light for their heights. Interestingly climbers such as Quintana, Uran, Alaphilippe, Carapaz and Richie Porte all have a BMI of around 21, whereas Lopez is a bit heavier.

If the figures reported on ProCyclingStats are accurate, George Bennet and Emanuel Buchmann are significantly underweight. Weighting 58kg for a height of 1.80m does not seem to be conducive to strong performance, unless they are extraordinary physical specimens.

### Conclusions

Professional cyclists are lean, but they would not be able to achieve the performance required if they were underweight. It is possible that the weights of individual riders might vary over time by a couple of kilos, moving them a small amount vertically on the chart, but scientific approaches are increasingly employed by expert nutritionists to avoid significant weight loss over longer stage races. The Jumbo Foodcoach app was developed alongside the Jumbo-Visma team and, working with Team Sky, James Morton strove to ensure that athletes fuel for the work required. Excessive weight loss can lead to a range of problems for health and performance.

## Betting on the Tour

On the eve of the Tour de France, the pundits have made their predictions, but when the race is over, they will be long forgotten. One way of checking your own forecasts is to take a look at the odds offered on the betting markets. These are interesting, because they reflect the actions of people who have actually put money behind their views. In an efficient and liquid market, the latest prices ought to reflect all information available. This blog takes a look at the current odds, without wishing to encourage gambling in any way.

The website oddchecker.com collates the odds from a number of bookmakers across a large range of bets. It is helpful to convert the odds into predicted probabilities. Focussing on the overall winner,  Egan Bernal is the favourite at 5/2 (equating to a 29% probability taking the yellow jersey), followed by Geraint Thomas at 7/2 (22%) and Jakob Fuglsang at 6/1 (14%). This gives a 51% chance of a winner being one of the two Team Ineos riders. The three three leading contenders are some distance ahead of Adam Yates, Richie Porte, Thibaut Pinot and Nairo Quintana. Less fancied riders include Roman Bardet, Steven Kruijswijk, Rigoberto Uran, Mikel Landa, Enric Mas and Vincenzo Nibali. Anyone else is seen as an outsider.

## Ups and downs

The odds change over time, as the markets evaluate the performance and changing fortunes of the riders. In the following chart shows the fluctuations in the average daily implied winning chances of the three current favourites since the beginning of the year, according to betfair.com.

The implied probability that Geraint Thomas would repeat last year’s win has hovered between 20% and 30%, spiking up a bit during the Tour of Romandie. Unfortunately, Chris Froome’s odds are no longer available, as he was most likely the favourite earlier this year. However, his crash on 11 June instantaneously improved the odds for other riders, particularly Thomas and Bernal, though expectations for the Welshman declined after he crashed out of the Tour de Suisse on 18 June.

The betting on Fuglsang spiked up sharply during the Tirreno Adriatico, where he won a stage and came 3rd on GC, and the Tour of the Basque country, where he finished strongly. Apparently, his three podium results in the the Ardenne Classics had no effect on his chances of a yellow jersey, whereas his victory in the Critérium Dauphiné had a significant positive impact.

Egan Bernal, appeared from the shadows. At the beginning of the year, he was seen as a third string in Team Ineos. His victory in Paris Nice hardly registered on his odds for the Tour. But since Froome’s crash and Thomas’s departure from the Tour de Suisse, he became the bookies’ favourite.

With 65% of the money on the three main contenders, there are some pretty good odds available on other riders. A couple of crashes, an off day or a bit of bad luck could turn the race on its head. Clearly the Ineos and Astana teams are capable of protecting their GC contenders, but so too are Movistar, EF Education First, Michelton Scott, Groupama-FDJ, Bahrain Merida and others.

## References

Code I used can be found here

## Strava – Tour de Richmond Park Clockwise

Following my recent update on the Tour de Richmond Park leaderboard, a friend asked about the ideal weather conditions for a reverse lap, clockwise around the park. This is a less popular direction, because it involves turning right at each mini-roundabout, including Cancellara corner, where the great Swiss rouleur crashed in the 2012 London Olympics, costing him a chance of a medal.

An earlier analysis suggested that apart from choosing a warm day and avoiding traffic, the optimal wind direction for a conventional anticlockwise lap was a moderate easterly, offering a tailwind up Sawyers Hill. It does not immediately follow that a westerly wind would be best for a clockwise lap, because trees, buildings and the profile of the course affect the extent to which the wind helps or hinders a rider.

Currently there are over 280,000 clockwise laps recorded by nearly 35,000 riders, compared with more than a million anticlockwise laps by almost 55,000 riders. As before, I downloaded the top 1,000 entries from the leaderboard and then looked up the wind conditions when each time was set on a clockwise lap.

In the previous analysis, I took account of the prevailing wind direction in London. If wind had no impact, we would expect the distribution of wind directions for leaderboard entries to match the average distribution of winds over the year. I defined the wind direction advantage to be the difference between these two distributions and checked if it was statistically significant. These are the results for the clockwise lap.

The wind direction advantage was significant (at p=1.3%). Two directions stand out. A westerly provides a tailwind on the more exposed section of the park between Richmond Gate and Roehampton, which seems to be a help, even though it is largely downhill. A wind blowing from the NNW would be beneficial between Roehampton and Robin Hood Gate, but apparently does not provide much hindrance on the drag from Kingston Gate up to Richmond, perhaps because this section of the park is more sheltered. The prevailing southwesterly wind was generally unfavourable to riders setting PBs on a clockwise lap.

The excellent mywindsock web site provides very good analysis for avid wind dopers. This confirms that the wind was blowing predominantly from the west for the top ten riders on the leaderboard, including the KOM, though the wind strength was generally light.

The interesting thing about this exercise is that it demonstrates a convergence between our online and our offline lives, as increasing volumes of data are uploaded from mobile sensors. A detailed analysis of each section of the million laps riders have recorded for Richmond Park could reveal many subtleties about how the wind flows across the terrain, depending on strength and direction. This could be extended across the country or globally, potentially identifying local areas where funnelling effects might make a wind turbine economically viable.

### References

Jupyter notebook for calculations

## Creating artistic images from Strava rides

When you upload a ride, Strava draws a map using the longitude and latitude coordinates recorded by your GPS device. This article explores ways in which these numbers, along with other metrics, can be used to create interesting images that might have some artistic merit.

The idea was motivated by the huge advances made in the field of Deep Learning, particularly applications for image recognition. However, since datasets come in all shapes and forms, researchers have explored ways of converting different types of data into images.  In a paper published in 2015, the authors achieved success in identifying standard time series by converting them into images.

GPS bike computers typically record snapshots of information every second. What kind of images could these time series generate? It turns out that there are several ways to convert a time series into an image.

### Spectrogram

Creating a spectrogram is a standard approach from signal processing that is particularly useful for analysing acoustic files. The spectrogram is a heat map that shows how the underlying frequencies contributing to the signal change over time. Technically, it is derived by calculating the discrete Fourier transform of a window that slides across the time series. I applied this to my regular Saturday morning club ride of four laps around Richmond Park. The image changes a bit once the ride gets going after about 1200 seconds (20 minutes), but, frankly, the result was not particularly illuminating. There is no obvious reason to consider cycling power data as a superposition of frequencies.

### Ah! Now we are getting somewhere

The authors of the referenced paper took a different approach to produce things called Gramian Angular Summation Field (GASF), Gramian Angular Difference Field (GADF), and Markov Transition Field (MTF). Read the paper if want to know the details. I created these and something call a Recurrence Plot. All of these methods generate a matrix, by combining every element in the time series with every other element. The underling observations occurring at times $t_{1}$ and $t_{2}$ determine the colour of the pixel at position ($t_{1}$, $t_{2}$). Images are symmetric along the lower-left to upper-right diagonal, apart from GADF, which is antisymmetric.

Let’s see how do they look for on four laps of Richmond Park. We have six time series, with corresponding sets of images below. The segmentation of the images is due to periodicity of the data. This is particularly clear in the geographic data (longitude, latitude and altitude). The higher intensity of the main part of the ride is most obvious in the heart rate data. The MTF plots are quite interesting. Scroll down through the images to the next section

### From cycle ride to art

It is one thing to create an image of each item, but how can we combine these to summarise a ride in a single image. I considered two methods of combining time series into a single image: a) create a new image where the vertical and horizontal axes represent different series and b) create a new image by simply adding the corresponding values from two underlying images.

One problem is that some cyclists don’t have gadgets like heart rate monitors and power meters, so I initially restricted myself to just the longitude, latitude and altitude data. Nevertheless, as noted in an earlier blog, it is possible to work out speed, because the time interval is one second between each reading. Furthermore, one can estimate power, from the speed and changes in elevation.

Another problem is that rides differ in length. For this I split the ride into, say, 128 intervals and took the last observation in each interval. So for a 3 hour ride, I’d be sampling about once every 84 seconds.

The chart at the top of this blog was created by first normalising each series to a standard range (-1, +1). Method a) was used to create two images: longitude was added to latitude and altitude was multiplied by speed. These were added using method b). Using these measures will produce pretty much the same chart each time the ride is done. In contrast, an image that is totally unique to the ride can be produced using data relating to the individual rider. The image below uses the same recipe to combine speed, heart rate, power and cadence. If this had been a particularly special ride, the image would be a nice personal memento.

For anyone interested in the underlying code, I have posted a Jupyter notebook here.

### References

Encoding Time Series as Images for Visual Inspection and Classification Using Tiled Convolutional Neural Networks, Wang Z Oates T, https://www.aaai.org/ocs/index.php/WS/AAAIW15/paper/viewFile/10179/10251

## Machine learning for a medical study of cyclists

This blog provides a technical explanation of the analysis underlying the medical paper about male cyclists described previously. Part of the skill of a data scientist is to choose from the arsenal of machine learning techniques the tools that are appropriate for the problem at hand. In the study of male cyclists, I was asked to identify significant features of a medical data set. This article describes how the problem was tackled.

### Data

Fifty road racing cyclists, riding at the equivalent of British Cycling 2nd category or above, were asked to complete a questionnaire, provide a blood sample and undergo a DXA scan – a low intensity X-ray used to measure bone density and body composition. I used Python to load and clean up the data, so that all the information could be represented in Pandas DataFrames. As expected this time-consuming, but essential step required careful attention and cross-checking, combined with the perseverance that is always necessary to be sure of working with a clean data set.

The questionnaire included numerical data and text relating to cycling performance, training, nutrition and medical history. As a result of interviewing each cyclist, a specialist sports endocrinologist identified a number of individuals who were at risk of low energy availability (EA), due to a mismatch between nutrition and training load.

Bone density was measured throughout the body, but the key site of interest was the lumbar spine (L1-L4). Since bone density varies with age and between males and females, it was logical to use the male, age-adjusted Z-score, expressing values in standard deviations above or below the comparable population mean.

The measured blood markers were provided in the relevant units, alongside the normal range. Since the normal range is defined to cover 95% of the population, I assumed that the population could be modelled by a gaussian distribution in order to convert each blood result into a Z-score. This aligned the scale of the blood results with the bone density measures.

### Analysis

I decided to use the Orange machine learning and data visualisation toolkit for this project. It was straightforward to load the data set of 46 features for each of the 50 cyclists. The two target variables were lumbar spine Z-score (bone health) and 60 minute FTP watts per kilo (performance). The statistics confirmed the researchers’ suspicion that the lumbar spine bone density of the cyclists would be below average, partly due to the non-weight-bearing nature of the sport. Some of the readings were extremely low (verging on osteoporosis) and the question was why.

Given the relatively small size of the data set (a sample of 50), the most straightforward approach for identifying the key explanatory variables was to search for an optimal Decision Tree. Interestingly, low EA turned out to be the most important variable in explaining lumbar spine bone density, followed by prior participation in a weight-bearing sport and levels of vitamin D (which was, in most cases, below the ideal level of athletes). Since I had used all the data to generate the tree, I made use of Orange’s data sampler to confirm that these results were highly robust. This had some similarities with the Random Forest approach. Although Orange produces some simple graphical tools like the following, I use Python to generate my own versions for the final publication.

Finding a robust decision tree is one thing, but it was essential to verify whether the decision variables were statistically significant. For this, Orange provides box plots for discrete variables. For my own peace of mind, I recalculated all of the Student’s T-statistics to confirm that they were correct and significant. The charts below show an example of an Orange box plot and the final graphic used in the publication.

The Orange toolkit includes other nice data visualisation tools. I particularly liked the flexibility available to make scatter plots. This inspired the third figure in the publication, which showed the most important variable explaining performance. This chart highlights a cluster of three cyclists with low EA, whose FTP watts/kg were lower than expected, based on their high training load. I independently checked the T-statistics of the regression coefficients to identify relationships that were significant, like training load, or insignificant, like percentage body fat.

### Conclusions

The Orange toolkit turned out to be extremely helpful in identifying relationships that fed directly into the conclusions of an important medical paper highlighting potential health risks and performance drivers for high level cyclists. Restricting nutrition through diet or fasted rides can lead to low energy availability, that can cause endocrine responses in the body that reduce lumbar spine bone density, resulting in vulnerability to fracture and slow recovery. This is know as Relative Energy Deficiency in Sport (RED-S). Despite the obsession of many cyclists to reduce body fat, the key variable explaining functional threshold power watts/kg was weekly training load.

### References

Low energy availability assessed by a sport-specific questionnaire and clinical interview indicative of bone health, endocrine profile and cycling performance in competitive male cyclists, BMJ Open Sport & Exercise Medicine, https://doi.org/10.1136/bmjsem-2018-000424

Relative Energy Deficiency in Sport, British Association of Sports and Exercise Medicine

Synergistic interactions of steroid hormones, British Journal of Sports Medicine

Cyclists: Make No Bones About It, British Journal of Sports Medicine

Male Cyclists: bones, body composition, nutrition, performance, British Journal of Sports Medicine

## Strava – Automatic Lap Detection

As you upload your data, you accumulate a growing history of rides. It is helpful to find ways of classifying different types of activities. Races and training sessions often include laps that are repeated during the ride. Many GPS units can automatically record laps as you pass the point where you began your ride or last pressed the lap button. However, if the laps were not recorded on the device, it is tricky to recover them. This article investigates how to detect laps automatically.

First consider the simple example of a 24 lap race around the Hillingdon cycle circuit. Plotting the GPS longitude and latitude against time displays repeating patterns. It is even possible to see the “omega curve” in the longitude trace. So it should be possible to design an algorithm that uses this periodicity to calculate the number of laps.

This is a common problem in signal processing, where the Fourier Transform offers a neat solution. This effectively compares the signal against all possible frequencies and returns values with the best fit in the form of a power spectrum. In this case, the frequencies correspond to the number of laps completed during the race. In the bar chart below, the power spectrum for latitude shows a peak around 24. The high value at 25 probably shows up because I stopped my Garmin slightly after the finish line. A “harmonic” also shows up at 49 “half laps”. Focussing on the peak value, it is possible to reconstruct the signal using a frequency of 24, with all others filtered out.

So we’re done – we can use a Fourier Transform to count the laps! Well not quite. The problem is that races and training sessions do not necessarily start and end at exactly the starting point of a lap. As a second example, consider my regular Saturday morning club run, where I ride from home to the meeting point at the centre of Richmond Park, then complete four laps before returning home. As show in the chart below, a simple Fourier Transform approach suggests that ride covered 5 laps, because, by chance, the combined time for me to ride south to the park and north back home almost exactly matches the time to complete a lap of the park. Visually it is clear that the repeating pattern only holds for four laps.

Although it seems obvious where the repeating pattern begins and ends, the challenge is to improve the algorithm to find this automatically. A brute force method would compare every GPS location with every other location on the ride, which would involve about 17 million comparisons for this ride, then you would need to exclude the points closely before or after each recording, depending on the speed of the rider. Furthermore, the distance between two GPS points involves a complex formula called the haversine rule that accounts for the curvature of the Earth.

Fortunately, two tricks can make the calculation more tractable. Firstly, the peak in the power spectrum indicates roughly how far ahead of the current time point to look for a location potentially close to the current position. Given a generous margin of, say, 15% variation in lap times, this reduces the number of comparisons by a whole order of magnitude. Secondly, since we are looking for points that are very close together, we only need to multiply the longitudes by the cosine of the latitude (because lines of longitude meet at the poles) and then a simple Euclidian sum the squares of the differences locates points within a desired proximity of, say, 10 metres.  This provides a quicker way to determine the points where the rider was “lapping”. These are shaded in yellow in the upper chart and shown in red on a long/latitude plot below. The orange line on the upper chart shows, on the right hand scale, the rolling lap time, i.e. the number of seconds to return to each point on the lap, from which the average speed can be derived.

Two further refinements were required to make the algorithm more robust. One might ask whether it makes a difference using latitude or longitude. If the lap involved riding back and forth along a road that runs due East-West, the laps would show up on longitude but not latitude. This can be solved by using a 2-dimensional Fourier Transform and checking both dimensions. This, in turn, leads to the second refinement, exemplified by the final example of doing 12 ascents of the Nightingale Lane climb. The longitude plot includes the ride out to the West, 12 reps and the Easterly ride back home.

The problem here was that the variation in longitude/latitude on the climb was tiny compared with the overall ride. Once again, the repeating section is obvious to the human eye, but more difficult to unpick from its relatively low peak in the power spectrum. A final trick was required: to consider the amplitude of each frequency in decreasing order of power and look out for any higher frequency peaks that appear early on the list. This successfully identified the relevant part of the ride, while avoiding spurious observations for rides that did not include laps.

The ability for an algorithm to tag rides if they include laps is helpful for classifying different types of sessions. Automatically marking the laps would allow riders and coaches to compare laps against each other over a training session or a race. A potential AI-powered robo-coach could say “Ah, I see you did 12 repeats in your session today… and apart from laps 9 and 10, you were getting progressively slower….”