Last time I tried to predict a race, I trained up a neural network on past race results, ahead of the World Championships in Harrogate. The model backed Sam Bennett, but it did not take account of the weather conditions, which turned out to be terrible. Fortunately the forecast looks good for tomorrow’s Milan Sanremo.
This time I have tried using a Random Forest, based on the results of the UCI races that took place in 2020 and so far in 2021. The model took account of each rider’s past results, team, height and weight, together with key statistics about each race, including date, distance, average speed and type of parcours.
One of the nice things about this type of model is that it is possible to see how the factors contribute to the overall predictions. The following waterfall chart explains why the model uncontroversially has Wout van Aert as the favourite.
The largest positive contribution comes from being Wout van Aert. This is because he has a lot of good results. His height and weight favour Milan Sanremo. He also has a strong positive coming from his team. This distance and race type make further positive contributions.
We can contrast this with the model’s prediction for Mathieu van der Poel, who is ranked 9th.
We see a positive personal contribution from being van der Poel, but having raced fewer UCI events, he has less of a strong set of results than van Aert. According to the model the Alpecin Fenix team contribution is not a strong as Jumbo Visma, but the long distance of the race works in favour of the Dutchman. The day of year gives a small negative contribution, suggesting that his road results have been stronger later in the year, but this could be due to last year’s unusual timing of races.
Each of the other riders in the model’s top 10 is in with a shout.
It’s taken me all afternoon to set up this model, so this is just a short post.
Post race comment
Where was Jasper Stuyven?
Like Mads Pedersen in Harrogate back in 2019, Jasper Stuyven was this year’s surprise winner in Sanremo. So what had the model expected for him? Scrolling down the list of predictions, Stuyven was ranked 39th.
His individual rider prediction was negative, perhaps because he has not had many good results so far this year, though he did win Omloop Het Nieuwsblad last year and had several top 10 finishes. The model assessed that his greatest advantage came from the length of the race, suggesting that he tends to do well over greater distances.
The nice thing about this approach is that that it identifies factors that are relevant to particular riders, in a quantitative fashion. This helps to overcome personal biases and the human tendency to overweight and project forward what has happened most recently.
It was shocking to see footage of Remco Evenepoel’s horrific crash in Il Lombardia. Reports indicate that he broke his pelvis after falling from a bridge into a ravine. This follows the injuries sustained by his Deceuninck-QuickStep team mate Fabio Jakobsen in the Tour of Poland.
The video above shows the repairs to my pelvis carried out by the specialist team at St George’s Hospital. My accident was less spectacular than Remco’s, I just hit a large pothole, while riding in the Kent lanes last March. It took the ambulance two and a half hours to arrive, as this was just at the beginning of the COVID-19 crisis. In fact, lock-down was announced on the evening of my crash. There was a lot of uncertainty about the virus back then, so it was a pretty scary time to be in hospital. Nevertheless I have immense respect and gratitude for the NHS staff who looked after me.
I was given crutches the day after the operation and returned home the day after that, with strict instructions to remain non-weight-bearing on the injured leg for six weeks and then only partial weight-bearing for the next six weeks. An NHS physiotherapist contacted me and regularly provided a progression of exercises. I set myself additional challenges, like doing extra press-ups.
After six weeks of doing no proper exercise, I had lost 4kg. The circumference of my left thigh was 5cm less than the right. However, following a review at the hospital, I was given permission to start gentle exercise on my static turbo trainer. I began by removing the left pedal and performing single leg drills, but after a couple of days it was easier to put my injured leg on the pedal as a passenger. This also gave the hip some mobility.
After a week on the turbo, I was up to one hour a day at about 160 watts. It took a long time to increase this above 200 watts. I watched a lot of old cycling films, without any particular urge to go on Zwift. I started riding outside in mid-June, 12 weeks post op. My Garmin pedals allowed me to monitor the left-right balance as well as average power.The following chart shows that 21 weeks after my accident, balance is hovering around 48:52 and five minute power is back over 300 watts.
The psychological aspect of rehabilitation has been very important. I have focussed on targets and deadlines, marking each little achievement as a milestone. I am now walking without a limp, though running is still off limits. I even went kitesurfing a couple of weeks ago (don’t tell my surgeon about that one). I have been busy learning Italian, composing music and programming in Python.
Since heading back out on the roads, I have been riding cautiously, as my hip will not regain full strength until next spring. I plan to enter a couple of time trials to rekindle a sense of competition, without the danger of riding in a peloton. Racing again next season remains a goal.
Probably the most important mental aspect has been to stay positive at all times and never to spend time feeling sorry for oneself. This has been difficult as, inevitably, there have been a couple of set-backs when progress has seemed to reverse. But on the whole, my recovery has been astounding and, like Chris Froome, I remain optimistic about regaining my peak.
Remco will be back on the road next season, with the potential to pick up some results later in the year.
As a growing number of people seek to educate themselves on coronavirus COVID-19, while confined to their homes, a better understanding can be gained by taking a look at how to model an epidemic.
Researchers have created highly complex models of the spread of infections. For example, BlueDot’s disease-tracking model, described in this podcast, monitors the Internet with AI language translators and evaluates the network effects of transmission based on air travel itineraries. However, a surprising amount of insight can be gained from a very simple approach called the SIR model.
The SIR model divides the population into three classes. The susceptible class (S) includes everyone who can catch the infection. In the case of a novel virus like corona, it seems that the entire global population was initially susceptible. The infected class (I) includes all those currently infected and able to transmit the virus to susceptible people. The removed class (R) includes everyone who has recovered from the virus or, unfortunately, died. In the model, these people no longer transmit the disease nor are they susceptible. The idea is that people move from the susceptible class to the infected class to the removed class.
Although there is much focus in the media on the exponential rise of the total number of cases of coronavirus, this figure includes recoveries and deaths. In one sense this is a huge underestimate, because the figures only includes people who have taken a test and returned a positive result. As explained by Tomas Pueyo, many people do not display symptoms until around 5 days after infection and for over 90% these symptoms are mild, so there could be ten times more people infected than the official figures suggest. In another sense, the figures are a huge exaggeration, because people who have recovered are unlikely to be infectious, because their immune systems have fought off the virus.
The SIR model measures the number of infectious people. On the worldometers site these are called “active cases”. The critical insight of the SIR model, shown in the diagram above, is that the class of infected people grows if the daily number of new cases exceeds the number of closed cases.
Closed cases – removal rate
In a real epidemic, experts don’t really know how many people are infected, but they can keep track of those who have died or recovered. So it is best to start by considering the rate of transfer from infected to removed. After some digging around, it appears that the average duration of an infection is about 2 weeks. So in a steady state situation, about on person in 14 or 7% of those infected would recover every day. This percentage would be a bit less than this if the epidemic is spreading fast, because there would be more people who have recently acquired the virus, so let’s call it 6%. For the death rate, we need the number of deaths divided by the (unknown) total number of people infected. This is likely to be lower than the “case fatality rate” reported on worldometers because that divides the number of deaths only be the number of positive tests. The death rate is estimated to be 2-3%. If we add 3% to the 6% of those recovering, the removal rate (call it “a”) is estimated to be 9%.
In the absence of a cure or treatment for the virus, it is unlikely that the duration of infectiousness can be reduced. As long as hospitals are not overwhelmed, those who might otherwise have died may be saved. However, there is not much that governments or populations can do to speed up the daily rate of “closed cases”. The only levers available are those to reduce the number of “new cases” below 9%. This appeared to occur as a result of the draconian actions taken in China in the second half of February, but the sharp increase in new cases that became apparent over the weekend of 6/7 March spooked the financial markets.
New cases – infection rate
In the SIR model, the number of new infections depends on three factors: the number of infectious people, the number of susceptible people and the something called the infection rate (r), which measures the probability that an infected person passes on the virus to a susceptible person either through direct contact or indirectly, for example, by contaminating a surface, such as a door handle.
Governments can attempt to reduce the number of new infections by controlling each of the three driving factors. Clearly, hand washing and avoiding physical contact can reduce the infection rate. Similarly infected people are encouraged to isolate themselves in order to reduce the proportion of susceptible people exposed to direct or indirect contact. Guidance to UK general practitioners is to advise patients suffering from mild symptoms to stay at home and call 111, while those with serious symptoms should call 999.
When will it end?
As more people become infected, the number of susceptible people naturally falls. Until there is a vaccine, there is nothing governments can do to speed up this decline. Eventually, when enough people will have caught the current strain of the virus, so-called “herd immunity” will prevail. It is not necessary for everyone to have come into contact with the virus, rather it is sufficient for the number of susceptible people to be smaller than a critical value. Beyond this point infected people, on average, recover before they encounter a susceptible person. This is how the epidemic will finally start to die down.
When people refer to the transmission rate or reproduction rate of a virus, they mean the number of secondary infections produced by a primary infection across a susceptible population. This is equal to the number of susceptible people times the infection rate divided by the removal rate. This determines the threshold number of susceptible people below which the number of infections falls. The critical value is equal to removal rate relative to the infection rate (a/r) . When the number of susceptible people falls to this critical value, the number of infected people will reach a peak and subsequently decline. More susceptible people will still continue to be infected, but at a decreasing rate, until the infection dies out completely, by which time a significant part of the population will have been infected.
This looks very scary indeed
Running the figures through the SIR model produces some extremely scary predictions. At the time of writing, new cases of infection were rising at a rate of about 15%. At this rate, the virus could spread to 2/3 of the population before it dies out. If three percent of those infected die, the virus would kill 2 percent of the population. Based on results so far these would be largely elderly people or those suffering from complications, so it is extremely important that they are protected from infection. If the virus continues to run out of control, the number of deaths could run into the millions before the epidemic ends.
It is absolutely essential to reduce the infection rate and keep it low, particularly among the elderly and vulnerable groups
We should watch China carefully for new cases. If none arise, it suggests that a large proportion of the population gained immunity through infection, even though lower numbers of infections were reported. However, if the imposition of constraints temporary reduced the infection rate, leaving a large susceptible pool still vulnerable, the epidemic could re-emerge once the constraints are relaxed.
What to do?
The imposition of governments restrictions on travel and large gatherings are forcing the people to rethink their options. Where possible, office workers, university students, schoolchildren and sportsmen may find themselves congregating online in virtual environments rather than in the messy and dangerous real world.
Among cyclists, this ought to be good news for Zwift and other online platforms. Zwift seems to be particularly well-positioned, due to its strong social aspect, which allows riders to meet and race against each other in virtual races. It also has the potential to allow world tour teams to compete in virtual races.
In fact the ability to meet friends and do things together virtually would have applications across all walks of life. Sports fans need something between the stadium and the television. Businesses need a medium that fills the gap between a physical office and a conference call. Schools and universities require better ways to ensure that students can learn, while classrooms and lecture theatres are closed. These innovations may turn out to be attractive long after the coronavirus scare has subsided.
While the prospect of going to the pub or a crowded nightclub loses its appeal, cycling offers the average person a very attractive alternative way to meet friends while avoiding close proximity with large groups of people. As the weather improves, the chance to enjoying some exercise in the fresh air looks ever more enticing.
Since my blog about Strava Fitness and Freshness has been very popular, I thought it would be interesting to demonstrate a simple model that can help you use these metrics to improve your cycling performance.
As a quick reminder, Strava’s Fitness measure is an exponentially weighted average of your daily Training Load, over the last six weeks or so. Assuming you are using a power meter, it is important to use a correctly calibrated estimate of your Functional Threshold Power (FTP) to obtain an accurate value for the Training Load of each ride. This ensures that a maximal-effort one hour ride gives a value of 100. The exponential weighting means that the benefit of a training ride decays over time, so a hard ride last week has less impact on today’s Fitness than a hard ride yesterday. In fact, if you do nothing, Fitness decays at a rate of about 2.5% per day.
Although Fitness is a time-weighted average, a simple rule of thumb is that your Fitness Score equates to your average daily Training Load over the last month or so. For example, a Fitness level of 50 is consistent with an average daily Training Load (including rest days) of 50. It may be easier to think of this in terms of a total Training Load of 350 per week, which might include a longer ride of 150, a medium ride of 100 and a couple of shorter rides with a Training Load of 50.
How to get fitter
The way to get fitter is to increase your Training Load. This can be achieved by riding at a higher intensity, increasing the duration of rides or including extra rides. But this needs to be done in a structured way in order be effective. Periodisation is an approach that has been tried and tested over the years. A four-week cycle would typically include three weekly blocks of higher training load, followed by an easier week of recovery. Strava’s Fitness score provides a measure of your progress.
Modelling Fitness and Fatigue
An exponentially weighted moving average is very easy to model, because it evolves like a Markov Process, having the following property, relating to yesterday’s value and today’s Training Load.
where is Fitness or Fatigue on day t and for Fitness or for Fatigue
This is why your Fitness falls by about 2.4% and your Fatigue eases by about 13.3% after a rest day. The formula makes it straightforward to predict the impact of a training plan stretching out into the future. It is also possible to determine what Training Load is required to achieve a target level of Fitness improvement of a specific time period.
Ramping up your Fitness
The change in Fitness over the next seven days is called a weekly “ramp”. Aiming for a weekly ramp of 5 would be very ambitious. It turns out that you would need to increase your daily Training Load by 33. That is a substantial extra Training Load of 231 over the next week, particularly because Training Load automatically takes account of a rider’s FTP.
Interestingly, this increase in Training Load is the same, regardless of your starting Fitness. However, stepping up an average Training Load from 30 to 63 per day would require a doubling of work done over the next week, whereas for someone starting at 60, moving up to 93 per day would require a 54% increase in effort for the week.
In both cases, a cyclist would typically require two additional hard training rides, resulting in an accumulation of fatigue, which is picked up by Strava’s Fatigue score. This is a much shorter term moving average of your recent Training Load, over the last week or so. If we assume that you start with a Fatigue score equal to your Fitness score, an increase of 33 in daily Training Load would cause your Fatigue to rise by 21 over the week. If you managed to sustain this over the week, your Form (Fitness minus Fatigue) would fall from zero to -16. Here’s a summary of all the numbers mentioned so far.
Whilst it might be possible to do this for a week, the regime would be very hard to sustain over a three-week block, particularly because you would be going into the second week with significant accumulated fatigue. Training sessions and race performance tend to be compromised when Form drops below -20. Furthermore, if you have increased your Fitness by 5 over a week, you will need to increase Training Load by another 231 for the following week to continue the same upward trajectory, then increase again for the third week. So we conclude that a weekly ramp of 5 is not sustainable over three weeks. Something of the order of 2 or 3 may be more reasonable.
A steady increase in Fitness
Consider a rider with a Fitness level of 30, who would have a weekly Training Load of around 210 (7 times 30). This might be five weekly commutes and a longer ride on the weekend. A periodised monthly plan could include a ramp of 2, steadily increasing Training Load for three weeks followed by a recovery week of -1, as follows.
This gives a net increase in Fitness of 5 over the month. Fatigue has also risen by 5, but since the rider is fitter, Form ends the month at zero, ready to start the next block of training.
To simplify the calculations, I assumed the same Training Load every day in each week. This is unrealistic in practice, because all athletes need a rest day and training needs to mix up the duration and intensity of individual rides. The fine tuning of weekly rides is a subject for another blog.
A tougher training block
A rider engaging in a higher level of training, with a Fitness score of 60, may be able to manage weekly ramps of 3, before the recovery week. The following Training Plan would raise Fitness to 67, with sufficient recovery to bring Form back to positive at the end of the month.
A general plan
The interesting thing about this analysis is that the outcomes of the plans are independent of a rider’s starting Fitness. This is a consequence of the Markov property. So if we describe the ambitious plan as [3,3,3,-2], a rider will see a Fitness improvement of 7, from whatever initial value prevailed: starting at 30, Fitness would go to 37, while the rider starting at 60 would rise to 67.
Similarly, if Form begins at zero, i.e. the starting values of Fitness and Fatigue are equal, then the [3,3,3,-2] plan will always result in a in a net change of 6 in Fatigue over the four weeks.
In the same way, (assuming initial Form of zero) the moderate plan of [2,2,2,-1] would give any rider a net increase of Fitness and Fatigue of 5.
A couple of years ago I built a model to evaluate how Froome and Dumoulin would have matched up, if they had not avoided racing against each other over the 2017 season. As we approach the 2019 World Championships Road Race in Yorkshire, I have adopted a more sophisticated approach to try to predict the winner of the men’s race. The smart money could be going on Sam Bennett.
With only two races outstanding, most of this year’s UCI world tour results are available. I decided to broaden the data set with 2.HC classification European Tour races, such as the OVO Energy Tour of Britain. In order to help with prediction, I included each rider’s weight and height, as well as some meta-data about each race, such as date, distance, average speed, parcours and type (stage, one-day, GC, etc.).
The key question was what exactly are you trying to predict? The UCI allocates points for race results, using a non-linear scale. For example, Mathieu Van Der Poel was awarded 500 points for winning Amstel Gold, while Simon Clarke won 400 for coming second and Jakob Fuglsang picked up 325 for third place, continuing down to 3 points for coming 60th. I created a target variable called PosX, defined as a negative exponential of the rider’s position in any race, equating to 1.000 for a win, 0.834 for second, 0.695 for third, decaying down to 0.032 for 20th. This has a similar profile to the points scheme, emphasising the top positions, and handles races with different numbers of riders.
A random forest would be a typical choice of model for this kind of data set, which included a mixture of continuous and categorical variables. However, I opted for a neural network, using embeddings to encode the categorical variables, with two hidden layers of 200 and 100 activations. This was very straightforward using the fast.ai library. Training was completed in a handful of seconds on my MacBook Pro, without needing a GPU.
After some experimentation on a subset of the data, it was clear that the model was coming up with good predictions on the validation set and the out-of-sample test set. With a bit more coding, I set up a procedure to load a start list and the meta-data for a future race, in order to predict the result.
With the final start list for the World Championships Road Race looking reasonably complete, I was able to generate the predicted top 10. The parcours obviously has an important bearing on who wins a race. With around 3600m of climbing, the course was clearly hilly, though not mountainous. Although the finish was slightly uphill, it was not ridiculously steep, so I decided to classify the parcours as rolling with a flat finish
Mathieu Van Der Poel
Edvald Boasson Hagen
Greg Van Avermaet
It was encouraging to see that the model produced a highly credible list of potential top 10 riders, agreeing with the bookies in rating Mathieu Van Der Poel as the most likely winner. Sagan was ranked slightly below Kristoff and Bennett, who are seen as outsiders by the pundits. The popular choice of Philippe Gilbert did not appear in my top 10 and Alaphilippe was only 9th, in spite of their recent strong performances in the Vuelta and the Tour, respectively. Riders in positions 5 to 10 would all be expected to perform well in the cycling classics, which tend to be long and arduous, like the Yorkshire course.
For me, 25/1 odds on Sam Bennett are attractive. He has a strong group of teammates, in Dan Martin, Eddie Dunbar, Connor Dunne, Ryan Mullen and Rory Townsend, who will work hard to keep him with the lead group in the hillier early part of the race. Then he will then face an extremely strong Belgian team that is likely to play the same game that Deceuninck-QuickStep successfully pulled off in stage 17 of the Vuelta, won by Gilbert. But Bennett was born in Belgium and he was clearly the best sprinter out in Spain. He should be able to handle the rises near the finish.
A similar case can be made for Kristoff, while Matthews and Van Avermaet both had recent wins in Canada. Nevertheless it is hard to look past the three-times winner Peter Sagan, though if Van Der Poel launches one of his explosive finishes, there is no one to stop him pulling on the rainbow jersey.
After the race, I checked the predicted position of the eventual winner, Mads Pedersen. He was expected to come 74th. Clearly the bad weather played a role in the result, favouring the larger riders, who were able to keep warmer. The Dane clearly proved to be the strongest rider on the day.
It is easy to assume that successful professional cyclists are all skinny little guys, but if you look at the data, it turns out that they have an average height of 1.80m and an average weight of around 68kg. If we are to believe the figures posted on ProCyclingStats, hardly any professional cyclists would be considered underweight. In fact, they would struggle to perform at the required level if they did not maintain a healthy weight.
Taller than you might think
According to a study published in 2013 and updated in 2019, the global average height of adult males born in 1996 was 1.71m, but there is considerable regional variation. The vast majority of professional cyclists come from Europe, North America, Russia and the Antipodes where men tend to be taller than those from Asia, Africa and South America. For the 41 Colombians averaging 1.73m, there are 85 Dutch riders with a mean height of 1.84m. See chart below.
Furthermore, road cycling involves a range of disciplines, including sprinting and time trialling, where size and raw power provide an advantage. The peloton includes larger sprinters alongside smaller climbers.
Not as light as expected
While 68kg for a 1.80m male is certainly slim, it equates to a body mass index of 21 (BMI = weight / (height)²), which is towards the middle of the recommended healthy range. BMI is not a sophisticated measure, as it does not distinguish between fat and muscle. Since muscle is more dense than fat and cyclists tend to have it a higher percentage of lean body mass, they will look slimmer than a lay person of equivalent height and weight. Nevertheless doctors use BMI as a guide and become concerned when it falls below 18.5.
The chart includes over 1,100 professional cyclists, but very few pros would be considered underweight. The majority of riders have a BMI of between 20 and 22. Although Colombian riders (red) tend to be smaller, specialising in climbing, their average BMI of 20.8 is not that different from larger Dutch riders (orange) with a mean BMI of 21.2. The taller Colombians include the sprinters Hodeg, Gaviria and Molano.
Types of rider
This chart shows the names of a sample of top riders. All-out sprinters tend to have a BMI of around 24, even if they are small like Caleb Ewan. Sprints at the end of more rolling courses are likely to be won by riders with a BMI of 22, such as Greipel, van Avermaet, Sagan, Gaviria, Groenewegen, Bennet and Kwiatkowski. Time trial specialists like Dennis and Thomas have similar physiques, though Dumoulin and Froome are significantly lighter and remarkably similar to each other.
GC contenders Roglic, Kruiswijk and Gorka Izagirre are near the centre of the distribution with a BMI around 21, close to Viviani, who is unusually light for a sprinter. Pinot, Valverde, Dan Martin, the Yates brothers and Pozzovivo appear to be light for their heights. Interestingly climbers such as Quintana, Uran, Alaphilippe, Carapaz and Richie Porte all have a BMI of around 21, whereas Lopez is a bit heavier.
If the figures reported on ProCyclingStats are accurate, George Bennet and Emanuel Buchmann are significantly underweight. Weighting 58kg for a height of 1.80m does not seem to be conducive to strong performance, unless they are extraordinary physical specimens.
Professional cyclists are lean, but they would not be able to achieve the performance required if they were underweight. It is possible that the weights of individual riders might vary over time by a couple of kilos, moving them a small amount vertically on the chart, but scientific approaches are increasingly employed by expert nutritionists to avoid significant weight loss over longer stage races. The Jumbo Foodcoach app was developed alongside the Jumbo-Visma team and, working with Team Sky, James Morton strove to ensure that athletes fuel for the work required. Excessive weight loss can lead to a range of problems for health and performance.
On the eve of the Tour de France, the pundits have made their predictions, but when the race is over, they will be long forgotten. One way of checking your own forecasts is to take a look at the odds offered on the betting markets. These are interesting, because they reflect the actions of people who have actually put money behind their views. In an efficient and liquid market, the latest prices ought to reflect all information available. This blog takes a look at the current odds, without wishing to encourage gambling in any way.
The website oddchecker.com collates the odds from a number of bookmakers across a large range of bets. It is helpful to convert the odds into predicted probabilities. Focussing on the overall winner, Egan Bernal is the favourite at 5/2 (equating to a 29% probability taking the yellow jersey), followed by Geraint Thomas at 7/2 (22%) and Jakob Fuglsang at 6/1 (14%). This gives a 51% chance of a winner being one of the two Team Ineos riders. The three three leading contenders are some distance ahead of Adam Yates, Richie Porte, Thibaut Pinot and Nairo Quintana. Less fancied riders include Roman Bardet, Steven Kruijswijk, Rigoberto Uran, Mikel Landa, Enric Mas and Vincenzo Nibali. Anyone else is seen as an outsider.
Ups and downs
The odds change over time, as the markets evaluate the performance and changing fortunes of the riders. In the following chart shows the fluctuations in the average daily implied winning chances of the three current favourites since the beginning of the year, according to betfair.com.
The implied probability that Geraint Thomas would repeat last year’s win has hovered between 20% and 30%, spiking up a bit during the Tour of Romandie. Unfortunately, Chris Froome’s odds are no longer available, as he was most likely the favourite earlier this year. However, his crash on 11 June instantaneously improved the odds for other riders, particularly Thomas and Bernal, though expectations for the Welshman declined after he crashed out of the Tour de Suisse on 18 June.
The betting on Fuglsang spiked up sharply during the Tirreno Adriatico, where he won a stage and came 3rd on GC, and the Tour of the Basque country, where he finished strongly. Apparently, his three podium results in the the Ardenne Classics had no effect on his chances of a yellow jersey, whereas his victory in the Critérium Dauphiné had a significant positive impact.
Egan Bernal, appeared from the shadows. At the beginning of the year, he was seen as a third string in Team Ineos. His victory in Paris Nice hardly registered on his odds for the Tour. But since Froome’s crash and Thomas’s departure from the Tour de Suisse, he became the bookies’ favourite.
With 65% of the money on the three main contenders, there are some pretty good odds available on other riders. A couple of crashes, an off day or a bit of bad luck could turn the race on its head. Clearly the Ineos and Astana teams are capable of protecting their GC contenders, but so too are Movistar, EF Education First, Michelton Scott, Groupama-FDJ, Bahrain Merida and others.
An earlier analysis suggested that apart from choosing a warm day and avoiding traffic, the optimal wind direction for a conventional anticlockwise lap was a moderate easterly, offering a tailwind up Sawyers Hill. It does not immediately follow that a westerly wind would be best for a clockwise lap, because trees, buildings and the profile of the course affect the extent to which the wind helps or hinders a rider.
Currently there are over 280,000 clockwise laps recorded by nearly 35,000 riders, compared with more than a million anticlockwise laps by almost 55,000 riders. As before, I downloaded the top 1,000 entries from the leaderboard and then looked up the wind conditions when each time was set on a clockwise lap.
In the previous analysis, I took account of the prevailing wind direction in London. If wind had no impact, we would expect the distribution of wind directions for leaderboard entries to match the average distribution of winds over the year. I defined the wind direction advantage to be the difference between these two distributions and checked if it was statistically significant. These are the results for the clockwise lap.
The wind direction advantage was significant (at p=1.3%). Two directions stand out. A westerly provides a tailwind on the more exposed section of the park between Richmond Gate and Roehampton, which seems to be a help, even though it is largely downhill. A wind blowing from the NNW would be beneficial between Roehampton and Robin Hood Gate, but apparently does not provide much hindrance on the drag from Kingston Gate up to Richmond, perhaps because this section of the park is more sheltered. The prevailing southwesterly wind was generally unfavourable to riders setting PBs on a clockwise lap.
The excellent mywindsock web site provides very good analysis for avid wind dopers. This confirms that the wind was blowing predominantly from the west for the top ten riders on the leaderboard, including the KOM, though the wind strength was generally light.
The interesting thing about this exercise is that it demonstrates a convergence between our online and our offline lives, as increasing volumes of data are uploaded from mobile sensors. A detailed analysis of each section of the million laps riders have recorded for Richmond Park could reveal many subtleties about how the wind flows across the terrain, depending on strength and direction. This could be extended across the country or globally, potentially identifying local areas where funnelling effects might make a wind turbine economically viable.
When you upload a ride, Strava draws a map using the longitude and latitude coordinates recorded by your GPS device. This article explores ways in which these numbers, along with other metrics, can be used to create interesting images that might have some artistic merit.
The idea was motivated by the huge advances made in the field of Deep Learning, particularly applications for image recognition. However, since datasets come in all shapes and forms, researchers have explored ways of converting different types of data into images. In a paper published in 2015, the authors achieved success in identifying standard time series by converting them into images.
GPS bike computers typically record snapshots of information every second. What kind of images could these time series generate? It turns out that there are several ways to convert a time series into an image.
Creating a spectrogram is a standard approach from signal processing that is particularly useful for analysing acoustic files. The spectrogram is a heat map that shows how the underlying frequencies contributing to the signal change over time. Technically, it is derived by calculating the discrete Fourier transform of a window that slides across the time series. I applied this to my regular Saturday morning club ride of four laps around Richmond Park. The image changes a bit once the ride gets going after about 1200 seconds (20 minutes), but, frankly, the result was not particularly illuminating. There is no obvious reason to consider cycling power data as a superposition of frequencies.
Ah! Now we are getting somewhere
The authors of the referenced paper took a different approach to produce things called Gramian Angular Summation Field (GASF), Gramian Angular Difference Field (GADF), and Markov Transition Field (MTF). Read the paper if want to know the details. I created these and something call a Recurrence Plot. All of these methods generate a matrix, by combining every element in the time series with every other element. The underling observations occurring at times and determine the colour of the pixel at position (, ). Images are symmetric along the lower-left to upper-right diagonal, apart from GADF, which is antisymmetric.
Let’s see how do they look for on four laps of Richmond Park. We have six time series, with corresponding sets of images below. The segmentation of the images is due to periodicity of the data. This is particularly clear in the geographic data (longitude, latitude and altitude). The higher intensity of the main part of the ride is most obvious in the heart rate data. The MTF plots are quite interesting. Scroll down through the images to the next section
From cycle ride to art
It is one thing to create an image of each item, but how can we combine these to summarise a ride in a single image. I considered two methods of combining time series into a single image: a) create a new image where the vertical and horizontal axes represent different series and b) create a new image by simply adding the corresponding values from two underlying images.
One problem is that some cyclists don’t have gadgets like heart rate monitors and power meters, so I initially restricted myself to just the longitude, latitude and altitude data. Nevertheless, as noted in an earlier blog, it is possible to work out speed, because the time interval is one second between each reading. Furthermore, one can estimate power, from the speed and changes in elevation.
Another problem is that rides differ in length. For this I split the ride into, say, 128 intervals and took the last observation in each interval. So for a 3 hour ride, I’d be sampling about once every 84 seconds.
The chart at the top of this blog was created by first normalising each series to a standard range (-1, +1). Method a) was used to create two images: longitude was added to latitude and altitude was multiplied by speed. These were added using method b). Using these measures will produce pretty much the same chart each time the ride is done. In contrast, an image that is totally unique to the ride can be produced using data relating to the individual rider. The image below uses the same recipe to combine speed, heart rate, power and cadence. If this had been a particularly special ride, the image would be a nice personal memento.
For anyone interested in the underlying code, I have posted a Jupyter notebook here.
This blog provides a technical explanation of the analysis underlying the medical paper about male cyclists described previously. Part of the skill of a data scientist is to choose from the arsenal of machine learning techniques the tools that are appropriate for the problem at hand. In the study of male cyclists, I was asked to identify significant features of a medical data set. This article describes how the problem was tackled.
Fifty road racing cyclists, riding at the equivalent of British Cycling 2nd category or above, were asked to complete a questionnaire, provide a blood sample and undergo a DXA scan – a low intensity X-ray used to measure bone density and body composition. I used Python to load and clean up the data, so that all the information could be represented in Pandas DataFrames. As expected this time-consuming, but essential step required careful attention and cross-checking, combined with the perseverance that is always necessary to be sure of working with a clean data set.
The questionnaire included numerical data and text relating to cycling performance, training, nutrition and medical history. As a result of interviewing each cyclist, a specialist sports endocrinologist identified a number of individuals who were at risk of low energy availability (EA), due to a mismatch between nutrition and training load.
Bone density was measured throughout the body, but the key site of interest was the lumbar spine (L1-L4). Since bone density varies with age and between males and females, it was logical to use the male, age-adjusted Z-score, expressing values in standard deviations above or below the comparable population mean.
The measured blood markers were provided in the relevant units, alongside the normal range. Since the normal range is defined to cover 95% of the population, I assumed that the population could be modelled by a gaussian distribution in order to convert each blood result into a Z-score. This aligned the scale of the blood results with the bone density measures.
I decided to use the Orange machine learning and data visualisation toolkit for this project. It was straightforward to load the data set of 46 features for each of the 50 cyclists. The two target variables were lumbar spine Z-score (bone health) and 60 minute FTP watts per kilo (performance). The statistics confirmed the researchers’ suspicion that the lumbar spine bone density of the cyclists would be below average, partly due to the non-weight-bearing nature of the sport. Some of the readings were extremely low (verging on osteoporosis) and the question was why.
Given the relatively small size of the data set (a sample of 50), the most straightforward approach for identifying the key explanatory variables was to search for an optimal Decision Tree. Interestingly, low EA turned out to be the most important variable in explaining lumbar spine bone density, followed by prior participation in a weight-bearing sport and levels of vitamin D (which was, in most cases, below the ideal level of athletes). Since I had used all the data to generate the tree, I made use of Orange’s data sampler to confirm that these results were highly robust. This had some similarities with the Random Forest approach. Although Orange produces some simple graphical tools like the following, I use Python to generate my own versions for the final publication.
Decision Tree from Orange
Published Decision Tree
Finding a robust decision tree is one thing, but it was essential to verify whether the decision variables were statistically significant. For this, Orange provides box plots for discrete variables. For my own peace of mind, I recalculated all of the Student’s T-statistics to confirm that they were correct and significant. The charts below show an example of an Orange box plot and the final graphic used in the publication.
Orange Box Plot
Final Box Plots
The Orange toolkit includes other nice data visualisation tools. I particularly liked the flexibility available to make scatter plots. This inspired the third figure in the publication, which showed the most important variable explaining performance. This chart highlights a cluster of three cyclists with low EA, whose FTP watts/kg were lower than expected, based on their high training load. I independently checked the T-statistics of the regression coefficients to identify relationships that were significant, like training load, or insignificant, like percentage body fat.
Orange Scatter Plot
Published Scatter Plot
The Orange toolkit turned out to be extremely helpful in identifying relationships that fed directly into the conclusions of an important medical paper highlighting potential health risks and performance drivers for high level cyclists. Restricting nutrition through diet or fasted rides can lead to low energy availability, that can cause endocrine responses in the body that reduce lumbar spine bone density, resulting in vulnerability to fracture and slow recovery. This is know as Relative Energy Deficiency in Sport (RED-S). Despite the obsession of many cyclists to reduce body fat, the key variable explaining functional threshold power watts/kg was weekly training load.