Machine learning for a medical study of cyclists

Screen Shot 2018-10-11 at 15.28.46

This blog provides a technical explanation of the analysis underlying the medical paper about male cyclists described previously. Part of the skill of a data scientist is to choose from the arsenal of machine learning techniques the tools that are appropriate for the problem at hand. In the study of male cyclists, I was asked to identify significant features of a medical data set. This article describes how the problem was tackled.


Fifty road racing cyclists, riding at the equivalent of British Cycling 2nd category or above, were asked to complete a questionnaire, provide a blood sample and undergo a DXA scan – a low intensity X-ray used to measure bone density and body composition. I used Python to load and clean up the data, so that all the information could be represented in Pandas DataFrames. As expected this time-consuming, but essential step required careful attention and cross-checking, combined with the perseverance that is always necessary to be sure of working with a clean data set.

The questionnaire included numerical data and text relating to cycling performance, training, nutrition and medical history. As a result of interviewing each cyclist, a specialist sports endocrinologist identified a number of individuals who were at risk of low energy availability (EA), due to a mismatch between nutrition and training load.

Bone density was measured throughout the body, but the key site of interest was the lumbar spine (L1-L4). Since bone density varies with age and between males and females, it was logical to use the male, age-adjusted Z-score, expressing values in standard deviations above or below the comparable population mean.

The measured blood markers were provided in the relevant units, alongside the normal range. Since the normal range is defined to cover 95% of the population, I assumed that the population could be modelled by a gaussian distribution in order to convert each blood result into a Z-score. This aligned the scale of the blood results with the bone density measures.


I decided to use the Orange machine learning and data visualisation toolkit for this project. It was straightforward to load the data set of 46 features for each of the 50 cyclists. The two target variables were lumbar spine Z-score (bone health) and 60 minute FTP watts per kilo (performance). The statistics confirmed the researchers’ suspicion that the lumbar spine bone density of the cyclists would be below average, partly due to the non-weight-bearing nature of the sport. Some of the readings were extremely low (verging on osteoporosis) and the question was why.

Given the relatively small size of the data set (a sample of 50), the most straightforward approach for identifying the key explanatory variables was to search for an optimal Decision Tree. Interestingly, low EA turned out to be the most important variable in explaining lumbar spine bone density, followed by prior participation in a weight-bearing sport and levels of vitamin D (which was, in most cases, below the ideal level of athletes). Since I had used all the data to generate the tree, I made use of Orange’s data sampler to confirm that these results were highly robust. This had some similarities with the Random Forest approach. Although Orange produces some simple graphical tools like the following, I use Python to generate my own versions for the final publication.


Finding a robust decision tree is one thing, but it was essential to verify whether the decision variables were statistically significant. For this, Orange provides box plots for discrete variables. For my own peace of mind, I recalculated all of the Student’s T-statistics to confirm that they were correct and significant. The charts below show an example of an Orange box plot and the final graphic used in the publication.

The Orange toolkit includes other nice data visualisation tools. I particularly liked the flexibility available to make scatter plots. This inspired the third figure in the publication, which showed the most important variable explaining performance. This chart highlights a cluster of three cyclists with low EA, whose FTP watts/kg were lower than expected, based on their high training load. I independently checked the T-statistics of the regression coefficients to identify relationships that were significant, like training load, or insignificant, like percentage body fat.


The Orange toolkit turned out to be extremely helpful in identifying relationships that fed directly into the conclusions of an important medical paper highlighting potential health risks and performance drivers for high level cyclists. Restricting nutrition through diet or fasted rides can lead to low energy availability, that can cause endocrine responses in the body that reduce lumbar spine bone density, resulting in vulnerability to fracture and slow recovery. This is know as Relative Energy Deficiency in Sport (RED-S). Despite the obsession of many cyclists to reduce body fat, the key variable explaining functional threshold power watts/kg was weekly training load.


Low energy availability assessed by a sport-specific questionnaire and clinical interview indicative of bone health, endocrine profile and cycling performance in competitive male cyclists, BMJ Open Sport & Exercise Medicine,

Relative Energy Deficiency in Sport, British Association of Sports and Exercise Medicine

Synergistic interactions of steroid hormones, British Journal of Sports Medicine

Cyclists: Make No Bones About It, British Journal of Sports Medicine

Male Cyclists: bones, body composition, nutrition, performance, British Journal of Sports Medicine


Strava Ride Statistics

If you ride with a power meter and a heart rate monitor, Strava’s premium subscription will display a number of summary statistics about your ride. These differ from the numbers provided by other software, such as Training Peaks. How do all these numbers relate to each other?

A tale of two scales

Over the years, coaches and academics have developed statistics to summarise the amount of physiological stress induced by different types of endurance exercise. Two similar approaches have gained prominence. Dr Andrew Coggan has registered the names of several measures used by Training Peaks. Dr Phil Skiba has developed as set of metrics used in the literature and by PhysFarm Training Systems. These and other calculations are available on Golden Cheetah‘s excellent free software.

Although it is possible to line up metrics that roughly correspond to each other, the calculations are different and the proponents of each scale emphasise particular nuances that distinguish them. This makes it hard to match up the figures.

Here is an example for a recent hill session. The power trace is highly variable, because the ride involved 12 short sharp climbs.

Metric Coggan TrainingPeaks Skiba Literature Strava
Power equivalent physiological cost of ride Normalised Power 282 xPower 252 Weighted Avg Power 252
Power variability of ride Variability Index 1.57 Variability Index 1.41
Rider’s sustainable power Functional Threshold Power 312 Critical Power 300 FTP 300
Power cost / sustainable power Intensity Factor 0.9 Relative Intensity 0.84 Intensity 0.84
Assessment of intensity and duration of ride Training Stress Score 117 BikeScore 101 Training Load 100
Training Impulse based on heart rate Suffer Score 56

Weighted Average Power

According to Strava, Weighted Average Power takes account of the variability of your power reading during a ride. “It is our best guess at your average power if you rode at the exact same wattage the entire ride.” That sounds an awful lot like Normalized Power, which is described on Training Peaks as “an estimate of the power that you could have maintained for the same physiological “cost” if your power output had been perfectly constant (e.g., as on a stationary cycle ergometer), rather than variable”. But it is apparent from the table above that Strava is calculating Skiba’s xPower.

The calculations of Normalized Power and xPower both smooth the raw power data, raise these observations to the fourth power, take the average over the whole ride and obtain the fourth root to give the answer.

Normalized Power or xPower = (Average(Psmoothed4))1/4

The only difference between the calculations is the way that smoothing accounts for the body’s physiological delay in reacting to rapid changes in pedalling power. Normalized Power uses a 30 second moving average, whereas xPower uses a “25 second exponential average”. According to Skiba, exponential decay is better than Coggan’s linear decay in representing the way the body reacts to changes in effort.

The following chart zooms into part of the hill reps session, showing the raw power output (in blue), moving average smoothing for Normalised Power (in green), exponential smoothing for xPower (in red), with heart rate shown in the background (in grey). Two important observations can be made. Firstly, xPower’s exponential smoothing is more highly correlated with heart rate, so it could be argued that it does indeed correspond more closely with the underlying physiological processes. Secondly, the smoothing used for xPower is less volatile, therefore xPower will always be lower than Normalized Power (because the fourth-power scaling is dominated by the highest observations).


Why do both metrics take the watts and raise them to the fourth power? Coggan states that many of the body’s responses are “curvilinear”. The following chart is a good example, showing the rapid accumulation of blood lactate concentration at high levels of effort.

Screen Shot 2017-04-20 at 15.08.31

Plotting the actual data from a recent test on a log-log scale, I obtained a coefficient of between 3.5 and 4.7, for the relation between lactate level and watts. This suggests that taking the average of smoothed watts raised to the power 4 gives an indication of the average level of lactate in circulation during the ride.

The hill reps ride included multiple bouts of high power, causing repeated accumulation of lactate and other stress related factors. Both the Normalised Power of 282W and xPower of 252W were significantly higher than the straight average power of 179W. The variability index compares each adjusted power against average power, resulting in variability indices of 1.57 and 1.41 respectively. These are very high figures, due to the hilly nature of the session. For a well-paced time trial, the variability index should be close to 1.00.

Sustainable Power

It is important for a serious cyclist to have a good idea of the power that he or she can sustain for a prolonged period. Functional Threshold Power and Critical Power measure slightly different things. The emphasis of FTP is on the maximum power sustainable for one hour, whereas CP is the power theoretically sustainable indefinitely. So CP should be lower than FTP.

Strava allows you to set your Functional Threshold Power under your personal performance settings. The problem is that if Strava’s Weighted Average Power is based on Skiba’s xPower, it would be more consistent to use Critical Power, as I did in the table above. This is important because this figure is used to calculate Intensity and Training Load. If you follow Strava’s suggestion of using FTP, subsequent calculations will underestimate your Training Load,  which, in turn, impacts your Fitness & Freshness curves.


The idea of intensity is to measure severity of a ride, taking account of the rider’s individual capabilities.  Intensity is defined as the ratio of the power equivalent physiological cost of the ride relative to your sustainable power. For Coggan, the Intensity Factor is NP/FTP; for Skiba the Relative Intensity is xPower/CP; and for Strava the Intensity is Weighted Average Power/FTP.

Training Load

An overall assessment of a ride needs to take account of the intensity and the duration of a ride. It is helpful to standardise this for an individual rider, by comparing it against a benchmark, such as an all-out one hour effort.

Coggan proposes the Training Stress Score that takes the ratio the work done at Normalised Power, scaled by the Intensity Factor, relative to one hour’s work at FTP. Skiba defines the BikeScore as the ratio the work done at xPower, scaled by the Relative Intensity, relative to one hour’s work at CP. And finally, Strava’s Training Load takes the ratio the work done at Weighted Average Power, scaled by Intensity, relative to one hour’s work at FTP.

Note that for my hill reps ride, the BikeScore of 101, was considerably lower than the TSS of 117. Although my estimated CP is 12W lower than my FTP, xPower was 30W lower than NP. Using my CP as my Strava FTP, Strava’s Training Load is the same as Skiba’s Bike Score (otherwise I’d get 93).

Suffer Score

Strava’s Suffer Score was inspired by Eric Banister’s training-impulse (TRIMP) concept. It is derived from the amount of time spent in each heart rate zone, so it can be calculated for multiple sports. You can set your Strava heart rate zones in your personal settings, or just leave then on default, based on your maximum heart rate.

A non-linear relationship is assumed between effort and heart rate zone. Each minute in Zone 1, Endurance, is worth 12 seconds; Moderate Zone 2 minutes are worth 24 seconds; Zone 3 Tempo minutes are worth 45 seconds; Zone 4 Threshold minutes are worth 100 seconds; and Anaerobic Zone 5 minutes are worth 120 seconds. The Suffer Score is the weighted sum of minutes in each zone.

The next blog will comment on the Fitness & Freshness charts available on Strava Premium.