## Strava – Automatic Lap Detection

As you upload your data, you accumulate a growing history of rides. It is helpful to find ways of classifying different types of activities. Races and training sessions often include laps that are repeated during the ride. Many GPS units can automatically record laps as you pass the point where you began your ride or last pressed the lap button. However, if the laps were not recorded on the device, it is tricky to recover them. This article investigates how to detect laps automatically.

First consider the simple example of a 24 lap race around the Hillingdon cycle circuit. Plotting the GPS longitude and latitude against time displays repeating patterns. It is even possible to see the “omega curve” in the longitude trace. So it should be possible to design an algorithm that uses this periodicity to calculate the number of laps.

This is a common problem in signal processing, where the Fourier Transform offers a neat solution. This effectively compares the signal against all possible frequencies and returns values with the best fit in the form of a power spectrum. In this case, the frequencies correspond to the number of laps completed during the race. In the bar chart below, the power spectrum for latitude shows a peak around 24. The high value at 25 probably shows up because I stopped my Garmin slightly after the finish line. A “harmonic” also shows up at 49 “half laps”. Focussing on the peak value, it is possible to reconstruct the signal using a frequency of 24, with all others filtered out.

So we’re done – we can use a Fourier Transform to count the laps! Well not quite. The problem is that races and training sessions do not necessarily start and end at exactly the starting point of a lap. As a second example, consider my regular Saturday morning club run, where I ride from home to the meeting point at the centre of Richmond Park, then complete four laps before returning home. As show in the chart below, a simple Fourier Transform approach suggests that ride covered 5 laps, because, by chance, the combined time for me to ride south to the park and north back home almost exactly matches the time to complete a lap of the park. Visually it is clear that the repeating pattern only holds for four laps.

Although it seems obvious where the repeating pattern begins and ends, the challenge is to improve the algorithm to find this automatically. A brute force method would compare every GPS location with every other location on the ride, which would involve about 17 million comparisons for this ride, then you would need to exclude the points closely before or after each recording, depending on the speed of the rider. Furthermore, the distance between two GPS points involves a complex formula called the haversine rule that accounts for the curvature of the Earth.

Fortunately, two tricks can make the calculation more tractable. Firstly, the peak in the power spectrum indicates roughly how far ahead of the current time point to look for a location potentially close to the current position. Given a generous margin of, say, 15% variation in lap times, this reduces the number of comparisons by a whole order of magnitude. Secondly, since we are looking for points that are very close together, we only need to multiply the longitudes by the cosine of the latitude (because lines of longitude meet at the poles) and then a simple Euclidian sum the squares of the differences locates points within a desired proximity of, say, 10 metres.  This provides a quicker way to determine the points where the rider was “lapping”. These are shaded in yellow in the upper chart and shown in red on a long/latitude plot below. The orange line on the upper chart shows, on the right hand scale, the rolling lap time, i.e. the number of seconds to return to each point on the lap, from which the average speed can be derived.

Two further refinements were required to make the algorithm more robust. One might ask whether it makes a difference using latitude or longitude. If the lap involved riding back and forth along a road that runs due East-West, the laps would show up on longitude but not latitude. This can be solved by using a 2-dimensional Fourier Transform and checking both dimensions. This, in turn, leads to the second refinement, exemplified by the final example of doing 12 ascents of the Nightingale Lane climb. The longitude plot includes the ride out to the West, 12 reps and the Easterly ride back home.

The problem here was that the variation in longitude/latitude on the climb was tiny compared with the overall ride. Once again, the repeating section is obvious to the human eye, but more difficult to unpick from its relatively low peak in the power spectrum. A final trick was required: to consider the amplitude of each frequency in decreasing order of power and look out for any higher frequency peaks that appear early on the list. This successfully identified the relevant part of the ride, while avoiding spurious observations for rides that did not include laps.

The ability for an algorithm to tag rides if they include laps is helpful for classifying different types of sessions. Automatically marking the laps would allow riders and coaches to compare laps against each other over a training session or a race. A potential AI-powered robo-coach could say “Ah, I see you did 12 repeats in your session today… and apart from laps 9 and 10, you were getting progressively slower….”

## Strava Power Curve

If you use a power meter on Strava premium, your Power Curve provides an extremely useful way to analyse your rides. In the past, it was necessary to perform all-out efforts, in laboratory conditions, to obtain one or two data points and then try to estimate a curve. But now your power meter records every second of every ride. If you have sustained a number of all-out efforts over different time intervals, your Power Curve can tell you a lot about what kind of rider you are and how your strengths and weaknesses are changing over time.

Strava provides two ways to view your Power Curve: a historical comparison or an analysis of a particular ride. Using the Training drop-down menu, as shown above, you can compare two historic periods. The curves display the maximum power sustained over time intervals from 1 second to the length of your longest ride. The times are plotted on a log scale, so that you can see more detail for the steeper part of the curve. You can select desired time periods and choose between watts or watts/kg.

The example above compares this last six weeks against the year to date. It is satisfying to see that the six week curve is at, or very close to, the year to date high, indicating that I have been hitting new power PBs (personal bests) as the racing season picks up. The deficit in the 20-30 minute range indicates where I should be focussing my training, as this would be typical of a breakaway effort. The steps on the right hand side result from having relatively few very long rides in the sample.

Note how the Power Curve levels off over longer time periods: there was a relatively small drop from my best hour effort of 262 watts to 243 watts for more than two hours. This is consistent with the concept of a Critical Power that can be sustained over a long period. You can make a rough estimate of your Functional Threshold Power by taking 95% of your best 20 minute effort or by using your best 60 minute effort, though the latter is likely to be lower, because your power would tend to vary quite a bit due to hills, wind, drafting etc., unless you did a flat time trial. Your 60 minute normalised power would be better, but Strava does not provide a weighted average/normalised power curve. An accurate current FTP is essential for a correct assessment of your Fitness and Freshness.

Switching the chart to watts/kg gives a profile of what kind of rider you are, as explained in this Training Peaks article. Sprinters can sustain very high power for short intervals, whereas time trial specialists can pump out the watts for long periods. Comparing myself against the performance table, my strengths lie in the 5 minutes to one hour range, with a lousy sprint.

The other way to view your Power Curve comes under the analysis of a particular ride. This can be helpful in understanding the character of the ride or for checking that training objectives have been met. The target for the session above was to do 12 reps on a short steep hill. The flat part of the curve out to about 50 seconds represents my best efforts. Ideally, each repetition would have been close to this. Strava has the nice feature of highlighting the part of the course where the performance was achieved, as well as the power and date of the historic best. The hump on the 6-week curve at 1:20 occurred when I raced some club mates up a slightly longer steep hill.

If you want to analyse your Power Curve in more detail, you should try Golden Cheetah. See other blogs on Strava Fitness and Freshness, Strava Ride Statistics or going for a Strava KOM.

## What are you looking at?

In a recent blog, I described an experiment to train a deep neural network to distinguish between photographs of Vincenzo Nibali and Alejandro Valverde, using a very small data set of images. In the conclusion, I suggested that the network was probably basing its decisions more on the colours of the riders’ kit rather than on facial recognition. This article investigates what the network was actually “looking at”, in order to understand better how it was making decisions.

The issues of accountability and bias were among the topics discussed at the last NIPS conference. As machine learning algorithms are adopted across industry, it is important for companies to be able to explain how conclusions are reached. In many instances, it is not acceptable simply to rely on an impenetrable black box. AI researchers and developers need to be able to explain what is going on inside their models, in order to justify decisions taken. In doing so, some worrying instances of bias have been revealed in the selection of data used to train the algorithms.

I went back to my rider recognition model and used an approach called “Class Activation Maps” to identify which parts of the images accounted for the network’s choice of rider. Making use of the code provided in lesson 7 of the course offered by fast.ai, I took advantage of my existing small set of training, validation and test images of the two famous cyclists. Starting with a pre-trained version of ResNet34, the idea was to replace the last two layers with four new ones, the crucial one being a convolutional layer with two outputs, matching the number of cyclists in the classification task. The two outputs of this layer were 7×7 matrix representations of the relevant image.

The final predictions of the model came from a softmax of a flattened average pooling of these 7×7 representations. The softmax output gave the probabilities of Nibali and Valverde respectively. Since there was no learning beyond the final convolution, the activations of the two 7×7 matrices represented the “Nibali-ness” and “Valverde-ness” of the image. This could be displayed as a heat map on top of the image.

Examples are shown below for the validation set of 10 images of Nibali followed by 10 of Valverde. The yellow patch of the heat map highlights the part of the image that led to the prediction displayed above each image. Nine out of ten were correct for Nibali and six for Valverde.

The heat maps were very helpful in understanding the model’s decision making process. It seemed that for Nibali, his face and helmet were important, with some attention paid to the upper part of his blue Astana kit. In contrast, the network did a very good job at identifying the M on Valverde’s Moviestar kit. It was interesting to note that the network succeeded in spotting that Nibali was wearing a Specialized helmet whereas Valverde had a Catlike design. Three errors arose in the photos of his face, which was mistaken for Nibali’s. In fact, any picture of a face led to a prediction of Nibali, as demonstrated by the cropped image below that was used for training.

Why should that be? Looking back at the training set, it turned out that, by chance, there were far more mugshots of Nibali, while there were more photos of Valverde riding his bike, with his face obscured by sunglasses. This was an example of unintentional bias in the training data, providing a very useful lesson.

The final set of pictures shows the predictions made on the out-of-sample test set. All the predictions are correct, except the first one, where the model failed to spot the green M on Valverde’s chest and mistook the blurred background for Nibali. Otherwise the results confirmed that the network looked at Nibali’s face, the rider’s helmet or Valverde’s kit. It also remembered seeing an image of Nibali holding the Giro trophy in the training set.

In conclusion, Class Activation Maps provide a useful way of visualising the activations of hidden laters in a deep neural network. This can go some way to accounting for the decisions that appear in the output. The approach can also help identify unintentional bias in the training set.

## Suddenly Summer in Richmond Park

### Tour de Richmond Park Leaderboard – year to date 2018

This week’s dramatic change in the weather has seen a string of quick laps recorded for the Tour de Richmond Park. Twelve out of the fastest fifteen efforts were completed on 18/19 April. Apart from the sheer pleasure of finally being able to ride in short sleeves, two meteorological factors came into play: higher temperatures and a favourable wind direction.

As noted in an earlier blog, changes in temperature have a far greater impact on air density than variations in atmospheric pressure and humidity. When I completed a lap last week, the temperature was 6oC, but on 19 April it was closer to 26oC. The warmer weather had the effect of reducing air density by more than 7%. Theoretically, this should allow you to ride about 2% faster for the same effort. Using a physics model I built last year to analyse Strava segments, it is possible to estimate the effect of variations in the factors that determine your position on the leaderboard. Based on an average power of 300W and some reasonable estimates of other variables, this rise in temperature would reduce your time from 16:25 to 16:04 (as expected, 2% quicker).

The other key factor is the wind. On 18/19 April, it was blowing from the south or southeast. This was not the mythical easterly that provides a tailwind up Sawyers Hill, but according to the analysis in another earlier blog, it is generally beneficial for doing a quick lap around the park.

I clocked up a decent time this morning, to reach 15th place on the year-to-date leaderboard, but I failed to take my own advice on the best time of day. The traffic tends to be lighter first thing in the morning or in the evening, when the park closes. After waiting until mid-morning for the temperature to rise, I ended up being blocked by slow-moving vehicles on two occasions.

Although it was frustrating having to brake for traffic, the really puzzling thing was an average power reading of 254W. This is much lower than the other riders on the leaderboard. Last week, I did a lap in 16:44 at an average power of 313W, which seems much more reasonable. Admittedly, I was wearing a skin suit today, but that would not have saved 50W. It is possible that I had some drafting benefit from the numerous cars in the park and some favourable gusts of wind. However, my suspicion is that my Garmin Vector pedals had not calibrated correctly, after I switched them from my road bike, before today’s ride.

The concluding message is get on your bike and enjoy the sunshine. And why not try to beat your best time for the Tour de Richmond Park?

## Which team is that?

My last blog explored the effectiveness of deep learning in spotting the difference between Vincenzo Nibali and Alejandro Valverde. Since the faces of the riders were obscured in many of the photos, it is likely that the neural network was basing its evaluations largely on the colours of their team kit. A natural next challenge is to identify a rider’s team from a photograph. This task parallels the approach to the kaggle dog breed competition used in lesson 2 of the fast.ai course on deep learning.

Eighteen World Tour teams are competing this year. So the first step was to trawl the Internet for images, ideally of riders in this year’s kit. As before, I used an automated downloader, but this posed a number of problems. For example, searching for “Astana” brings up photographs of the capital of Kazakhstan. So I narrowed things down by searching for  “Astana 2018 cycling team”. After eliminating very small images, I ended up with a total of about 9,700 images, but these still included a certain amount of junk that I did have the time to weed out, such as photos of footballers or motorcycles in the “Sky Racing Team”,.

The following small sample of training images is generally OK, though it includes images of Scott bikes rather than Mitchelton-Scott riders and  a picture of  Sunweb’s Wilco Kelderman labelled as FDJ. However, with around 500-700 images of each team, I pressed on, noting that, for some reason, there were only 166 of Moviestar and these included the old style kit.

For training on this multiple classification problem, I adopted a slightly more sophisticated approach than before. Taking a pre-trained Resnet50 model, I performed some initial fine-tuning, on images rescaled to 224×224. I settled on an optimal learning rate of 1e-3 for the final layer, while allowing some training of lower layers at much lower rates. With a view to improving generalisation, I opted to augment the training set with random changes, such as small shifts in four directions, zooming in up to 10%, adjusting lighting and left-right flips. After initial training, accuracy was 52.6% on the validation set. This was encouraging, given that random guesses would have achieved a rate of 1 in 18 or 5.6%.

Taking a pro tip from fast.ai, training proceeded with the images at a higher resolution of 299×299. The idea is to prevent overfitting during the early stages, but to improve the model later on by providing more data for each image. This raised the accuracy to 58.3% on the validation set. This figure was obtained using a trick called “test time augmentation”, where each final prediction is based on the average prediction of five different “augmented” versions of the image in question.

Given the noisy nature of some of the images used for training, I was pleased with this result, but the acid test was to evaluate performance on unseen images. So I created a test set of two images of a lead rider from each squad and asked the model to identify the team. These are the results.

The trained Resnet50 correctly identified the teams of 27 out of 36 images. Interestingly, there were no predictions of MovieStar or Sky. This could be partly due to the underrepresentation of MovieStar in the training set. Froome was mistaken for AG2R and Astana, in column 7, rows 2 and 3. In the first image, his 2018 Sky kit was quite similar to Bardet’s to the left and in the second image the sky did appear to be Astana blue! It is not entirely obvious why Nibali was mistaken for Sunweb and Astana, in the top and bottom rows. However, the huge majority of predictions were correct. An overall  success rate of 75% based on an afternoon’s work was pretty amazing.

The results could certainly be improved by cleaning up the training data, but this raises an intriguing question about the efficacy of artificial intelligence. Taking a step back, I used Bing’s algorithms to find images of cycling teams in order to train an algorithm to identify cycling teams. In effect, I was training my network to reverse-engineer Bing’s search algorithm, rather than my actual objective of identifying cycling teams. If an Internet search for FDJ pulls up an image of Wilco Kelderman, my network would be inclined to suggest that he rides for the French team.

In conclusion, for this particular approach to reach or exceed human performance, expert human input is required to provide a reliable training set. This is why this experiment achieved 75%, whereas the top submissions on the dog breeds leaderboard show near perfect performance.

## Valverde or Nibali?

Alejandro Valverde has kicked off the 2018 season with an impressive series of wins. Meanwhile Vincenzo Nibali delighted the tifosi with his victory in Milan San Remo. It is pretty easy to tell these two riders apart in the pictures above, but could computer distinguish between them?

Following up on my earlier blogs about neural networks, I have been taking a look at the updated version of fast.ai’s course on deep learning. With the field advancing at a rapid pace, this provides a good way to staying up to date with the state of the art. For example, there are now a couple of cheaper alternatives to AWS for accessing high powered GPUs, offered by Paperspace and Crestle. The latest fast.ai libraries include many new tools that work extremely well in practice.

There’s a view that deep learning requires hours of training on high-powered supercomputers, using thousands (or millions) of labelled examples, in order to learn to perform computer vision tasks. However, newer architectures, such as ResNet, are able to run on much smaller data sets. In order to test this, I used an image downloader to grab photos of Nibali and Valverde and manually selected about 55 decent pictures of each one.

I divided the images into a training set with about 40 images of each rider, a validation set with 10 of each and a test set containing the rest. Nibali appears in a range of different coloured jerseys, though the Astana blue is often present. Valverde is mainly wearing the old dark blue Movistar kit with a green M. There were more close-up shots of Nibali’s face than Valverde.

I was able to fine-tune a pre-trained ResNet neural network to this task, using some of the techniques from the fast.ai tool box, each designed to improve generalisation. The first trick was to augment the training set by performing minor transformations of the images at random, such as taking a mirror image, shifting left or right and zooming in a bit. The second set of tricks varied the rate of learning as the algorithm iterated repeatedly through the training set. A final useful technique created a set of variants of each test image and took the average of the predictions. Everything ran at lightning speed on a Paperspace GPU. After a run time of just a few minutes, the ResNet was able to  score 17 out of 20 on the following validation set.

The confusion matrix shows that the model correctly identified all the Nibali images, but it was wrong on three pictures of Valverde. The first incorrect image (below) shows Valverde in the red leader’s jersey of the Tour of Murcia, which is not dissimilar to Nibali’s new Bahrain Merida kit, though he was wearing red in two of his training images. In the second instance, the network was fooled by the change in colour of Moviestar’s kit, which had become rather similar to Astana’s light blue. The figure of 0.41 above the close-up image indicates that the model assigned only a 41% probability that the image was Valverde. It probably fell below the critical 50% level, in spite of the blue/green colours, because there were were far more close-up shots of Nibali than Valverde in the training set.

Overall of 17 out of 20 on the validation set is impressive. However, the network had access to the validation set during training, so this result is “in sample”. A proper  “out of sample” evaluation of the model’s ability made use the following ten images, comprising the test set that was kept aside.

Amazingly, the model correctly identified 9 out of the 10 pictures it had not seen before. The only error was the Valverde selfie shown in the final image. In order to work better in practice, the training set would need to include more examples of the riders’ 2018 kit. A variant of the problem would be to identify the team rather than the rider. The same network can be trained for multiple classes rather than just two.

This experiment shows that it is pretty straightforward to run state of the art image recognition tools remotely on a GPU somewhere in the cloud and come up with pretty impressive results, even with a small data set.

The next blog describes how to identify a rider’s team.

## Froome’s data on Strava

Chris Froome has been logging data on Strava since the beginning of the year. He had already completed over 4,500km, around Johannesburg, in the first four weeks of January. The weather has been hot and he has been based at an altitude of around 1350m. Some have speculated that he has been replicating the conditions of a grand tour, so that measurements can be made that may assist in his defence against the adverse analytical finding made at last year’s Vuelta.

Whatever the reasons, Froome chose to “Empty the tank” with epic ride on 28 January, completing 271km in just over six hours at an average of 44.8kph. The activity was flagged on Strava, presumably because he completed it suspiciously fast. For example, he rode the 20km Back Straight segment at 50.9kph, finishing in 24:24, nearly four minutes faster than holder of the the KOM: a certain Chris Froome. Since there was no significant wind blowing, one can only assume he was being motor-paced.

One interesting thing about rides displayed publicly on Strava is that anyone can download a GPX file of the route, which shows the latitude, longitude and altitude of the rider, typically at one second intervals. Although Froome is one of the professional riders who prefer to keep their power data private, this blog explores the possibility of estimating power from the  GPX file. The plan is similar to the way Strava estimates power.

1. Calculate the rider’s speed from changes in position
3. Estimate air density from historic weather reports
4. Make assumptions about rider/bike mass, aerodynamic drag, rolling resistance
5. Estimate power required to ride at estimated speed

## Knowledge is power

An interesting case study is Froome’s TT Bike Squeeeeze from 6 January, which included a sustained 2 hour TT effort. Deriving speed and gradient from the GPX file is straightforward, though it is helpful to include smoothing (say, a five second average) to iron out noise in the recording. It is simple to check the average speed and charts against those displayed on Strava.

Several factors affect air density. Firstly, we can obtain the local weather conditions from sources, such as Weather Underground. Froome set off at 6:36am, when it was still relatively cool, but he Garmin shows that it warmed up from 18 degrees to 40 degrees during the ride. Taking the average of 29 for the whole ride simplifies matters. Air pressure remained constant at around 1018hPa, but this is always quoted for sea level, so the figure needs to be adjusted for altitude. Froome’s GPS recorded an altitude range from 1242m to 1581m. However we can see that his starting altitude was recorded as 1305m, when the actual altitude of this location was 1380m. We conclude that his average altitude for the ride, recorded at 1436m, needs to be corrected by 75m to 1511m and opt to use this as an elevation adjustment for the whole ride. This is important because the air is sufficiently less dense at this altitude to have a noticeable impact on aerodynamic drag.

An estimate of power requires some additional assumptions. Froome uses his road bike, TT bike and mountain bike for training, sometimes all in the same ride, and we suspect some rides are motor-paced. However, he indicates that the 6 January ride was on the TT bike. So a CdA of 0.22 for drag and a Crr of 0.005 for rolling resistance seem reasonable. Froome weighs about 70kg and fair assumptions were taken for the spec of his bike. Finally, the wind was very light, so it was ignored in the calculations.

Under these assumptions, Froome’s estimated average power was 205W. The red shaded area marks a 2 hour effort completed at 43.7kph, with a higher average power of 271W. His maximal average power sustained over one hour was 321W or 4.58W/kg. There is nothing adverse about these figures; they seem to be eminently within the expected capabilities of the multiple grand tour winner.

Of course, quite a few assumptions went into these calculations, so it is worth identifying the most important ones. The variation of temperature had a small effect: the whole ride at 18 degrees would have required an average of 209W or, at 40 degrees, 201W. Taking account of altitude was important: the same ride at sea level would have required 230W, but the variations in altitude during the ride were not significant. At the speeds Froome was riding, aerodynamics were important: a CdA of 0.25 would have needed 221W, whereas a super-aero CdA of 0.20 rider could have done 195W. This sensitivity analysis suggests that the approach is robust.

Running the same analysis over the “Empty the tank” ride gives an average power requirement of 373W for six hours, which is obviously suspect. However, if he was benefiting from a 50% reduction in drag by following a motor vehicle, his estimated average power for the ride would have been 244W – still pretty high, but believable.

Posting rides on Strava provides an independently verifiable adjunct to a biological passport.

## Cycling Data Science – building models

In the previous blog, I explored the structure of a data set of summary statistics from over 800 rides recorded on my Garmin device. The K-means algorithm was an example of unsupervised learning that identified clusters of similar observations without using any identifying labels. The Orange software, used previously, makes it extremely easy to compare a number of simple models that map a ride’s statistics to its type: race, turbo trainer or just a training ride. Here we consider Decision Trees, Random Forests and Support Vector Machines.

### Decision Trees

Perhaps the most basic approach is to build a Decision Tree. The algorithm finds an efficient way to make a series of binary splits of the data set, in order to arrive at a set of criteria that separates the classes, as illustrated below.

The first split separates the majority of training rides from races and turbo trainer sessions, based on an average speed of 35.8km/h. Then Average Power Variance helped identify races, as observed in the previous blog. After this, turbo trainer sessions seemed to have a high level of TISS Aerobicity, which relates to the percentage of effort done aerobically. Pedalling balance, fastest 500m and duration separated the remaining rides. An attractive way to display these decisions is to create a Pythagorean Tree, where the sides of each triangle relate to the number of observations split by each decision.

### Random Forests

Many alternative sets of decisions could separate the data, where any particular tree can be quite sensitive to specific observations. A Random Forest addresses this issue by creating a collection of different decision trees and choosing the class by majority vote. This is the Pythagorean Forest representation of 16 trees, each with six branches.

### Support Vector Machines

A Support Vector Machine (SVM) is a widely used model for solving this kind of categorisation problem. The training algorithm finds an efficient way to slice the data, that largely separates the categories, while allowing for some overlap. The points that are closest to the slices are called support vectors. It is tricky to display the results in such a high dimensional space, but the following scatter plot displays Average Power Variance versus Average Speed, where the support vectors are shown as filled circles.

### Comparison of results

A Confusion Matrix provides a convenient way to compare the accuracy of the models. This correlates the predictions versus the actual category labels. Out of the 809 rides, only 684 were labelled. The Decision Tree incorrectly labelled 20 races and 7 turbos as training rides. The Random Forest did the best job, with only six misclassifications, while the SVM made 11 errors.

Looking at the classification errors can be very informative. It turns out that the two training rides classified as races by the SVM had been accidentally mislabelled – they were in fact races! Furthermore, looking at the five races the that SVM classified as training rides, I punctured in one, I crashed in another and in a third race, I was dropped from the lead group, but eventually rolled in a long way behind with a grupetto. The Random Forest also found an alpine race where my Garmin battery failed and classified it as a training ride. So the misclassifications were largely understandable.

After correcting the data set for mislabelled rides, the Random Forest improved to just two errors and the SVM dropped to just eight errors. The Decision Tree deteriorated to 37 errors, though it did recognise that the climbing rate tends to be zero on a turbo training session.

### Prediction

Having trained three models, we can take a look at the sample of 125 unlabelled rides. The following chart shows the predictions of the Random Forest model. It correctly identified one race and suggested several turbo trainer sessions. The SVM also found another race.

### Conclusions

Several lessons can be learned from these experiments. Firstly, it is very helpful to start with a clean data set. But if this is not the case, looking at the misclassified results of a decent model can be useful in catching mislabelled data. The SVM seemed to be good for this task, as it had more flexibility to fit the data than the Decision Tree, but it was less prone to overfit the data than the Random Forest.

The Decision Tree was helpful in quickly identifying average speed and power variance (chart below) as the two key variables. The SVM and Random Forest were both pretty good, but less transparent. One might improve on the results by combining these two models.

The next blog will explore this topic further.

## Cycling Data Science – clusters

Data Science is a hot topic that is impacting a range of diverse areas from business to sport. With so many cyclists collecting and uploading their data, there is plenty of raw material from which to draw interesting insights. This is the first in a series of articles exploring applications of data science in the field of cycling, beginning with the concept of clustering.

As a data set, I took all my Garmin files covering 2014-2017. Having previously uploaded them onto Golden Cheetah (GC), I took advantage of the API that allows external programmes, such as Python, to retrieve data. I also used a Python library to download the same rides from Strava, where I had recorded additional information about the rides.
After a certain amount of (rather time-consuming) tidying up, I ended up with over 800 rides. Each ride had over 200 summary statistics calculated by GC, as well as other meta-data, such as whether the ride was a race or turbo session. The metrics included all the standard items, such as time, distance, speed, heart rate, power, elevation gain, TSS, normalised power, as well as more esoteric metrics like “Time expended when Power is above CP and W’ bal is between 50% and 75% of W'”. When each ride is represented by a point in 200-dimensional space, it is easy to be overwhelmed. As a coach or an informed rider, which metrics are the most meaningful? This is precisely where data science steps in.
I decided to use some open source machine learning and data visualisation software called Orange. This makes it very straightforward to set up simple pipelines using a toolbox of standard approaches, as illustrated above.
One of the first things to do was to ask the computer to look for clusters of rides with similar characteristics. Orange has a useful feature that finds informative projections of the data that can be displayed on a scatter plot. As a first cut, the K-means algorithm categorised the data into four clusters that were largely explained by the time of day and the duration of the ride.

Although this makes a pretty graph, it simply tells us that I start a lot of rides in the morning, but do quite a few in the afternoon and evening. The green cluster includes my longer rides that rather obviously have to start earlier in the day. The scale is annoyingly shown in seconds, so a duration of 1800 would be a five hour ride. The blue band runs from about 1:30pm to about 6:30pm.

Grouping rides by time of day was not very helpful, so I filtered out that variable and searched again for rides that were similar in terms of effort. This made the results much more interesting. Distance and Average Power Variance (APV) were among the most informative metrics. The following scatter plot does a very good job of separating out races (shown in green), from normal rides and turbo trainer sessions (red). The points I did not have time to label are shown in grey.
Average Power Variance measures the mean power deviation with respect to its 30 second moving average. This will be high when power output is continually changing sharply, as it does on very short town centre courses or the Crystal Palace loop, where you are repeatedly sprinting out of corners. When racing on the Hillingdon and Dunsfold circuits or longer Surrey League routes, power is still much more variable than on a club ride. The band of Saturday club riders is very obvious at 53km: four laps of Richmond Park, with varying levels of APV depending on how aggressively the group was riding. You can also see that I quite often do only one or two laps, at about 19km and 30km. Short TTs and hill climb races tend to have less power variability. This was also the case on the endlessly long climbs encountered on the Haute Route. Lastly, turbo sessions have much lower APV because, even if target power levels vary, they tend to be sustained at the same level for each segment.
It is worth noting that APV is not correlated with the Variability Index, which is the ratio of normalised power to average power. APV is affected by continual changes in power output, whereas the Variability Index is strongly affected by power peaks, even if they a relatively few. The two power files below illustrate the difference.

## Conclusions

This analysis draws attention to Average Power Variance as a useful metric that is high for circuit and road races, but lower for TTs and long hilly races. The key observation for me is that relatively little of my training has a high APV.

The next part in this series zooms in on the races, to identify metrics associated with good and bad results.

## Kings and Queens of the Mountains

I guess that most male cyclists don’t pay much attention to the women’s leaderboards on Strava. And if they do it might just be to make some puerile remark about boys being better than girls. From a scientific perspective the comparison of male and female times leads to some interesting analysis.

Assuming both men and women have read my previous blogs on choosing the best time, weather conditions and wind directions for the segment that suits their particular strengths, we come back to basic physics.

KOM or QOM time = Work done / Power = (Work against gravity + Drag x Distance + Rolling resistance x Distance) / (Mass x Watt/kg)

Of the three components of work done, rolling resistance tends to be relatively insignificant. On a very steep hill, most of the work is done against gravity, whereas on a flat course, aerodynamic drag dominates.

The two key factors that vary between men and women are mass and power to weight ratio (watts per kilo).  A survey published by the ONS in 2010, rather shockingly reported that the average British man weighed 83.6kg, with women coming in at 70.2kg. This gives a male/female ratio of 1.19. KOM/QOM cyclists would tend to be lighter than this, but if we take 72kg and 60kg, the ratio is still 1.20.

Males generate more watts per kilogram due to having a higher proportion of lean muscle mass. Although power depends on many factors, including lungs, heart and efficiency of circulation, we can estimate the relative power to weight ratio by comparing the typical body composition of males and females. Feeding the ONS statistics into the Boer formula gives a lean body mass of 74% for men and 65% for women, resulting in a ratio of 1.13. This can be compared against the the useful table on Training Peaks showing maximal power output in Watts/kg, for men and women, over different time periods and a range of athletic abilities. The table is based on the rows showing world record performances and average untrained efforts.  For world champion five minute efforts and functional threshold powers, the ratios are consistent with the lean mass ratio. It makes sense that the ratio should be higher for shorter efforts, where the male champions are likely to be highly muscular. Apparently the relative performance is precisely 1.21 for all durations in untrained people.

On a steep climb, where the work done against gravity dominates, the benefit of additional male muscle mass is cancelled by the fact that this mass must be lifted, so the difference in time between the KOM and the QOM is primarily due to relative power to weight ratio. However, being smaller, women suffer from the disadvantage that the inert mass of bike represents a larger proportion of the total mass that must be raised against gravity. This effect increases with gradient. Accounting for a time difference of up to 16% on the steepest of hills.

In contrast, on a flat segment, it comes down to raw power output, so men benefit from advantages in both mass and power to weight ratio. But power relates to the cube of the velocity, so the elapsed time scales inversely with the cube root of power. Furthermore, with smaller frames, women present a lower frontal area, providing a small additional advantage. So men can be expected to have a smaller time advantage of around 9%. In theory the advantage should continue to narrow as the gradient shifts downhill.

## Theory versus practice

Strava publishes the KOM and QOM leaderboards for all segments, so it was relatively straightforward to check the basic model against a random selection of 1,000 segments across the UK. All  leaderboards included at least 1,666 riders, with an overall average of 637 women and 5,030 men. One of the problems with the leaderboards is that they can be contaminated by spurious data, including unrealistic speeds or times set by groups riding together. To combat this, the average was taken of the top five times set on different dates, rather than simply to top KOM or QOM time.

The average segment length was just under 2km, up a gradient of 3%. The following chart plots the ratio of the QOM time to the KOM time versus gradient compared with the model described above. The red line is based on the lean body mass/world record holders estimate of 1.13, whereas the average QOM/KOM ratio was 1.32. Although there is a perceivable upward slope in the data for positive gradients, clearly this does not fit the data.

Firstly, the points on the left hand side indicate that men go downhill much more fearlessly than women, suggesting a psychological explanation for the observations deviating from the model. To make the model fit better for positive gradients, there is no obvious reason to expect the weight ratio of male to female Strava riders to deviate from the general population, so this leaves only the relative power to weight ratio. According to the model the QOM/KOM ratio should level off to the power to weight ratio for steep gradients. This seems to occur for a value of around 1.40, which is much higher than the previous estimates of 1.13 or the 1.21 for untrained people. How can we explain this?

A notable feature of the data set was that sample of 1,000 Strava segments was completed by nearly eight times as many men as women. This, in turn reflects the facts that there are more male than female cyclists in the UK and that men are more likely to upload, analyse, publicise and gloat over their performances than women.

Having more men than women, inevitably means that the sample includes more high level male cyclists than equivalent female cyclists. So we are not comparing like with like. Referring back to the Training Peaks table of expected power to weight ratios, a figure of 1.40 suggests we are comparing women of a certain level against men of a higher category, for example, “very good” women against “excellent” men.

A further consequence of having far more men than women is that is much more likely that the fastest times were recorded in the ideal conditions described in my previous blogs listed earlier.

## Conclusions

There is room for more women to enjoy cycling and this will push up the standard of performance of the average amateur rider. This would enhance the sport in the same way that the industry has benefited as more women have joined the workforce.