## Strava – Automatic Lap Detection

As you upload your data, you accumulate a growing history of rides. It is helpful to find ways of classifying different types of activities. Races and training sessions often include laps that are repeated during the ride. Many GPS units can automatically record laps as you pass the point where you began your ride or last pressed the lap button. However, if the laps were not recorded on the device, it is tricky to recover them. This article investigates how to detect laps automatically.

First consider the simple example of a 24 lap race around the Hillingdon cycle circuit. Plotting the GPS longitude and latitude against time displays repeating patterns. It is even possible to see the “omega curve” in the longitude trace. So it should be possible to design an algorithm that uses this periodicity to calculate the number of laps.

This is a common problem in signal processing, where the Fourier Transform offers a neat solution. This effectively compares the signal against all possible frequencies and returns values with the best fit in the form of a power spectrum. In this case, the frequencies correspond to the number of laps completed during the race. In the bar chart below, the power spectrum for latitude shows a peak around 24. The high value at 25 probably shows up because I stopped my Garmin slightly after the finish line. A “harmonic” also shows up at 49 “half laps”. Focussing on the peak value, it is possible to reconstruct the signal using a frequency of 24, with all others filtered out.

So we’re done – we can use a Fourier Transform to count the laps! Well not quite. The problem is that races and training sessions do not necessarily start and end at exactly the starting point of a lap. As a second example, consider my regular Saturday morning club run, where I ride from home to the meeting point at the centre of Richmond Park, then complete four laps before returning home. As show in the chart below, a simple Fourier Transform approach suggests that ride covered 5 laps, because, by chance, the combined time for me to ride south to the park and north back home almost exactly matches the time to complete a lap of the park. Visually it is clear that the repeating pattern only holds for four laps.

Although it seems obvious where the repeating pattern begins and ends, the challenge is to improve the algorithm to find this automatically. A brute force method would compare every GPS location with every other location on the ride, which would involve about 17 million comparisons for this ride, then you would need to exclude the points closely before or after each recording, depending on the speed of the rider. Furthermore, the distance between two GPS points involves a complex formula called the haversine rule that accounts for the curvature of the Earth.

Fortunately, two tricks can make the calculation more tractable. Firstly, the peak in the power spectrum indicates roughly how far ahead of the current time point to look for a location potentially close to the current position. Given a generous margin of, say, 15% variation in lap times, this reduces the number of comparisons by a whole order of magnitude. Secondly, since we are looking for points that are very close together, we only need to multiply the longitudes by the cosine of the latitude (because lines of longitude meet at the poles) and then a simple Euclidian sum the squares of the differences locates points within a desired proximity of, say, 10 metres.  This provides a quicker way to determine the points where the rider was “lapping”. These are shaded in yellow in the upper chart and shown in red on a long/latitude plot below. The orange line on the upper chart shows, on the right hand scale, the rolling lap time, i.e. the number of seconds to return to each point on the lap, from which the average speed can be derived.

Two further refinements were required to make the algorithm more robust. One might ask whether it makes a difference using latitude or longitude. If the lap involved riding back and forth along a road that runs due East-West, the laps would show up on longitude but not latitude. This can be solved by using a 2-dimensional Fourier Transform and checking both dimensions. This, in turn, leads to the second refinement, exemplified by the final example of doing 12 ascents of the Nightingale Lane climb. The longitude plot includes the ride out to the West, 12 reps and the Easterly ride back home.

The problem here was that the variation in longitude/latitude on the climb was tiny compared with the overall ride. Once again, the repeating section is obvious to the human eye, but more difficult to unpick from its relatively low peak in the power spectrum. A final trick was required: to consider the amplitude of each frequency in decreasing order of power and look out for any higher frequency peaks that appear early on the list. This successfully identified the relevant part of the ride, while avoiding spurious observations for rides that did not include laps.

The ability for an algorithm to tag rides if they include laps is helpful for classifying different types of sessions. Automatically marking the laps would allow riders and coaches to compare laps against each other over a training session or a race. A potential AI-powered robo-coach could say “Ah, I see you did 12 repeats in your session today… and apart from laps 9 and 10, you were getting progressively slower….”

## Cycling Data Science – clusters

Data Science is a hot topic that is impacting a range of diverse areas from business to sport. With so many cyclists collecting and uploading their data, there is plenty of raw material from which to draw interesting insights. This is the first in a series of articles exploring applications of data science in the field of cycling, beginning with the concept of clustering.

As a data set, I took all my Garmin files covering 2014-2017. Having previously uploaded them onto Golden Cheetah (GC), I took advantage of the API that allows external programmes, such as Python, to retrieve data. I also used a Python library to download the same rides from Strava, where I had recorded additional information about the rides.
After a certain amount of (rather time-consuming) tidying up, I ended up with over 800 rides. Each ride had over 200 summary statistics calculated by GC, as well as other meta-data, such as whether the ride was a race or turbo session. The metrics included all the standard items, such as time, distance, speed, heart rate, power, elevation gain, TSS, normalised power, as well as more esoteric metrics like “Time expended when Power is above CP and W’ bal is between 50% and 75% of W'”. When each ride is represented by a point in 200-dimensional space, it is easy to be overwhelmed. As a coach or an informed rider, which metrics are the most meaningful? This is precisely where data science steps in.
I decided to use some open source machine learning and data visualisation software called Orange. This makes it very straightforward to set up simple pipelines using a toolbox of standard approaches, as illustrated above.
One of the first things to do was to ask the computer to look for clusters of rides with similar characteristics. Orange has a useful feature that finds informative projections of the data that can be displayed on a scatter plot. As a first cut, the K-means algorithm categorised the data into four clusters that were largely explained by the time of day and the duration of the ride.

Although this makes a pretty graph, it simply tells us that I start a lot of rides in the morning, but do quite a few in the afternoon and evening. The green cluster includes my longer rides that rather obviously have to start earlier in the day. The scale is annoyingly shown in seconds, so a duration of 1800 would be a five hour ride. The blue band runs from about 1:30pm to about 6:30pm.

Grouping rides by time of day was not very helpful, so I filtered out that variable and searched again for rides that were similar in terms of effort. This made the results much more interesting. Distance and Average Power Variance (APV) were among the most informative metrics. The following scatter plot does a very good job of separating out races (shown in green), from normal rides and turbo trainer sessions (red). The points I did not have time to label are shown in grey.
Average Power Variance measures the mean power deviation with respect to its 30 second moving average. This will be high when power output is continually changing sharply, as it does on very short town centre courses or the Crystal Palace loop, where you are repeatedly sprinting out of corners. When racing on the Hillingdon and Dunsfold circuits or longer Surrey League routes, power is still much more variable than on a club ride. The band of Saturday club riders is very obvious at 53km: four laps of Richmond Park, with varying levels of APV depending on how aggressively the group was riding. You can also see that I quite often do only one or two laps, at about 19km and 30km. Short TTs and hill climb races tend to have less power variability. This was also the case on the endlessly long climbs encountered on the Haute Route. Lastly, turbo sessions have much lower APV because, even if target power levels vary, they tend to be sustained at the same level for each segment.
It is worth noting that APV is not correlated with the Variability Index, which is the ratio of normalised power to average power. APV is affected by continual changes in power output, whereas the Variability Index is strongly affected by power peaks, even if they a relatively few. The two power files below illustrate the difference.

## Conclusions

This analysis draws attention to Average Power Variance as a useful metric that is high for circuit and road races, but lower for TTs and long hilly races. The key observation for me is that relatively little of my training has a high APV.

The next part in this series zooms in on the races, to identify metrics associated with good and bad results.