Many commentators have been licking their lips at the prospect of head-to-head combat between Chris Froome and Tom Dumoulin at next year’s Tour de France. It is hard to make a comparison based on their results in 2017, because they managed to avoid racing each other over the entire season of UCI World Tour races, meeting only in the World Championship Individual Time Trial, where the Dutchman was victorious. But it is intriguing to ask how Dumoulin might have done in the Tour de France and the Vuelta or, indeed, how Froome might have fared in the Giro.
Inspiration for addressing these hypothetical questions comes from an unexpected source. In 2009 Netflix awarded a $1million prize to a team that improved the company’s technique for making film recommendations to its users, based on the star ratings assigned by viewers. The successful algorithm exploited the fact that viewers may enjoy the films that are highly rated by other users who have generally agreed on the ratings of the films they have seen in common. Initial approaches sought to classify films into genres or those starring particular actors, in the hope of grouping together viewers into similar categories. However, it turned out to be very difficult to identify which features of a film are important. An alternative is simply to let the computer crunch the data and identify the key features for itself. A method called Collaborative Filtering became one of the most popular employed for recommender systems.
Our cycling problem shares certain characteristics with the Netflix challenge: instead of users, films and ratings, we have riders, races and results. Riders enter a selection of races over the season, preferring those where they hope to do well. Similar riders, for example sprinters, tend to finish high in the results of races where other sprinters also do well. Collaborative filtering should be able to exploit the fact that climbers, sprinters or TTers tend to finish close to each other, across a range of races.
This year’s UCI World Tour concluded with the Tour of Guangxi, completing the data set of results for 2017. After excluding team time trials, 883 riders entered 174 races, resulting in 26,966 finishers. Most races have up to 200 participants , so if you imagine a huge table with all the racers down the rows and all the races across the columns, the resulting matrix is “sparse” in the sense that there are lots of missing values for the riders who were not in a particular race. Collaborative Filtering aims to fill in the spaces, i.e. to estimate the position of a rider who did not enter a specific race. This is exactly what we would like to do for the Grand Tours.
It took a couple of minutes to fit a matrix factorisation Collaborative Filtering model, using keras, on my MacBook Pro. Some experimenting suggested that I needed about 50 hidden factors plus a bias to come up with a reasonable fit for this data set. Taking at random the Milan San Remo one day stage race, it did a fairly good job of predicting the top ten riders for this long, hilly race with a flat finish.
|Model fit (prediction)||Rider||Actual result|
The following figure visualises the primary factors the model derived for classifying the best riders. Sprinters are in the lower part of chart, with climbers towards the top and allrounders in the middle. Those with a lot of wins are towards the left.
Now we come to the interesting part: how would Tom Dumoulin and Chris Froome have compared in the other’s Grand Tours? Note that this model takes account of the results of all the riders in all the races, so it should be capable of detecting the benefit of being part of a strong team.
Tour de France
The model suggested that Tom Dumoulin would have beaten Chris Froome in stages 1(TT), 2, 5, 6, 10 and 21, but the yellow jersey winner would have been stronger in the mountains and won overall.
The model suggested that Chris Froome would have been ahead in the majority of stages, leaving stages 4, 5, 6, 9, 10(TT), 14 and 21(TT) to Dumoulin. The Brit would have most likely claimed the pink jersey.
Vuelta a España
The model suggested that Tom Dumoulin would have beaten Chris Froome in stages 2, 4, 12, 18, 19 and 21. In spite of a surge by the Dutchman towards the end of the race, the red jersey would have remained with Froome.
Based on a Collaborative Filtering approach, the results of 2017 suggest that Chris Froome would have beaten Tom Dumoulin in any of the Grand Tours.