What are you looking at?

Screen Shot 2018-05-06 at 18.48.40.png

In a recent blog, I described an experiment to train a deep neural network to distinguish between photographs of Vincenzo Nibali and Alejandro Valverde, using a very small data set of images. In the conclusion, I suggested that the network was probably basing its decisions more on the colours of the riders’ kit rather than on facial recognition. This article investigates what the network was actually “looking at”, in order to understand better how it was making decisions.

The issues of accountability and bias were among the topics discussed at the last NIPS conference. As machine learning algorithms are adopted across industry, it is important for companies to be able to explain how conclusions are reached. In many instances, it is not acceptable simply to rely on an impenetrable black box. AI researchers and developers need to be able to explain what is going on inside their models, in order to justify decisions taken. In doing so, some worrying instances of bias have been revealed in the selection of data used to train the algorithms.

I went back to my rider recognition model and used an approach called “Class Activation Maps” to identify which parts of the images accounted for the network’s choice of rider. Making use of the code provided in lesson 7 of the course offered by fast.ai, I took advantage of my existing small set of training, validation and test images of the two famous cyclists. Starting with a pre-trained version of ResNet34, the idea was to replace the last two layers with four new ones, the crucial one being a convolutional layer with two outputs, matching the number of cyclists in the classification task. The two outputs of this layer were 7×7 matrix representations of the relevant image.

The final predictions of the model came from a softmax of a flattened average pooling of these 7×7 representations. The softmax output gave the probabilities of Nibali and Valverde respectively. Since there was no learning beyond the final convolution, the activations of the two 7×7 matrices represented the “Nibali-ness” and “Valverde-ness” of the image. This could be displayed as a heat map on top of the image.

Examples are shown below for the validation set of 10 images of Nibali followed by 10 of Valverde. The yellow patch of the heat map highlights the part of the image that led to the prediction displayed above each image. Nine out of ten were correct for Nibali and six for Valverde.

Screen Shot 2018-05-06 at 18.10.00.png
Class Activation Maps applied to the validation set

The heat maps were very helpful in understanding the model’s decision making process. It seemed that for Nibali, his face and helmet were important, with some attention paid to the upper part of his blue Astana kit. In contrast, the network did a very good job at identifying the M on Valverde’s Moviestar kit. It was interesting to note that the network succeeded in spotting that Nibali was wearing a Specialized helmet whereas Valverde had a Catlike design. Three errors arose in the photos of his face, which was mistaken for Nibali’s. In fact, any picture of a face led to a prediction of Nibali, as demonstrated by the cropped image below that was used for training.

Screen Shot 2018-05-06 at 18.21.58

Why should that be? Looking back at the training set, it turned out that, by chance, there were far more mugshots of Nibali, while there were more photos of Valverde riding his bike, with his face obscured by sunglasses. This was an example of unintentional bias in the training data, providing a very useful lesson.

The final set of pictures shows the predictions made on the out-of-sample test set. All the predictions are correct, except the first one, where the model failed to spot the green M on Valverde’s chest and mistook the blurred background for Nibali. Otherwise the results confirmed that the network looked at Nibali’s face, the rider’s helmet or Valverde’s kit. It also remembered seeing an image of Nibali holding the Giro trophy in the training set.

Screen Shot 2018-05-06 at 18.34.38.png
Class Activation Maps applied to the test set

In conclusion, Class Activation Maps provide a useful way of visualising the activations of hidden laters in a deep neural network. This can go some way to accounting for the decisions that appear in the output. The approach can also help identify unintentional bias in the training set.

Which team is that?

Screen Shot 2018-04-11 at 11.18.09

My last blog explored the effectiveness of deep learning in spotting the difference between Vincenzo Nibali and Alejandro Valverde. Since the faces of the riders were obscured in many of the photos, it is likely that the neural network was basing its evaluations largely on the colours of their team kit. A natural next challenge is to identify a rider’s team from a photograph. This task parallels the approach to the kaggle dog breed competition used in lesson 2 of the fast.ai course on deep learning.

Eighteen World Tour teams are competing this year. So the first step was to trawl the Internet for images, ideally of riders in this year’s kit. As before, I used an automated downloader, but this posed a number of problems. For example, searching for “Astana” brings up photographs of the capital of Kazakhstan. So I narrowed things down by searching for  “Astana 2018 cycling team”. After eliminating very small images, I ended up with a total of about 9,700 images, but these still included a certain amount of junk that I did have the time to weed out, such as photos of footballers or motorcycles in the “Sky Racing Team”,.

The following small sample of training images is generally OK, though it includes images of Scott bikes rather than Mitchelton-Scott riders and  a picture of  Sunweb’s Wilco Kelderman labelled as FDJ. However, with around 500-700 images of each team, I pressed on, noting that, for some reason, there were only 166 of Moviestar and these included the old style kit.

Screen Shot 2018-04-11 at 10.18.54.png
Small sample of training images

For training on this multiple classification problem, I adopted a slightly more sophisticated approach than before. Taking a pre-trained Resnet50 model, I performed some initial fine-tuning, on images rescaled to 224×224. I settled on an optimal learning rate of 1e-3 for the final layer, while allowing some training of lower layers at much lower rates. With a view to improving generalisation, I opted to augment the training set with random changes, such as small shifts in four directions, zooming in up to 10%, adjusting lighting and left-right flips. After initial training, accuracy was 52.6% on the validation set. This was encouraging, given that random guesses would have achieved a rate of 1 in 18 or 5.6%.

Taking a pro tip from fast.ai, training proceeded with the images at a higher resolution of 299×299. The idea is to prevent overfitting during the early stages, but to improve the model later on by providing more data for each image. This raised the accuracy to 58.3% on the validation set. This figure was obtained using a trick called “test time augmentation”, where each final prediction is based on the average prediction of five different “augmented” versions of the image in question.

Given the noisy nature of some of the images used for training, I was pleased with this result, but the acid test was to evaluate performance on unseen images. So I created a test set of two images of a lead rider from each squad and asked the model to identify the team. These are the results.

75 percent right.png
75% accuracy on the test set

The trained Resnet50 correctly identified the teams of 27 out of 36 images. Interestingly, there were no predictions of MovieStar or Sky. This could be partly due to the underrepresentation of MovieStar in the training set. Froome was mistaken for AG2R and Astana, in column 7, rows 2 and 3. In the first image, his 2018 Sky kit was quite similar to Bardet’s to the left and in the second image the sky did appear to be Astana blue! It is not entirely obvious why Nibali was mistaken for Sunweb and Astana, in the top and bottom rows. However, the huge majority of predictions were correct. An overall  success rate of 75% based on an afternoon’s work was pretty amazing.

The results could certainly be improved by cleaning up the training data, but this raises an intriguing question about the efficacy of artificial intelligence. Taking a step back, I used Bing’s algorithms to find images of cycling teams in order to train an algorithm to identify cycling teams. In effect, I was training my network to reverse-engineer Bing’s search algorithm, rather than my actual objective of identifying cycling teams. If an Internet search for FDJ pulls up an image of Wilco Kelderman, my network would be inclined to suggest that he rides for the French team.

In conclusion, for this particular approach to reach or exceed human performance, expert human input is required to provide a reliable training set. This is why this experiment achieved 75%, whereas the top submissions on the dog breeds leaderboard show near perfect performance.

Valverde or Nibali?

Alejandro Valverde has kicked off the 2018 season with an impressive series of wins. Meanwhile Vincenzo Nibali delighted the tifosi with his victory in Milan San Remo. It is pretty easy to tell these two riders apart in the pictures above, but could computer distinguish between them?

Following up on my earlier blogs about neural networks, I have been taking a look at the updated version of fast.ai’s course on deep learning. With the field advancing at a rapid pace, this provides a good way to staying up to date with the state of the art. For example, there are now a couple of cheaper alternatives to AWS for accessing high powered GPUs, offered by Paperspace and Crestle. The latest fast.ai libraries include many new tools that work extremely well in practice.

There’s a view that deep learning requires hours of training on high-powered supercomputers, using thousands (or millions) of labelled examples, in order to learn to perform computer vision tasks. However, newer architectures, such as ResNet, are able to run on much smaller data sets. In order to test this, I used an image downloader to grab photos of Nibali and Valverde and manually selected about 55 decent pictures of each one.

I divided the images into a training set with about 40 images of each rider, a validation set with 10 of each and a test set containing the rest. Nibali appears in a range of different coloured jerseys, though the Astana blue is often present. Valverde is mainly wearing the old dark blue Movistar kit with a green M. There were more close-up shots of Nibali’s face than Valverde.

Screen Shot 2018-04-03 at 18.30.08.png

I was able to fine-tune a pre-trained ResNet neural network to this task, using some of the techniques from the fast.ai tool box, each designed to improve generalisation. The first trick was to augment the training set by performing minor transformations of the images at random, such as taking a mirror image, shifting left or right and zooming in a bit. The second set of tricks varied the rate of learning as the algorithm iterated repeatedly through the training set. A final useful technique created a set of variants of each test image and took the average of the predictions. Everything ran at lightning speed on a Paperspace GPU. After a run time of just a few minutes, the ResNet was able to  score 17 out of 20 on the following validation set.

Screen Shot 2018-04-03 at 18.49.27.png

The confusion matrix shows that the model correctly identified all the Nibali images, but it was wrong on three pictures of Valverde. The first incorrect image (below) shows Valverde in the red leader’s jersey of the Tour of Murcia, which is not dissimilar to Nibali’s new Bahrain Merida kit, though he was wearing red in two of his training images. In the second instance, the network was fooled by the change in colour of Moviestar’s kit, which had become rather similar to Astana’s light blue. The figure of 0.41 above the close-up image indicates that the model assigned only a 41% probability that the image was Valverde. It probably fell below the critical 50% level, in spite of the blue/green colours, because there were were far more close-up shots of Nibali than Valverde in the training set.

Overall of 17 out of 20 on the validation set is impressive. However, the network had access to the validation set during training, so this result is “in sample”. A proper  “out of sample” evaluation of the model’s ability made use the following ten images, comprising the test set that was kept aside.

Screen Shot 2018-04-03 at 21.21.59

Amazingly, the model correctly identified 9 out of the 10 pictures it had not seen before. The only error was the Valverde selfie shown in the final image. In order to work better in practice, the training set would need to include more examples of the riders’ 2018 kit. A variant of the problem would be to identify the team rather than the rider. The same network can be trained for multiple classes rather than just two.

This experiment shows that it is pretty straightforward to run state of the art image recognition tools remotely on a GPU somewhere in the cloud and come up with pretty impressive results, even with a small data set.

The next blog describes how to identify a rider’s team.