Hello and welcome to the OVO Tech Blog advent calendar! If all goes according to plan we'll be blogging every day until Christmas, with each entry written by a different member of the OVO team. Stay tuned for exciting posts about technology, UI/UX, data science, management and more.
Without further ado, let's get started.
Background
I'm Chris, a principal engineer working on the Identity team at OVO Energy.
My wife enjoys searching for Wally and his friends in the Where's Wally books. I don't, mostly because she is much better at it than I am.
I am lazy and weak-willed, so I decided to cheat. I felt like this was the sort of thing a computer should be able to do for me.
Preliminary research
My first problem was that I wasn't sure of the appropriate machine-learning jargon for this particular problem, namely finding a known entity (Wally) in a complex image (a crowd of non-Wallies).
A cry for help on Twitter was answered within minutes by my friend Hamish:
You probably want "object localisation and detection"
— Hamish Dickson (@_mishy) November 20, 2017
Now I had a phrase to google for, I was ready to start my research.
After an hour or so, I'd ascertained that a Convolutional Neural Network (CNN or ConvNet) was probably a good tool for the job. I'd also compiled a list of dozens of books and papers I needed to read, and was starting to lose hope of ever getting anything implemented.
Convolutional Neural Networks
Neural networks are made up of layers of neurons. There is an input layer, an output layer and one or more hidden layers in between. In most neural networks, the layers are fully interconnected: a given neuron in layer N will receive weighted inputs from every single neuron in layer N-1.
This works well for many machine learning problems, but doesn't scale well for image processing. For a 600x400 pixel image with three colour channels, every neuron in the first hidden layer would have 600 * 400 * 3 = 720000 inputs!
A CNN is designed specifically for image processing. Its neurons are arranged in a pattern that matches the way images are encoded: as a grid of x pixels * y pixels * three colour channels per pixel. So the input layer is a set of neurons arranged in 3D, with each neuron representing one colour channel of a single pixel. Using the same example of a 600x400 image with three colour channels, the input layer would have its neurons arranged in a 600x400x3 pattern.
By arranging the neurons in 3D in this way, a CNN can avoid the problem of excessive interconnectedness by only connecting a given neuron to a small number of 'nearby' neurons in the previous layer. For example in my CNN, each neuron in the first hidden layer was connected only to the pixels in a 3x3 grid that surrounded it.
If you'd like to learn more about CNNs, there's a huge amount of material available online. The lecture notes for the Stanford "Convolutional Neural Networks for Visual Recognition" course are a good place to start.
Mature poets steal
After a little more research I stumbled across this blog post by Henrik Tünnermann about using CNNs to detect and locate cars in images.
This is exactly what I wanted to do, except for the fact that he was looking for cars and I was looking for Wally. His blog post even had an accompanying GitHub repo containing simple, well-documented code to train a neural network and then use it to locate cars.
Maybe, I thought, I could use the code he wrote, but train the neural network using pictures of Wally instead of pictures of cars?
Immature poets imitate; mature poets steal.
T. S. Eliot
Channelling the spirit of T. S. Eliot, I lifted Henrik's code pretty much verbatim and tweaked it slightly to look for Wally. My version of the code is available on GitHub here.
Training data
I knew from the start that my biggest problem would be lack of training data.
Ordinarily neural networks are trained using thousands or even millions of data points; as a general rule of thumb, the more training data you have, the more accurate your neural network will be.
But there are only a few dozen Where's Wally posters in the world, so I needed to get creative in order to produce as much training data as I could.
Detecting a Wally in any given 64x64 pixel chunk of an image is a binary classification problem (it's either Wally or it's not), so to train the CNN I needed to provide lots of 64x64 pixel images, each one labelled as either a Wally or a non-Wally.
non-Wallies
Producing the non-Wallies was easy:
- Take a few Where's Wally posters, easily discovered via Google
- Ask wife to find Wally
- Take any large chunk of the picture that doesn't contain Wally
- Split it into 64x64 pixel pieces
This gave me thousands of non-Wallies, which look like this:
Wallies
Coming up with a useful number of pictures of Wally was more troublesome. I decided to try a very simple trick, which turned out to work surprisingly well:
- Using the same Where's Wally posters as before, extract 94x94 pixel images containing just Wally and his surroundings
- Moving one pixel at a time, pan around the 94x94 pixel image and extract a 64x64 pixel sub-image.
This gave me 900 very similar images of Wally for each poster, which look like this:
I did this for three posters, giving me a total of 2700 Wallies and 1468 non-Wallies.
Results
Here's what the CNN came up with when I tried it on a Where's Wally poster. Note that this is not one of the images I used for training - the CNN had never seen this image before.
First, here's a heatmap showing the probability of Wally being in each part of the image. Dark blue is low probability, yellow-green is high:
Now let's look at places that the CNN thinks has at least a 99.9% chance of being Wally:
Finally we can superimpose those on the original image:
Of the seven Wallies that the CNN found, six are false positives and one is the correct answer. I'll take that!
I tried it on a couple of other images and the results were less impressive. In one image the CNN found about 20 potential Wallies (it seemed to be tricked by red and white stripy things, just like humans are), and in the other it couldn't find him at all.
But the fact that this worked at all on even a single image is miraculous enough to satisfy me.
Code
The code is available here in the form of Jupyter notebooks: https://github.com/cb372/theres-wally
The trained neural network is also in the GitHub repo so you can try it out on your own Where's Wally images if you like.
Conclusion
Even a machine-learning newbie like me can train a neural network to do something (sort of) useful. You should try it too.