Teaching computers to understand what they see is the subject that keeps all the computer vision engineers awake. Even though a lot of progress has been accomplished in Image Recognition field over the past few years, there are a lot of puzzle pieces still missing that should fit together to get a complete and clear picture on how to teach machines to make sense of what they see.For a long time Image Classification was not considered as a statistical problem until a partial solution came from the Machine Learning field under the name of Neural Networks, in particular, Convolutional Neural Networks (CNN). CNN is a special type of Artificial Neural Networks that offer human-like results in image classification tasks.
This article explains how human brain reconstructs the visual world, how machines learn to understand visuals and what the applications of Image Classification are.
Image Recognition is the process of identifying what an image depicts. For humans interpreting the visual world comes easy. When humans see something, there is an inherent understanding of what it is. In most cases, there is no need for a conscious study of the object in order to make sense of it. However, for computers, it is an extremely difficult task because they can only manipulate digits. For example, a 3x3 square point on Albert Einstein’s forehead, for a computer is a 3-dimensional matrix where each dimension represents one of the primary colors: red, green and blue.
Illustration 1. The matrix represents the way how computer sees a square area on Albert Einstein’s forehead
Even though humans are able to interpret images in a fraction of seconds, a complex cognitive process occurs in the visual cortex of the brain. The visual cortex is divided into layers (V1-V8), and it processes visual information coming from the eyes. When a stimulus is present at the receptive field, its representation first reaches the V1 layer, or in other words the neurons in the area of V1 layer fire first. This layer is a map which preserves the spatial information of the stimulus in the world and also detects its edges. V1 layer is strongly connected to V2 layer, which in turn is involved in discriminating shapes, orientations, colors and other low-level features. Higher-level visual features involve the brain’s understanding of the context and relationship of the images and are only perceived in the higher layers, such as V6-V8.
Illustration 2. Human visual cortex and its different layers
Let’s say, the perceived stimulus is your dad. The object detection itself is accomplished in layer V1, however, the semantic information is only perceived in the layers V6-V8.
It is important to stress that what each layer is responsible for, is almost always related to controversy as the research brings more and more discoveries over time. However, it is a fact that the higher the layer is, the more abstract the presentation becomes.
Apart from this high-level architecture, on the micro level, the neuron’s mechanics has been applied to simulate the processes in visual cortex layers. In particular, each neuron receives input from the dendrites and based on complex non-linearity which is applied to its input will fire, if the summed non-linear input overcomes some threshold. Although this explanation, is very simplified, it was enough for a research to invent the first Artificial Neural Network.
Inspired by the human visual system, engineers tried to replicate this process with machines. To enable computers to understand objects, it was necessary to create a system that would extract high-level features from visual “stimuli” by using only numerical manipulations. That’s when Convolutional Neural Nets come into place. When fed with enough of clean and well-defined data, CNN allows extracting high-level, common features for each category the data encompass.
The representations learned by CNN are similar to how the human visual layers represent visual information: the first convolutional layers extract low-level features, such as edges and blobs, and the latest layers assign the semantic part to the image. For example, the picture of a cat is represented below by second and tenth hidden convolutional layers. As we can see, the second layer still preserves the form of the cat, but in the tenth layer, the features are more abstract.
Illustration 3. Representations from II hidden layer vs X hidden layer
These abstract features are then assigned to the class of cats and this is what comprises the training process. The output of it is called a model. The model represents a set of abstract features where each is assigned to a particular class.
Now, when the model is applied to a completely new image of a cat, it will classify the image correctly as a cat, if it finds common features between the model and the image.
If we had provided the CNN with cats and dogs’ images, then the model will be able to differentiate between cats and dogs with a particular level of accuracy.
Illustration 4. The image of a cat on the right was not used during the training process, but because the model learned the common features of cats, it will be able to predict with high accuracy that this image depicts a cat and not a dog
All in all, image classification for a computer translates into the problem of identifying common features by “looking” at the digits and doing mathematical manipulations to find a function (i.e. model), which can generalize on unseen data.
The state-of-the-art performance of Convolutional Neural Nets in image classification task can be equivalent to human’s, but it’s only possible if the following factors are met: plenty of data is provided (GBs), a long length of time is allotted, and the appropriate neural network architecture is in place.
Doculayer is about smart content management and Image Recognition is part of a large chain of Machine Learning solutions that we offer. Although there are a number of open APIs available to gain insights from images, Doculayer develops its own, unique classifier to protect clients’ sensitive data. Using open services for image classification such as Google implies sharing clients’ data with 3rd parties. While it’s not an issue if you need to classify images with cats and dogs, it is a compliance problem when IDs and credit cards need to be classified.
Maintaining a high level of security while providing an accurate performance of the ML classifier comes with some challenges. Since Doculayer does not have large image libraries, it relies heavily on clients’ data or open-source datasets, which are usually not ready for direct use. Cleaning and manually labeling them requires a lot of time. Another challenge is finding the right architecture. In most cases building an in-house architecture is more efficient; however, if we don’t have enough data, then using a pre-trained architecture is a better option. By investing in a dedicated hardware we were able to overcome these challenges and significantly improve the timeframe required to train a model.