The field of Computer Vision has been experiencing significant momentum since AI introduced Deep Neural Networks and in particular Convolutional Neural Networks (CNNs). Although CNNs were invented a while ago (in 1968), their full potential remained hidden until recently. The development of computationally powerful computers allowed to experiment with CNNs and tap into their real value.
In 2012, Alex Krizhevsky designed a CNN called AlexNet which was trained using large scale image dataset (ImageNet) and ran using GPU. The results were so promising that since then the Computer Vision field has been taken over by Deep Neural Nets research. In fact, many new CNNs architectures are introduced every year, and Deep Learning has become a buzzword.
Given the fact that creating a CNN architecture which would perform well is not a trivial problem but requires proper scientific knowledge, the progress which has been witnessed during the last years proves the importance of this technology.
Illustration 1. Overview of CNN architectures performance in classification task 
In particular, such computer vision problems as image tagging, object detection, and image generation have been tremendously improved thanks to Convolutional Neural Networks. First, this new approach eliminated the need to engineer features which were used to solve those problems before. Second, the results produced using Deep Neural Networks outperformed the old-fashioned techniques.
So, let’s take a look at the most common technologies that are powered by CNNs.
Image tagging is the technology based on CNNs which enables a computer to assign a category to an image.
Image tagging can be used with unstructured datasets to actually structure them.
Illustration 2. The architecture of Convolutional Neural Networks 
Companies seeking to organize their massive datasets into meaningful for them categories can take advantage of this technology. Its applications are extensive from identifying defects on a product line to diagnosing diseases from MRI scans. Another example is to apply image tagging to improve product discovery. Content management platforms, like Doculayer.ai, leverage machine vision to streamline labeling of large visual datasets for retail companies.
Reverse Image Search is a method to extract the image representations using CNNs and compare them with one another to find conceptually similar images.
Reverse Image Search is used to find similar images in an unstructured data space.
Reverse Image Search extracts the image representations from the latest convolutional layer in Neural Net. Then, these representations are compared to each other using some distance metrics.
Illustration 3. The architecture of Reverse Image Search
Reverse Image Search is the simplest way to group fast image datasets into conceptually “correct” categories. Additionally, this can be considered as a way to cluster the images.
Image Captioning enables computers to generate image descriptions.
Image Captioning can be used when we are interested in representing the image content in words.
Image Captioning can be conceived in the encoder-decoder framework. First, image embeddings are extracted by using pre-trained CNNs (encoding step), and further, the embeddings are used as input to Long Short Term Memory (LSTM, a type of neural network which can process sequences of data and therefore is used for text datasets) networks which learn to decode the embeddings into text.
Illustration 4. The architecture of Image Captioning 
Image Captioning can be used in the blind assistance systems, image metadata generation systems, and robotics.
Object Detection is the technology that identifies not only what object is depicted in an image/video but also where its position is.
Object Detection is used in cases when the position of a particular object/subject is requested. It is a tracking technology.
CNNs is the primary technology here to extract the regions of interest which are then categorized, and the bounding boxes are derived.
Illustration 5. The architecture of Object Detection 
Important to say here that this approach is only one of many existing for object detection.
Facial detection is one of the most common uses cases of Object Detection technology. It can be utilized as a security measure to let only certain people inside the office building or to recognize and tag your friends on Facebook. Last year Instagram added a new feature based on this technology designed to make it easier for visually impaired people to use its platform. This feature uses object recognition technology to generate a description of photos. While scrolling the app everyone using screen readers can hear the list of items that photo contains.
Image Segmentation is a technology which can segment an image into conceptual parts but contrary to object detection, here every pixel in an image is assigned a category.
Image Segmentation can be used to locate objects and their boundaries.
Usually, the algorithms employed in such tasks are based on convolution-deconvolution methods. For example, one algorithm is using CNNs to create feature maps, but at the same time, subsampling layers are introduced to keep the whole process computationally feasible. The computational burden lies in fact that the classification decision is done per pixel. For this reason, by reducing the neurons, computational efficiency can be improved. The next step though is to apply transpose convolution during which the network is trained to reconstruct the previously reduced neurons.
Illustration 6. Architecture of Image Segmentation
This technology is mainly used in medical imaging, GeoSensing, and precision agriculture.
Image Denoising is the technology which uses self-supervised learning methods to generate images without noise or blurring. It is based on the Autoencoders algorithms which learn to encode the images in a lower feature space and decode them generating data distribution of interest.
Image Denoising can be used with some success to remove noise or blurring from images.
The algorithm tries first to encode the input data into a lower number of dimensions (compression) and then reconstructs it back to latent feature space representation (decoding). In a more formal language, the encoder learns to approximate the identity function by using fewer dimensions. Therefore, this technique is also suitable for dimension reduction purposes. In the context of image denoising, we can set the convolutional autoencoder to learn to generate high-quality images by providing the algorithm with low-quality against ground-truth high-quality images. In this way, the decoder will try to learn how to represent the input in higher-quality.
Illustration 7. The example of how Image Denoising technology works. Top, the noisy digits fed to the network, and bottom, the digits are reconstructed by the network 
Applications, like Let’sEnhance.io, use this technology to improve the quality and resolution of images.
Generative Adversarial Networks (GANs) is a type of unsupervised learning which learns to generate realistic images.
This technique can be used in applications which generate photorealistic images. For example, it can be used in interior or industrial design or computer games scenes.
When generating an image, we want to be able to sample from a complex, high-dimensional space which is impossible to do directly. Instead, we can examine this space by using CNN. GANs do this in a manner of a game.
Illustration 8. Generative Adversarial Networks training process 
With proper training, GANs provide a more precise and sharper 2D texture image magnitudes. Its quality is higher, while the level of details and colors remains unchanged. NVIDIA uses this technology to transform sketches into photorealistic landscapes.
1. Canziani A., Molnar T., Burzawa L., Sheik D., Chaurasia A., Culurciello E., (September 8, 2018). Analysis of deep neural networks. Retrieved from https://medium.com/@culurciello/analysis-of-deep-neural-networks-dcf398e71aae
2. Sharma, V. (October 15, 2018). Everything You Need to Know About Convolutional Neural Networks. Retrieved from https://www.datasciencecentral.com/profiles/blogs/everything-you-need-to-know-about-convolutional-neural-networks
3. Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678.
4. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057).
6. Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125-1134).
7. Chollet F. (May 14, 2016). Building Autoencoders in Keras Retrieved from https://blog.keras.io/building-autoencoders-in-keras.html
8. Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision (pp. 1520-1528).