Overview of Style Transfer (Deep Harmonization)
Last Updated :
16 Jul, 2020
Since humans have started educating themselves of the surrounding world, painting has remained the salient way of expressing emotions and understanding. For example, the image of the tiger below has the content of a tiger from real-world tigers. But notice the style of texturing and colouring is way dependent on the creator.
What is Style Transfer in Neural Networks?
Suppose you have your photograph (P), captured from your phone. You want to stylize your photograph as shown below.

This process of taking the content of one image (P) and style of another image (A) to generating an image (X) matching content of P and style of A is called Style Transfer or Deep Harmonization. You cannot obtain X by simply overlapping P and A.
Architecture & Algorithm
Gatys et al in 2015 showed that it is possible to separate content and style of an image and hence possible to combine content and style of different images. He used a convolutional neural network (CNN), called vgg-19 (vgg stands for Visual Geometric Group) which is 19 layers deep (with 16 CONV layers and 3 FC layers).
vgg-19 is pre-trained on ImageNet dataset by Standford Vision Lab of Stanford University. Gatys used average pooling and no FC layers.
Pooling is typically used to reduce the spatial volume of feature vectors. This helps to reduce the amount of computations. There are 2 types of pooling as depicted below:
Pooling Process
Losses in Style Transfer:
- Content Loss
Let us select a hidden layer (L) in vgg-19 to calculate the content loss. Let p: original image and x: generated image. Let Pl and Fl denote feature representations of the respective images corresponding to layer L. Then the content loss will be defined as:
L _{\text {content}}(\rho, x, L)=\frac{1}{2} \sum_{i j}\left(F_{i j}^{l}-P_{i j}^{l}\right)^{2}
- Style Loss
For this, we first have to calculate Gram Matrix. Calculation of correlation between different filters/ channels involves the dot product between the vectorized feature maps i and j at layer l. The matrix thus obtained is called Gram Matrix (G). Style loss is the square of difference between the Gram Matrix of the style image with the Gram Matrix of generated Image.
G_{i j}^{l}=\sum_{k} F_{i k}^{l} F_{j k}^{l}
- Total Loss
is defined by the below formula (with α and β are hyperparameters that are set as per requirement).
L_{\text {total}}(P, A, X)=\alpha \times L_{\text {content}}+\beta \times L_{\text {style}}
The generated image X, in theory, is such that the content loss and style loss is least. That means X matches both the content of P and style of A at the same time. Doing this will generate the desired output.
Note: This is very exciting new field made possible due to hardware optimizations, parallelism with CUDA (Compute Unified Device Architecture) and Intel's hyperthreading concept.
Code & Output
You can find the
entire code, data files and outputs of Style Transfer (bonus for sticking around : It has code for audio styling as well!) here
__CA__'s Github Repo.
Similar Reads
Transformer Neural Network In Deep Learning - Overview In this article, we are going to learn about Transformers. We'll start by having an overview of Deep Learning and its implementation. Moving ahead, we shall see how Sequential Data can be processed using Deep Learning and the improvement that we have seen in the models over the years. Deep Learning
10 min read
Deep Transfer Learning - Introduction Deep transfer learning is a machine learning technique that utilizes the knowledge learned from one task to improve the performance of another related task. This technique is particularly useful when there is a shortage of labeled data for the target task, as it allows the model to leverage the know
8 min read
StyleGAN - Style Generative Adversarial Networks StyleGAN is a generative model that produces highly realistic images by controlling image features at multiple levels from overall structure to fine details like texture and lighting. It is developed by NVIDIA and builds on traditional GANs with a unique architecture that separates style from conten
5 min read
Factorized Dense Synthesizer Transformer models are a huge success among the wide range of different NLP tasks. This caused the transformers to largely replacing the former auto-regressive recurrent neural network architecture in many state-of-the-art architectures. At the core of this transformer, the architecture uses a metho
4 min read
Audio Seq2seq Model using Transformers The article explores the various applications of the Seq2Seq model in various fields, delving into its complexities. We'll also look at how audio transformation can be used in practice. What is Seq2Seq model?Seq2Seq are encoder and decoder models allowing for different lengths of inputs and outputs
9 min read
Audio Data Preparation and Augmentation in Tensorflow Audio is nothing but any type of sound in a digital format. It can be speech, music, environmental sounds, etc. Audio data is commonly used in various fields such as speech recognition, music analysis, audio classification, and more. In recent years, audio data has become an increasingly important d
8 min read