What is Neural style transfer?

Overview

Neural style transfer (NST) is a deep learning application that uses convolutional neural networks (CNNs) to metamorphose a content image to make it seem viewed under the lens of a style image. The model is supplied with two input images- style and content images.

Note: The content image defines the content of the output image that should be maintained. The style image captures the pattern along which the content image should be changed to produce the output image.

An example of neural transfer

How does NST work?

Convolutional networks

Convolutional networks consist of layers of filters where each filter has weights/parameters that remain constant during the training process. The filter weights perform convolution with a section of the input image representation. Consequently, each filter produces a filtered output representation as input to the next layer. These filters introduce non-linearity into the model.

An example of a layered convolutional net architecture
  • Maxpool layers: In the above-illustrated architecture, we use these layers to cut down the dimensionality of the feature maps to prevent overfitting of the image. Additionally, it provides the edge of reducing computational costs.
A demonstration of how maxpool layers work to reduce dimensions
  • Lower level layers: We use the lower filter layers to compile more delicate input image features, such as lines and other details.
  • Higher level layers: We use the filter layers to build on the feature maps produced by lower filter layers and dwell on extracting more complex image details.

Content loss

We don't alter the parameters or weights of the filters present in each layer. Instead, we update the image using content loss as a distance metric between the content image and the generated image.

Content loss is defined by the equation below, where OO represents the output image and II represents the input image:

Style loss

Style of an image is to extract the finer details of the input style image without paying much regard to the higher-level content that is depicted in the image. We utilize the output of lower-level filters which better capture such minute and fine-grained details required to mimic the style.

To better comprehend it, the output of each of the filters indicates the presence of detail, and when these outputs are correlatedScalar product of two vectors each representing the output of a filter channel, it produces a corresponding Gram matrix.

The Gram matrix for input style image SS can be equated as the following where jj represents the width of the ithi^{th}channel,LL, the total number of layers, cc and cc' respectively two different filters.

Similarly, the Gram matrix for the generated image GGcan be calculated as follows:

Consequently, the total style loss can be found by the accumulation of the mean squared errors as follows where NlN_l is the number of feature maps at layer ll, MlM_lrepresents the length of the flattened feature vector at layer, lland wlw_lrepresents the weight applied to layer ll.

Total Loss

The style and content losses are combined and weighted to compute the total loss as follows, where α\alphaand β\betaare weights applied to each of the losses.

We can change these weights to pivot the amount of importance to the output image. For instance, greater values of α\alpha would mean that higher importance is to be given to content preservation, whereas higher values of β\betawould imply greater emphasis to style in the final output.

The total loss is computed and is back-propagated to produce another image with a seemingly lower loss. The process is repeated until convergence, and this eventually produces artistic and aptly transformed images.

Copyright ©2024 Educative, Inc. All rights reserved