Experimental Setup: Image Synthesis and Manipulation

Learn how to perform image-to-image translation with pix2pix and pix2pixHD.

In this lesson, we will investigate image-to-image translation using the pix2pix and pix2pixHD models. We will synthesize shoes from shoe outlines and urban cityscapes from instances and semantic maps. We will start with the simplest model: pix2pix.


The dataset we are going to use consists of 50,000 training images from the UT Zappos 50K dataset, also known as UT-Zap50K. The dataset consists of approximately 50,000 catalog images collected from the Zappos website and their respective edge maps. All images are centered on a white background and pictured in the same orientation. It contains four types of footwear, including shoes, sandals, slippers, and boots. This is the data we are going to use to train pix2pix.

In the improved implementation (pix2pixHD), we will use the Cityscapes dataset. The Cityscapes dataset consists of semantic, instance-wise, dense pixel annotations of 30 classes, including cars, trees, and pedestrians, to name just a few. It has 5,000 images with high-quality annotations and 20,000 images with coarse annotations.

Let’s take a look at the following steps regarding helper functions, which we will use to load the data and iterate over mini batches when training pix2pix:

  1. We begin by importing the various libraries:

Get hands-on with 1200+ tech skills courses.