Unsupervised image-to-image translation using generative models



Journal Title

Journal ISSN

Volume Title



In recent years deep learning has achieved great success in various computer vision tasks, such as image classification and segmentation. Unsupervised image-to-image (I2I) translation, which models how to translate images from one domain to another without paired data, lacks systematic and thorough study. In this dissertation I illustrate the significance of studying unsupervised I2I translation, relevant theories, and propose potential approaches to addressing drawbacks and shortcomings in existing works. This dissertation introduces four new contributions in unsupervised I2I translation. The first contribution is the proposal of a unified framework for unsupervised I2I translation. The second contribution is to provide fine-grained control on I2I translation where current approaches fall short. The third contribution of this dissertation is cooperating a module for controlling shapes when translating certain type of images, which require preserving shapes after I2I translation. Lastly, this dissertation proposes a new I2I translation framework that learns to, in an unsupervised manner, only translate objects of interest and leave others unaltered. The first contribution of this work is to address the open problem of multimodal unsupervised I2I translation using a generative adversarial network. Previous works, such as MUNIT and DRIT, are able to translate images among multiple domains, but they generate images of inferior quality and less diverse. Moreover, they require training n(n-1) generators and n discriminators for learning to translate images among n domains. Therefore, I propose a simpler yet more effective framework for multimodal unsupervised I2I translation. The new approach only consists of a mapping network, a encode-decoder pair (generator), and a discriminator. The methods assume that the latent space can be decomposed into content and style sub-spaces by the encoder, where content space is deemed domain-invariant and style space is domain-dependent. Unlike MUNIT and DRIT that simply sample style codes from a standard normal distribution when translating, I employ a mapping network to learn style codes of different domains. Translation is done through the decoder by keeping content codes and exchanging the style codes. To encourage diversity in translated images, I employ style regularizations and inject Guassian noise in the decoder. Extensive experiments show that the new framework is superior to or comparable to state-of-the-art baselines. The second contribution of this dissertation is to add fine-grained control when performing I2I translation. The new framework first assumes that the latent space can be decomposed into content and style sub-spaces. Instead of naively exchanging style codes when translating, the framework uses an interpolator that guides the transformation and produces sequences of intermediate results under different strengths of transformation. Domain specific information, which might still exist in content code and generate inferior images if they are simply treated as domain-invariant, are excluded in our framework. We prove the key assumptions of our framework by establishing some theoretical foundations. Extensive experiments show that the translated images using the new framework are superior or comparable to state-of-the-field baselines. This dissertation also proposes a new I2I translation framework that is shape-aware. Attribute transfer is more challenging when the source and target domain share different shapes, and this new model is able to preserve shape when transferring attributes. Compared to other state-of-art GANs-based image-to-image translation models, the new model is able to generate more visually appealing results while maintaining the quality of results from transfer learning. The last part of this work tries to learn to only translate objects of interest and keep the background unaltered, which produces more visually pleasing results than other approaches. Previous works, such as CycleGAN, MUNIT, and StarGAN2 are able to translate images among multiple domains and generate diverse images, but they often introduce unwanted changes to the background. To improve this, I propose a simple yet effective attention-based framework for unsupervised I2I translation. The framework not only translates solely objects of interests and leave the background unaltered, but also generates images for multiple domains simultaneously. Unlike recent studies on unsupervised I2I with attention mechanism that require ground truth for learning attention maps, the new approach learns attention maps in an unsupervised manner. Extensive experiments show that the new framework is superior to the state-of-the-art baselines.



Unsupervised image-to-image translation, Generative models

Graduation Month



Doctor of Philosophy


Department of Computer Science

Major Professor

William H. Hsu