Qicheng Lao, Mohammad Havaei, Francis Dutil, Ahmad Pesaranghader, Lisa Di Jorio and Thomas Fevens. Proceedings of the IEEE International Conference on Computer Vision 2019, (pp. 7567-7576).
Synthesizing images from a given text description in-volves engaging two types of information: the content,which includes information explicitly described in the text(e.g., color, composition, etc.), and the style, which is usu-ally not well described in the text (e.g., location, quantity,size, etc.). However, in previous works, it is typically treatedas a process of generating images only from the content,i.e.,without considering learning meaningful style representa-tions. In this paper, we aim to learn two variables that aredisentangled in the latent space, representing content andstyle respectively. We achieve this by augmenting currenttext-to-image synthesis frameworks with a dual adversar-ial inference mechanism. Through extensive experiments,we show that our model learns, in an unsupervised manner,style representations corresponding to certain meaningfulinformation present in the image that are not well describedin the text. The new framework also improves the qualityof synthesized images when evaluated on Oxford-102, CUBand COCO datasets.