CLIP-guided StyleGAN inversion for text-driven real image editing

dc.contributor.authorid0000-0002-0249-5858
dc.contributor.authoridN/A
dc.contributor.authorid0000-0002-6280-8422
dc.contributor.authorid0000-0002-7039-0046
dc.contributor.coauthorCeylan, Duygu
dc.contributor.coauthorErdem, Erkut
dc.contributor.departmentN/A
dc.contributor.departmentN/A
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.kuauthorBaykal, Ahmet Canberk
dc.contributor.kuauthorAnees, Abdul Basit
dc.contributor.kuauthorErdem, Aykut
dc.contributor.kuauthorYĆ¼ret, Deniz
dc.contributor.kuprofilePhD Student
dc.contributor.kuprofileMaster Student
dc.contributor.kuprofileFaculty Member
dc.contributor.kuprofileFaculty Member
dc.contributor.schoolcollegeinstituteGraduate School of Sciences and Engineering
dc.contributor.schoolcollegeinstituteGraduate School of Sciences and Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.yokidN/A
dc.contributor.yokidN/A
dc.contributor.yokid20331
dc.contributor.yokid179996
dc.date.accessioned2025-01-19T10:29:09Z
dc.date.issued2023
dc.description.abstractResearchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the Contrastive Language-Image Pre-training (CLIP) embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.
dc.description.indexedbyWoS
dc.description.indexedbyScopus
dc.description.issue5
dc.description.openaccessBronze, Green Submitted
dc.description.publisherscopeInternational
dc.description.sponsorsThis work has been partially supported by AI Fellowships to A. C. Baykal and A. Basit Anees provided by the KUIS AI Center, by BAGEP 2021 Award of the Science Academy to A. Erdem, and by an Adobe research gift.
dc.description.volume42
dc.identifier.doi10.1145/3610287
dc.identifier.eissn1557-7368
dc.identifier.issn0730-0301
dc.identifier.quartileQ1
dc.identifier.scopus2-s2.0-85174729821
dc.identifier.urihttps://doi.org/10.1145/3610287
dc.identifier.urihttps://hdl.handle.net/20.500.14288/25840
dc.identifier.wos1086833300011
dc.keywordsGenerative adversarial networks
dc.keywordsImage-to-image translation
dc.keywordsImage editing
dc.languageen
dc.publisherAssociation for Computing Machinery
dc.relation.grantnoAI Fellowships; KUIS AI Center; BAGEP 2021 Award of the Science Academy
dc.sourceAcm Transactions on Graphics
dc.subjectComputer science
dc.subjectSoftware engineering
dc.titleCLIP-guided StyleGAN inversion for text-driven real image editing
dc.typeJournal Article

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
IR05144.pdf
Size:
30.15 MB
Format:
Adobe Portable Document Format