Publication:
Leveraging vision-language models to select trustworthy super-resolution samples generated by diffusion models

dc.contributor.PhDKorkmaz, Cansu
dc.contributor.departmentDepartment of Electrical and Electronics Engineering
dc.contributor.departmentKUIS AI (Koç University & İş Bank Artificial Intelligence Center)
dc.contributor.facultymemberYes
dc.contributor.kuauthorKorkmaz, Cansu
dc.contributor.kuauthorTekalp, Ahmet Murat
dc.contributor.kuauthorDoğan, Zafer
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteResearch Center
dc.date.accessioned2026-04-22T10:23:28Z
dc.date.available2026-02-01
dc.date.issued2026
dc.description.abstractSuper-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.
dc.description.fulltextYes
dc.description.harvestedfromOpenAire API
dc.description.indexedbyWOS
dc.description.indexedbyScopus
dc.description.openaccessGold OA
dc.description.peerreviewstatusNon-Peer-Reviewed
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuTÜBİTAK
dc.description.sponsorshipThe work of Cansu Korkmaz was supported by the Koc University Artificial Intelligence (KUIS AI) Center Fellowship. The work of A. Murat Tekalp was supported in part by TUBITAK 2247-A under Award 120C156 and in part by Turkish Academy of Sciences (TUBA). The work of Zafer Dogan was supported by the TUBITAK 2232 International Fellowship for Outstanding Researchers under Award 118C337.
dc.description.studentonlypublicationNo
dc.description.studentpublicationYes
dc.description.versionPublished Version
dc.identifier.doi10.1109/tcsvt.2025.3585092
dc.identifier.eissn1558-2205
dc.identifier.embargoNo
dc.identifier.endpage1432
dc.identifier.filenameinventorynoIR06906
dc.identifier.grantno118C337
dc.identifier.issn1051-8215
dc.identifier.issue2
dc.identifier.openairedoi_dedup___::e509a2607a5fd8f01fe04f69b819daaa
dc.identifier.quartileQ1
dc.identifier.scopus2-s2.0-105010047435
dc.identifier.startpage1419
dc.identifier.urihttps://hdl.handle.net/20.500.14288/32598
dc.identifier.urihttps://doi.org/10.1109/tcsvt.2025.3585092
dc.identifier.volume36
dc.identifier.wos001687411500018
dc.keywordsFOS: computer and information sciences
dc.keywordsArtificial intelligence (cs.AI)
dc.keywordsComputer vision and pattern recognition (cs.CV)
dc.language.isoeng
dc.publisherIEEE
dc.relation.affiliationKoç University
dc.relation.collectionKoç University Institutional Repository
dc.relation.ispartofIEEE Transactions on Circuits and Systems for Video Technology
dc.relation.openaccessYes
dc.rightsCC BY (Attribution)
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectComputer vision
dc.titleLeveraging vision-language models to select trustworthy super-resolution samples generated by diffusion models
dc.typeJournal Article
dspace.entity.typePublication
relation.isOrgUnitOfPublication21598063-a7c5-420d-91ba-0cc9b2db0ea0
relation.isOrgUnitOfPublication77d67233-829b-4c3a-a28f-bd97ab5c12c7
relation.isOrgUnitOfPublication.latestForDiscovery21598063-a7c5-420d-91ba-0cc9b2db0ea0
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublicationd437580f-9309-4ecb-864a-4af58309d287
relation.isParentOrgUnitOfPublication.latestForDiscovery8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
IR06906.pdf
Size:
2.81 MB
Format:
Adobe Portable Document Format