Publication:
Leveraging vision-language models to select trustworthy super-resolution samples generated by diffusion models

dc.contributor.departmentKUIS AI (Koç University & İş Bank Artificial Intelligence Center)
dc.contributor.departmentDepartment of Electrical and Electronics Engineering
dc.contributor.kuauthorKorkmaz, Cansu
dc.contributor.kuauthorTekalp, Ahmet Murat
dc.contributor.kuauthorDoğan, Zafer
dc.contributor.schoolcollegeinstituteResearch Center
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.date.accessioned2026-07-02T07:03:29Z
dc.date.available2026-03-27
dc.date.issued2026
dc.description.abstractSuper-resolution (SR) is an ill-posed inverse problem with many feasible solutions that are consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution; but this trade-off often leads to artifacts that introduce ambiguity in information-critical applications such as identifying digits or letters. On the other hand, diffusion models generate a diverse set of SR images; but now selecting the most trustworthy solution out of this set becomes a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to evaluate semantic correctness, visual quality, and the presence of artifacts. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS)-a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity using CLIP embeddings, structural integrity via SSIM on edge maps, and artifact sensitivity measured through a multi-level wavelet decomposition. We empirically demonstrate that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, and DISTS-which fail to reflect information fidelity-our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning model outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR tasks.
dc.description.fulltextNo
dc.description.harvestedfromManual
dc.description.indexedbyWOS
dc.description.indexedbyScopus
dc.description.openaccessGreen Submitted
dc.description.publisherscopeInternational
dc.description.readpublishN/A
dc.description.sponsoredbyTubitakEuTÜBİTAK
dc.description.sponsorshipThe work of Cansu Korkmaz was supported by the Koc University Artificial Intelligence (KUIS AI) Center Fellowship. The work of A. Murat Tekalp was supported in part by TUBITAK 2247-A under Award 120C156 and in part by Turkish Academy of Sciences (TUBA). The work of Zafer Dogan was supported by the TUBITAK 2232 International Fellowship for Outstanding Researchers under Award 118C337.
dc.description.versionPublished Version
dc.identifier.WoSQuartileQ1
dc.identifier.doi10.1109/TCSVT.2025.3585092
dc.identifier.eissn1558-2205
dc.identifier.embargoNo
dc.identifier.endpage1432
dc.identifier.grantno118C337
dc.identifier.grantno120C156
dc.identifier.issn1051-8215
dc.identifier.issue2
dc.identifier.scopus2-s2.0-105010047435
dc.identifier.startpage1419
dc.identifier.urihttps://doi.org10.1016/j.nsa.2026.106990
dc.identifier.urihttps://hdl.handle.net/20.500.14288/32850
dc.identifier.volume36
dc.identifier.wos001687411500018
dc.keywordsTraining
dc.keywordsSemantics
dc.keywordsDiffusion models
dc.keywordsMeasurement
dc.keywordsAccuracy
dc.keywordsVisualization
dc.keywordsImage reconstruction
dc.keywordsCircuits and systems
dc.keywordsImage edge detection
dc.keywordsGenerative adversarial networks
dc.keywordsSuper-resolution
dc.keywordsTrustworthy SR
dc.keywordsVision-language models
dc.keywordsHuman evaluation
dc.languageeng
dc.publisherIEEE
dc.relation.affiliationKoç University
dc.relation.collectionKoç University Institutional Repository
dc.relation.ispartofIEEE Transactions on Circuits and Systems for Video Technology
dc.relation.openaccessN/A
dc.rightsN/A
dc.rights.uriN/A
dc.subjectEngineering
dc.titleLeveraging vision-language models to select trustworthy super-resolution samples generated by diffusion models
dc.typeJournal Article
dspace.entity.typePublication
relation.isOrgUnitOfPublication77d67233-829b-4c3a-a28f-bd97ab5c12c7
relation.isOrgUnitOfPublication21598063-a7c5-420d-91ba-0cc9b2db0ea0
relation.isOrgUnitOfPublication.latestForDiscovery77d67233-829b-4c3a-a28f-bd97ab5c12c7
relation.isParentOrgUnitOfPublicationd437580f-9309-4ecb-864a-4af58309d287
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscoveryd437580f-9309-4ecb-864a-4af58309d287

Files