Benchmarking AI chatbots for maternal lactation support: a cross-platform evaluation of quality, readability, and clinical accuracy

Publication:
Benchmarking AI chatbots for maternal lactation support: a cross-platform evaluation of quality, readability, and clinical accuracy

dc.contributor.coauthor	Aslan, Ilke Ozer
dc.contributor.department	KUH (Koç University Hospital)
dc.contributor.kuauthor	Aslan, Mustafa Törehan
dc.contributor.schoolcollegeinstitute	KUH (KOÇ UNIVERSITY HOSPITAL)
dc.date.accessioned	2025-09-10T04:55:47Z
dc.date.available	2025-09-09
dc.date.issued	2025
dc.description.abstract	Background and Objective: Large language model (LLM)-based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots' content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and factual accuracy of responses generated by three publicly accessible AI chatbots-ChatGPT-4o Pro, Gemini 2.5 Pro, and Copilot Pro-when prompted with common maternal questions related to breast-milk supply. Methods: Twenty frequently asked breastfeeding-related questions were submitted to each chatbot in separate sessions. The responses were paraphrased to enable standardized scoring and were then evaluated using three validated tools: ensuring quality information for patients (EQIP), the simple measure of gobbledygook (SMOG), and the global quality scale (GQS). Factual accuracy was benchmarked against WHO, ACOG, CDC, and NICE guidelines using a three-point rubric. Additional user experience metrics included response time, character count, content density, and structural formatting. Statistical comparisons were performed using the Kruskal-Wallis and Wilcoxon rank-sum tests with Bonferroni correction. Results: ChatGPT-4o Pro achieved the highest overall performance across all primary outcomes: EQIP score (85.7 +/- 2.4%), SMOG score (9.78 +/- 0.22), and GQS rating (4.55 +/- 0.50), followed by Gemini 2.5 Pro and Copilot Pro (p < 0.001 for all comparisons). ChatGPT-4o Pro also demonstrated the highest factual alignment with clinical guidelines (95%), while Copilot showed more frequent omissions or simplifications. Differences in response time and formatting quality were statistically significant, although not always clinically meaningful. Conclusions: ChatGPT-4o Pro outperforms other chatbots in delivering structured, readable, and guideline-concordant breastfeeding information. However, substantial variability persists across the platforms, and none should be considered a substitute for professional guidance. Importantly, the phenomenon of AI hallucinations-where chatbots may generate factually incorrect or fabricated information-remains a critical risk that must be addressed to ensure safe integration into maternal health communication. Future efforts should focus on improving the transparency, accuracy, and multilingual reliability of AI chatbots to ensure their safe integration into maternal health communications.
dc.description.fulltext	Yes
dc.description.harvestedfrom	Manual
dc.description.indexedby	WOS
dc.description.indexedby	Scopus
dc.description.indexedby	PubMed
dc.description.openaccess	Gold OA
dc.description.publisherscope	International
dc.description.readpublish	N/A
dc.description.sponsoredbyTubitakEu	N/A
dc.description.version	Published Version
dc.description.volume	13
dc.identifier.doi	10.3390/healthcare13141756
dc.identifier.eissn	2227-9032
dc.identifier.embargo	No
dc.identifier.filenameinventoryno	IR06367
dc.identifier.issue	14
dc.identifier.quartile	Q2
dc.identifier.scopus	2-s2.0-105011646514
dc.identifier.uri	https://doi.org/10.3390/healthcare13141756
dc.identifier.uri	https://hdl.handle.net/20.500.14288/30105
dc.identifier.wos	001536667600001
dc.keywords	Artificial intelligence
dc.keywords	Breastfeeding
dc.keywords	Chatbot
dc.keywords	Clinical accuracy
dc.keywords	Patient education
dc.keywords	Lactation support
dc.keywords	Large language models
dc.language.iso	eng
dc.publisher	MDPI
dc.relation.affiliation	Koç University
dc.relation.collection	Koç University Institutional Repository
dc.relation.ispartof	Healthcare
dc.relation.openaccess	Yes
dc.rights	CC BY (Attribution)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Health care sciences and services
dc.subject	Health policy services
dc.title	Benchmarking AI chatbots for maternal lactation support: a cross-platform evaluation of quality, readability, and clinical accuracy
dc.type	Journal Article
dspace.entity.type	Publication
person.familyName	Aslan
person.givenName	Mustafa Törehan
relation.isOrgUnitOfPublication	f91d21f0-6b13-46ce-939a-db68e4c8d2ab
relation.isOrgUnitOfPublication.latestForDiscovery	f91d21f0-6b13-46ce-939a-db68e4c8d2ab
relation.isParentOrgUnitOfPublication	055775c9-9efe-43ec-814f-f6d771fa6dee
relation.isParentOrgUnitOfPublication.latestForDiscovery	055775c9-9efe-43ec-814f-f6d771fa6dee