Benchmarking AI chatbots for maternal lactation support: a cross-platform evaluation of quality, readability, and clinical accuracy

Publication:
Benchmarking AI chatbots for maternal lactation support: a cross-platform evaluation of quality, readability, and clinical accuracy

Files

Primary IR06367.pdf (386.28 KB)

Departments

Organizational Unit

KUH (Koç University Hospital)

School / College / Institute

Organizational Unit

KUH (KOÇ UNIVERSITY HOSPITAL)

UPPER

KU-Authors

Aslan, Mustafa Törehan

Co-Authors

Aslan, Ilke Ozer

Publication Date

2025

Type

Journal Article

Embargo Status

No

Abstract

Background and Objective: Large language model (LLM)-based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots' content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and factual accuracy of responses generated by three publicly accessible AI chatbots-ChatGPT-4o Pro, Gemini 2.5 Pro, and Copilot Pro-when prompted with common maternal questions related to breast-milk supply. Methods: Twenty frequently asked breastfeeding-related questions were submitted to each chatbot in separate sessions. The responses were paraphrased to enable standardized scoring and were then evaluated using three validated tools: ensuring quality information for patients (EQIP), the simple measure of gobbledygook (SMOG), and the global quality scale (GQS). Factual accuracy was benchmarked against WHO, ACOG, CDC, and NICE guidelines using a three-point rubric. Additional user experience metrics included response time, character count, content density, and structural formatting. Statistical comparisons were performed using the Kruskal-Wallis and Wilcoxon rank-sum tests with Bonferroni correction. Results: ChatGPT-4o Pro achieved the highest overall performance across all primary outcomes: EQIP score (85.7 +/- 2.4%), SMOG score (9.78 +/- 0.22), and GQS rating (4.55 +/- 0.50), followed by Gemini 2.5 Pro and Copilot Pro (p < 0.001 for all comparisons). ChatGPT-4o Pro also demonstrated the highest factual alignment with clinical guidelines (95%), while Copilot showed more frequent omissions or simplifications. Differences in response time and formatting quality were statistically significant, although not always clinically meaningful. Conclusions: ChatGPT-4o Pro outperforms other chatbots in delivering structured, readable, and guideline-concordant breastfeeding information. However, substantial variability persists across the platforms, and none should be considered a substitute for professional guidance. Importantly, the phenomenon of AI hallucinations-where chatbots may generate factually incorrect or fabricated information-remains a critical risk that must be addressed to ensure safe integration into maternal health communication. Future efforts should focus on improving the transparency, accuracy, and multilingual reliability of AI chatbots to ensure their safe integration into maternal health communications.

Publisher

MDPI

Subject

Health care sciences and services, Health policy services

Source

Healthcare

DOI

10.3390/healthcare13141756

URI

https://doi.org/10.3390/healthcare13141756
https://hdl.handle.net/20.500.14288/30105

Rights

CC BY (Attribution)

Creative Commons license

Except where otherwised noted, this item's license is described as CC BY (Attribution)

Publication: Benchmarking AI chatbots for maternal lactation support: a cross-platform evaluation of quality, readability, and clinical accuracy

Files

Departments

School / College / Institute

Program

KU-Authors

KU Authors

Co-Authors

Publication Date

Language

Type

Embargo Status

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

Source

Publisher

Subject

Citation

Has Part

Source

Book Series Title

Edition

DOI

URI

item.page.datauri

Link

Rights

Copyrights Note

Creative Commons license

Collections

Endorsement

Review

Supplemented By

Referenced By

0

Views

1

Downloads

Publication:
Benchmarking AI chatbots for maternal lactation support: a cross-platform evaluation of quality, readability, and clinical accuracy