Publication:
Benchmarking AI chatbots for maternal lactation support: a cross-platform evaluation of quality, readability, and clinical accuracy

Thumbnail Image

Departments

School / College / Institute

Organizational Unit

Program

KU Authors

Co-Authors

Aslan, Ilke Ozer

Publication Date

Language

Embargo Status

No

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

Background and Objective: Large language model (LLM)-based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots' content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and factual accuracy of responses generated by three publicly accessible AI chatbots-ChatGPT-4o Pro, Gemini 2.5 Pro, and Copilot Pro-when prompted with common maternal questions related to breast-milk supply. Methods: Twenty frequently asked breastfeeding-related questions were submitted to each chatbot in separate sessions. The responses were paraphrased to enable standardized scoring and were then evaluated using three validated tools: ensuring quality information for patients (EQIP), the simple measure of gobbledygook (SMOG), and the global quality scale (GQS). Factual accuracy was benchmarked against WHO, ACOG, CDC, and NICE guidelines using a three-point rubric. Additional user experience metrics included response time, character count, content density, and structural formatting. Statistical comparisons were performed using the Kruskal-Wallis and Wilcoxon rank-sum tests with Bonferroni correction. Results: ChatGPT-4o Pro achieved the highest overall performance across all primary outcomes: EQIP score (85.7 +/- 2.4%), SMOG score (9.78 +/- 0.22), and GQS rating (4.55 +/- 0.50), followed by Gemini 2.5 Pro and Copilot Pro (p < 0.001 for all comparisons). ChatGPT-4o Pro also demonstrated the highest factual alignment with clinical guidelines (95%), while Copilot showed more frequent omissions or simplifications. Differences in response time and formatting quality were statistically significant, although not always clinically meaningful. Conclusions: ChatGPT-4o Pro outperforms other chatbots in delivering structured, readable, and guideline-concordant breastfeeding information. However, substantial variability persists across the platforms, and none should be considered a substitute for professional guidance. Importantly, the phenomenon of AI hallucinations-where chatbots may generate factually incorrect or fabricated information-remains a critical risk that must be addressed to ensure safe integration into maternal health communication. Future efforts should focus on improving the transparency, accuracy, and multilingual reliability of AI chatbots to ensure their safe integration into maternal health communications.

Source

Publisher

MDPI

Subject

Health care sciences and services, Health policy services

Citation

Has Part

Source

Healthcare

Book Series Title

Edition

DOI

10.3390/healthcare13141756

item.page.datauri

Link

Rights

CC BY (Attribution)

Copyrights Note

Creative Commons license

Except where otherwised noted, this item's license is described as CC BY (Attribution)

Endorsement

Review

Supplemented By

Referenced By

0

Views

1

Downloads

View PlumX Details