Evaluation of the accuracy, consistency, and scientific reliability of AI-powered Chatbots in endodontic practice

Authors

DOI:

https://doi.org/10.2340/aos.v85.46148

Keywords:

artificial intelligence, chatbots, endodontics, clinical decision-making, guideline adherence

Abstract

Objective: This study aimed to evaluate the accuracy, consistency, and scientific reliability of two AI-powered chatbots—ChatGPT-3.5 and ChatGPT-4o—in clinical endodontic decision-making, using the recently published European Society of Endodontology (ESE) S3-level Clinical Practice Guideline as the gold standard reference.

Material and Methods: Twenty-five dichotomous (yes/no) questions were developed based on the ESE guideline and presented to both chatbots across three time intervals, yielding 300 total responses. Each response was evaluated for accuracy and consistency, and the quality of the supporting references was assessed according to their journal ranking (Q1, Q2, others).

Results: Both ChatGPT versions demonstrated high internal consistency across repeated measurements. ChatGPT-3.5 showed 94.4% agreement (κ = 0.824; 95% confidence interval [CI]: 0.786–0.898; p < 0.001), whereas ChatGPT-4o demonstrated 98.9% agreement (κ = 0.937; 95% CI: 0.893–0.965; p < 0.001). The accuracy of ChatGPT-3.5 relative to the guideline-based answers was 81.4%, 88.9%, and 82.2% in the morning, afternoon, and evening sessions, respectively, while ChatGPT-4o achieved 82.9%, 83.3%, and 85.4%, respectively. No statistically significant differences were observed between the models across the three time intervals (p > 0.05). The proportion of Q1/Q2-ranked references was high and comparable between ChatGPT-3.5 (74–82%) and ChatGPT-4o (76–84%).

Conclusion: Both ChatGPT-3.5 and ChatGPT-4o demonstrated substantial alignment with the ESE S3-level clinical practice guideline. However, these findings should not be interpreted as definitive assessments of current clinical conversational AI systems, and further evaluation of evolving models is required.

Downloads

Download data is not yet available.

References

McCarthy J, Minsky ML, Rochester N, Shannon CE. A proposal for the dartmouth summer research project on artificial intelligence, August 31, 1955. AI Magazine. 2006;27(4):12.

Deng L. Artificial intelligence in the rising wave of deep learning: the historical path and future outlook [Perspectives]. EEE Signal Process Mag. 2018;35(1):180-177. DOI: https://doi.org/10.1109/MSP.2017.2762725

Aminoshariae A, Kulild J, Nagendrababu V. Artificial intelligence in endodontics: current applications and future directions. J Endod. 2021;47(9):1352–7. DOI: https://doi.org/10.1016/j.joen.2021.06.003

Chae YM, Yoo KB, Kim ES, Chae H. The adoption of electronic medical records and decision support systems in Korea. Healthc Inform Res. 2011;17(3):172–7. DOI: https://doi.org/10.4258/hir.2011.17.3.172

Schleyer TK, Thyvalikakath TP, Spallek H, Torres-Urquidy MH, Hernandez P, Yuhaniak J. Clinical computing in general dentistry. J Am Med Inform Assoc. 2006;13(3):344–52. DOI: https://doi.org/10.1197/jamia.M1990

Shortliffe EH. Testing reality: the introduction of decision-support technologies for physicians. Methods Inf Med. 1989;28(1):1–5. DOI: https://doi.org/10.1055/s-0038-1635546

Adamopoulou E, Moussiades L. Chatbots: history, technology, and applications. Artif Intell Appl Innov. 2020;584:373–83. DOI: https://doi.org/10.1016/j.mlwa.2020.100006

Duncan HF, Kirkevang LL, Peters OA, El-Karim I, Krastl G, Del Fabbro M, et al. Treatment of pulpal and apical disease: the European Society of Endodontology (ESE) S3-level clinical practice guideline. Int Endod J. 2023;56 Suppl 3:238–95. DOI: https://doi.org/10.1111/iej.13974

Gheisari M, Ebrahimzadeh F, Rahimi M, Moazzamigodarzi M, Liu Y, Dutta Pramanik PK. et al. Deeplearning: applications, architectures, models, tools, andframeworks: a comprehensive survey. CAAI Trans Intell Technol. 2023;8(3):581–606. DOI: https://doi.org/10.1049/cit2.12180

Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J. 2024;57(3):305–14. DOI: https://doi.org/10.1111/iej.14014

Plebani M. ChatGPT: Angel or Demond? Critical thinking is still needed. Clin Chem Lab Med. 2023 Apr 25;61(7):1131-1132 DOI: https://doi.org/10.1515/cclm-2023-0387

Cadamuro J, Cabitza F, Debeljak Z, De Bruyne S, Frans G, Perez SM, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med. 2023;61(7):1158–66. DOI: https://doi.org/10.1515/cclm-2023-0355

Li X, Chan S, Zhu X, Pei Y, Ma Z, Liu X, et al. Are ChatGPT and GPT-4 general-purpose solvers for financial text analytics? A study on several typical tasks. In Proceedings of the 2023 conference on empirical methods in natural language processing: industry track (pp. 408–422). Singapore: Association for Computational Linguistics; 2023. DOI: https://doi.org/10.18653/v1/2023.emnlp-industry.39

OpenAI. GPT-4 and GPT-4o model documentation. OpenAI; 2024 [cited 2026 Jan 5]. Available from: https://platform.openai.com/docs/models

Jalali P, Mohammad-Rahimi H, Wang F-M, Sohrabniya F, AmirHossein Ourang S, Tian Y, et al. Performance of seven artificial intelligence Chatbots on board-style endodontic questions. J Endodontics. 2025 Oct;51(10):1413-1419. DOI: https://doi.org/10.1016/j.joen.2025.06.014

Aljamani S, Hassona Y, Fansa HA, Saadeh HM, Jamani KD. Evaluating large language models in addressing patient questions on endodontic pain: a comparative analysis of accessible Chatbots. J Endod. 2025. ISSN 0099-2399. 2025 Nov;51(11):1617-1624. DOI: https://doi.org/10.1016/j.joen.2025.04.015

de Moura JDM, Fontana CE, da Silva Lima VHR, de Souza Alves I, de Melo Santos PA, de Almeida Rodrigues P. Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: a cross-sectional study. Comput Biol Med. 2024;183:109332. ISSN 0010-4825. DOI: https://doi.org/10.1016/j.compbiomed.2024.109332

Büker M, Mercan G. Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: a comparative assessment. Int J Medi Inform. 2025;201:105948. ISSN 1386-5056. DOI: https://doi.org/10.1016/j.ijmedinf.2025.105948

Arılı Öztürk, E., Turan Gökduman, C. & Çanakçi, B.C. (2026) Evaluation of the performance of ChatGPT-4 and ChatGPT-4o as a learning tool in endodontics. nternational Endodontic Journal, 59, 1057–1069. Available from: DOI: https://doi.org/10.1111/iej.14217

Ahmad B, Saleh K, Alharbi S, Alqaderi H, Jeong YN. Artificial intelligence in periodontology: performance evaluation of ChatGPT, Claude, and Gemini on the in-service examination. medRxiv [Preprint]. 2024. DOI: https://doi.org/10.1101/2024.05.29.24308155

Satheesh Krishna, Nishaant Bhambra, Robert Bleakney, Rajesh Bhayana Evaluation of reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 on a radiology board–style examination. Radiology. 2024;311(2):e232715. DOI: https://doi.org/10.1148/radiol.232715

Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol. 2024;40(6):722–9. DOI: https://doi.org/10.1111/edt.12965

Downloads

Published

2026-06-04