Evaluation of the accuracy, consistency, and scientific reliability of AI-powered Chatbots in endodontic practice
DOI:
https://doi.org/10.2340/aos.v85.46148Keywords:
artificial intelligence, chatbots, endodontics, clinical decision-making, guideline adherenceAbstract
Objective: This study aimed to evaluate the accuracy, consistency, and scientific reliability of two AI-powered chatbots—ChatGPT-3.5 and ChatGPT-4o—in clinical endodontic decision-making, using the recently published European Society of Endodontology (ESE) S3-level Clinical Practice Guideline as the gold standard reference.
Material and Methods: Twenty-five dichotomous (yes/no) questions were developed based on the ESE guideline and presented to both chatbots across three time intervals, yielding 300 total responses. Each response was evaluated for accuracy and consistency, and the quality of the supporting references was assessed according to their journal ranking (Q1, Q2, others).
Results: Both ChatGPT versions demonstrated high internal consistency across repeated measurements. ChatGPT-3.5 showed 94.4% agreement (κ = 0.824; 95% confidence interval [CI]: 0.786–0.898; p < 0.001), whereas ChatGPT-4o demonstrated 98.9% agreement (κ = 0.937; 95% CI: 0.893–0.965; p < 0.001). The accuracy of ChatGPT-3.5 relative to the guideline-based answers was 81.4%, 88.9%, and 82.2% in the morning, afternoon, and evening sessions, respectively, while ChatGPT-4o achieved 82.9%, 83.3%, and 85.4%, respectively. No statistically significant differences were observed between the models across the three time intervals (p > 0.05). The proportion of Q1/Q2-ranked references was high and comparable between ChatGPT-3.5 (74–82%) and ChatGPT-4o (76–84%).
Conclusion: Both ChatGPT-3.5 and ChatGPT-4o demonstrated substantial alignment with the ESE S3-level clinical practice guideline. However, these findings should not be interpreted as definitive assessments of current clinical conversational AI systems, and further evaluation of evolving models is required.
Downloads
References
McCarthy J, Minsky ML, Rochester N, Shannon CE. A proposal for the dartmouth summer research project on artificial intelligence, August 31, 1955. AI Magazine. 2006;27(4):12.
Deng L. Artificial intelligence in the rising wave of deep learning: the historical path and future outlook [Perspectives]. EEE Signal Process Mag. 2018;35(1):180-177. DOI: https://doi.org/10.1109/MSP.2017.2762725
Aminoshariae A, Kulild J, Nagendrababu V. Artificial intelligence in endodontics: current applications and future directions. J Endod. 2021;47(9):1352–7. DOI: https://doi.org/10.1016/j.joen.2021.06.003
Chae YM, Yoo KB, Kim ES, Chae H. The adoption of electronic medical records and decision support systems in Korea. Healthc Inform Res. 2011;17(3):172–7. DOI: https://doi.org/10.4258/hir.2011.17.3.172
Schleyer TK, Thyvalikakath TP, Spallek H, Torres-Urquidy MH, Hernandez P, Yuhaniak J. Clinical computing in general dentistry. J Am Med Inform Assoc. 2006;13(3):344–52. DOI: https://doi.org/10.1197/jamia.M1990
Shortliffe EH. Testing reality: the introduction of decision-support technologies for physicians. Methods Inf Med. 1989;28(1):1–5. DOI: https://doi.org/10.1055/s-0038-1635546
Adamopoulou E, Moussiades L. Chatbots: history, technology, and applications. Artif Intell Appl Innov. 2020;584:373–83. DOI: https://doi.org/10.1016/j.mlwa.2020.100006
Duncan HF, Kirkevang LL, Peters OA, El-Karim I, Krastl G, Del Fabbro M, et al. Treatment of pulpal and apical disease: the European Society of Endodontology (ESE) S3-level clinical practice guideline. Int Endod J. 2023;56 Suppl 3:238–95. DOI: https://doi.org/10.1111/iej.13974
Gheisari M, Ebrahimzadeh F, Rahimi M, Moazzamigodarzi M, Liu Y, Dutta Pramanik PK. et al. Deeplearning: applications, architectures, models, tools, andframeworks: a comprehensive survey. CAAI Trans Intell Technol. 2023;8(3):581–606. DOI: https://doi.org/10.1049/cit2.12180
Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J. 2024;57(3):305–14. DOI: https://doi.org/10.1111/iej.14014
Plebani M. ChatGPT: Angel or Demond? Critical thinking is still needed. Clin Chem Lab Med. 2023 Apr 25;61(7):1131-1132 DOI: https://doi.org/10.1515/cclm-2023-0387
Cadamuro J, Cabitza F, Debeljak Z, De Bruyne S, Frans G, Perez SM, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med. 2023;61(7):1158–66. DOI: https://doi.org/10.1515/cclm-2023-0355
Li X, Chan S, Zhu X, Pei Y, Ma Z, Liu X, et al. Are ChatGPT and GPT-4 general-purpose solvers for financial text analytics? A study on several typical tasks. In Proceedings of the 2023 conference on empirical methods in natural language processing: industry track (pp. 408–422). Singapore: Association for Computational Linguistics; 2023. DOI: https://doi.org/10.18653/v1/2023.emnlp-industry.39
OpenAI. GPT-4 and GPT-4o model documentation. OpenAI; 2024 [cited 2026 Jan 5]. Available from: https://platform.openai.com/docs/models
Jalali P, Mohammad-Rahimi H, Wang F-M, Sohrabniya F, AmirHossein Ourang S, Tian Y, et al. Performance of seven artificial intelligence Chatbots on board-style endodontic questions. J Endodontics. 2025 Oct;51(10):1413-1419. DOI: https://doi.org/10.1016/j.joen.2025.06.014
Aljamani S, Hassona Y, Fansa HA, Saadeh HM, Jamani KD. Evaluating large language models in addressing patient questions on endodontic pain: a comparative analysis of accessible Chatbots. J Endod. 2025. ISSN 0099-2399. 2025 Nov;51(11):1617-1624. DOI: https://doi.org/10.1016/j.joen.2025.04.015
de Moura JDM, Fontana CE, da Silva Lima VHR, de Souza Alves I, de Melo Santos PA, de Almeida Rodrigues P. Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: a cross-sectional study. Comput Biol Med. 2024;183:109332. ISSN 0010-4825. DOI: https://doi.org/10.1016/j.compbiomed.2024.109332
Büker M, Mercan G. Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: a comparative assessment. Int J Medi Inform. 2025;201:105948. ISSN 1386-5056. DOI: https://doi.org/10.1016/j.ijmedinf.2025.105948
Arılı Öztürk, E., Turan Gökduman, C. & Çanakçi, B.C. (2026) Evaluation of the performance of ChatGPT-4 and ChatGPT-4o as a learning tool in endodontics. nternational Endodontic Journal, 59, 1057–1069. Available from: DOI: https://doi.org/10.1111/iej.14217
Ahmad B, Saleh K, Alharbi S, Alqaderi H, Jeong YN. Artificial intelligence in periodontology: performance evaluation of ChatGPT, Claude, and Gemini on the in-service examination. medRxiv [Preprint]. 2024. DOI: https://doi.org/10.1101/2024.05.29.24308155
Satheesh Krishna, Nishaant Bhambra, Robert Bleakney, Rajesh Bhayana Evaluation of reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 on a radiology board–style examination. Radiology. 2024;311(2):e232715. DOI: https://doi.org/10.1148/radiol.232715
Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol. 2024;40(6):722–9. DOI: https://doi.org/10.1111/edt.12965
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Saba Kilimci, Elif Delve Başer Can, Jale Tanalp

This work is licensed under a Creative Commons Attribution 4.0 International License.
Acta Odontologica Scandinavica publishes original research papers as well as critical reviews relevant to the diagnosis, epidemiology, health service, prevention, aetiology, pathogenesis, pathology, physiology, microbiology, development and treatment of diseases affecting tissues of the oral cavity and associated structures including papers on cause and effect or explanatory/associative relationships for experimental or observational studies.