1. The UCSF Health problem
Treatment plans in obstetrics and gynecology can contain numerous reasonable options. A good decision depends on the communication of evidence-based risks and benefits in a manner the patient can understand. Patients also may need support clarifying their values as they make treatment decisions based on their values, preference, and risk tolerance. This process is limited by clinic time constraints, language concordance, health literacy, and the ability of providers to integrate an evolving evidence base.1,2 Expanding access to high-quality in-clinic counseling is limited by a shortage of providers and limited appointment lengths, with an average wait time of one month to receive care for conditions that significantly impact quality of life.3,4
Patients have begun to turn to general-purpose large language models (LLMs) for medical information, but these are not validated in the clinical setting, with guidance not personalized to the patient’s data.5 This underscores the value of developing a validated AI-enhanced counseling tool to augment human-led care.
2. How might AI help?
LLMs have shown immense promise in diagnostic and management reasoning tasks, and their ability to translate between languages and adjust reading level is well-suited to a patient-facing role.6–8 Retrieval-augmented generation from reliable sources has been shown to mitigate hallucinations and inaccuracies that can occur with general-purpose models.9
An LLM chatbot could tailor options counseling to the patient’s specific scenario and provide interactive, language and reading-level concordant counseling to augment in-clinic discussions. This chatbot would be integrated into UCSF MyChart and would use retrieval-augmented generation from the patient’s clinical data including history/symptomatology (now made more detailed and more accurate by ambient scribing), physical exam, lab and imaging studies, and provider assessment and plans, as well as high-quality literature and clinical guidelines. Context-rich language and reading-level concordant interactions between the patient and the chatbot would improve patient understanding of their clinical course and treatment options to enable truly informed treatment decisions. In addition, chatbot interactions can help patients clarify and document their values and preferences as they evolve with their understanding of their condition and options, to help patients and providers prepare for future clinical encounters.
For example, a patient with abnormal uterine bleeding might be counseled on a variety of medical and surgical treatment options in clinic, then spend time interacting with the chatbot at home, in their own language and reading level, to further review the risks and benefits in the context of their goals to come to a decision that could be communicated with the provider, who would then facilitate treatment. As the patient trials the therapy over the next 3 months, they can document their progress, questions that come up about side effects, and if these are not adequately answered in the chatbot, the patient can escalate to scheduling a clinic appointment. If the patient’s medical conditions outside the field of Ob/Gyn evolve over time, this may introduce new relative or absolute contraindications to treatment options, and these can be flagged in the chatbot to trigger a new discussion between the provider and patient.
This context-rich, longitudinal, patient-centered interaction, grounded in the relevant, curated literature would strongly augment brief and infrequent clinical interactions. This approach addresses the problem of high-quality language and reading-level concordant counseling without increasing clinical resource utilization. Changes in treatment plan would necessarily go through a provider, ensuring there is always a human-in-the-loop with the responsibility to ensure next steps are medically sound.
I propose piloting the chatbot on one or two clinical scenarios with discrete treatment options. However, a strength of my proposal is that this approach can be scaled to scenarios across Ob/Gyn and other specialties. Possible scenarios in Ob/Gyn include:
1) Trial of labor after cesarean vs. planned repeat cesarean: Choosing between trial of labor after cesarean (TOLAC) and planned repeat cesarean requires in depth discussion of preferred birth experience, risk of complications in the context of lifetime reproductive goals and tolerance for different types of risk.
2) Abnormal uterine bleeding and pelvic pain: Patients with heavy, irregular, and/or painful menstrual periods can choose among several hormonal and non-hormonal medications as well as surgical procedures with varying contraindications, probability of success, and risk.
3) Labor and delivery interventions: Labor and delivery decisions often must be made quickly to balance maternal and fetal well-being, by on-call providers who have sometimes just met the patient. Unexpected interventions, even with informed consent, can lead to birth trauma and distrust. A typical prenatal visit is lasts 20 minutes or less and covers routine screening, management of acute pregnancy and non-pregnancy issues, and anticipatory guidance, making it challenging to proactively counsel patients on the wide variety of possible labor and delivery experiences.
4) Menopause management: There are numerous hormonal and nonhormonal treatments for menopause symptoms that can have a profound impact on quality of life. The data on risks associated with these options is complex, making choosing appropriate, effective treatments a challenge. Patients with cancer-related or cardiovascular risk factors are often given a blanket recommendation to avoid treatment instead of making an informed decision that incorporates a nuanced understanding of risk.
Other scenarios which require similarly complex counseling include: Pregnancy options counseling, contraception counseling, and pelvic floor disorders treatment options
3. How would an end-user find and use it?
Providers would inform patients about the chatbot during in-clinic counseling. The chatbot could be activated by the provider or automatically triggered for specified diagnoses. The chatbot would be embedded in the MyChart app, which could prompt the patient to use the tool as part of the after-visit summary. Data from the chatbot about patient preferences and treatment progress could be fed into the patient’s chart to enable shared decision-making.
4. Embed a picture of what the AI tool might look like.
See the attached chat window mock-up and an example dialogue in Versa chat. Use of retrieval augmented generation from the medical record and literature (or integration with a model fine-tuned on medical literature such as OpenEvidence, which has been proposed in other projects for clinical decision support) would make this dialogue even more patient-centered and grounded in evidence.
5. What are the risks of AI errors?
The primary risk for this intervention is hallucination of inaccurate or inappropriate information for a specific patient—for example, suggesting an estrogen-containing contraceptive to a patient with risk factors for venous thromboembolic disease. This risk will be mitigated by the following: 1) personalizing counseling using patient-specific medical history data and using retrieval-augmented generation from reputable literature and guidelines; 2) evaluating of pilot chatbot counseling sessions by expert clinicians prior to widespread rollout of this feature; 3) maintaining a human-in-the-loop such that treatment decisions, prescriptions, and surgical plans are finalized with a human provider.
6. How will we measure success?
a. A list of measurements using data that is already being collected in APeX
- Proportion of patients with supported medical conditions who are using the chatbot
- Comparison of treatments utilized between chatbot users and nonusers
- Comparisons of the of number clinical interactions (visits, clinician advice messages, phone calls) between users and nonusers of the chatbot
b. A list of other measurements you might ideally have to evaluate success of the AI
- Accuracy of chatbot advice given during the pilot phase as graded by expert clinicians. LLMs have also shown promise in automated large scale evaluation of LLM tools and their use could be considered during the evaluation phase.10
- Patient satisfaction scores compared between chatbot users and nonusers
- Provider satisfaction and qualitative perspective on the utility of the chatbot
- Validated measures of decision-quality from pilot users
High quality counseling, high uptake by patients, patient/provider satisfaction, high decision quality, and efficient utilization of clinic resources would support continuing the project. Negative trends in any of these metrics could be grounds to abandon the project.
Ammar Joudeh, MD: I am an Assistant Professor in the Department of Obstetrics, Gynecology, and Reproductive Sciences and an Ob/Gyn generalist clinician. I have expertise in prenatal care, labor and delivery care, and the medical and surgical management of benign gynecologic conditions. In addition, I have a track record of managing complex operational research and quality improvement initiatives in settings as diverse as rural Zambian health centers, Indian public schools, and California labor and delivery wards. As the quality improvement chief of my residency program, I focused on labor management and received an award from the midwives in my program for my commitment to patient-centered care. I believe every patient should have access to in-depth, empathetic, and personalized counseling, and I recognize that our time-constrained clinical environment challenges this ideal. I believe this chatbot would be the first of its kind to integrate personalized clinical data and robust evidence to augment patient counseling. I am eager to contribute part of my existing academic time, in addition to the time supported by the initiative, to developing this tool.
Miriam Kuppermann, PhD, MPH will utilize 2% of the 10% effort to co-lead this project. Dr. Kuppermann is recognized expert in developing and evaluating informed decision-making support tools. She has led numerous intervention studies aimed at improving informed decision making in obstetrics and gynecology among racially, ethnically and socioeconomically diverse populations of patients.11,12 In her own research, she has recognized the limitations of existing patient-facing decision support tools and the difficulty of designing a tool that can cater to the breadth of patient values, preferences, languages, and literacy levels while evolving with changing guidelines. The flexibility and adaptability that AI will provide this tool is a gamechanger, and Dr. Kuppermann’s expertise in decision support interventions will be invaluable for both the design and evaluation of this tool.
8. Summary of open improvement edits:
In open improvement, I refined the proposal in collaboration with project co-lead Dr. Kuppermann, added Dr. Kuppermann’s bio and a letter support from my clinical division chief, added references from the literature that support the value of the proposed pilot, and responded to Dr. Pletcher’s comment by adding an example patient dialogue using Versa Chat (see attachments).
References:
1. Truong S, Foley OW, Fallah P, et al. Transcending Language Barriers in Obstetrics and Gynecology: A Critical Dimension for Health Equity. Obstetrics and Gynecology. 2023;142(4):809-817. doi:10.1097/AOG.0000000000005334
2. Yahanda AT, Mozersky J. What’s the Role of Time in Shared Decision Making? AMA J Ethics. 2020;22(5):E416-E422. doi:10.1001/AMAJETHICS.2020.416
3. Gravel K, Légaré F, Graham ID. Barriers and facilitators to implementing shared decision-making in clinical practice: A systematic review of health professionals’ perceptions. Implementation Science. 2006;1(1):1-12. doi:10.1186/1748-5908-1-16/TABLES/3
4. Corbisiero MF, Tolbert B, Sanches M, et al. Medicaid coverage and access to obstetrics and gynecology subspecialists: findings from a national mystery caller study in the United States. Am J Obstet Gynecol. 2023;228(6):722.e1-722.e9. doi:10.1016/j.ajog.2023.03.004
5. Why ChatGPT Has a Better Bedside Manner Than Your Doctor - Bloomberg. Accessed April 21, 2025. https://www.bloomberg.com/news/articles/2025-04-11/why-chatgpt-has-a-bet...
6. Goh E, Gallo RJ, Strong E, et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat Med. 2025;31(4):1233-1238. doi:10.1038/s41591-024-03456-y
7. Brodeur PG, Buckley TA, Kanjee Z, et al. Superhuman Performance of a Large Language Model on the Reasoning Tasks of a Physician.
8. Dzuali F, Seiger K, Novoa R, et al. ChatGPT May Improve Access to Language-Concordant Care for Patients With Non–English Language Preferences. JMIR Med Educ 2024;10:e51435 https://mededu.jmir.org/2024/1/e51435. 2024;10(1):e51435. doi:10.2196/51435
9. Zakka C, Shad R, Chaurasia A, et al. Almanac-Retrieval-Augmented Language Models for Clinical Medicine. Published online 2024. doi:10.1056/AIoa2300068
10. Johri S, Jeong J, Tran BA, et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat Med. 2025;31(1):77-86. doi:10.1038/s41591-024-03328-5
11. Dehlendorf C, Fitzpatrick J, Fox E, et al. Cluster randomized trial of a patient-centered contraceptive decision support tool, My Birth Control. Am J Obstet Gynecol. 2019;220(6):565.e1-565.e12. doi:10.1016/J.AJOG.2019.02.015
12. Kuppermann M, Kaimal AJ, Blat C, et al. Effect of a Patient-Centered Decision Support Tool on Rates of Trial of Labor After Previous Cesarean Delivery: The PROCEED Randomized Clinical Trial. JAMA. 2020;323(21):2151-2159. doi:10.1001/JAMA.2020.5952
Comments
I like your idea. Do you
I like your idea. Do you have any sort of prototype for this chatbot yet? Have you tried engineering a prompt to launch a chatbot session with Versa?
Thanks for your comment Dr.
Thanks for your comment Dr. Pletcher! I added an example dialogue with Versa as an attached document. I am hopeful that in collaboration with the health AI team, we could integrate the patient's clinical note as part of the prompt to help provide context for the counseling (rather than the patient having to describe their clinical scenario as I have shown in the Versa example). In addition, I think the idea of using a model tuned to medical literature or building in retrieval-augmented generation from literature curated for specific clinical scenarios would be an additional step toward robust advice with citations. I see that other submissions have discussed the possibility of integrating OpenEvidence into Apex to facilitate clinical decision/diagnostic reasoning support, and I think this proposal could also take advantage of that integration for patient-facing decision support.