Artificial Intelligence / Machine Learning Demonstration Projects 2025

Crowdsourcing ideas to bring advances in data science, machine learning, and artificial intelligence into real-world clinical practice.

Improving Surgical Site Infection Reporting using AI

Proposal Status: 

The UCSF Health Problem
Surgical site infections (SSIs) remain the leading surgical complication, contributing to increased morbidity, mortality, extended hospital stays, and significant healthcare costs. Manual surveillance methods, while essential for accurate SSI identification, are resource-intensive and hinder infection preventionists from focusing on prevention efforts. Current semi-automated methods, relying on structured data triggers such as microbiology results or reoperations, have low positive predictive value, leading to extensive manual chart reviews. Given the national priority to improve healthcare-associated infection (HAI) surveillance, there is an urgent need for an automated, AI-driven approach to improve efficiency and accuracy. The primary end-users of the proposed AI solution are infection preventionists who conducted SSI surveillance at UCSF Health and other healthcare institutions.

How Might AI Help?
Artificial intelligence, particularly generative AI and large language models (LLMs), can transform SSI surveillance by synthesizing complex, unstructured clinical data into meaningful insights. Our AI solution will analyze electronic health record (EHR) data, including provider notes, microbiology results, and interventional radiology reports, to identify deep incisional and organ-space SSIs with high accuracy. The AI will reduce the number of surgical cases requiring any manual chart review and for patients identified as likely to meet the definition of a deep or organ space SSI, the amount of time required per chart to complete surveillance will be reduced through the creation of a concise clinical synopsis of key infection-related clinical events. Importantly, in the long term, automating this process has the potential to improve the timeliness of infection detection and reduce the number of infection preventionists required to conduct surveillance. The latter is both cost savings as well as potentially critical as we are facing a national shortage of infection preventionists. AI models will be trained using NHSN criteria and validated against historical UCSF SSI outcome data to ensure reliability and clinical utility, as well as tested prospectively side-by-side to the current EHR-embedded tool over a defined evaluation period.

 

How Would an End-User Find and Use It?
The AI tool will be integrated within UCSF’s APeX EHR system, presenting infection preventionists with a streamlined workflow. The tool could be automated to run in the background on a daily basis, and when a potential SSI is identified, a flag and short AI-generated infection summary would be added to the patient’s chart. These patients could then go into a workqueue for infection preventionists to review, and either confirm or override the AI’s determination. This reduces the time spent per patient while maintaining a human-in-the-loop approach for final decision-making. This will be particularly useful during routine surveillance activities, allowing infection preventionists to focus on true positive cases rather than spending time on false leads.This also does not significantly alter the current infection preventionist workflow, which depends on reviewing cases in a workqueue, so the training required to adopt this new proposed workflow should be minimal.

 

Example of AI output

 

What Are the Risks of AI Errors?
AI-based SSI detection introduces risks such as false positives, false negatives, and AI hallucinations (erroneous conclusions). False positives may result in unnecessary investigations, while false negatives could lead to missed SSIs, impacting patient safety and regulatory compliance. However, thus far, this tool has been tested on a subset of NHSN cases from 2024 and maintains a 100% negative predictive value when compared to current SSI reporting.

To mitigate these risks, our model will undergo rigorous validation with retrospective and prospective datasets. Continuous performance monitoring, bias assessment, and iterative refinements will ensure accuracy. The AI tool will always function as an assistive technology rather than an autonomous decision-maker, with infection preventionists retaining ultimate control over SSI determinations.

How Will We Measure Success?
We will assess success based on adoption, impact on infection preventionist workflow efficiency, and improvements in SSI detection rates. Metrics include:

a. Measurements using existing APeX data:

  • Number of SSIs detected pre- and post-AI implementation
  • Time spent per chart review (EHR audit logs)
  • Positive predictive value (PPV), negative predictive value (NPV), sensitivity and specificity of AI-generated determinations
    • Given that this is intended to be a screening tool, we will prioritize maximizing sensitivity and negative predictive value to ensure that potential SSI cases are not missed. We will also work closely with the UCSF infection prevention group to ensure any potential tradeoffs are balanced and reasonable for their workload.
  • Infection preventionist workload reduction (number of cases reviewed)

b. Additional measurements for evaluating success:

  • Infection preventionist user satisfaction surveys
  • Qualitative analysis of AI explanation trustworthiness
  • Analysis of bias across demographic and clinical subgroups
  • Impact of tool on SSI standardized infection ratios

Describe Your Qualifications and Commitment
This project is spearheaded by Elizabeth Wick, MD (Colorectal Surgery, UCSF Vice Chair of Quality and Safety), Deborah Yokoe, MD, MPH (Infectious Diseases, UCSF Medical Director for Hospital Epidemiology and Infection Prevention), and Logan Pierce, MD (Hospital Medicine/DoC-IT, Managing Director of Data Core). This interdisciplinary team, comprising infection prevention experts, surgeons, informaticists, and infection preventionists is committed to refining and deploying this AI-assisted surveillance tool.

Elizabeth Wick, MD, is a Professor of Surgery and Vice Chair for Quality and Safety in the Department of Surgery at UCSF. She is an expert in surgical quality improvement and SSI prevention and has led multiple national initiatives focused on improving surgical outcomes. Dr. Wick has extensive experience with NHSN surveillance methods and has been instrumental in developing strategies to reduce the burden of manual data collection. She will provide leadership in integrating AI solutions into infection prevention workflows and ensure that the project aligns with national quality improvement priorities.

Deborah Yokoe, MD, MPH, is an international leader in healthcare epidemiology and infection prevention. As the Medical Director of Infection Prevention and Control at UCSF Health, she has a deep understanding of NHSN surveillance definitions and infection preventionist workflows. Dr. Yokoe has played a key role in shaping national infection prevention strategies and brings critical expertise in evaluating AI-generated SSI determinations. She will ensure that AI implementation is clinically sound and supports infection preventionists in making accurate and timely SSI identifications.

Logan Pierce, MD, is board-certified in both clinical informatics and internal medicine. He is the Managing Director of UCSF Data Core, a team of physician data scientists dedicated to utilizing EHR data to improve healthcare outcomes. He has experience using large language models to extract data from clinical text. Dr. Pierce will actively contribute throughout the development lifecycle, ensuring alignment with UCSF Health priorities and participating in regular progress reviews with Health AI and AER teams.

 

Supporting Documents: 

GPT-4-DCS: A Large Language Model Pilot to Reduce Hospital Discharge Summary Documentation Burden and to Enhance Quality and Safety

Proposal Status: 

 

Section 1. The UCSF Health Problem 

Problem Statement: Across UCSF Health, nearly 59,000 hospital discharge summary (DCS) narratives are manually written each year. As one of the longest, yet most important forms of clinical documentation, the DCS places substantial documentation burden on inpatient providers. Unfortunately, as both the literature and inpatient provider experience have shown,1 producing a high quality DCS in a timely manner is challenging. Whether providers are pressed for time, or are simply not aware of all of the details of a patient’s hospital encounter (the discharging physician at UCSF, is on average the last of 3 sequential physicians having cared for a patient, and therefore may not be aware of all events throughout the encounter), physician-written discharge summaries are not error free, as we recently demonstrated in a UCSF study accepted in JAMA Internal Medicine and is in pre-print.2 (Figure 1) Poor quality discharge summaries may then have multiple downstream ripple effects impacting subsequent quality of care and patient safety. 

Background: The hospital DCS is an accounting of important hospital events and treatments that must be identified, synthesized, and composed by the discharging provider. A substantial contributor to documentation burden (often described as an epidemic3) is the DCS, a uniquely time intensive source of burden affecting not only all inpatient providers directly, but - depending on the quality of the summary – downstream providers too (e.g. primary care physicians, Skilled Nursing Facility physicians, and subsequent inpatient providers who commonly rely on reviewing the prior DCS when readmitting a patient).

The sequelae of this documentation burden include reduced face-to-face time on inpatient care, increased medical error rates, reduced document quality, physician burnout, and attrition. In one study, over 44% of hospital physicians reported not having sufficient time to compose high quality discharge summaries.1 Burden that is unique to the DCS derives from the need to review notes, procedures, and events throughout the hospital encounter (the longer the encounter, the more difficult the task); synthesize; reconstruct; and manually compose a problem-by-problem narrative of important hospital events and treatments. The discharging physician is often in the position of trying to reconstruct events that occurred before his or her care for the patient. Furthermore, because the hospital environment is a busy setting, with the physician taking care of multiple patients and being paged on average every 15 minutes (internal study), composing a summary is done either while the physician is on service in a highly interruptive environment - or done after hours in “pajama time.” Ultimately, not only is documentation burden recognized by leading healthcare organizations as a critical systemic problem, but it has also been designated by UCSF Health as a high priority IT Portfolio Initiative for FY2025 here. Additionally, by increasing the efficiency of DCS narrative production, an important secondary benefit of this proposal is helping facilitate discharge readiness earlier in the day, thereby opening inpatient beds earlier and decompressing the number of patients boarding in the emergency department (also a UCSF priority). Therefore, this proposal aligns with several existing UCSF priorities.

Section 2. How might AI help? 

AI assistants are already being piloted in many health systems to reduce documentation burden through ambient scribing (USCF is currently piloting this technology), and AI-generated draft responses to inbox messages, noting statistically significant reductions in burden and burnout.4 Although AI’s potential for higher-stakes use cases such as medical decision making is still being explored, LLMs are well-known to excel in lower-stakes use cases such as medical summarization.5 We have already demonstrated safety and feasibility of using Versa for generating the DCS narrative by reading through all hospital encounter notes and comparing GPT-4 Turbo’s output to physician-generated summary narratives in a retrospective study that has been accepted for publication by JAMA Internal Medicine. According to processes outlined by the UCSF AI Governance Committee here, the next step following the retrospective analysis we did for the JAMA IM paper, would be a prospective pilot, which we are hereby proposing.

While the proposed pilot targets DCS for the Hospital Medicine service (as in our JAMA IM paper), the value of an LLM-drafted discharge summary has tremendous potential to scale across all inpatient specialty services at UCSF.

Section 3. How would an end-user find and use it?

An end-user would invoke the LLM-generated DCS narrative directly in the existing discharge summary workflow in APeX. (Figure 2A) As discussed with the APEx Enabled Research (AER) team on March 10, 2025, any inpatient provider writing a hospital DCS could optionally launch the LLM simply by following the usual workflow and clicking on “Discharge Summary” that would then be followed by the option to choose an LLM generated draft for provider review. Because Versa is HIPAA-compliant, UCSF is uniquely positioned to do this work and become a national leader in LLMs for the DCS.

Section 4. Embed a picture of what the AI tool might look like (Figure 2)

Figure 2A shows the existing APeX workflow for generating a discharge summary. The inpatient provider selects “Discharge Summary” from within the Discharge Navigator tab, opening a Discharge Summary template. Many of the items in the template (e.g. medication list, referrals, etc.) will auto-populate from APeX. However, the summary narrative (“History (with Chief Complaint)” and “Brief Hospital Course by Problem”) must currently be manually composed by the inpatient provider. This is the section that involves the most documentation burden. Figure 2B shows sample LLM output where the drafted discharge summary narrative would be placed and available for provider review. Furthermore, Epic has the ability to optionally include citations /links associated with LLM-generated statements to information sources from the hospital encounter, thereby facilitating review and verification by the provider.

Section 5. What are the risks of AI errors

The proposed use of AI for the DCS involves human-in-the-loop. Namely, inpatient providers must review LLM-drafted DCS much in the same manner as they must review drafted inbox replies to patients. It remains the inpatient provider’s responsibility to ensure the accuracy of the DCS. Nevertheless, in our retrospective study, not only did we find that providers reviewing the summaries (blinded as to whether the summary was LLM- or physician-generated) had equal preference for both (χ2 = 5.2, p=0.27), and that they had similar overall quality ratings (3.67 [SD 0.49] vs 3.77 [SD 0.57]; p=0.674), but that there was no difference in the harmfulness scores associated with each LLM vs. physician error. While accurate medical summarization is important, (we found that physicians also made errors of omission, inaccuracy, and even hallucinations), the stakes of AI use for medical summarization with human-in-the-loop are lower than for AI used in medical decision making. Finally, identifying errors in the LLM-generated content could optionally be accomplished by enabling a reporting button within APeX with which inpatient providers could report potential errors for further investigation and mitigation. We also hope to be one of the first use cases for UCSF’s Impact Monitoring Platform for AI in Clinical Care (IMPACC), which is being built for continuous, automated, longitudinal AI monitoring.

Section 6. How will we measure success

There are many potential approaches to deploying the intervention and metrics to measure success. In this high level overview, it is critical that the deployment approach minimizes the friction to existing clinical workflows that could limit adoption and user satisfaction. We therefore propose an observation cohort approach allowing each inpatient provider to have the option to choose an LLM-drafted summary for each patient. Although a randomized controlled trial (RCT) – either randomized at the provider level (certain providers opted in for all of their patients) or at the patient level (same provider could use the LLM drafts for some patients but not others) – would offer the most rigorous evidence, an RCT could cause substantial workflow friction.

Therefore, we will measure success in an observational cohort study on two main domains:

  1. Measurements using data already being collected in APeX: Leveraging the APeX audit logs, an area of deep expertise in our research group, we will assess the burden on inpatient providers using the LLM-drafted content by: 1. Measuring the amount of manual editing of the LLM-drafted narrative the provider makes (character count and % of narrative text edited as proxies for burden), and 2. Time savings as measured by the amount of time spent either editing LLM-drafted narratives or composing provider-generated narratives.
  2. Measurements not necessarily available in APeX, but ideal to have: In any new technology deployment, user satisfaction is important to capture. If possible, we will attempt to capture user satisfaction with the LLM-generated narrative process by collecting Net Promoter Scores from inpatient providers using the LLM-drafted narratives, as well as those who are downstream recipients (e.g. PCPs, SNF physicians). Also ideal, if possible, would be to capture in APeX, inpatient provider flags of potential errors (inaccuracies, omissions, hallucinations), not only as a measure of quality and safety, but for investigators to identify concerns, and develop mitigation strategies. Optionally, the investigator team can quantify these error types, or (in a more scalable approach) we could potentially embed a multi-agentic platform such as CrewAI (crewai.com) to allow multiple LLMs to check one another for errors.

Section 7. Describe your qualifications and commitment:

This project is led by Dr. Benjamin Rosner, MD, PhD, FAMIA. Dr. Rosner is a hospitalist, a clinical informaticist, an AI researcher within DoC-IT, and the Faculty Lead for AI in Medical Education at the School of Medicine. He has years of experience developing, testing, and deploying digital technologies into healthcare, and he was the lead author of the retrospective study that underpinned the evidence for this solution. Dr. Rosner has worked directly with AER through the Digital Diagnostics and Therapeutics Committee (DD&T) for 6 years.

Charu Raghu Subramanian, MD is a Clinical Informatics Fellow, a hospitalist, and a co-lead author on the retrospective research study that underpinned the evidence for this pilot. She holds certifications in Epic Clarity and as an Epic Builder.

Citations 

1.    Sorita A, Robelia PM, Kattel SB, et al. The ideal hospital discharge summary: A survey of U.S. physicians. J Patient Saf. September 6, 2017. doi:10.1097/PTS.0000000000000421

2.    Williams CYK, Subramanian CR, Ali SS, et al. Physician- and Large Language Model-Generated Hospital Discharge Summaries: A Blinded, Comparative Quality and Safety Study. medRxiv. September 30, 2024. doi:10.1101/2024.09.29.24314562 https://www.medrxiv.org/content/10.1101/2024.09.29.24314562v1 (also accepted for publication in JAMA-IM, with anticipated publication in May, 2025)

3.    Hobensack M, Levy DR, Cato K, et al. 25 × 5 Symposium to Reduce Documentation Burden: Report-out and Call for Action. Appl Clin Inform. 2022;13(2):439-446. doi:10.1055/s-0042-1746169

4.    Garcia P, Ma SP, Shah S, et al. Artificial Intelligence-Generated Draft Replies to Patient Inbox Messages. JAMA Netw Open. 2024;7(3):e243201. doi:10.1001/jamanetworkopen.2024.3201

5.    Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. 2024;30(4):1134-1142. doi:10.1038/s41591-024-02855-5

 

Summary of Open Improvement Edits

  • Added suggested measurement concepts to allow inpatient providers to flag potential errors (inaccuracies, omissions, hallucinations)
  • Added option for investigator team to quantify errors
  • Added option for multi-agentic platform to identify and quantify errors
  • Added Medrxiv link to manuscript in citation #2 (to be published in JAMA-IM in May, 2025)

Adaptive closed-loop large language model platform to improve imaging surveillance of intracranial tumors

Proposal Status: 

1. The UCSF Health problem: Imaging surveillance represents a cornerstone of brain tumor management and includes surveillance of incidental lesions that may require future treatment or post-treatment follow-up to ensure disease control. Intracranial lesions are relatively common on magnetic resonance imaging and are incidentally seen in 0.7-1.6% of the general population 1,2. These typically include “benign” tumors that are slow growing and may require years of follow-up. Timely detection of tumor growth on imaging surveillance represents a clinical opportunity for intervention, potentially with less invasive methods such as radiosurgery. Similarly, adherence to follow-up for patients with previously treated non-malignant brain tumors is critical as these individuals often have decades of expected life with ongoing need for imaging surveillance to detect recurrence. Follow-up non-compliance remains a hurdle to care with a paucity of prior studies evaluating rates of follow-up loss and the impact of missed opportunities for intervention. Studies that have previously evaluated this are sparse 3–6 with some studies demonstrating non-compliance rates of more than 20% 3,4. Thus, there is a need for novel healthcare interventions that can improve compliance with imaging surveillance for patients with brain tumors and minimize follow-up loss over long periods of time.

This proposal aims to create a closed-loop adaptive AI system in Epic that incorporates a large language model (LLM) to 1) identify provider-requested imaging modality and follow-up time period and 2) predict the probability of follow-up loss to generate tailored reminders for follow-up scheduling. This system could also adapt and modify reminder frequency longitudinally based on patient compliance. End-users include neurosurgical and neuro-oncology providers who evaluate brain tumor patients as well as patients who will receive reminders about imaging and clinic follow-up. Closed loop systems without the use of LLMs have been used in other clinical contexts to encourage follow-up 7,8. However, this would be a novel approach to this clinical challenge.

2. How might AI help? LLMs provide the opportunity to improve imaging and clinical follow-up through several avenues. An LLM can assess the “assessment and plan” section in clinical documentation to identify provider-recommended follow-up time-period and imaging modality. More importantly, the model may assess clinical data (e.g. past medical history, age, functional status, etc), demographic data (e.g. distance traveled, insurance status, race/ethnicity, etc), and social history (family support, employment status, etc) at the time of the encounter to aid in prediction of risk of follow-up loss. “Non-compliance” risk assessment by the model could provide the ability for an adaptive and tailored closed-loop reminder system with risk assessment and reminder schedules updated at each subsequent visit.

Currently, follow-up compliance relies on clinical staff booking an appointment, ordering follow-up imaging, and reaching out to patients. Staff often set Epic self-reminders that will notify them to initiate the next appointment. However, this still relies on personnel following up on these reminders with potential oversight due to human errors and staff turnover. Additionally, there are difficulties tracking this long-term especially when follow-up may extend out to more than 10 years from diagnosis or treatment. This proposed LLM-based closed-loop system could produce automated reminders (MyChart, email, text, Epic Letters) in a tailored, data-driven fashion to serve as an automated aid to clinical staff and ensure imaging follow-up. The model could also be expanded to continue to assess risk of non-compliance and adapt reminders longitudinally along a patient’s imaging surveillance course.

3. How would an end-user find and use it? The AI support program could be activated through Apex at the time of a clinic visit by any neurosurgical or neuro-oncology providers. At this timepoint, the model would be able to 1) identify the imaging follow-up timepoint and modality placed in the note by the provider, 2) provide a LLM-based risk assessment of follow-up non-compliance which would be reported to providers and be fed forward into a reminder system framework, and 3) initiate a tailored set of reminders to patients at pre-specified time points based on risk profile and tumor type. Additionally, the model could initiate reminders to providers to ensure that imaging orders are placed in preparation for clinic appointments. As the visit approaches, the model could detect when appointments are created and if patients attend those appointments. If compliance is not met, the model could trigger additional patient reminders and incorporate this for future improvements to its predictive performance. Reminders will be mediated through MyChart digital, text message, and automated telephone outreach, depending on patient-selected contact preference. There is an opportunity to tailor the outreach modality as well.

4. Embed a picture of what the AI tool might look likeWe anticipate the AI tool can initially be an “opt in” patient care option for practitioners. There could be a “Yes/No” option in the “Wrap up” section of Apex during an outpatient encounter. Once this is selected, the model would be activated and report a “risk profile” group for that patient, display the provider’s recommended imaging/clinic follow-up pulled from the note for practitioner review, and display a tailored reminder schedule for the patient with the ability for minor customizations by the practitioner to reminder templates. As the next follow-up date is nearing, the reminders would be sent and the model would adapt based on whether the patient schedules follow-up or not. If no appointment is scheduled within the specified follow-up time period, then the model will initiate additional reminders with feedback to the LLM for risk prediction refinement. If compliance is met with scheduling of a visit, the model can re-initiate new reports on provider-specific imaging follow-up for that visit, an updated risk profile, and new reminder schedules. See supporting Figure 1 and 2 for images of mockup AI tool and workflow.

5. What are the risks of AI errorsAI errors in the context of this platform could lead to misinterpretation of planned imaging follow-up secondarily leading to incorrectly timed follow-up reminders. This could lead to patient confusion and potential need for clarification/error correction from clinical staff. The model may incorrectly identify “at risk” patients for follow-up non-compliance which may change practitioner interactions or management strategy considerations. However, at this time, clinical management strategies will not be changed based on the perceived risk of loss of follow-up. There are several methods to mitigate these errors. Standard practice in neurosurgery clinics is for patient navigators to set reminders on scheduling imaging follow-up. This could be continued on initial implementation of the platform to serve as a “backstop” to ensure that patients are appropriately scheduled. Providers would also be able to check the AI-provided imaging modality and time-period for follow-up as this information would be displayed in the Epic function. Additionally, reminders will be a template outreach with contact information provided so that patients may call back with any questions or requests for clarification.

6. How will we measure successSuccess with this proposal can be measured in several ways, which relate to model development, completion of a closed-loop adaptive framework, and implementation in the clinical setting with the possibility of a prospective, randomized, interventional study. Here are the specific aims/goals of the proposal:

Aim 1: Develop an LLM to reliably extract practitioner-recommended follow-up time-period and the corresponding follow-up imaging modality from clinical and identify patients at risk of non-compliance and provide this data as an output. The target population will include those who undergo upfront imaging surveillance or post-treatment imaging surveillance for brain tumors within the neurosurgery practice. The initial model can be developed based on de-identified clinical documentation in a retrospective fashion with further model performance assessment and refinement based on prospective data collection feedback from Aim 3. We will quantify the accuracy of LLM identification of recommended follow-up time and modality with a target of over 98% accuracy. The performance of the model for risk assessment should include an AUC of greater than 0.7.

Aim 2: Develop a closed-loop framework in Epic to incorporate LLM-identified imaging follow-up timepoint and risk profile to implement tailored patient reminders with adaptive capabilities in the setting of non-compliance. Patient reminders will be mediated through MyChart digital letters, email, text messages, and automated telephone outreach. Reminders to practitioners for imaging orders can also be triggered within this framework. We will quantify the accuracy of execution of reminders based on risk assessment by the model as a marker of success.

Aim 3: Conduct a prospective, randomized, interventional study with a comparison of follow-up adherence between the closed-loop LLM adaptive reminder system (intervention arm) and standard clinical practice reminders (control arm) for patients undergoing upfront imaging surveillance or post-treatment imaging surveillance for brain tumors within the neurosurgery practice. The study endpoint will be to examine the rate of follow-up within 1 year of a specified follow-up time point between the closed loop LLM-based system reminder intervention arm and the standard clinical practice reminder system. Additionally, prospective validation of the model’s ability to identify patients at risk of follow-up non-compliance will be assessed with refinement of the model to improve predictive capabilities.

7. Describe your qualifications and commitment:I am an Assistant Professor and clinical faculty within the UCSF Department of Neurological Surgery with a surgical practice focused on brain tumors and skull base lesions. Many of the patients I manage have tumors that are considered “benign” or slow growing and require long time periods for follow-up, often over 5-10 years. Many of these patients will either require future treatment with an initial period of imaging surveillance to assess for tumor growth or will need long periods of follow-up after upfront treatment. There are limited consensus recommendations for duration of imaging follow-up for non-malignant tumors either for patients who undergo imaging surveillance as an imaging strategy or in the postoperative period. In general, for many of these patients, imaging every 1-3 years is required. I have been working directly with Dr. Madhumita Sushil, who is an Assistant Professor in the Division of Clinical Informatics and Digital Transformation (DoC-IT) - Department of Medicine, Department of Neurological Surgery, and the Bakar Computational Health Sciences Institute (BCHSI). Dr. Sushil has additional expertise in the development of LLMs and will additionally provide guidance on model training, development, and implementation. I have support from my department to participate in regular work-in-progression sessions and collaborate with the Health AI and AER teams to develop and implement this proposal. The framework from this proposal could be more broadly implemented in other disease contexts across the institution (outside of brain tumors), improve the utilization of UCSF-based imaging centers, provide for improved patient quality of care, and streamline outpatient workflow.

Harnessing Artificial Intelligence to Develop an Interdisciplinary Approach to Reduce Hospital-Acquired Pressure Injury (HAPI)

Proposal Status: 

The UCSF Health Problem: Hospital Acquired Pressure Injury (HAPI) is a preventable injury to skin or soft tissue that is acquired during a patient’s hospital stay. Reducing HAPI rates is a top priority for UCSF Health leadership as the occurrence of HAPI is detrimental to patient experience and outcomes, results in significant costs (estimated cost to the health system for 1 HAPI is $18,000-$27,000), and is a critical quality measure in the evaluation of hospital performance. The creation of interdisciplinary workflows, utilization of structured problem-solving, and electronic dashboards to monitor HAPI rates and risk have reduced HAPI rates at Benioff Children’s Hospital (BCH) by 64% over the last 15 months. However, poor HAPI bundle compliance continues to be a driver of HAPI as current HAPI rates are 27% higher than national benchmarks. Critical care units account for 95% of HAPI at BCH.

The Gap: There is a lack of real-time electronic medical record (EMR) tools that assist interdisciplinary bedside teams to more effectively support efforts to prevent HAPI.

The objective of this proposal is to develop an AI report to reduce HAPI in critical care units at BCH (Oakland and SF) by improving bundle compliance through an interdisciplinary approach.

Generalizability: The AI report can be leveraged to support HAPI prevention and management across UCSF Hospitals (including adult patients) and could also be adapted to prevent other top harms (e.g Catheter Associated Blood Stream Infections, etc).  

How might AI help? Bedside clinical teams document hundreds of clinical observations related to evidence-based risk factors for HAPI prevention daily in flowsheet rows and notes for each patient, making it challenging to assess if care provided meets expected standards. For example, routine aspects of patient care (such as nutrition delivery) are often disrupted due to procedures, feeding intolerance, or other factors, leading to gaps in knowledge of actual vs ideal state of delivered nutritional goals. Lastly, there are knowledge gaps regarding how to modify care goals when patients with risk factors for developing HAPI are identified.

Multimodal generative AI is ideal for creating a report that can summarize clinical data because it can seamlessly process and integrate diverse data types—such as structured data and free-text notes—into a unified model. By combining techniques like natural language processing for unstructured text and machine learning for structured data, the AI report can extract meaningful insights across all data formats. Importantly, its generative capabilities enable it to summarize clinical data and propose actionable content (such as tailored recommendations) in real-time.

Data Sources: All data elements requested are either documented in flowsheet rows or templated notes in Apex at the individual patient level. Data requested for this report are based on published, validated risk factors for HAPI development. Structured Data: Nutrition (e.g. formula or TPN prescribed, rate of formula/TPN delivery, hours over which nutrition delivered, regular diet order, or percent of regular diet finished, HAPI bundle elements (e.g. repositioning (time of turn and position in which patient is repositioned), perfusion, skin hygiene, mobility promotion, device rotation, application of barrier creams, etc), Medication (e.g. medication administration of vasoactive medication, medications administered through central lines), Medical Devices/LDA (e.g. endotracheal tube, noninvasive positive pressure device, etc).  Unstructured Data: Nutrition: Goal nutritional intake (found in templated registered dietician note), HAPI injury status: Found in templated wound care notes.

How would an end-user find and use it?

Location: The AI report will be found in the “Summary Tab” as nursing, respiratory therapy, and physicians/advanced practice providers (APP) routinely access this tab when viewing patient charts in the inpatient setting. The proposed alert to remind bedside to staff to perform key HAPI bundle components will trigger when a bundle element is overdue when a patient’s chart is opened.

Timing of Support: Support will be most effective in real-time, as HAPI prevention efforts are carried out as frequently as every two hours in critical care units. Reminders to comply with unit policies (bundle assessment/documentation) and suggested recommendations will be made available as the need for improvement is identified.

What end users may see? The AI tool will generate a report summarizing compliance with best practices for the current day and the past week. It will also provide recommendations to improve bundle compliance. This alert can be temporarily silenced if clinical stability prevents compliance. For physicians and APPs, a suggestion box with action items will appear when they click on the hyperlink under the suggested interventions section. For immediate action items (e.g., placing a wound care consult), an alert will appear as soon as the chart is opened.

How are recommendations explained: The non-compliant bundle element and the reason for non-compliance will be visually highlighted, along with a brief suggestion for getting back into compliance. Some interventions need only brief explanations (e.g unit policy is to turn patient every 2 hours, but it has been 4 hours since last turn). For more nuanced suggestions (such as optimizing nutrition), the report will describe why and length of time nutrition is not at goal and propose actionable suggestions (such as possible means for nutritional delivery).

Picture of Embedded AI Tool:

 

What are the risks of AI error? 

AI errors could be caused by missing data and hallucinations. Missing data could lead to both under and overestimation of HAPI risk. However, because a key component of this report is to increase bundle compliance (and by association documentation of bundle compliance), we plan to mitigate errors from missing data by improving data entry. To validate the AI model, domain experts will regularly review its outputs and compare them with existing data reports (HAPI dashboard, nurse audit data) and analysis of component data (e.g., skin assessment, patient positing, etc) through clarity queries. Additionally, accuracy of information extracted from note text will be assessed by comparing AI output of notes with text matching (text mined from templated notes [existing standard]). The project lead has extensive experience analyzing structured and unstructured EMR data (Mahendra et al, Pulm Circ, 2023; Mahendra et al, Crit Care Explor, 2021). A feedback button will also be available for end-users to report inaccuracies. 

Feedback: A feedback button will be available for end-users to report inaccuracies at any time in the EMR.  Additionally, we will build an active user-centered design process working group to seek and incorporate iterative input from end users as the report is developed and implemented in the clinical setting. Feedback will be used to 1. Continuously retrain and refine the model for improved accuracy and reliability 2. Optimize bedside use of this tool 3. Identify other factors (staffing, equipment availability, etc) that may be identified as important factors to monitor to improve HAPI prevention effectiveness. 

How will success be measured? Success will be defined by both outcome and process measures. The primary outcome measure will be BCH Stage 2 and greater HAPI rates/1,000 patient days. The goal would be to realize a sustained reduction of HAPI rates to below national benchmark data across BCH. HAPI rates are readily available as they are closely monitored across UCSF Health in the zero-harm dashboard. The secondary outcome is improved adherence to the HAPI prevention bundle (a key measure of adoption and utilization of the tool). Documentation of bundle compliance will be assessed through query of structured flowsheet row data and in person audits (routinely performed by nursing and documented in the digital rounding tool). An automated dashboard is also being built to report bundle compliance by UCSF Health IT.  

Qualifications: Malini Mahendra MD (project lead) is a pediatric intensivist and data scientist. As a certified clarity data analyst with expertise in use of machine learning and natural language processing algorithms, she has extensive experience analyzing UCSF EMR data for quality improvement.  She is also the local quality improvement and informatics lead for the mission bay pediatric intensive care unit.

Deborah Franzon MD, MHA (collaborator) is the Executive Medical Director for Quality and Safety at BCH with over 20 years of clinical expertise and implementation science experience.   Dr. Franzon has been the recipient of prior UCSF open proposal awards including: Learning Health System Innovations (2017-2019) building a machine learning predictive model for extubation in PICU patients, soon to be integrated into Epic; and Caring Wisely (2021) to reduce delirium in PICU patients across BCH SF and Oakland.  She developed, implemented and published on the impact of an EHR enhanced dashboard in reducing CLABSI rates in the PICU. (Pediatrics, 2014) She has a proven track reducing harm, improving clinical outcomes, and driving data-informed, innovative change through collaborative, patient-centered leadership.

This project is supported by BCH Leadership: Nicholas Holmes MD MBA (President of BCH), Joan Zoltanski MD MBA (Chief Medical Officer of BCH), Judie Boehmer RN MN (Chief Nursing Officer BCH), Michael Lang MD (Chief Medical Information Officer for Children's Services), Jeff Fineman MD (Pediatric Critical Care Division Chief), Shan Ward MD and Loren Sacks MD (MB Pediatric ICU and Pediatric Cardiac ICU Medical Directors), Mary Nottingham RN and Lori Fineman RN (MB Clinical Nurse Specialists, Pediatric Critical Care and Pediatric Cardiac Critical Care), Mandeep Chadha MD (BCH-Oakland Pediatric Critical Care Quality Lead).

 

 

Selected References:

 

Mahendra M, Chu P, Amin EK, Nawaytou H, Duncan JR, Fineman JR, Smith-Bindman R. Associated radiation exposure from medical imaging and excess lifetime risk of developing cancer in pediatric patients with pulmonary hypertension. Pulm Circ. 2023 Aug 21;13(3):e12282. doi: 10.1002/pul2.12282. PMID: 37614831; PMCID: PMC10442605.

Mahendra M, Luo Y, Mills H, Schenk G, Butte AJ, Dudley RA. Impact of Different Approaches to Preparing Notes for Analysis With Natural Language Processing on the Performance of Prediction Models in Intensive Care. Crit Care Explor. 2021 Jun 11;3(6):e0450. doi: 10.1097/CCE.0000000000000450. PMID: 34136824; PMCID: PMC8202578.

Pageler NM, Longhurst CA, Wood M, Cornfield D, Suermondt J, Sharek P, Franzon D. Use of electronic medical record-enhanced checklist and electronic dashboard to decrease CLABSIs. Pediatrics. 2014;133(3):e738-e746. doi:10.1542/peds.2013-2249

 

Solutions for Patient Safety Operational Definition and Prevention Bundle. (rev. 2020). https://static1.squarespace.com/static/62e034b9d0f5c64ade74e385/t/636177d9b97cd87674b58f18/1667332057726/PI-Bundle-Op+Def.pdf

Padula WV, Delarmente BA. The national cost of hospital-acquired pressure injuries in the United States. Int Wound J. 2019;16(3):634-640. doi:10.1111/iwj.13071

Johnson AK, Kruger JF, Ferrari S, et al. Key Drivers in Reducing Hospital-acquired Pressure Injury at a Quaternary Children's Hospital. Pediatr Qual Saf. 2020;5(2):e289. Published 2020 Apr 7. doi:10.1097/pq9.0000000000000289

 

Supporting Documents: 

SNAP Into Action: Building an EHR Dashboard to Track Delayed Antibiotic Prescriptions in Pediatric Acute Otitis Media

Proposal Status: 

The UCSF Health Problem

Acute otitis media (AOM) affects millions of children each year and is the number one indication for antibiotic use in pediatrics,1 despite evidence that 85% of cases self-resolve.2 To combat antibiotic overuse and the potential for adverse events, the American Academy of Pediatrics3 and Centers for Disease Control4 recommend Safety Net Antibiotic Prescriptions (SNAPs). SNAPs are prescribed during the encounter with the intention that the prescription will be filled and used within 1-3 days only if the child’s symptoms fail to resolve or worsen. However, SNAPs are difficult to track because there is no structured designation in the order data that distinguishes them from standard prescriptions meant to start immediately (treatment today prescriptions, TTP). Thus, identifying SNAPs has historically required burdensome manual chart review, which makes it difficult to assess their epidemiological value.5

SNAPs are commonly used in pediatric care at UCSF, both in ambulatory and emergency settings, yet their use is not systematically tracked. Quality improvement initiatives require insight into how often SNAPs are prescribed and filled, which patient populations are more likely to receive or use them, and how prescribing patterns vary across clinicians in order to ensure equitable and effective antibiotic stewardship.

How Might AI Help?
Versa has already shown that it can help analyze disparities in pediatric antibiotic prescribing by automating labor-intensive chart review. In our pilot retrospective cohort study of pediatric AOM cases from 2021 to 2024, we found that Versa (utilizing GPT-4o) was able to accurately categorize 98% of treatment plans into “SNAP”, TTP”, or “Other” from physician notes as compared to the gold standard of human review by two board certified pediatricians.6 The model achieved a sensitivity of 95.9% and specificity of 99.1% for SNAP detection. We found that overall, 76.8% of antibiotic prescriptions were TTP, while 23.2% were SNAPs—and that non-English-speaking patients and those in the lowest SES quartiles were significantly less likely to receive SNAPs. These results indicate substantial opportunity to expand the use of SNAPs.

Building on this foundation, we propose the development of a live Versa/Epic-integrated dashboard that enables physicians to visualize their own prescribing patterns in real time. Versa will automatically review provider notes for all pediatric encounters (patients < 19 years old) that are flagged with an AOM ICD-10 diagnosis code H65, H66, H67.7,8 Each encounter will be classified as a SNAP, TTP, or other, and the results will populate the dashboard.

The primary visualization will display an individual provider’s SNAP versus TTP prescribing rates, benchmarked against their clinical department and the broader institution. Users will be able to toggle between percentage-based and absolute patient counts and view trends over time. Additional features will include filtering by English vs. non-English-speaking patients, race, ethnicity, and additional demographics. Based on their chart selections, users will be able to click to generate a workbench report that shows the individual encounters and the classifications that make up the data points they are viewing.

We have also linked pharmacy dispense data to patient encounters, enabling us to track whether and when prescriptions were filled. An additional dashboard view, should resources allow, could display individual provider SNAP and TTP pickup rates compared to clinic- and institution-level benchmarks, offering insight into the effectiveness of individual provider’s patient counseling and education. This allows providers to assess whether their patients are filling prescriptions as intended, particularly in the case of SNAPs which are education-intensive, and to refine communication strategies accordingly.

Finally, should Versa prove to be too expensive for this use case, our team also trained a small, local model, Clinical LongFormer, that demonstrated 93% accuracy. This model could be deployed on Wynton (UCSF’s high-performance compute cluster) in place of Versa should a more cost-efficient solution be desired.

While we are actively exploring developing a structured SmartSet in APeX to distinguish SNAP from TTP prescriptions, implementing and achieving consistent adoption across pediatrics, family medicine, and emergency medicine at multiple UCSF sites poses significant operational challenges. Our proposed AI-based approach offers a scalable, low-burden solution that enables both retrospective analysis for quality improvement and equity monitoring, and future validation of SmartSet use against what the AI determines from clinical documentation.

How Would an End-User Find and Use It?
The dashboard will be integrated directly into UCSF’s APeX EHR system into the relevant pediatric departments that consent to its use. Users will be able to see it on a tab adjacent to their inbox for quick and easy access between seeing patients. The default view will present each user’s prescribing patterns in the context of the clinical department in which they are currently logged in. The underlying LLM will run every seven days on all pediatric encounters associated with an AOM ICD-10 code in enabled clinical settings, and the dashboard will automatically update with new data every week. They will be able to toggle on relevant patient demographics such as language, insurance, etc. They may also select between seeing absolute patient numbers or percentages for their patients in terms of how their AOM treatment plan patterns compare to others and how often their patients are filling their prescriptions.

Embed a Picture of the AI Tool

 

 

Risks of AI Errors
There are risks such as false positives (incorrectly labeling a non-SNAP as a SNAP) and false negatives (failing to identify a SNAP). False positives could promote a sense of successful antibiotic stewardship when there is in fact a gap. Conversely, false negatives may result in missed opportunities to identify disparities and support quality improvement. While the model is not directly involved in treatment decisions, misclassification could impact provider benchmarking and the efficacy of equity-focused interventions.

To mitigate these risks, ongoing evaluation will be implemented to monitor for model drift, ensuring the AI tool maintains accuracy over time. This includes continuous performance monitoring and validation procedures to detect and address any degradation in model accuracy. A feedback mechanism will be established for clinicians to report discrepancies through the workbench reporting interface that shows individual encounters and their Versa classifications. This feedback will be reviewed and used to refine the AI model.

How Will We Measure Success?
Success will be measured through provider adoption and any observed impact on SNAP prescribing and dispense behavior.

 Measures using data already being collected in APeX:

  • Overall clinical rates of TTP and SNAP being prescribed before AI tool implementation
  • Rate of TTP vs. SNAP use by provider
  • Patient pickup rates of SNAP vs. TTP by demographic

 Measures using other measurements ideally needed:

  • Provider engagement metrics, including frequency and duration of dashboard use
  • Equity metrics i.e. changes in SNAP prescribing rates among non-English-speaking and low-SES patients
  • Change in SNAP prescribing rates among dashboard exposed vs. non-exposed providers (based on APeX audit logs when a clinician opens the dashboard more than once in the set time period)
  • Provider satisfaction with the AI tool through Qualtrics surveys

 Describe Your Qualifications and Commitment

Jessica Pourian, MD, is a Clinical Informatics Fellow and urgent-care pediatrician with experience in LLMs, operational AI implementations, antibiotic stewardship, and healthcare disparities. She is physician builder and Clarity certified. She is transitioning to a faculty position as Assistant Professor in the Department of Pediatrics and will have a 40% health system operational role as Physician Lead for Pediatric Informatics. She has led the SNAP Study, which uses AI to identify and address inequities in antibiotic prescribing for pediatric AOM. She is committed to the success of this project and will dedicate protected time to participate in regular work-in-progress sessions, collaborate closely with the Health AI and AER teams, and support the development, validation, and implementation of the AI algorithm in clinical workflows.

Valerie Flaherman, MD, MPH, is a Professor of Pediatrics and Epidemiology & Biostatistics with expertise in EHR-based research and clinical decision support. She brings expertise in clinical research, health services, and informatics, with a particular focus on leveraging EHR data for clinical decision support. Dr. Flaherman led the development of the Newborn Weight Tool (NEWT), a widely used digital tool built from EHR data on over 160,000 infants, and served as PI for the Healthy Start trial integrating CDS into Epic. She is also the Managing Director of the BORN Network, a national research collaborative. Her background in EHR-based intervention design and pediatric care makes her a key advisor on the development and implementation of the SNAP prescribing dashboard.

Raman Khanna, MD, MAS, is a Professor of Clinical Medicine and Medical Director of Inpatient Informatics. He co-chairs the Digital Diagnostics and Therapeutics Committee and leads efforts to integrate digital tools into the EHR, with a focus on clinical communication, decision support, and API-based innovation. Dr. Khanna is also Program Director of the Clinical Informatics Fellowship. His experience in deploying operational informatics tools across UCSF Health makes him a key collaborator in the development and implementation of the SNAP dashboard.

 

 References

1.         Hersh AL, Shapiro DJ, Pavia AT, Shah SS. Antibiotic prescribing in ambulatory pediatrics in the United States. Pediatrics. 2011;128(6):1053-1061. doi:10.1542/peds.2011-1337

2.         Venekamp RP, Sanders SL, Glasziou PP, Del Mar CB, Rovers MM. Antibiotics for acute otitis media in children. Cochrane Database Syst Rev. 2015;2015(6):CD000219. doi:10.1002/14651858.CD000219.pub4

3.         The Diagnosis and Management of Acute Otitis Media | Pediatrics | American Academy of Pediatrics. Accessed August 13, 2024. https://publications.aap.org/pediatrics/article/131/3/e964/30912/The-Dia...

4.         CDC. Ear Infection Basics. Ear Infection. April 23, 2024. Accessed August 13, 2024. https://www.cdc.gov/ear-infection/about/index.html

5.         Daggett A, Wyly DR, Stewart T, et al. Improving Emergency Department Use of Safety-Net Antibiotic Prescriptions for Acute Otitis Media. Pediatr Emerg Care. 2022;38(3):e1151-e1158. doi:10.1097/PEC.0000000000002525

6.         Flaherman, V, Pourian, J. A SNAPpy Use of Large Language Models: Using LLMs to Classify Treatment Plans in Pediatric Acute Otitis Media. Under Review.

7.         Vojtek I, Nordgren M, Hoet B. Impact of pneumococcal conjugate vaccines on otitis media: A review of measurement and interpretation challenges. International Journal of Pediatric Otorhinolaryngology. 2017;100:174-182. doi:10.1016/j.ijporl.2017.07.009

8.         Hu T, Done N, Petigara T, et al. Incidence of acute otitis media in children in the United States before and after the introduction of 7- and 13-valent pneumococcal conjugate vaccines during 1998-2018. BMC Infect Dis. 2022;22(1):294. doi:10.1186/s12879-022-07275-9

 

AI-Augmented Fall Prevention Tool for Nurses

Proposal Status: 

The UCSF Health problem

Falls among hospitalized adult patients are a significant patient safety concern, associated with higher morbidity, prolonged hospital stays, and approximately $62,000 in non-reimbursable costs per fall.1 While our fall rate is below the national benchmark, falls are considered a never event.2Moreover, they are a quality indicator under UCSF Quality and Safety True North Pillar. Nurses play a critical role in fall prevention3 and are the intended end-users of our proposed AI-Augmented Fall Prevention Tool.

UCSF Health currently employs an evidence-based Fall Prevention Program that incorporates Fall TIPS4, including risk assessment, care planning, and the delivery of tailored interventions. Nurses assess and document fall risk every shift using the STRATIFY tool5, which is based on 5 features: recent history of falls, and observations agitation, visual impairment, toileting, and mobility. However, STRATIFY uses a fixed threshold for identifying “high risk” patients, which limits its utility in rapidly changing clinical situations. Additionally, many of its components are already documented in other nursing assessments but are not easily integrated with other data.

A UCSF pilot in 2020 evaluated the Epic AI Fall Risk Prediction model6 with 38 variables, but it did not outperform STRATIFY and was not adopted into the workflow. It also lacked actionable recommendations for end-users, limiting its utility in patient care. Despite this, AI remains a promising avenue for integrating broader clinical features into fall risk prediction.

Major limitations to the current fall prevention approach are 1) the over-reliance on an overall risk score rather than attention to individual factors, 2) the high cognitive load to assess the contribution of the medical history and recent changes in status to inform tailored care planning, 3) and the burden of documentation that takes away from direct patient care.

How might AI help?

The primary goal of the proposed AI-Augmented Fall Prevention Tool is to reduce inpatient falls through timely and targeted prevention. By leveraging real-time clinical data, the tool will assist nurses with care planning by providing tailored, actionable recommendations. It will also reduce documentation burden by synthesizing existing information into a user-friendly format. The tool will incorporate both risk prediction and large language models and will be capable of reading from and writing to APeX. The tool will consist of a risk prediction and clinical decision support system with an integrated end-user feedback loop (Figure 1) These interdependent components will be built across three phases.

Phase 1: The Risk Prediction model will generate a dynamic profile that presents 1) a risk score and corresponding stratification (e.g., high, medium, low) and 2) the clinical features underlying fall risk. We will use APeX and Incident Report (IR) data to label cases. Fall events are documented in a separate IR database, but as of January 2024, they are also documented in APeX in the Post-Fall Flowsheet and Note. Since January 2022, there have been 1,148 documented falls.

Predictive features will be selected based on research evidence7,8,9 and their availability in APeX. Our prior work has shown that nurses already document STRATIFY elements in other daily flowsheets. In addition to pulling these data, we will include clinical, environmental, and time-dependent variables such as time of day, unit type, demographics, medical history, admission assessments, vital signs, medications, lab values, new diagnoses/procedures, and functional status. The model will automatically update the risk profile as new data become available.

We will begin by focusing on structured data from the Admission Navigator, Diagnoses, Problem List, Medication List/MAR, and Flowsheets. We will later explore the added value of unstructured data (e.g., gait assessments in Physical Therapy (PT) notes) and the technical costs associated with their inclusion.

 How would an end-user find and use it?

Phase 2: The Clinical Decision Support System (CDSS) will be embedded in the patient’s chart.Nurses would interact with it at the beginning of the shift and when a patient’s clinical status changes. The tool will visualize each patient’s fall risk, display trendlines, and highlight top contributors to any change. A clinical advisory will appear when a new high-risk patient is identified or when a patient’s risk sharply increases. It would recommend tailored interventions or adjustments to the current Falls Care Plan based on research evidence10,11. Recommendations will incorporate patient preferences and conditions (e.g., language, visual impairment) and be presented in natural, LLM-generated language to ensure clarity and increase nurses trust in the AI.

Phase 3 will include End-user Feedback and a Tracking Dashboard.  The nurse will be able to accept or reject the clinical advisory recommendation. If accepted, the AI tool would automatically update the Care Plan, reducing the need for manual documentation. It would also generate a reminder to document the performed intervention by the end of shift, if not done already. If the nurse rejects the advisory based on their clinical judgement, they can provide a rationale using a standardized list of options that reduce the cognitive burden. This input will support reinforcement learning with a "nurse in the loop" model. A Tracking Dashboard built in Tableau will display model and user metrics, monitored by the Falls Committee. This interprofessional committee, which reviews all fall incidents, has been actively involved in developing this proposal.

Key stakeholders — including Fall TIPS leaders, Falls Champions, and nurse informaticists — will co-design the CDSS interface to ensure usability, minimal clicks, and integration into workflows. We will apply the Consolidated Framework for Implementation Research12 and the NASSS model (Non-adoption, Abandonment, Scale-up, Spread, and Sustainability)13 to guide adoption and sustainability.

What are the risks of AI errors?

Primary risks include false positives and false negatives. False positives may flag too many patients as high risk and lead to alert fatigue from unnecessary clinical advisory alerts. False negatives may potentially increase falls due to missed opportunities for fall prevention. We plan to track model fit statistics to reduce these risks. Additionally, we will review clinical advisory rejections, which may suggest excessive false positives, and their rationale. We aim to mitigate these issues by optimizing the thresholds, refining the model, and engaging our stakeholders. Moreover, to reduce bias, we will plan to periodically recalibrate the model across age, race/ethnicity, and diagnosis groups.

How will we measure success?

This multi-phase project requires significant clinical and technical investment, but it has the potential to improve patient safety, documentation efficiency, and reduce EHR-related burnout — offsetting its costs. It may also be extended to other preventable harms and serve as a model for nursing-led AI innovation. Lastly, it would generate pilot data for a larger extramural grant.

Aim 1: Evaluate the risk prediction model fit and compare its ability to identify high risk patients to STRATIFY. Model metrics will include False Positive Rate, False Negative Rate, Positive Predictive Value, and falls occurring in patients classified as “low risk”. We will adopt the model if it performs as well or better than STRATIFY. If successful, we propose to replace STRATIFY with this AI tool and embed it in the current Fall TIPS program.

Aim 2: Evaluate and continuously refine the user experience with brief surveys and focus groups with nurses as well as general AI tool usage with APeX metadata. We will also use APeX metadata to measure the change in time spent on fall related documentation.

Aim 3: Evaluate the impact on the number of inpatient falls per 1,000 patient days, and the rate of falls with injury. These measures are available through the Zero Harms Dashboard. We will consider abandoning the project if there is a persistent increase in falls especially in patients that AI flagged as “low risk”, high advisory rejection, or low user satisfaction. The thresholds for these outcomes will be decided upon with the stakeholders.

  • A list of measurements using data that is already being collected in APeX: AI toolusage (views of the risk profile, clickthrough rate on advisory recommendations, advisories accepted and rejected, rationale for rejection), documentation (time spent on fall related documentation, time to the completed documentation after the advisory reminder).
  • A list of other measurements you might ideally have to evaluate success of the AI: nurse satisfaction and trust in the AI tool (focus groups), nurse agreement with AI-generated explanations (surveys/focus groups), use of fall prevention resources (e.g., safety attendants, PT consults), rates of falls and falls with injury, and potential cost savings.

Qualifications and commitment

The project leads are Maria Yefimova, PhD, RN (5% effort) and Sasha Binford, PhD, MS, RN, PHN, AGCNS-BC (5% effort). Both are faculty in the Department of Physiological Nursing at UCSF School of Nursing. Dr. Yefimova is the lead nurse scientist with UCSF Health. With her implementation science background and experience evaluating remote patient monitoring programs, she will oversee the development and evaluation. Dr. Binford is the Nursing Clinical Quality Specialist. She is the Fall Subject Matter Expert and has served as the tri-Chair of the Falls Committee. Previously, she has led the development of the UCSF AI delirium risk prediction model. She will contribute her expertise on the model development and validation and will liaison with clinical and operational stakeholders.

Falls Committee members that include Clinical Nurse Specialists (Melissa Lee MS, RN, PCCN, GCNS-BC, NEA-BC, chair), Nursing Quality (Meghan Sweis, MSN, RN, CNL, CPHQ), and Continuing Quality Improvement (Adam Cooper, DNP, RN, NPD-BC, EBP-C), as well as Nursing Informatics (Kay Burke, MBA, BSN, RN, NE-BC) will provide in-kind effort on model validation and evaluation.

References

  1. Dykes, P. C., Carroll, D. L., McColgan, K., Hurley, A. C., Lipsitz, S. R., & Bates, D. W. (2023). Cost of inpatient falls and cost-benefit analysis of implementation of an evidence-based fall prevention program. JAMA Health Forum, 4(1), e225125. https://doi.org/10.1001/jamahealthforum.2022.5125
  2. Centers for Medicare & Medicaid Services. (2006, May 18). Eliminating serious, preventable, and costly medical errors - never events.
  3. Horta, R. S. T. (2024). Falls prevention in older people and the role of nursing. British Journal of Community Nursing, 29(7), 335–339. https://doi.org/10.12968/bjcn.2024.0005
  4. Dykes, P. C., Carroll, D. L., Hurley, A. C., Lipsitz, S., Benoit, A., Chang, F., Meltzer, S., Tsurikova, R., Zuyov, L., & Middleton, B. (2010). Fall prevention in acute care hospitals: A randomized trial. JAMA, 304(17), 1912–1918. https://doi.org/10.1001/jama.2010.1567
  5. Oliver, D., Britton, M., Seed, P., Martin, F. C., & Hopper, A. H. (1997). Development and evaluation of evidence-based risk assessment tool (STRATIFY) to predict which elderly inpatients will fall: case-control and cohort studies. BMJ, 315(7115), 1049–1053. https://doi.org/10.1136/bmj.315.7115.1049
  6. Cognitive Computing Model Brief: Inpatient Risk of Falls Epic Systems Corporation.
  7. Appeadu, M. K., & Bordoni, B. (2023). Falls and fall prevention in older adults. In StatPearls. StatPearls Publishing. Retrieved April 1, 2025, from https://www.ncbi.nlm.nih.gov/books/NBK560761/
  8. Mao, A., Su, J., Ren, M., Chen, S., & Zhang, H. (2025). Risk prediction models for falls in hospitalized older patients: A systematic review and meta-analysis. BMC Geriatrics, 25, Article 29. https://doi.org/10.1186/s12877-025-05688-0
  9. Ghosh, M., O’Connell, B., Afrifa-Yamoah, E., Kitchen, S., & Coventry, L. (2022). A retrospective cohort study of factors associated with severity of falls in hospital patients. Scientific Reports, 12, 12266. https://doi.org/10.1038/s41598-022-16403-z
  10. Pierre-Lallemand, W., Coughlin, V., Brown-Tammaro, G., & Williams, W. (2025). Nursing-led targeted strategies for preventing falls in older adults. Geriatric Nursing. https://doi.org/10.1016/j.gerinurse.2025.02.005
  11. Ojo, E. O., & Thiamwong, L. (2022). Effects of nurse-led fall prevention programs for older adults: A systematic review. Pacific Rim International Journal of Nursing Research, 26(3), 415–429. https://he02.tci-thaijo.org/index.php/PRIJNR/article/view/258061
  12. Damschroder, L. J., Aron, D. C., Keith, R. E., Kirsh, S. R., Alexander, J. A., & Lowery, J. C. (2009). Fostering implementation of health services research findings into practice: A consolidated framework for advancing implementation science. Implementation Science, 4, 50. https://doi.org/10.1186/1748-5908-4-50
  13. Greenhalgh, T., Wherton, J., Papoutsi, C., Lynch, J., Hughes, G., A’Court, C., Hinder, S., Fahy, N., Procter, R., & Shaw, S. (2017). Beyond adoption: A new framework for theorizing and evaluating nonadoption, abandonment, and challenges to the scale-up, spread, and sustainability of health and care technologies. Journal of Medical Internet Research, 19(11), e367. https://doi.org/10.2196/jmir.8775

Improving Access to Advance Care Planning Goals of Care Documentation in the EHR Using AI

Proposal Status: 

1. The UCSF Health Problem
Advanced care planning (ACP) is a process that allows patients to name a surrogate decision-maker and to discuss and document their preferences for medical care.1,2 Centers for Medicare and Medicaid Services (CMS) and other national organizations recommend ACP, and it is an important quality metric often tied to reimbursement.2 When ACP information in the electronic health record (EHR) cannot be found during a medical crisis, patients often receive care that is not aligned with their goals, which is widely recognized as an important patient safety issue.3,4 

To improve ACP documentation, with our team’s guidance, UCSF adopted “clinically meaningful” ACP as a quality metric, which includes both documented discussions and legal forms (e.g., advance directives). UCSF has also created and adopted an “.ACP” SmartPhrase and ACP note title that, along with ACP documents, are “pulled into” a central location in the EHR for easy clinician retrieval.

However, threats to patient safety and adherence to patients’ goals of care abound, particularly in inpatient medicine and surgical settings.5,6 At UCSF, both inpatient and outpatient clinicians and ACP leaders have spent many hours, year after year, attempting to educate clinicians about using the .ACP SmartPhrase and the ACP note titles to pull this information into the central ACP location in the EHR. With frequent trainee and inpatient clinician turnover, despite ongoing education, clinician use of these documentation innovations is low (e.g., < 5% in surgery). As our team has shown, important ACP information is often buried in clinical notes and ignored during medical crises.3,7 While semi-automated natural language processing (NLP) can be used to manually search for a list of known ACP terms in the EHR, this manual abstraction is time-consuming, resource-intensive, and not practical during a medical crisis. 

Given national guidelines for ACP, quality metrics tied to reimbursements, clinician burden of documentation, and the need to honor patients’ medical wishes, it is imperative to be able to efficiently gather ACP-related information into a central usable location. Artificial Intelligence (AI) and large language models (LLM) have tremendous promise to identify all documented goals of care conversations in the medical record and ensure this critical information is available to all clinicians in a central ACP location in the EHR. We propose that an AI model would run autonomously in the background without the need for additional clinician or staff support. 

2. How Might AI Help?
Generative AI and large language models (LLMs) can transform the identification of goals of care documentation in the EHR by identifying unstructured clinical data. Our AI solution will analyze EHR data, including all inpatient and outpatient notes, to identify documented ACP discussions by any provider type, including clinicians, social workers, nurses, nurse practitioners, chaplains, and healthcare navigators. The AI tool will function as an assistive technology to find ACP information that clinicians can use at the bedside to help patients and families make medical decisions. The AI and LLM solution will increase the number of listed documented goals of care conversations in the central ACP activity in APeX, decrease the time required per chart to complete surveillance that would otherwise require NLP and manual chart review, and decrease clinician burden in finding and documenting ACP information. Importantly, automating this process has the potential to improve patient safety by ensuring that patients’ stated wishes for medical care are honored and respected. This initiative will improve care for patients across clinical disciplines and across inpatient and outpatient care and will be scalable in multiple other care settings.

3. How Would an End-User Find and Use It?
The AI tool will be designed to run automatically in the background on a daily basis. When documented goals of care conversations are identified in the EHR, they will be added to the established central ACP location in the EHR. The goals of care information will then be made available to clinicians without the need for NLP search queries or manual chart review. In addition, as the model improves over time, it will obviate the need for ongoing clinician education to use the .ACP SmartPhrase and ACP note titles when documenting goals of care.

4. Example of AI Output

Using UCSF Versa with the prompt: “Pull wording from notes related to advance care planning verbatim.”

Discussed with XXXX regarding XXXX’s code status and XXXX would not like to pursue heroic measures should he have a cardiac arrest, ie. Prefers XXXX to be DNR/DNI. XXXX worries that given his already poor baseline quality of life, an event like that may make this worse for him, and it may not be in his best interest to pursue full life measures. XXXX is yet to fill a POLST form but will do so soon.

Surrogate decision maker: XXXX (Conservator)

Life sustaining treatment preferences (i.e. Code): DNR/DNI

5. What Are the Risks of AI Errors?
AI-based identification may introduce risks such as false negatives, false positives, and AI hallucinations. False negatives could lead to a lack of identification of ACP documentation in medical notes, impacting patient safety and quality metric compliance. However, any improvement over the current state, in which most of ACP information in notes is being missed, has the potential to improve patient safety. False positives may result in non-ACP information being included in the EHR central activity (e.g., “noise”). However, because ACP information is so crucial in a medical crisis, it is generally agreed that false positives are preferable to false negatives as this documentation can be quickly screened by clinicians if it is readily accessible in a central location. Through prompt engineering, we will ask Versa to generate notes verbatim to lessen the risk of hallucinations, but we will continue to check for and refine the model.

To mitigate these risks, the model will undergo rigorous validation to ensure reliability and clinical utility: (a) retrospective validation – the AI models will be developed and then validated against historical goals of care notes with .ACP SmartPhrases as the gold standard; (b) among 1000 patients over 6 months we will conduct a prospective validation through manual chart review of clinical notes; (c) qualitative patient queries (n=20) about the accuracy of ACP documentation (i.e., 10 patients on the medical and 10 on the surgical inpatient services representing the average self-reported race/ethnicity background of UCSF inpatients; 10 White (50%), 4 Asian (20%), 3 Black (15%), and 3 Hispanic/Latinx (15%) patients); and (d) clinician feedback. We will continually monitor and manually validate the model to correct for bias and refine it to ensure accuracy. 

6. How Will We Measure Success?
We will follow a cohort of hospitalized patients, 65 years of age and older, admitted to the inpatient medical and surgical services over 6 months. Success will be measured by:

  • The increase in documented goals of care discussions in the central ACP EHR activity 6 months after applying the AI/LLM tool compared to a demographically matched comparison cohort 6 months prior to applying the tool
  • Using chart review:
    • The positive predictive value and negative predictive value of AI-generated ACP documentation  
    • The time saved using the AI/LLM tool compared to using NLP ACP term queries and manual chart review
    • Qualitative analysis of potential bias of ACP documentation across demographic and clinical subgroups
    • A survey of 20 patients (i.e., 10 medicine and 10 surgical inpatients) to query whether goals of care documentation in the EHR is aligned with their preferences
    • A survey of 20 inpatient clinicians (10 medicine and 10 surgical clinicians) about their satisfaction (5-point Likert scale) with documenting, finding, and using ACP information in the inpatient setting with open-ended questions on how to improve 

7. Describe Your Qualifications and Commitment
This project is co-led by Rebecca Sudore, MD (Division of Geriatrics) and Elizabeth Wick, MD (Colorectal Surgery), and includes Logan Pierce, MD (Hospital Medicine). This interdisciplinary team comprises ACP experts, surgery and hospital medicine clinicians, and an informaticist with a proven successful track record of improving ACP workflows at UCSF. Currently, this team holds a large, multi-center NIH Pragmatic Trial Collaboratory project to test patient-facing, automated ACP EMR interventions. The current proposal builds on their successful collaboration and the glaring need they have identified to improve ACP documentation, decrease clinician burden, and improve patient safety. Please see the letter of support provided by our clinical and operational partners, Drs. Michelle Mourad and Molly Cantor.

Rebecca Sudore, MD, is a geriatrician, palliative medicine physician, implementation scientist, Professor of Medicine at UCSF, and co-director of the Vulnerable Aging Research Core of the NIH-funded Pepper Center. Her research focuses on aging, health literacy, and developing and testing tools to facilitate ACP, particularly for historically marginalized older adults. Dr. Sudore has created ACP interventions and clinical workflows, including automated EHR-based patient interventions, that have been tested in randomized trials.8-10 In close collaboration with Population Health and Primary Care Strategies, these automated workflows have been adopted by UCSF Health. She will provide leadership in ACP documentation and measurement and ensure the project aligns with the priorities of UCSF Population Health, the inpatient ACP team (please see the letter of support), and national guidelines for ACP.

Elizabeth Wick, MD, is a Professor of Surgery and Vice Chair for Quality and Safety in the Department of Surgery at UCSF. She is an expert in surgical quality improvement and has led multiple national initiatives focused on improving surgical outcomes. Dr. Wick has extensive experience with ACP in surgery and has been instrumental in developing strategies to increase ACP documentation at UCSF, along with Dr. Sudore. She will provide leadership in integrating AI solutions into surgical inpatient workflows. Dr. Wick will work closely with Dr. Rochelle Dicker, section chief acute care surgery, UCSF Health, operational leader, to support this project.

Logan Pierce, MD, is board-certified in both clinical informatics and internal medicine. He is the Managing Director of UCSF Data Core, a team of physician data scientists dedicated to utilizing EHR data to improve healthcare outcomes. He has experience using large language models to extract data from clinical text. Dr. Pierce will actively contribute throughout the development lifecycle, ensuring alignment with UCSF Health priorities and participating in regular progress reviews with Health AI and AER teams.

References:

1.         Sudore RL, Lum HD, You JJ, et al. Defining Advance Care Planning for Adults: A Consensus Definition From a Multidisciplinary Delphi Panel. J Pain Symptom Manage. May 2017;53(5):821-832 e1. doi:10.1016/j.jpainsymman.2016.12.331

2.         Hickman SE, Lum HD, Walling AM, Savoy A, Sudore RL. The care planning umbrella: The evolution of advance care planning. J Am Geriatr Soc. Feb 25 2023;doi:10.1111/jgs.18287

3.         Walker E, McMahan R, Barnes D, Katen M, Lamas D, Sudore R. Advance Care Planning Documentation Practices and Accessibility in the Electronic Health Record: Implications for Patient Safety. J Pain Symptom Manage. Feb 2018;55(2):256-264. doi:10.1016/j.jpainsymman.2017.09.018

4.         Allison TA, Sudore RL. Disregard of patients' preferences is a medical error: comment on "Failure to engage hospitalized elderly patients and their families in advance care planning". JAMA Intern Med. May 13 2013;173(9):787. doi:10.1001/jamainternmed.2013.203

5.         Colley A, Lin J, Pierce L, et al. Experiences with targeting inpatient advance care planning for emergency general surgery patients: A resident-led quality improvement project. Surgery. Oct 2023;174(4):844-850. doi:10.1016/j.surg.2023.04.031

6.         Colley A, Lin JA, Pierce L, Finlayson E, Sudore RL, Wick E. Missed Opportunities and Health Disparities for Advance Care Planning Before Elective Surgery in Older Adults. JAMA Surg. Oct 1 2022;157(10):e223687. doi:10.1001/jamasurg.2022.3687

7.         McMahan RD, Tellez I, Sudore RL. Deconstructing the Complexities of Advance Care Planning Outcomes: What Do We Know and Where Do We Go? A Scoping Review. J Am Geriatr Soc. Jan 2021;69(1):234-244. doi:10.1111/jgs.16801

8.         Sudore RL, Schillinger D, Katen MT, et al. Engaging Diverse English- and Spanish-Speaking Older Adults in Advance Care Planning: The PREPARE Randomized Clinical Trial. JAMA Intern Med. Dec 1 2018;178(12):1616-1625. doi:10.1001/jamainternmed.2018.4657

9.         Sudore RL, Walling AM, Gibbs L, Rahimi M, Wenger NS, Team UCHCPS. Implementation Challenges for a Multisite Advance Care Planning Pragmatic Trial: Lessons Learned. J Pain Symptom Manage. Aug 2023;66(2):e265-e273. doi:10.1016/j.jpainsymman.2023.04.022

10.       Walling AM, Sudore RL, Bell D, et al. Population-Based Pragmatic Trial of Advance Care Planning in Primary Care in the University of California Health System. J Palliat Med. Sep 2019;22(S1):72-81. doi:10.1089/jpm.2019.0142

TRACE: An AI-Integrated Tool for Early Management in Patients with Pregnancy of Unknown Location

Proposal Status: 

Section 1. The UCSF Health problem

  • Vaginal bleeding with a positive pregnancy test is a common reason for emergency department visits.1 The most critical concern in this context is ectopic pregnancy (EP), which occurs in 1–2% of pregnancies and, if undiagnosed, can result in serious morbidity and loss of fertility.2,3 Diagnosis is challenging because the presenting symptoms are non-specific and often overlap with other early pregnancy complications.4 Evidence-based clinical guidelines recommend early use of transvaginal ultrasound (TVS) and serial β-hCG testing to identify intrauterine pregnancy (IUP) or EP.1 However, when the pregnancy location cannot be determined on initial presentation—a situation termed Pregnancy of Unknown Location (PUL)—further testing and follow-up are required to reach a final diagnosis.5

Numerous diagnostic algorithms have been proposed to improve classification and outcomes for women with PUL.6 While a 2006 consensus statement laid the foundation for PUL management, inconsistencies remain due to varying definitions and classifications of at-risk populations and outcome categories.1 More recently the M6 prediction model was developed and modified to include BhCG, ultrasound, and clinical characteristics with and without progesterone measures. This model has been published and externally validated, serving as a starting point for this proposal. (REF) A renewed effort is needed to standardize and streamline approaches and improve real-world implementation (including diagnosis and management) of evidence-based PUL care.4

At UCSF/ZSFG, a robust clinical guideline for the management of Pregnancy of Unknown Location (PUL) was updated in 2024 and is currently used by residents and attendings to guide risk assessment, triage, and follow-up decisions (ZSFG PUL Guidelines, 2024). Despite this evidence base, implementation remains manual, variable, and burdensome, as exploratory conversations with residents and attendings uncovered, right now:

  • Gyn interns rotate every 5 weeks and spend 1–3 hours daily managing a “floater list” of 15–30+ PUL patients with supervision by a daily rotating gyn attending physician.
  • Follow-up plans are documented in free text in APeX and vary based on attending style, leading to inefficiencies and inconsistencies.
  • There is no embedded clinical decision support, no structured tool for follow-up tracking, and no patient-facing communication platform. Interns spend hours calling floater list patients daily
  • Management depends on interpretation of lab results ultrasound and evolving history, making room for error and delayed diagnosis and intervention for this high-stakes condition.

Section 2. How might AI help? 

Recent developments in artificial intelligence (AI) and machine learning (ML) have shown strong potential to improve PUL management, particularly in outcome prediction, risk stratification, ultrasound interpretation, and individualized care. Key advances include:

  • Risk-prediction models such as M6 and modified M6 with clinical characteristics, which use serial β-hCG with and without progesterone values to estimate likelihood of ectopic pregnancy, have demonstrated high diagnostic accuracy (AUCs up to 0.89).7–10 More complex models—such as neural networks and support vector machines trained on clinical, lab, and imaging data—have shown comparable or better performance.11
  • These models now power decision-support algorithms, including the validated two-step protocol (progesterone + M6), which reduces unnecessary follow-up while maintaining high sensitivity for ectopic detection.6,12
  • AI in ultrasound image interpretation has been piloted using deep learning to identify subtle features of ectopic pregnancies that may be missed by less experienced clinicians.13,14
  • AI is also enabling personalized management strategies, including expectant management for stable ectopic cases.6,10,12

Despite this promising evidence base, AI tools for PUL remain largely absent from routine clinical workflows and are rarely integrated with patient-facing communication systems. UCSF has the opportunity to lead in this space for this complex high-stakes condition often requiring weeks of individual follow-up, by developing an embedded AI tool built on its rich historical EHR data, leveraging the existing M6 model, and up to date care algorithm— enhancing efficiency, accuracy, and equity in PUL care.

We propose TRACE (Triage, Risk Assessment, and Communication Engine) a 3-in-1 AI tool embedded in APeX and MyChart to support both clinicians and patients across the PUL pathway:

  1. Prediction Model: We will train a machine learning model—building on the published and validated M6 prediction model— working closely with UCSF data scientists for model validation. We will use data from the UCSF EMR following the procedures described in the publications referenced.  We will identify cases using the ICD 10 diagnosis code for PUL and extract data on model inputs including BhCG, ultrasound results, and clinical characteristics (risk factors, symptoms, sociodemographics) for our population to assess the ability of the model to predict ectopic pregnancy, viable intrauterine pregnancy, miscarriage, or persistent PUL. Initially, we expect to replicate the model using only the initial visit and the first follow-up, however, going forward we hope to improve the predictive model utilizing information from all the follow-up visits so that it can inform decisions (such as when to return for the next visit/lab test/US) at each visit as described in step 2 below.
  2. Decision Support + Triage Algorithm: We will translate the ZSFG PUL Guidelines (2024 update) into a structured AI-driven triage algorithm, which will be updated with the prediction model results. The tool will generate next-step clinical recommendations based on predicted outcome and patient-specific factors. Example outputs: “Repeat hCG in 48 hours”, “Schedule transvaginal ultrasound 7 days”, “Consider methotrexate”, “Continue expectant management with safety counseling”
  3. Patient-Facing Communication Assistant: A chatbot within MyChart will deliver timely, consistent, and language-accessible communication to patients, reducing follow-up delays and clinical workload:
  • Confirm follow-up appointments and lab tests
  • Notify patients of test results and recommended next steps
  • Deliver safety information and education (e.g., signs of ectopic pregnancy)
  • Provide reminders and check-ins

This end-to-end system will augment the current time intensive manual “floater list” process, creating a centralized, streamlined, consistent, and patient-centered workflow for PUL management at UCSF.

Section 3. How would an end-user find and use it?

The TRACE AI tool would be most useful when the gyn team is called to consult on a patient with a positive pregnancy test and early pregnancy symptoms (e.g., bleeding, pain), either in the emergency department or outpatient settings (a frequent occurrence). When the consulting gyn clinician initiates hCG testing or flags a case as suspected Pregnancy of Unknown Location (PUL), the AI tool would automatically populate a PUL Management Panel in the APeX EHR.

As shown in Figure 1 (left panel), the tool would provide:

  • A predicted outcome classification (ectopic, intrauterine pregnancy, miscarriage, or persistent PUL)
  • A risk level (e.g., 43% ectopic), generated by the model
  • A visual timeline of serial hCG and US results
  • Management recommendations based on embedded clinical guidelines (e.g., "Repeat hCG in 48 hrs," "Order transvaginal ultrasound")
  • Action buttons to pend orders or insert a templated clinical note

The patient-facing chatbot interface (see Figure 2, right panel) would operate in parallel via MyChart. Patients would receive:

  • Timely updates about lab results and next steps
  • Appointment scheduling assistance
  • Safety warnings (e.g., symptoms of ectopic rupture)
  • Follow-up reminders and options for clarification

Both interfaces are designed to reduce cognitive burden on providers, support standardized care, and ensure that patients-particularly those at high risk-receive clear, timely, and actionable information.

 

Section 4. Embed a picture of what the AI tool might look like

Section 5. What are the risks of AI errors

  • False negatives, such as classifying a high-risk ectopic pregnancy as low-risk, could result in delayed or missed diagnosis and potentially serious patient harm.
  • False positives, such as overclassifying a low-risk case, could lead to unnecessary labs, visits, or patient anxiety.
  • Equity risks are possible if the model underperforms for subpopulations (e.g., patients with limited English proficiency, Medicaid insurance, or historically marginalized groups) due to biased training data.

These risks will be mitigated through the following approaches:

  • Begin with validated models (e.g., M6) and retrain using UCSF historical data
  • Evaluate and report model performance stratified by race/ethnicity, language, and insurance status
  • Keep AI recommendations advisory, allowing clinician discretion
  • Include transparent outputs and user-facing explanation of AI logic (e.g., “based on hormone trends and symptoms, this is considered high risk”)
  • Build an audit dashboard to track tool usage, override frequency, and outcome correlation

Section 6. How will we measure success

We will draw on our expertise as implementation researchers to evaluate the effectiveness of this AI tool and will seek additional funding to do so. Prior to introduction we will survey residents and conduct retrospective chart review examine measures described below before and after introduction. We will estimate time demands for both clinicians and patients prior to and after introduction. We will assess whether end-users (clinicians and patients) are using the system as intended, whether it leads to improved decision-making and workflow efficiency, and whether it improves timely and accurate diagnosis of PUL. We will also examine whether the tool supports equity, reduces unnecessary follow-ups, and minimizes risk of delayed ectopic diagnosis.

We will monitor adoption, usage patterns, and clinical outcomes using a combination of APeX-derived metrics and ideal supplemental measures at ZSFG. These metrics will help determine whether to continue expanding the tool within UCSF or to modify or abandon implementation.

A. Measurements using data that is already being collected in APeX:

  • Time from first positive pregnancy test to definitive PUL diagnosis
  • Number of hCG draws, pelvic ultrasounds, and follow-up visits per PUL patient
  • Proportion of ectopic pregnancies diagnosed before rupture
  • Proportion of PUL patients receiving timely follow-up
  • Frequency of AI tool usage by eligible clinicians (e.g., OB/Gyn residents, Gyn attendings, ED providers)
  • Frequency with which AI-generated recommendations are followed vs. overridden
  • Number and type of orders pended via the AI interface
  • MyChart chatbot message delivery and open rates

B. Additional measurements ideally needed to evaluate success of the AI:

  • Clinician-reported satisfaction, time savings, trust, and perceived utility (via surveys or focus groups)
  • Patient understanding of follow-up instructions (e.g., short MyChart-based surveys)
  • Time spent by interns managing the floater list before and after implementation
  • Disparities in model performance and outcomes stratified by insurance status, language, and race/ethnicity
  • Rate of errors or safety concerns identified through audit (e.g., missed follow-up for high-risk cases)
  • Longitudinal reduction in unnecessary follow-up testing for low-risk cases

Success would be indicated by high tool adoption, improved timeliness and consistency of PUL management, reduced follow-up burden for clinicians and patients, cost-effectiveness, and equitable model performance across subgroups. Low adoption, significant workflow disruption, or evidence of biased or unsafe predictions would be grounds for modification or discontinuation. 

Qualifications and commitment.

This project will be led by Dilys Walker, MD, FACOG.  Dr Walker is a practicing obstetrician gynecologist at ZSFG and a member of the leadership team at the Bixby Center for Global Reproductive Health. Her expertise is in implementation research across the life course. Dr. Walker has decades of experience managing PUL in various settings. She was awarded a Gates Grand Challenges and J and J grant to build and test a Virtual Mentor chatbot for postpartum hemorrhage. Her team has attended 3 works-in-progress sessions with the AER team.

 

References

1.         Barnhart KT. Ectopic Pregnancy. N Engl J Med. 2009;361(4):379-387. doi:10.1056/NEJMcp0810384

2.         ACOG Practice Bulletin No. 193: Tubal Ectopic Pregnancy - PubMed. Accessed April 1, 2025. https://pubmed.ncbi.nlm.nih.gov/29470343/

3.         Marion LL, Meeks GR. Ectopic pregnancy: History, incidence, epidemiology, and risk factors. Clin Obstet Gynecol. 2012;55(2):376-386. doi:10.1097/GRF.0b013e3182516d7b

4.         Kirk E, Papageorghiou AT, Condous G, Tan L, Bora S, Bourne T. The diagnostic effectiveness of an initial transvaginal scan in detecting ectopic pregnancy. Hum Reprod Oxf Engl. 2007;22(11):2824-2828. doi:10.1093/humrep/dem283

5.         Condous G, Kirk E, Lu C, et al. Diagnostic accuracy of varying discriminatory zones for the prediction of ectopic pregnancy in women with a pregnancy of unknown location. Ultrasound Obstet Gynecol Off J Int Soc Ultrasound Obstet Gynecol. 2005;26(7):770-775. doi:10.1002/uog.2636

6.         Bobdiwala S, Christodoulou E, Farren J, et al. Triaging women with pregnancy of unknown location using two-step protocol including M6 model: clinical implementation study. Ultrasound Obstet Gynecol Off J Int Soc Ultrasound Obstet Gynecol. 2020;55(1):105-114. doi:10.1002/uog.20420

7.         Maheut C, Panjo H, Capmas P. Diagnostic accuracy validation study of the M6 model without initial serum progesterone (M6NP) in triage of pregnancy of unknown location. Eur J Obstet Gynecol Reprod Biol. 2024;296:360-365. doi:10.1016/j.ejogrb.2024.03.010

8.         Kyriacou C, Ledger A, Bobdiwala S, et al. Updating M6 pregnancy of unknown location risk-prediction model including evaluation of clinical factors. Ultrasound Obstet Gynecol Off J Int Soc Ultrasound Obstet Gynecol. 2024;63(3):408-418. doi:10.1002/uog.27515

9.         Christodoulou E, Bobdiwala S, Kyriacou C, et al. External validation of models to predict the outcome of pregnancies of unknown location: a multicentre cohort study. Bjog. 2021;128(3):552-562. doi:10.1111/1471-0528.16497

10.       Hou L, Liang X, Zeng L, Wang Q, Chen Z. Conventional and modern markers of pregnancy of unknown location: Update and narrative review. Int J Gynaecol Obstet Off Organ Int Fed Gynaecol Obstet. 2024;167(3):957-967. doi:10.1002/ijgo.15807

11.       Rueangket P, Rittiluechai K, Prayote A. Predictive analytical model for ectopic pregnancy diagnosis: Statistics vs. machine learning. Front Med. 2022;9:976829. doi:10.3389/fmed.2022.976829

12.       Jurman L, Brisker K, Ruach Hasdai R, et al. Enhancing decision-making in tubal ectopic pregnancy using a machine learning approach to expectant management: a clinical article. BMC Pregnancy Childbirth. 2024;24:825. doi:10.1186/s12884-024-07035-4

13.       Training and testing performance of ectopic pregnancy prediction model... ResearchGate. Accessed April 1, 2025. https://www.researchgate.net/figure/Training-and-testing-performance-of-...

14.       An automated ectopic pregnancy prediction system using ultrasound images with the aid of a deep learning technique. ResearchGate. Published online December 16, 2024. doi:10.1007/s00500-024-10333-w

 

Summary of Open Improvement Edits

The 17 comments provided on our proposal raised valuable points that have been used to strengthen the proposal.  The majority of the comments came from clinical faculty and residents who manage the problem of PUL on a daily basis and believe this tool could be transformational on multiple levels.  We have tried to strengthen section 2 on How might AI help? in response to these helpful comments.

Summary of comments:

Comments supportive of the TRACE tool:

  • Quality of care- The tool will improve the quality of patient care for this high-stakes condition. Though guidelines exist and have been validated in various populations, they are not always followed at ZSFG (or around the country for that matter) as one reader stated “I have found as a resident that these guidelines are not always followed, often due to providers simply not having enough time to refer to them every time, and instead going with their own personal preference when it comes to PUL management, and that leads to a lot of variation in practice and sometimes near misses. because providers simply don’t have the time to carefully map the specifics through the guideline (see attached) and instead rely on
  • Administrative Burden- residents spend up to 3 hours daily following the floater list of PUL patients.
  • Costs- The majority of patients with PUL are covered by Medicaid.  By streamlining follow up, costs will be saved while maintaining or improving quality.
  • Patient-centered- Many patients are required to be followed from days to weeks with serial blood draws and ultrasounds to determine the location and viability of their pregnancy. This is disruptive and anxiety provoking with the potential for errors and missed ectopic pregnancies.  By creating a patient facing platform, communication can be automated and consolidated. Additionally, one reader commented on the value of being able to create language accessible messaging to our multi-lingual population served.  Often requires 10-15 min per call to secure interpreter and make certain message is understood

 In summary

On reader wrote, “AI can handle complex algorithmic decision-making that’s difficult for (tired) human brains to do quickly and it can complete important administrative tasks better than physicians (tracking, patient communication and reminders) making it a win for patients and physicians alike. If successful, TRACE would be a highly sought-after tool across the United States.” 

Comments to strengthen the proposal:

  • One reader commented, how much of this could you do without AI? 

The fact is, all of it is currently done without AI which is suboptimal, time intensive, and risky.  We have added clarification to the description of the problem in section 1 to better justify the benefits of the TRACE tool. One of the readers, who is an ObGyn fellow, said it best- “PUL is a challenging clinical diagnosis because it carries significant uncertainty, is managed differently by different clinicians despite robust clinical guidelines on best practices, and is highly time and resource intensive for the trainees at our institution who do the patient-facing work of helping patients navigate the process of determining pregnancy location. This proposal has the potential to revolutionize care for patients facing an uncertain prognosis which can be confusing, scary and very burdensome, and also to revolutionize clinicians' work of managing PULs” 

  • In response to the comment- It sounds as if developing or refining the initial prediction model is key -- driving the two other pieces. Can you expand a bit on how you might work with the validated tools you've mentioned? Has any preliminary work been done with UCSF data?

This is correct, validating and potentially modifying the M6 prediction model with UCSF data would be the first step. We have added additional clarification and references in the description of risk prediction models and the pathway we describe for development. Specifically, we will use the modified M6 model from the UK/Belgium and validate its performance with UCSF data.  The M6 model has been published and validated externally. We would need to work closely with UCSF data scientists for model validation using data points available from the UCSF EMR following the procedures described in the publications listed below. We will identify cases using the ICD 10 diagnosis code for PUL and extract data on BhCG, ultrasound results, and clinical characteristics (risk factors, symptoms) for our population. Initially, we expect to replicate the model using the initial visit and the first follow-up, however, going forward we hope to improve the predictive model utilizing information from all the follow-up visits so that it can inform decisions (such as when to return for the next visit/lab test/US) at each visit.

 Kyriacou C, Ledger A, Bobdiwala S, Ayim F, Kirk E, Abughazza O, Guha S, Vathanan V, Gould D, Timmerman D, Van Calster B, Bourne T; Collaborators. Updating M6 pregnancy of unknown location risk-prediction model including evaluation of clinical factors. Ultrasound Obstet Gynecol. 2024 Mar;63(3):408-418. doi: 10.1002/uog.27515. PMID: 37842861.

 Maheut C, Panjo H, Capmas P. Diagnostic accuracy validation study of the M6 model without initial serum progesterone (M6NP) in triage of pregnancy of unknown location. Eur J Obstet Gynecol Reprod Biol. 2024 May;296:360-365. doi: 10.1016/j.ejogrb.2024.03.010. Epub 2024 Mar 8. PMID: 38552504.

 

 

 

 

 

 

 

Reducing Diagnostic Delays in Inflammatory Bowel Disease Through Machine Learning Approaches

Primary Author: Vivek Rudrapatna
Proposal Status: 

1-The UCSF Health Problem

According to the NAM about 30% of healthcare activities are generally wasted on unnecessary services and other inefficiencies. Although the true magnitude of wasteful activities at UCSF are not precisely known, it’s likely that waste not only exists but meaningfully harms UCSF in an increasingly competitive environment. Diagnostic and treatment errors harm patients and undercut our mission of advancing health worldwide. They also harm the financial health of UCSF, particularly in the setting of risk-bearing contracts with payors. Diagnostic and treatment errors also harm UCSF in many other indirect ways, such as 1) reducing healthcare access to other patients, 2) harming our reputation as a global leader in medicine, 3) increased risk of staff burnout, and 4) medico-legal risk from diagnostic or treatment delays.

2- How Might AI Help?

One of the great promises of AI lies in its ability to improve medical decision making, enhancing both patient outcomes and hospital efficiencies by reducing waste. While recent years have seen many advances in general AI solutions (e.g. ChatGPT), there remains a significant paucity in models that have been trained to correctly interpret healthcare data. Dedicated AI models for interpreting the EHR are likely necessary given its significant complexity, as well significant structural differences from other internet-scale text used to train most general AI solutions. In the near term, we foresee a need for “homegrown” AI solutions, given 1) the significant challenges that general AI companies face in accessing sensitive EHR data, and 2) their relative lack of healthcare expertise compared to what is available at UCSF.

We propose to develop a general-purpose system to reduce diagnostic delays, piloting this system in IBD. The choice of this disease reflects our clinical expertise and directly extends recent work from our group that grapples with and partially addresses the challenges of building systems intended for real-world deployment. However, we envision that a successful pilot will be the first step on the road to extending this very broadly across diseases. 

The system we envision will likely be a data-driven approach (use known cases and controls to train a longitudinal classifier), but we can explore the use of models that utilize published guidelines and diagnostic algorithms to enhance the model and eventually reduce errors from human clinicians.

Our model will be trained and validated using readily available and free de-identified EHR datasets. It will analyze patients' past healthcare interactions, integrating ICD codes (historical diagnoses, misdiagnoses, and symptoms), Medication history (previous prescriptions and treatment patterns), Laboratory results (CBC, inflammatory markers, etc.), Imaging reports (radiology findings linked to IBD suspicion), Clinical notes (NLP-based insights from physician documentation) and Demographics (age, sex, race, and lifestyle factors)

3- How would an end-user find and use it?

The AI model will generate a probability score for IBD based on these factors, offering transparent and explainable predictions. The output will display: The likelihood (%) that a patient has IBD, key contributing factors used in the AI’s prediction (e.g., chronic diarrhea, weight loss, anemia, previous gastro-related complaints), suggested next steps (e.g., referral to a gastroenterologist, additional non-invasive screening)

The model will be deployed within the Electronic Health Record (EHR) system and triggered during a patient’s visit to their physician. Specifically, it will be:

  • Automatically activated when a patient with relevant symptoms visits a physician.
  • Displayed in real-time within the physician’s workflow, ensuring immediate usability.

When the AI detects a high risk of IBD, the clinician will receive an alert within the EHR interface. The alert will display:

  • The probability score (%) of IBD suspicion for that patient.
  • An optional, expandable menu showing a breakdown of key contributing factors, such as persistent gastrointestinal complaints, abnormal lab findings, or prior treatments.
  • Suggested next steps, such as direct referral for a GI procedure, vs further non-invasive testing.

The tool will allow clinicians to acknowledge the AI recommendation and either follow or override the suggestion. Order additional tests (e.g., fecal calprotectin, CRP, stool culture) within the same interface. Directly refer the patient for a GI procedure, vs GI e-consultation. View explainability details—understanding why the AI made the recommendation.

4- Embed a picture of what the AI tool might look like. 

5- What Are the Risks of AI Errors?

While AI can improve early IBD detection, it comes with potential risks that must be managed.

1. False Positives (Overdiagnosis)

  • Risk: AI may flag non-IBD patients, leading to unnecessary referrals and lab tests.
  • Mitigation: Use explainability tools, confidence thresholds, and allow clinician override to avoid excessive false alarms.

2. False Negatives (Missed Diagnoses)

  • Risk: AI might miss true IBD cases, delaying treatment.
  • Mitigation: Regular model updates, feedback loops, and a human-in-the-loop approach ensure accuracy.

Risk Monitoring & Continuous Improvement

  • Track error rates (false positives/negatives).
  • Collect clinician feedback for refinement.
  • Regularly retrain AI models to improve reliability.

While false positives may lead to minor inefficiencies, the benefits of early IBD detection far outweigh the risks when managed properly.

6- How Will We Measure Success?

To evaluate the AI model’s effectiveness, we will perform an embedded randomized controlled trial of the tool, clustered by clinician, and measure the following:

1. Existing APeX Data Metrics

✔ Reduction in time to diagnosis (from first symptoms to confirmed IBD diagnosis).
✔ Proportion of direct-to-endoscopy referral cases with positive findings
✔ Percentage of clinicians interacting with the AI tool in practice (only in the intervention group)

2. Additional Ideal Metrics

✔ Clinician trust and adherence (based on a timed survey of the intervention group).
✔ Improvement in patient outcomes (fewer emergency visits, earlier treatment initiation).
✔ Cost savings from fewer unnecessary tests and delayed diagnoses.

When to Continue or Abandon?

  • Success: AI adoption increases, diagnostic delays decrease, and patient outcomes improve.
  • Failure: AI shows persistent false positives/negatives, low clinician adoption, or no impact on diagnosis speed.

If successful, we will seek long-term integration into UCSF’s APeX system

7- Describe your qualifications and commitment

I am an IBD physician and a researcher in clinical data science. As I clinician I am often at the receiving end of referrals for new or suspected cases of IBD and often note delays in timely diagnoses. As a researcher I work on computational methods for using EHR data to improve clinical decision making. We have developed, published and patented methods for reducing diagnostic delays in rare diseases using machine learning (PMID 38946554). More recently in unpublished work, we have enhanced these methods using novel ML architectures for longitudinal predictive modeling and incorporating domain expertise using automated methods. If selected for this pilot, I commit 15% of my effort to working with the UCSF AI team to develop, deploy and test this new tool.

Supporting Documents: 

Enhancing IBD Flare Inpatient Management: Integrating AI Tools into the ApeX EHR System for Improving Protocol Adherence, Patient Outcomes, and Reducing Healthcare Costs

Proposal Status: 

The UCSF Health problem

Inflammatory bowel disease (IBD), including Crohn's disease and ulcerative colitis, is a chronic inflammatory condition affecting 1.5 million people in the United States, with a significant hospitalization rate of 9.24 per 100 IBD patients annually [1]. Flares of the disease are a common cause of these hospitalizations. Optimal management of acute IBD flares necessitates timely surgical consultations, endoscopic evaluations, initiation of anticoagulant therapy, and possible surgical interventions [2]. However, adherence to these protocols is often compromised by heavy clinical workloads and oversight, leading to delays that diminish patient care quality, extend hospital stays, and increase healthcare costs [3].

UCSF, a tertiary medical center with a comprehensive IBD program, is committed to providing extensive care for patients with complex IBD cases necessitating hospitalization due to flares. Strict adherence to clinical protocols is vital for enhancing the quality of patient care, reducing the duration of hospital stays, and decreasing healthcare costs [3]. Despite these high standards, adherence to these protocols at UCSF, particularly at newly integrated sites such as Saint Francis Memorial Hospital and St. Mary's Medical Center, and during the absence of clinical fellows who assist the service, can fall short of expectations. This is largely due to the challenges posed by heavy clinical workloads and oversight.

How might AI help?

Our goal is to enhance the Advancing Patient-Centered Excellence (ApeX) electronic medical records (EHR) system at UCSF to improve monitoring and management of patients admitted with IBD flares. By integrating large language model (LLM) AI tools, such as VERSA at UCSF, both structured data (e.g., orders, lab values) and unstructured data (e.g., physician notes) collected during patient admissions can be effectively processed. This AI tool will actively monitor adherence to the established IBD management protocol and alert gastroenterology/IBD providers and primary team providers, such as hospitalists, by displaying a reminder window when they access a patient’s chart in the ApeX EHR system and being available as SmartPhrases available to providers when writing notes and signing off to the next provider in the hand-off in the APeX EHR system. This prompt aims to prevent delays in care by ensuring timely adherence to necessary clinical actions.

Key Areas Where AI Is Essential:

  1. Imaging Interpretation: Determining whether imaging studies indicate colonic dilation requires parsing radiology reports. AI, particularly natural language processing (NLP) tools, can extract such findings from unstructured text, facilitating timely surgical consultations when necessary.​
  2. Treatment Response Assessment: Evaluating a patient's response to intravenous glucocorticoids or biologic therapies involves synthesizing symptom descriptions, lab trends, and provider assessments documented in narrative form. AI models can integrate these data points to identify non-responders and prompt appropriate management adjustments.​
  3. Abscess Monitoring: Assessing the resolution of a Crohn’s-related abscess post-antibiotic therapy and drainage requires tracking symptom improvement and imaging findings over time. AI can correlate these unstructured data elements to determine if further intervention is warranted, especially when the drainage is not adequate.​
  4. Discharge Planning: Ensuring readiness for discharge encompasses verifying smoking cessation counseling, monitoring stool characteristics, and confirming nutritional tolerance—all typically noted in free-text clinical documentation. AI can collate this information to identify potential safety gaps prior to discharge, and prompt early initiation of discharge planning.
  5. Admission Reasoning: Identifying admissions in which an IBD flare was not the primary reason for hospitalization, or during which a IBD flare developed, necessitates analyzing provider notes before the ICD codes available from the medical coding team. AI can detect such nuances and initiate IBD flare management monitoring ​as above at very early stage.
  6. External Data Integration: Patients often receive care at multiple institutions, leading to fragmented records. AI can reconcile external documents, such as vaccination records or prior treatments, that are embedded in unstructured formats.
  7. Reducing Unnecessary Alerts and Enhancing Clinical Workflow: Traditional electronic health record (EHR) systems often generate numerous alerts based solely on structured data, leading to alert fatigue among clinicians. AI can mitigate this by analyzing unstructured data to provide context-aware alerts. For the situation in the above, if a vaccination was administered at an out-of-network facility and documented only in clinical notes, AI can recognize this and prevent redundant alerts. By tailoring notifications to the physician's preferences and clinical context, AI enhances the usability of EHR systems, reducing the cognitive burden caused by irrelevant or excessive alerts.

How would an end-user find and use it?

Our objective is to enhance the Advancing Patient-Centered Excellence (ApeX) electronic medical records (EHR) system at UCSF, aiming to improve the monitoring and management of patients admitted with IBD flares. By incorporating large language model (LLM) AI tools, such as VERSA at UCSF, we can effectively process both structured data (e.g., orders, lab values) and unstructured data (e.g., physician notes) collected during patient admissions. This AI tool will actively monitor adherence to the established IBD management protocol. It will alert gastroenterology/IBD providers and primary team providers, such as hospitalists, by displaying a reminder window when they access a patient’s chart in the ApeX EHR system. Additionally, SmartPhrases will be made available to providers for use when writing notes and during the hand-off process to the next provider, ensuring seamless communication. These prompts are designed to prevent delays in care by ensuring timely adherence to necessary clinical actions.

Embed pictures of what the AI tool might look like

Figure 1 shows the reminder window for the IBD flare inpatient milestone not met: Failure to response to the first line therapy, but no secondary medical or surgery therapy started. The failure and is recognized by the AI tools, large language model, from the unstructured data, symptoms included in the provider and nurse notes. The deficiency of secondary medical or surgery therapy information is collected from structured data, medication orders, and notes, notes documenting surgical evaluation and procedure plan. A useful link is also attached to show the most updated guideline about the milestones for user education. And if the provider accepts it, it will open an order set directly for the convenience. But it does not force the provider to take any action, in case it is a falsely positive reminder. It works more as a reminder and allow the provider to check. The “Decline” button will allow the user to leave a comment to help with the tool.

 Figure 2 depicts a SmartPhrase that shows the current IBD flare inpatient milestone meeting status. It will contain following menus to allow the provider to select if necessary and allow the provider to do any edits. The SmartPhrase will also work in the Handoff part for the purpose of signing off and providing information for other teams, like the night on call shift or other consult teams. 

 

 

 

 

 

What are the risks of AI errors?

AI tools may occasionally misinterpret data or generate errors, leading to false positives or false negatives. For instance, a milestone may not be met, yet the AI might incorrectly indicate that it has been met (false positive). However, this is unlikely to cause additional harm, since a provider who would have recognized the unmet milestone independently would not be misled into overlooking it simply because the AI suggested otherwise. The only scenario in which a milestone would still be missed is if both the provider and the AI tool fail to recognize it—essentially the same outcome as not using the AI tool at all. Conversely, the AI might fail to recognize that a milestone has been met, resulting in a false negative. However, these errors can often be corrected through provider review. Given that these milestones are generally straightforward for providers to verify, but may be overlooked due to heavy clinical workloads, the AI tool primarily functions as a reminder rather than a decision-maker.

How will we measure success?

The implementation and evaluation of the AI tool for IBD flare management necessitate a strategic approach to measure its effectiveness and minimize disruption of clinical workflow and maximize provider adoption and satisfaction. The evaluation process will be structured into two primary components:

  1. Evaluate the Effectiveness of Protocol Adherence Improvement: Prospective cohort study: Patients admitted with IBD flares before applying this tool and those after applying this tool and using the tool by their providers will be assigned to two different groups: This approach allows us to directly compare outcomes between the groups.
    • Primary Outcome—Protocol Adherence: Adherence will be quantitatively assessed using a specifically designed scoring system that evaluates how closely care teams follow established management protocols.
    • Secondary Outcomes—Health Costs and Hospital Stay Duration: We will measure the economic impact by considering potential costs associated with AI-related errors, such as false negatives that could lead to unnecessary workups, and potential savings by reducing the complication due to delay of care. Additionally, the length of hospital stays will be tracked to assess any efficiencies gained through improved management adherence.
  2. Evaluate Provider Satisfaction:
    • User Feedback and Scoring System: The acceptance and satisfaction of inpatient providers using the AI tool are critical for its long-term success. We will collect Net Promoter Scores (NPS) to quantify user satisfaction. Additionally, we will gather detailed feedback for further improvements, capturing both qualitative and quantitative aspects of user experience.

Existing APeX Data Metrics: Protocol Adherence, Hospital Stay Duration.

Additional Ideal Metrics: Health Costs, User Feedback, and Scoring.

Describe our qualifications and commitment

This project is spearheaded by Yuntao Zou, MD, a seasoned hospitalist with extensive experience in inpatient medical management. Dr. Zou is also an accomplished AI researcher, with a clinical and research focus on leveraging AI tools, such as Large Language Models (LLMs), to enhance healthcare delivery and clinical decision-making processes. I am responsible for designing and overseeing the entire project process. If selected, I will devote at least 10% or as required effort for 1 year to ensure the success of this proposal.

Vivek Rudrapatna, MD, PhD, serves as the co-lead for this initiative, with a specific focus on enhancing the IBD management protocol and developing the AI tool. As a physician-scientist and specialist in inflammatory bowel disease, Dr. Rudrapatna brings a wealth of experience in IBD management. He also leads a research group dedicated to developing methods for analyzing healthcare data, aiming to enhance clinical decision-making processes.

Reference:

1.           Buie, M.J., S. Coward, A.A. Shaheen, J. Holroyd-Leduc, L. Hracs, C. Ma, et al., Hospitalization Rates for Inflammatory Bowel Disease Are Decreasing Over Time: A Population-based Cohort Study. Inflamm Bowel Dis, 2023. 29(10): p. 1536-1545.

2.           Lewin, S. and F.S. Velayos, Day-by-Day Management of the Inpatient With Moderate to Severe Inflammatory Bowel Disease. Gastroenterol Hepatol (N Y), 2020. 16(9): p. 449-457.

3.           Burisch, J., M. Zhao, S. Odes, P. De Cruz, S. Vermeire, C.N. Bernstein, et al., The cost of inflammatory bowel disease in high-income settings: a Lancet Gastroenterology & Hepatology Commission. Lancet Gastroenterol Hepatol, 2023. 8(5): p. 458-492.

 

 Summary of Open Improvement Edits

  • Changed Figure 1 and its explanation to an example more related to the requirement of AI tools.
  • Thanks to Dr. Pletcher reminding! Key features requiring AI tools added in the "How might AI help?" part.
  • Thanks to Dr. Xue reminding! Added the influence of the falsely positive result (missing the milestones not met) in the "What are the risks of AI errors?" part.
Supporting Documents: 

RAPIDDx: A Tale of 2 LLMs. Real time, AI-enabled, Point-of-care Intelligence for Differential Diagnosis

Proposal Status: 

Section 1. The UCSF Health Problem 

Diagnostic error, the failure to establish or communicate an accurate and timely explanation of a patient’s health problem, affects 12 million people in the U.S. annually, leading to delays in treatment, potentially avoidable healthcare utilization, and increased morbidity and mortality.1

As a tertiary and quaternary referral center, UCSF and its clinicians face the ever-growing challenge of providing accurate and timely diagnosis for patients with high medical complexity. The task of achieving diagnostic excellence has grown more difficult not only because the corpus of knowledge for physicians to command continues to expand quickly, but because - as patients become more sick and more complex, and as the volume of information about them already contained in the electronic health record (EHR) increases - even the most capable of physicians are challenged to incorporate all of it into their diagnostic heuristics. Instead, many will examine just a limited set of recent EHR encounters by which to quickly familiarize themselves with the patient. This, however, can leave historical blind spots that might otherwise offer important contextual clues to aid in diagnosis. This is particularly important for complex patients not only who have greater co-morbidity burden, higher rates of polypharmacy and adverse drug events that may underlie or contribute to diagnoses, but for those with cognitive changes such as delirium and dementia, who may be less able to describe important elements of the history of present illness. LLMs, on the other hand, can examine vast quantities of the medical history from the EHR and synthesize, in near real-time, tailored diagnoses based on that information and new, evolving information associated with the current encounter

Nationally, the diagnostic excellence movement, led by the Agency for Healthcare Research and Quality (AHRQ), and the recently defunct Society to Improve Diagnosis in Medicine (SIDM), has hungered for decades for tools with the potential to dramatically transform the field of diagnostic excellence. Large language models (LLMs), if appropriately integrated into the electronic health record and able to review the entirety of a patient’s history, lab results, imaging, medications, and problem lists, offer an unprecedented opportunity to advance diagnostic excellence and healthcare quality in a generational leap forward.

Section 2. How might AI help? 

Although differential diagnosis (DDx) generators - machine learning, artificial intelligence (AI), and rules-based tools meant to aid the clinician in the diagnostic process by suggesting a list of potential diagnoses based on the information at hand - are not new,2,3 their performance has limited their widespread adoption. Many of these tools, often based on probabilistic models, utilize only patient symptoms for assessment, not taking advantage of documented physical exam, labs, imaging results and other structured and unstructured data typically available in an EHR.4LLMs on the other hand, including those using retrieval-augmented generation (RAG) that benefit from access to a large corpus of medical domain-specific knowledge, have demonstrated impressive diagnostic capabilities in clinical vignettes and in limited studies.5–9  

A Tale of 2 LLMs not Realizing their Full Potential: While UCSF established national leadership by creating a HIPAA-compliant LLM gateway (Versa), and a RAG-enriched version based on content from one of the largest medical text publishers in the world (Versa Curate), the lack of APeX/HIPAC integration of these tools has limited the realization of their full clinical potential. Users wishing to take advantage of these tools at the point of care, for example, must open Versa in one application window, APeX in another, and copy/paste limited quantities of information from one to the other. Another widely used LLM tool, Open Evidence (OE), enjoys widespread adoption at UCSF, but is a standalone tool that is not yet HIPAA-compliant, nor integrated into APeX. As such, users often type in generic prompts to get relatively generic answers. (Table 1)  Instead of copy/pasting just a fraction of information about a patient from APeX into Versa Curate, or manually typing generic information into OE, the  potential opportunities for differential diagnosis (as well as a vast array of other use cases that could be governed with “prompt order sets”) if these LLMs were integrated into APeX and could examine the entirety of a patient’s hospital encounter or outpatient records, are immense and scalable across nearly every UCSF specialty.

Primary Use Case: The primary use case for this proposal is to examine Versa Curate and OE as differential diagnosis generators that automatically review inpatient EHR data without manual prompting, and passively offer daily updated sidebar differential diagnoses. Developing HIPAA-compliance is an inclusion requirement for OE. If not met, then the proposal would proceed with Versa Curate alone. This use case impacts several high priority areas (Table 2). 

Secondary Use Case: Once Versa Curate and OE are integrated into APeX, the secondary use cases, that will not specifically be explored in this proposal, are immense, ranging from offering guideline-based medical management recommendations, identification of care gaps, and even assessment and feedback to learners about their clinical notes. 

Section 3. How would an end-user find and use it?

The target end-user, Hospital Medicine clinicians, would see the LLM-generated DDx list passively residing as a sidebar directly in the APeX workflow. (Figure 1) Because it is present to support the clinician, it would be purposefully visible but non-intrusive and non-interruptive. 

Section 4. Embed a picture of what the AI tool might look like

Figure 1 shows the existing APeX workflow in which an LLM sidebar would contain a daily updated DDx list with the option to paste the list into the Progress Note and to like or dislike the DDx suggestions. This list would be generated and updated automatically on a daily basis without prompting. A user could choose from assistants or models (e.g. Versa Curate or OE).

Section 5. What are the risks of AI errors

Although LLMs may make errors of omission, inaccuracy, and hallucination, the proposed use of AI for DDx generation poses relatively few risks because the DDx list is offered to the clinician as a suggestion of diagnoses to consider. It remains up to the clinician to decide whether and how to use the suggested information. However, users may report concerns with the “Report a concern” button to help investigators identify potential issues with the LLM DDx lists.

Section 6. How will we measure success

Within two domains listed below, we will measure success across three categories, and compare between Versa Curate and OE: People, Process, and Outcomes. Although the specifics of the deployment will be discussed in detail with AER if funded, the goal will be to minimize friction to the existing clinical workflow that could impact adoption and user satisfaction. For this reason, we propose enabling the LLM-generated DDx list as a passive feature during the pilot for all Hospital Medicine providers with the option to minimize the sidebar for those who prefer (See Figure 1). We will measure success over the pilot period as follows:

  1. Measurements using data already being collected in APeX:
    1. Person and Process: As a measure of both adoption and usage, we will measure via the Clarity audit logs, the rate at which users copy/paste elements from the DDx list into their progress notes (copy/paste instances / total number of progress notes in which a DDx list is visible). We will compare this rate between Versa Curate and OE.
    2. Outcomes: Although the potential impact on healthcare outcomes remains speculative, we will measure the length of hospital stay for patients in which there has been at least one copy/paste event relative to those patients for whom there have been none, and compare using the Wilcoxon signed-rank test, with p<.05 considered significant. Secondarily, in anticipation of important potential future uses beyond the pilot (e.g. future incorporation of LLM-generated DDx workup recommendations), we will silently run prompts in the background for LLM-based DDx workup recommendations and compare them for accuracy to actual actions taken by clinicians in APeX.
  2. Measurements not necessarily available in APeX, but ideal to have:
    1. Person: As part of the suggested build (See Figure 1), we will measure the rate of Like/Dislike clicks made in response to the DDx list for each of Versa Curate and OE. We will also capture the rates of “Report a concern” for each model.
    2. Process and Outcomes: As above

Figure 1. Mockup of LLM-generated DDx list in APeX.

Section 7. Describe your qualifications and commitment:

  • Dr. Benjamin Rosner, MD, PhD is a hospitalist, a clinical informaticist, an AI researcher within DoC-IT, and the Faculty Lead for AI in Medical Education at the School of Medicine.
  • Dr. Ralph Gonzales, MD, MSPH is the Associate Dean for Clinical Innovation and Chief Innovation Officer at UCSF Health.
  • Dr. Brian Gin, MD, PhD is a pediatric hospitalist, visiting scholar at University of Illinois Chicago, and chief architect/developer of Versa Curate.
  • Ki Lai is VP, Chief Data & Analytics Officer of UCSF Health.
  • Dr. Christy Boscardin, PhD, is the Director of Artificial Intelligence and Student Assessment as the School of Medicine and the champion behind Versa Curate.
  • Dr. Sumant Ranji, MD, is a hospitalist and the Director of the UCSF Coordinating Center for Diagnostic Excellence (CoDEx).
  • Dr. Travis Zack, MD, PhD is an Assistant Professor of Medicine in Hematology-Oncology, and is a Senior Medical Adviser to Open Evidence.

 

Citations 

1.    Singh H, Meyer AND, Thomas EJ. The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving US adult populations. BMJ Qual Saf. 2014;23(9):727-731. doi:10.1136/bmjqs-2013-002627

2.    Semigran HL, Linder JA, Gidengil C, Mehrotra A. Evaluation of symptom checkers for self diagnosis and triage: audit study. BMJ. 2015;351:h3480. doi:10.1136/bmj.h3480

3.    Schmieding ML, Kopka M, Schmidt K, Schulz-Niethammer S, Balzer F, Feufel MA. Triage Accuracy of Symptom Checker Apps: 5-Year Follow-up Evaluation. J Med Internet Res. 2022;24(5):e31810. doi:10.2196/31810

4.    Chishti S, Jaggi KR, Saini A, Agarwal G, Ranjan A. Artificial Intelligence-Based Differential Diagnosis: Development and Validation of a Probabilistic Model to Address Lack of Large-Scale Clinical Datasets. J Med Internet Res. 2020;22(4):e17550. doi:10.2196/17550

5.    Hirosawa T, Kawamura R, Harada Y, et al. ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation. JMIR Med Inform. 2023;11:e48808. doi:10.2196/48808

6.    Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78-80. doi:10.1001/jama.2023.8288

7.    Balas M, Ing EB. Conversational AI Models for ophthalmic diagnosis: Comparison of ChatGPT and the Isabel Pro Differential Diagnosis Generator. JFO Open Ophthalmology. 2023;1:100005. doi:10.1016/j.jfop.2023.100005

8.    Mizuta K, Hirosawa T, Harada Y, Shimizu T. Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician? Diagnosis (Berl). March 12, 2024. doi:10.1515/dx-2024-0027

9.    Eriksen AV, Möller S, Ryg J. Use of GPT-4 to Diagnose Complex Clinical Cases. NEJM AI. 2023;1(1). doi:10.1056/AIp2300031

 

Supporting Documents: 

Precision Summarization: Knowledge-Grounded AI for High-Impact Specialty Care

Proposal Status: 

The UCSF Health Problem

We aim to automate specialty-specific chart review, a high provider burden, error-prone process that costs clinicians 15–45 minutes per patient, leading to 6–12 hours of weekly administrative overhead contributing to pajama time. This burden impairs clinic throughput, contributes to burnout, and compromises care quality, especially in specialties like Gastroenterology where fragmented, longitudinal data including scanned outside records are the norm.

Despite advances in interoperability, no existing solutions address the upstream need to synthesize relevant clinical history for decision-making. Our innovation lies in generating real- time, specialty-aware clinical state summaries tailored to how specialists’ reason about patient care in several GI subspecialities.

Our end users are specialty physicians and advanced practice providers managing chronic, complex conditions who need accurate, longitudinal insights to deliver collaborative, high-quality care.

How Might AI Help?

Our AI system analyzes structured and unstructured EHR data—including scanned documents, provider notes, imaging and procedure reports, pathology results, and medication histories—to generate a clinically accurate, specialty-specific “current state” summary with references to original source text, spanning all prior encounters. This enables providers to begin each visit fully informed, without time-intensive manual review with references to sources of information available in line with the summaries. While transformer models have shown promise in summarization, they often produce hallucinations or omit critical context. Our architecture will involve technical collaboration with Acucare AI, a company that addresses these limitations using retrieval-augmented generation, domain-specific knowledge graphs, and specialty-tuned pipelines to ensure high clinical reliability of the outputs.

By automating a task that currently takes clinicians hours, our AI compresses chart review to minutes—improving care efficiency, quality, and clinician well-being.

How Would an End-User Find and Use It?

At first our AI tool will first be deployed as a standalone web application (link provided to authorized users) within UCSF’s secure cloud infrastructure, allowing rapid iteration and clinician feedback. It will generate a structured “current state” summary from EHR data, optimized for specialist workflows and presented in a clean, reviewable interface. A small subset of physicians will have the opportunity to validate, edit, and improve the AI-generated content during this testing phase. Once validated, the tool will be fully integrated into the EHR and surface within the existing pre-charting section—requiring no significant change to current clinical workflows. The AI summary will appear where clinicians already look for relevant patient history, pre-populated and linked to source data for easy verification.

The tool is most valuable just before the visit, replacing hours of manual chart review with an AI- generated, physician-ready summary that can be accepted as-is or modified. Because the AI

output fits seamlessly into current documentation practices, end-users are not required to learn new systems or processes. They simply review the summary, just as they do today the past medical records, only faster, more comprehensively, and with greater confidence in data accuracy.

This approach enables >30% reduction in pajama time and cognitive burden while preserving existing clinical workflows and autonomy.

Example of AI output

AI output

What Are the Risks of AI Errors?

The primary risks in most AI solutions include hallucinations (fabricated or unsupported information), omissions (missing key clinical details), and misclassification of clinical relevance—each of which can present risks to clinical decision-making if not addressed and diminished trust in AI outputs.

To mitigate these risks, we are collaborating with Acucare AI, which is developing a hybrid architecture combining large language models (LLMs) with a domain-specific clinical knowledge graph. This approach improves clinical specialty precision and recall, enabling AI outputs to be validated against structured medical knowledge in a measurable and reproducible way.

In addition, we will also conduct rigorous physician-led evaluations to qualitatively assess the accuracy, completeness, and clinical usability of the summaries. These outputs will be benchmarked against off-the-shelf LLMs to quantify improvements in reliability and clinical relevance in our novel hybrid approach.

This rigorous testing and validation framework including quantitative validation using knowledge-grounded AI, and qualitative feedback from domain experts—ensures high-fidelity outputs suitable for clinical deployment. With this approach errors are minimized, identified early, and iterated upon in a controlled environment prior to scale-up.

How Will We Measure Success?

We will evaluate the success of this pilot using three core metrics:
Precision and Recall of Clinical Concepts: We will quantitatively compare the performance of our hybrid Knowledge Graph + LLM pipeline against baseline LLM models alone, assessing the accuracy and completeness of extracted clinical concepts across patient records.
Qualitative Clinical Evaluation: Specialists will assess the completeness, accuracy, and clinical utility of AI-generated summaries for their own patients. This expert review will be captured for each summary through a live evaluation feature (such as a thumbs-up/down button) and ensures alignment with real-world expectations and safety in specialty care.
Time Saved Per Encounter: We will measure reductions in administrative and “pajama time” through pre- and post-intervention analysis. This can be done in an automated fashion by using APeX metadata (amount of time spent in a patient’s chart prior to their visit).

If the pilot fails to demonstrate adequate performance on metrics (1) or (2), we will either explore alternative approaches to improve the technology or halt the program, prioritizing safety, reliability, and clinical value.

Describe Your Qualifications and Commitment

I am a GI physician, IBD specialist, and a data science researcher at UCSF. As a physician I can personally attest to the substantial burden of chart review in my clinical practice. This is particularly a challenge for the IBD patients I see. UCSF is a tertiary care center of excellence, and we get many outside referrals from the greater Northern California area for IBD diagnosis and management. These patient records are frequently fragmented and often in formats that are especially difficult to review (scanned clinical documents of outside faxes, care everywhere data). A tool such as we propose to develop with Acucare AI, with the support of the UCSF AI and AER teams, will be a huge benefit to providers and patients at UCSF.

This work also aligns closely with my research lab’s work on data science methods for EHR data. We have developed and evaluated different computational methods for abstracting and summarizing clinical data, particularly clinical notes. Over the past year I have been working closely with Acucare AI to extend these approaches using advanced AI engineering methods to retrieve and summarize clinical data for clinicians. Acucare AI is led by Ganesh Krishnan and Chandan Kaur, who together bring technical depth in AI/ML, specialty-specific knowledge modeling, healthcare products and 35+ years of experience in product innovation.

I am confident that over a 1-year pilot we will be able to rigorously test our approach and make a go/no-go decision on whether to fully deploy it and scale it across diseases. If selected I will commit at least 15% effort for 1 year towards this project to ensure its success.

 

Supporting Documents: 

AI Automated Audiogram Interpretation and Cochlear Implant Referral

Primary Author: Yew Song Cheng
Proposal Status: 

Section 1: The UCSF Health Problem

Cochlear implants (CIs) have been restoring speech comprehension and auditory perception for individuals with moderate to profound sensorineural hearing loss, with studies consistently demonstrating significantly improved quality of life [1]. Despite the potential benefits of CIs, there is a huge disparity in identifying and referring patients for cochlear implant consultations – in 2022, it was estimated that less than 10% of individuals with qualifying hearing loss utilize cochlear implants [2]. This is due to a poor understanding of CI candidacy criteria and the lack of knowledge about CIs amongst primary care physicians, otolaryngologists, and audiologists, leading to under-referral of eligible patients and geographic disparities in access [3, 4].

Previous efforts have relied primarily on clinician judgment without systematic integration of audiometric data or objective measures into decision-making workflows, limiting effective referrals. Only within the last five years have the first papers about the application of AI and machine learning models to evaluate cochlear implantation candidacy been proposed outside of UCSF. However, these models have yet to be expanded into tools usable by both patients and medical professionals such as in electronic health record (EHR) alert systems [5,6,7]. For example, one of the current issues with audiometric data storage on the APeX EHR is the inability for APeX to automatically generate visually useful digital audiograms for providers reading patient notes.

The primary end-users of the proposed AI solution would include primary care providers, audiologists, otolaryngologists, and other clinicians involved in the early stages of hearing loss assessment, as well as any patients undergoing audiology testing.

Section 2: How Might AI Help?

Artificial intelligence offers a promising solution by automating the interpretation of audiometric data to enhance the accuracy and consistency of CI referrals. This project will utilize audiogram results and Consonant-Nucleus-Consonant (CNC) test scores from the de-identified UCSF Commons Database (2011-2024) to train machine learning (ML) models. In addition, factors such as insurance coverage, history of prior hearing aid use, and specialist referral history, along with demographic data such as age, sex, and preferred language, will be used as inputs.

The AI tool with be built with several modalities in mind. First, the tool will be designed to navigate the complex storage and organization of audiometric data on APeX, designed to convey the most important test values and to explain their significance to even primary care providers who may not have experience in interpreting audiometric test values. The highlight of this feature will be the automated conversion of audiometric data to an easy-to-read chart.

In addition to building an easy visualization modality, the AI model will be able to interpret the audiometric data to determine whether a patient may need to be considered for CI candidacy. This may be in the form of an alert on the EHR system for providers in a follow-up appointment or an automated educational message discussing what a cochlear implantation is and how one may benefit a patient flagged as a potential CI candidate. As a whole, the AI model will generate actionable, interpretable recommendations, thereby addressing the current gap in clinician knowledge and enhancing equitable patient referrals.

Section 3: How Would an End-User Find and Use It?

1)    The AI tool will integrate directly within existing Electronic Health Record (EHR) systems (e.g., APeX). When audiometric test results are inputted, the AI model will automatically analyze these data points and generate a simple dashboard referencing updated referral criteria and predictive visualizations of patient audiograms, highlighting thresholds relevant to CI candidacy.

2)    Clinicians will receive concise recommendations regarding CI candidacy and suggested actions, such as initiating a formal CI evaluation or providing patient education materials. The integration will occur at the point of audiometric data entry, streamlining workflow without additional steps required by clinicians.

3)    Patients will receive short, informational notifications delivered to their Epic inbox about their audiology appointment yielding significant results not meeting expected hearing thresholds, with recommendations to follow up to see if they fit CI standards. This would also provide information on CIs and address misconceptions or frequently asked questions about CIs.

What the AI tool might look like, 

What Are the Risks of AI Errors?

The implementation of an AI-driven audiometric assessment tool inherently presents several potential risks, including false positives, false negatives, and potential AI hallucinations. False positives, resulting in erroneous recommendations for CI referral, may lead to unwarranted patient anxiety, unnecessary clinical evaluations, increased healthcare costs, and decreased patient trust. Conversely, false negatives represent a critical failure to identify eligible candidates, thus perpetuating the current gap in CI referrals, delaying intervention, and negatively impacting patient outcomes.

Importantly, in this proposed application, AI is harnessed to organize data from audiometric testing and address knowledge gaps by applying well-defined CI criteria that will provide parameters that will limit the risk of hallucinations. Patients who are referred for CI consultations will be evaluated by human experts, providing accurate feedback to identify errors made by the AI model.

To systematically measure and mitigate these risks of AI hallucinations, the AI model will undergo rigorous retrospective and prospective validation against expert clinician recommendations. Performance metrics, including sensitivity, specificity, positive predictive value, and negative predictive value, will be monitored continuously. Regular audits and data-quality assessments will further identify discrepancies and anomalies indicative of AI hallucinations or other errors. Structured clinician feedback loops and ongoing model retraining and recalibration will ensure continuous performance improvements, minimizing errors and optimizing patient safety and clinical effectiveness.

Section 4: How Will We Measure Success?

To determine whether the AI-driven audiometric tool achieves its intended impact, success will be evaluated using a comprehensive measurement framework, addressing usage, clinical behavior change, patient outcomes, and equity considerations.

Measurements using data already collected in APeX:

  • Frequency and percentage of clinicians interacting with the AI-driven recommendations after audiometric data entry.
  • Rate of CI referrals generated pre-/post-AI implementation.
  • Compliance rates with AI-generated recommendations among clinicians.
  • Analysis of referral accuracy compared to historical clinician referral patterns.
  • Time from initial audiometric evaluation to formal CI assessment referral.
  • Demographic breakdown of referrals pre-/post-implementation to assess equitable changes in referral patterns across age, race/ethnicity, geographical location, and insurance type.

Additional ideal measurements to evaluate success:

  • Clinician satisfaction and perception of AI tool usability, accuracy, and clinical value (via surveys and qualitative feedback).
  • Patient satisfaction with the referral process and understanding of CI candidacy.
  • Assessment of clinician knowledge about CI eligibility criteria pre-/post-AI implementation.
  • Disparities in referral rates and clinical outcomes among different demographic groups (age, ethnicity, geography, and socioeconomic status).

Evidence required to sustain UCSF Health leadership support includes consistent improvement in CI referral accuracy, increased clinician adoption rates, improved patient outcomes, and enhanced equitable access to cochlear implantation services. Conversely, abandonment of the tool would be considered if substantial AI-driven inaccuracies persist, referral disparities worsen, clinician and patient dissatisfaction is consistently high, or negligible improvements in clinical outcomes are observed.

Section 5: Describe Your Qualifications and Commitment

Y. Song Cheng BM BCh, is a fellowship-trained neuro-otologist and CI surgeon at the UCSF Cochlear Implant Center. His research interest includes innovative CI technology, CI outcomes, and CI candidacy in the elderly. He has been active publishing on cochlear implant outcomes and within the field of hearing science for the past decade.

Nicole T. Jiam, MD, is the Director of the UCSF Otolaryngology Innovation Center and a neuro-otologist. She has extensively published on audiology disparities, AI-driven referral optimization, holds digital health patents, and currently serves on health tech advisory boards and guides AI tool development for cochlear implant candidate screening at UCSF. Dr. Jiam will actively guide the development of AI tools for cochlear implant candidate screening and participate in regular progress reviews with UCSF’s AER team.

Connie Chang-Chien, BS, is a UCSF medical student with a computational medicine research background at Johns Hopkins University, Mayo Clinic, and Fulbright research in Japan. She has helped with preliminary data analysis of audiometric data and design of the EHR mockup, and going forward with develop the models for audiogram interpretation and predictive CI referral.


Works Cited

1. McRackan TR, Bauschard M, Hatch JL, Franko-Tobin E, Droghini HR, Nguyen SA, Dubno JR. Meta-analysis of quality-of-life improvement after cochlear implantation and associations with speech recognition abilities. Laryngoscope. 2018 Apr;128(4):982-990. doi: 10.1002/lary.26738. Epub 2017 Jul 21. PMID: 28731538; PMCID: PMC5776066.

2. Marinelli JP, Sydlowski SA, Carlson ML. Cochlear Implant Awareness in the United States: A National Survey of 15,138 Adults. Semin Hear. 2022 Dec 1;43(4):317-323. doi: 10.1055/s-0042-1758376. PMID: 36466559; PMCID: PMC9715307.

3. Naz , T., Butt , G. A., Shahid , R., Jabbar , U., Mirza , H. M., & Kanwal , S. (2024). Awareness of Health Professionals about Candidacy of Cochlear Implant. Journal of Health and Rehabilitation Research, 4(1), 1417–1424. https://doi.org/10.61919/jhrr.v4i1.678

4. Nassiri AM, Marinelli JP, Sorkin DL, Carlson ML. Barriers to Adult Cochlear Implant Care in the United States: An Analysis of Health Care Delivery. Semin Hear. 2021 Dec 9;42(4):311-320. doi: 10.1055/s-0041-1739281. PMID: 34912159; PMCID: PMC8660164.

5. Carlson ML, Carducci V, Deep NL, DeJong MD, Poling GL, Brufau SR. AI model for predicting adult cochlear implant candidacy using routine behavioral audiometry. Am J Otolaryngol. 2024 Jul-Aug;45(4):104337. doi: 10.1016/j.amjoto.2024.104337. Epub 2024 Apr 23. PMID: 38677145.

6. Patro A, Perkins EL, Ortega CA, Lindquist NR, Dawant BM, Gifford R, Haynes DS, Chowdhury N. Machine Learning Approach for Screening Cochlear Implant Candidates: Comparing With the 60/60 Guideline. Otol Neurotol. 2023 Aug 1;44(7):e486-e491. doi: 10.1097/MAO.0000000000003927. Epub 2023 Jun 29. PMID: 37400135; PMCID: PMC10524241.

7. Shafieibavani E, Goudey B, Kiral I, Zhong P, Jimeno-Yepes A, Swan A, Gambhir M, Buechner A, Kludt E, Eikelboom RH, Sucher C, Gifford RH, Rottier R, Plant K, Anjomshoa H. Predictive models for cochlear implant outcomes: Performance, generalizability, and the impact of cohort size. Trends Hear. 2021 Jan-Dec;25:23312165211066174. doi: 10.1177/23312165211066174. PMID: 34903103; PMCID: PMC8764462.

 

Supporting Documents: 

Automated Knee Osteoarthritis Grading Decision Support Tool

Primary Author: Yuntong Ma
Proposal Status: 

The UCSF Health problem

Plain knee radiographs are a cornerstone in evaluating osteoarthritis (OA), and at UCSF, nearly all such studies include Kellgren-Lawrence (KL) grading to assess severity. While essential, this grading task is repetitive, time-consuming, and inherently subjective—especially in borderline cases1,2. In fact, studies show only moderate inter-reader reliability in KL grading, creating inconsistencies in diagnosis and downstream care decisions3,4.

KL grading adds to radiologist workload5, reducing time available for complex cases and clinical consults. Despite its clinical importance, few innovations have addressed this burden at scale. Although machine learning methods for KL grading have shown promise in research settings, these tools have yet to be integrated meaningfully into clinical workflows.

Our proposed solution targets a high-volume, repetitive task to reduce radiologist burden, improve grading consistency, and support radiology education. Our goal is not to replace clinical judgment, but to reinforce it with consistent, reproducible assessments of OA severity.

How might AI help?

We propose integrating an automated AI model for KL grading directly into the radiology workflow at UCSF. Our model has been trained on thousands of local clinical knee radiographs and has demonstrated strong agreement with expert radiologist grading. The tool uses a deep learning pipeline to automatically detect knee joints and assign KL grades (0 to 4, or total knee replacement). This structured information would be embedded directly into the radiology report and supported with visual overlays to guide interpretation. We hypothesize that this system will improve inter-reader reliability, assist radiologists in ambiguous cases, and reduce the clinical workload associated with routine KL scoring.

Automating KL grading using deep learning has the potential to significantly enhance clinical practice by improving grading consistency and reducing the workload of radiologists. Despite promising research, automatic KL grading has not yet been integrated into clinical workflows. Our proposed study would demonstrate the feasibility and effectiveness of such integration, showing that an automated system can streamline workflow, reduce reporting burden, and provide decision support for ambiguous cases.

Key beneficiaries of this tool are:

- Radiologists: With KL grading offloaded to the algorithm, radiologists can devote more time to more complex interpretations and high-value consults. Structured outputs integrated into PowerScribe reduce cognitive load and speed reporting.

- Patients: Consistent and reproducible KL scores reduce the risk of diagnostic variability. More available radiologist time means better patient communication and faster turnaround.

- Trainees: The system’s visual outputs and attention maps can serve as an educational aid for residents and fellows learning musculoskeletal imaging, helping them learn to identify KL grades more reliably and confidently.

By deploying this AI system at the point of care, we can support more efficient workflows, better diagnostic consistency, and enhanced trainee learning.

As a proof-of-concept for automatic KL grading for clinical knee OA scoring at UCSF, we have developed a deep learning pipeline which effectively detects the knee joint on clinical radiographs and automatically classifies OA severity. In this project we trained two models: an object detection model to first perform cropping around the knee joint, and a classification model to identify KL grade (0-4) or presence of total knee replacement (TKR).

For object detection, 814 unilateral AP knee radiographs from the Osteoarthritis Initiative (OAI)9 were used to train image cropping around the knee joint. A You Only Look Once (YOLO, version 8)10 object detection model was trained on OAI radiographs. An 80/10/10 split was applied for training, validation, and test sets. Mean intersection-over-union values between predicted and ground truth bounding boxes for the test set, training set, and validation set were 0.8947, 0.8845, and 0.8635, respectively. Applying object detection to UCSF clinical knee radiographs, at least one knee joint was detected in 94.73% (n = 10,842) radiographs and cropped.

For classification, 9,166 anonymized clinical knee radiographs were acquired from UCSF PACS AIR11 for training, validation, and prediction. 4,978 bilateral and 4,188 unilateral radiographs. KL labels were extracted from corresponding UCSF radiology reports were extracted using regular expression and used to run a pretrained EfficientNet-B712 classification model on an 80/10/10 training/validation/test data split of cropped and KL-labeledUCSF knee radiographs. Weighted Cohen's Kappa showed substantial agreement (0.74 for validation set and 0.76 for test set).

How would an end-user find and use it?

The AI system will be integrated with clinical radiology tools Visage Picture Archiving and Communication System (PACS) and Nuance PowerScribe reporting software. Radiological images will be routed to the KL grading software based on a set of filters, such as modality and body part. The models will detect left and right knee joints and generate numerical KL scores of osteoarthritis severity: 0 – No OA, 1 – Doubtful OA, 2 – Mild OA, 3 – Moderate OA, 4 – Severe OA, 5 – hardware, such as artificial joints from total knee replacement.

When a knee radiograph is opened for interpretation, suggested grades, along with brief descriptors of severity (e.g., "mild OA"), will be pre-populated into the PowerScribe radiology report template. Radiologists can edit or accept the AI-suggested grades. Additionally, an annotated image with overlaid KL scores and saliency maps (indicating the most relevant areas of the image used for classification) will be available for review in the Visage PACS viewer software. Trainees can use these visualizations to better understand the features driving each grade. This process occurs passively and seamlessly, saving time and offering real-time decision support. See example mock-up of the end-user interface below:

What are the risks of AI errors?

There are two primary categories of errors:

Failure to detect the knee joint: This would result in the system being unable to suggest a KL grade. However, our preliminary data show that the model successfully detects the knee joint in over 95% of cases, making this scenario unlikely. When it does occur, the fallback is manual grading by the radiologist, as is currently done.

Misclassification of osteoarthritis severity: This includes both false negatives (e.g., missed severe OA), which may lead to underestimation of disease severity and undertreatment, and false positives (e.g., overgrading of mild cases), which may lead to unnecessary further evaluation or intervention.

The rate of these failures will be measured comparing the record of model output to the final radiology reports. Our preliminary data showed good agreement between generated KL grades and ground truth values, assigned by radiologists. To mitigate these risks:

  • AI-generated grades will never bypass radiologist review. Only radiologist-approved grades will be included in the final report.
  • We will conduct a quality assurance study comparing AI grades to independent radiologist adjudication.
  • Continuous monitoring of agreement between AI predictions and final report grades will be conducted, and discrepancies will be reviewed to guide system refinement.

How will we measure success?

Initial evaluation: In a randomized adjudication study, two radiologists will independently compare prior clinical KL grades and AI-generated grades. We will measure inter-rater and model agreement using Cohen’s Kappa, with success defined as the model matching or exceeding inter-radiologist agreement.

Clinical impact: We will evaluate how frequently AI-generated grades are accepted without edits, and whether AI usage improves intra- and inter-reader consistency across reports. We will assess radiologist-reported satisfaction with the tool and trainee confidence and accuracy in KL grading via structured surveys and explore whether AI integration leads to measurable efficiency gains, including reduced interpretation time per case and an increase in the number of radiographs read per day.

Continuous Evaluation: We will implement a dashboard for continuous monitoring of the model performance by recording the generated grades and comparing them with the final grades entered into the reports. Any significant decrease in the rate of agreement will be followed up with model fine-tuning with additional data.

Describe your qualifications and commitment

This project is led by Dr. Yuntong (Lorin) Ma, MD, Assistant Professor of Radiology at UCSF, and Eugene Ozhinsky, PhD, Associate Professor of Radiology at UCSF.

Dr. Ma’s work focuses on developing deep learning solutions for musculoskeletal imaging, including automated classification of osteoarthritis severity and diagnosis of inflammatory arthropathy. Her research emphasizes practical clinical integration, and she actively contributes to shaping standards for the responsible deployment of imaging AI. Her ongoing research focuses on advancing the clinical implementation of AI tools in ways that are practical and relevant to patient care. If selected, she will commit 10% effort for at least 1 year towards this project to ensure its success.

Dr. Ozhinsky’s research focuses on applying advanced image acquisition and machine learning techniques to improve diagnosis, predict disease progression, and guide therapy—particularly in musculoskeletal conditions. He has developed AI models for tasks such as hip fracture detection, automated OA grading, and MRI protocol optimization. His long-term goal is to translate these novel techniques into routine clinical care so that they result in meaningful improvements in patient outcomes.

Drs. Ma and Ozhinsky will oversee a multidisciplinary team of scientists and engineers in close collaboration with the UCSF Center for Intelligent Imaging. The team meets weekly to review progress, troubleshoot challenges, and plan next steps. Regular engagement with key clinical stakeholders will guide implementation, including the Radiology AI Governance Committee, UCSF Health AI, and AER leadership.

References

1.         Kohn MD, Sassoon AA, Fernando ND. Classifications in Brief: Kellgren-Lawrence Classification of Osteoarthritis. Clin Orthop Relat Res. 2016;474(8):1886-1893. doi:10.1007/s11999-016-4732-4

2.         Braun HJ, Gold GE. Diagnosis of osteoarthritis: imaging. Bone. 2012;51(2):278-288. doi:10.1016/j.bone.2011.11.019

3.         Wright RW, MARS Group. Osteoarthritis Classification Scales: Interobserver Reliability and Arthroscopic Correlation. J Bone Joint Surg Am. 2014;96(14):1145-1151. doi:10.2106/JBJS.M.00929

4.         Köse Ö, Acar B, Çay F, Yilmaz B, Güler F, Yüksel HY. Inter- and Intraobserver Reliabilities of Four Different Radiographic Grading Scales of Osteoarthritis of the Knee Joint. The Journal of Knee Surgery. 2017;31:247-253. doi:10.1055/s-0037-1602249

5.         Tiulpin A, Thevenot J, Rahtu E, Lehenkari P, Saarakkala S. Automatic Knee Osteoarthritis Diagnosis from Plain Radiographs: A Deep Learning-Based Approach. Sci Rep. 2018;8:1727. doi:10.1038/s41598-018-20132-7

6.         Lee LS, Chan PK, Wen C, et al. Artificial intelligence in diagnosis of knee osteoarthritis and prediction of arthroplasty outcomes: a review. Arthroplasty. 2022;4:16. doi:10.1186/s42836-022-00118-7

7.         Swiecicki A, Li N, O’Donnell J, et al. Deep learning-based algorithm for assessment of knee osteoarthritis severity in radiographs matches performance of radiologists. Computers in Biology and Medicine. 2021;133:104334. doi:10.1016/j.compbiomed.2021.104334

8.         Norman B, Pedoia V, Noworolski A, Link TM, Majumdar S. Applying Densely Connected Convolutional Neural Networks for Staging Osteoarthritis Severity from Plain Radiographs. J Digit Imaging. 2019;32(3):471-477. doi:10.1007/s10278-018-0098-3

9.         NIMH Data Archive - OAI. Accessed November 4, 2024. https://nda.nih.gov/oai

10.       Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, Real-Time Object Detection. Published online May 9, 2016. doi:10.48550/arXiv.1506.02640

11.       AIR Overview. UCSF Radiology. May 29, 2018. Accessed March 20, 2025. https://radiology.ucsf.edu/research/core-services/PACS-air

12.       Tan M, Le QV. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Published online September 11, 2020. doi:10.48550/arXiv.1905.11946

Supporting Documents: 

SPICE-LD: Supporting Personalized, Inclusive, Culturally-appropriate Eating in Liver Diseases

Proposal Status: 

The UCSF Health Problem

Patients with chronic liver diseases, including those with cirrhosis and those awaiting transplantation, have unique and often stringent nutritional requirements.  Malnutrition and sarcopenia are prevalent among this patient population, directly impacting clinical outcomes, transplant candidacy, and overall quality of life.  While clinical nutrition guidelines and registered dietitian consults are invaluable, standard recommendations frequently overlook the patient’s ethnic and cultural food preferences.  This lack of cultural tailoring can result in suboptimal adherence, missed opportunities to optimize nutritional intake, and diminished patient satisfaction.  Therefore, there is an unmet need to provide treating clinicians with culturally-appropriate, evidence-based, and patient-specific nutritional recommendations that resonates with that person’s dietary norms and preferences.

 

How Might AI Help?

We propose to leverage a Retrieval-Augmented Generation (RAG) approach powered by a Large Language Model (LLM) that references established nutrition and liver disease guidelines.  RAG is an approach in which LLMs are enhanced by integrating them with vetted external data sources, allowing them to generate more accurate and contextually relevant responses.  In addition to standard clinical and lab data that would be transferred via Fast Healthcare Interoperability Resources (FHIR) application progressing interface (API) calls, we will also leverage patient-reported cultural and ethnic background information to tailor nutritional recommendations.  Specifically, the generative AI approach would:

  • Offer culturally appropriate recommendations
    • By integrating a repository of traditional food items, cooking methods, and culturally specific dietary patterns, the LLM can suggest meal plans or dietary adjustments that align with both the patient’s clinical needs and cultural norms.
    • Provide personalized nutritional guidance
      • Patient clinical factors, such as liver disease severity, comorbidities (e.g., renal impairment), and body anthropomorphic measurements, will be gathered via FHIR calls and fed into the model to generate customized nutrition strategies. 
      • Patient cultural/dietary preferences, such as cultural identification and dietary limitations, would be gathered through semi-structured forms entered by the treating clinician
      • Conduct adaptive learning and continuous updates
        • The LLM can continuously learn from new guidelines, emerging research, and aggregated nutritional data.

        This solution has the potential to improve adherence to nutritional recommendations and enhance the overall patient experience working with the hepatology and liver transplantation teams.

         

        How Would an End-User Find and Use It?

        This RAG-LLM tool would be embedded as a specialized tab within the ambulatory encounters in the APeX EHR.  The tab will provide the end-user, defined as a clinical provider or dietitian, with a display of relevant clinical data along with standardized and free-text selections for the patient’s cultural and preferred nutritional preferences.  There would also be a “Generate” button, which will take the EHR information and user-inputted preferences, to produce personalized dietary recommendations.  This information could then be populated into the progress note and the patient instructions.

        • FHIR-Based Data Retrieval and Processing
          • The system will automatically retrieve relevant patient data (e.g., labs, medications, severity of liver disease, comorbidities) via FHIR calls.
          • The system will also populate the patient’s demographics along with semi-structured input regarding the patient’s stated cultural background.
          • Interaction with the RAG-LLM
            • There will be a semi-structured box with selections for common dietary restrictions (e.g. “low-sodium,” “renal,” “nut allergy,” “lactose intolerant”) and free-text to pose specific nutritional questions (e.g., “What are culturally-appropriate protein sources for this patient who follows a South Asian diet and requires a low-sodium meal plan?”).
            • AI-Generated Guidance
              • The LLM synthesizes the patient’s clinical profile, cultural data, and evidence-based nutritional guidelines, returning a concise set of meal plans or tips that align with recognized best practices in liver disease nutrition.  This information could then be pushed into the progress note and patient instructions.
               

              Example of AI Output

               

               

              What Are the Risks of AI Errors?

              Implementing an LLM for culturally-aware nutritional guidance introduces several potential risks: 1. Incomplete or Inaccurate Recommendations, 2. Overreliance on AI, 3. Bias or Cultural Misalignment, and 4. Privacy and Security Concerns (due to integrating demographic and cultural information into the RAG-LLM API pipelines).  To mitigate these risks, we will conduct rigorous validation, ensure continuous updates to the guideline library, and maintain a clear disclaimer that final decisions rest with licensed professionals.

               

              How Will We Measure Success?

              We will evaluate this solution using a mix of clinical, operational, and patient-centered metrics:

              • Clinical Outcomes:  Caloric intake reported by the patient, nutrition labs (prealbumin, albumin), measurements of liver disease severity (MELD) before and after RAG-LLM deployment.
              • Implementation Outcomes:  Feedback through patient surveys on cultural relevance, dietary adherence, and overall satisfaction with their nutritional care plan.  Tool utilization and time saved in created personalized meal plans.

               

              Describe Your Qualifications and Commitment

              • This project is led by Dr. Jin Ge, MD, MBA.  Dr. Ge is a transplant hepatologist and a data science and AI researcher within the Division of Gastroenterology and DoC-IT.  He serves as the Director of Clinical AI for the Division of Gastroenterology.  He has extensive experiences in building and deploying specialized LLMs using retrieval augmented generation (RAG), such as the LiVersa for liver diseases.  He has previously worked closely with both APeX-Enabled Research and the AI Tiger Team on various digital health projects.  If selected, he will commit at least 10% effort for 1 year towards this project to ensure its success.
              • Jennifer C. Lai, MD, MBA is a Professor of Medicine, In Residence, in the Division of Gastroenterology.  She serves as the Director of the UCSF Health Advancing Research in Clinical Hepatology (ARCH), the research arm of the Division of Gastroenterology.  In addition, she is a board-certified Physician Nutrition Specialist (PNS).  She will be contributing her unique expertise and experiences at the intersection of nutritional sciences and transplant hepatology.
              • Kathy Pariani, RD is a Registered Dietitian Nutritionist and Certified Diabetes Education Care and Specialist.  Kathy provides nutrition assessments, education and counseling to patients with chronic diseases, specifically those with liver disease, diabetes, obesity and cardiovascular risk factors.  Kathy previously practiced as an acupuncturist and herbalist before earning her Master’s degree in Nutrition and Dietetics from Bastyr University.  Kathy is also a current Fellow in the Evidence Based Fellowship Program at UCSF, researching effective nutrition interventions in patients with Metabolic Associated Steatotic Liver Disease (MASLD).
              Supporting Documents: 

              LiVersa-CirrhosisRx: Integrating a Liver Disease Specific LLM within Clinical Decision Support System for Cirrhosis Care

              Proposal Status: 

              The UCSF Health Problem

              Despite the availability of clinical practice guidelines from the American Association for the Study of Liver Diseases (AASLD) and the American Gastroenterological Association (AGA), adherence to recommended quality measures for patients with cirrhosis remains suboptimal.  This gap leads to a high burden of readmissions and inpatient mortality bore by patients with cirrhosis – avoidable healthcare costs.  Cirrhosis is a dynamic condition that affects multiple organ systems outside of the liver, e.g. brain (encephalopathy), hematology (cytopenias), renal (hepatorenal syndrome), cardiology (cirrhosis related cardiomyopathy), and infectious diseases (spontaneous bacterial peritonitis); this complexity is reflected in the electronic health record, which typically includes relevant clinical data across multiple parts of the patient record.  Additionally, busy clinicians may find it challenging to quickly reference and integrate multiple clinical guidelines.  There is an urgent need for a streamlined solution that ensures guideline-concordant care tailored to each patient’s specific clinical profile and decompensations.

               

              How Might AI Help?

              We have previously constructed a liver disease specific Large Language Model (LLM) called “LiVersa” based on integration of AASLD clinical practice guidelines via Retrieval Augmented Generation (RAG-LLM).  LiVersa is the first clinical assistant in the UCSF Versa platform and is currently available for select clinicians in hepatology and liver transplantation (PMID 38451962, PMID 38935858).  We have also constructed “CirrhosisRx,” which is a rule-based (non-AI) clinical decision support system for guideline adherent inpatient cirrhosis care, built on the EngageRx platform using SMART-on-FHIR.  CirrhosisRx is currently deployed in a pragmatic randomized clinical trial (NCT05967273, PMID 38407255) at UCSF Medical Center.  In this proposal, we plan to link the two technologies, specifically integrating LiVersa into CirrhosisRx via Fast Healthcare Interoperability Resources (FHIR) application progressing interface (API) calls within the APeX EPIC EHR system, to provide a patient-personalized dynamic information retrieval system to provide guideline-based clinical recommendations to end-users.

              The integrated LiVersa-CirrhosisRx system would provide the following advantages:

              • Both Rule-Based and Generative Hybrid Recommendations:  Integration of LiVersa RAG-LLM into CirrhosisRx will allow providers to query and access up-to-date AASLD clinical guidelines without navigating outside of the EHR.
              • Personalized Medicine: The proposed application would automatically incorporate relevant patient-specific data, enabling real-time, personalized recommendations specific to the patient and clinical scenario (“Right Patient, Right Time”).
              • Streamlined Workflow: Embedding the LLM in a user-friendly CDS interface eliminates the need to toggle between multiple resources, reducing cognitive load and time constraints on clinicians.

               

              How Would an End-User Find and Use It?

              The combined LiVersa-CirrhosisRx application will be deployed within the existing “CirrhosisRx” tab that is enabled for select EPIC user contexts under the ongoing pragmatic randomized clinical trial.  See screenshot of CirrhosisRx:

              In the LiVersa-CirrhosisRx integration, we will use the FHIR data connected with LiVersa to generate a summary of the patient with displayed data.  A customized free-text box will be built into the CirrhosisRx application that allows the end user to ask open-ended questions regarding the management of the patient.  The embedded LiVersa RAG-LLM will return answers that reference the latest AASLD guidelines while reflecting the patient’s unique clinical situation.  Future expansions could include automated pending of orders and order sets consistent with LiVera’s recommendations.

               

              Example of AI Output

               

              What Are the Risks of AI Errors?

              There several potential risks arise when introducing an LLM-based solution into clinical workflows: 1. Hallucinations or Misinformation, 2. Clinician bias due to overreliance on AI, 3. Biases in recommendations, 4. Clinical scenarios un-accounted for by the RAG-LLM model.  To mitigate these risks, we propose a robust pilot testing phase embedded within the existing CirrhosisRx trial.  We will record both clinical outcomes, defined as adherence to AASLD/AGA practice guidelines for inpatient cirrhosis care, and implementation outcomes through structured usability assessments and semi-structured interviews.  We have been developing LiVersa for a use case of hepatology e-consultations and have a 12-question survey developed to evaluate its effectiveness versus human-written consultation recommendations in 54 previous e-consult cases.  The preliminary data from our analyses have demonstrated that the e-consultation drafts produced by LiVersa were helpful 71% of the time with a 4% potential harm rate.  Given that all clinical actions taken resulting from the LiVersa-CirrhosisRx application would have to be confirmed/finalized by a human clinician, there is an integral “human-in-the-loop” mechanism for this proposal.

               

              How Will We Measure Success?

              We will track both clinical outcomes and implementation outcomes to determine the effectiveness of our LLM-enhanced CDS tool:

              • Clinical Outcomes:
                • Guideline adherence - Measure adherence to five AASLD/AGA guideline recommendations, comparing EHR-based metrics before and after the tool’s introduction.
                • Rates of Hospital Readmission: Evaluate changes in readmissions in patients with cirrhosis taking place with 90 days.
                • Mortality and Morbidity Trends: Evaluate changes in inpatient mortality rates for patients with cirrhosis.
                • Implementation Outcomes:
                  • Time Savings: Track time spent in chart review and decision-making tasks.
                  • Frequency of CDS Usage: Evaluate how often providers use LiVersa-CirrhosisRx interface in the context of how many potential encounters where it could be used.
                  • Clinician Surveys: Gather feedback on trust in AI recommendations, ease of use, and perceived impact on practice.
                  • Equity and Bias Analysis
                    • Demographic Subgroup Performance: Examine whether the AI recommendations remain consistent across different patient populations with varied demographic backgrounds.

                     

                    Describe Your Qualifications and Commitment

                    • This project is led by Dr. Jin Ge, MD, MBA.  Dr. Ge is a transplant hepatologist and a data science and AI researcher within the Division of Gastroenterology and DoC-IT.  He serves as the Director of Clinical AI for the Division of Gastroenterology.  He has experiences in developing, testing, and deploying digital technologies for liver disease care and is the principal investigator for the CirrhosisRx pragmatic randomized controlled trial and LiVersa liver-disease specific LLM.  He has worked closely with both APeX-Enabled Research and the AI Tiger Team.  If selected, he will commit at least 10% effort for 1 year towards this project to ensure its success.
                    • Dr. Valy Fontil is a former faculty member at UCSF and currently serves as the Director of Research at Family Health Centers at NYU Langone Health.  He is a primary care physician, health services researcher, and digital health entrepreneur who specializes in innovations for high-risk, low-income populations.  He is the innovator of the EngageRx platform, on which CirrhosisRx is based, and has expertise in building digital health interventions in the EHR.  He will serve as an advisor to this effort.

                    CLEAR-CHE: Covert Liver Encephalopathy Assessment using Recorded Clinical Health Encounters

                    Proposal Status: 

                    The UCSF Health Problem

                    Hepatic encephalopathy (HE) is a common complication in patients with cirrhosis and is associated with significantly increased morbidity, mortality, and healthcare utilization.  Early-stage or covert hepatic encephalopathy (CHE), which can be present in up to 60% of patients with cirrhosis, often goes unrecognized by both patients and clinicians because its symptoms are subtle.  Without timely identification and intervention, such as initiating or titrating lactulose or rifaximin, patients can rapidly progress to overt HE – a  complication of cirrhosis that is associated with poor clinical outcomes.  Early intervention in the subclinical stage of the disease (CHE) could potentially prevent complications, improve quality of life, and reduce the burden on healthcare systems.  Current clinical methods for detecting CHE, however, are resource-intensive (due to the use of validated psychometric testing) and therefore not done in routine clinical practice.   This underscores the need for an efficient, potentially AI-driven approach, to screen and detect CHE in routine clinical practice. 

                     

                    How Might AI Help?

                    Our group is currently exploring novel methodologies for detection of CHE.  This is strong emerging evidence that suggest a patient’s voice can serve as a biomarker for CHE (PMID 35861546, PMID 39264936).  By leveraging AI-driven analysis of speech characteristics, it may be possible to detect subtle changes, such as altered speed, pitch, or articulation patterns, that correlate with CHE.  At UCSF, the implementation of ambient AI scribes/recordings in clinical settings provides a potentially rich source of audio data.  Our proposed AI solution seeks to analyze these ambient recordings obtained through routine clinical care, applying algorithms (including large language models and specialized speech-processing models) to detect early signals of hepatic encephalopathy.  This approach allows for scalable and efficient implementation, encourage early intervention by flagging potential CHE prior to overt symptoms, and reduce burdens on clinicians as it would be implemented as a part of routine clinical care.

                     

                    How Would an End-User Find and Use It?

                    The AI tool would ideally be integrated within existing tabs/dashboards for ambient AI recording and documentation systems (e.g. Abridge and Ambience).  The system would either analyze the recording in the background either in real-time or after the encounter has ended.  If the recording has patterns (either speed, pitch, or articulation) that are suggestive of CHE, an alert would be generated and sent to the treating clinician (envisioned to be hepatologists and advanced practice providers in the liver diseases clinic).  A summary report with insights summarizing the speech abnormalities and potential clinical significance would then be sent as an inbox result to the treating clinician. 

                     

                    Example of AI Output

                     

                    What Are the Risks of AI Errors?

                    AI-based detection of covert hepatic encephalopathy introduces several potential pitfalls:

                    • False Positives: Overestimating risk could lead to unnecessary patient anxiety.
                    • False Negatives: Missing early HE signs could delay treatment, leading to worse outcomes.
                    • Bias and Variations: Voice and speech patterns can vary by accent, language, and comorbid conditions, necessitating robust training sets.
                    • Data Privacy Concerns: Data use and protection policies for ambient recordings to be used in a research or quality improvement context are still in active development.

                    To mitigate these risks, we will conduct rigorous validation using retrospective and prospective data in limited pilot studies within the UCSF Hepatology and Liver Diseases outpatient clinics.  Continuous performance monitoring and iterative model refinement will ensure high sensitivity, specificity, and generalizability.  Given that this is an assistive AI tool, final clinical decision-making powers resides with the treating provider, therefore providing a “human-in-the-loop” check for deployment.

                     

                    How Will We Measure Success?

                    We will assess the efficacy and impact through quantitative and qualitative metrics:

                    • Clinical Impact and Detection Accuracy
                      • Sensitivity / Specificity: Proportion of true CHE cases correctly identified versus psychometric testing in pilot studies.
                      • Positive Predictive Value (PPV) / Negative Predictive Value (NPV): Assessing how often the tool is correct in its detections or ruling out CHE.
                      • Change in Overt HE Rates: Monitoring whether early identification reduces the incidence of overt HE among the population of patients with cirrhosis seen at UCSF.
                      • Workflow Efficiency and Adoption
                        • Time to Diagnosis: Tracking time from first clinical visit to diagnosis and treatment pre- and post-implementation.
                        • Provider Utilization: Assess EHR usage time related to summary statements/results generated by the AI tool.
                        • Equity and Bias Analysis
                          • Demographic Subgroup Performance: Ensuring consistent tool accuracy across different demographic, language, and cultural groups.

                           

                          Describe Your Qualifications and Commitment

                          • This project is led by Dr. Jin Ge, MD, MBA.  Dr. Ge is a transplant hepatologist and a data science and AI researcher within the Division of Gastroenterology and DoC-IT.  He serves as the Director of Clinical AI for the Division of Gastroenterology.  He is experiences in developing, testing, and deploying digital technologies for liver disease care.  He is currently running a pilot study to detect covert hepatic encephalopathy by using chatbots – this study is being extended to include ambient voice recordings in support of this proposal.  If selected, he will commit at least 10% effort for 1 year towards this project to ensure its success.
                          • Dr. Irene Y. Chen, SM, PhD is an Assistant Professor of Computational Precision Health at UC Berkeley and UCSF.  Her research is focused on safe clinical deployments of artificial intelligence and machine learning models in clinical settings.  She is a close collaborator of Dr. Ge’s and is particularly interested in the potential of utilizing voice biomarkers as a clinical diagnostic tool.

                          Optimizing New Patient Self-Scheduling Pathways with AI/ML

                          Proposal Status: 

                          Section 1: The UCSF Health Problem

                          At UCSF and other leading academic medical centers, the referral intake and triage process for new patients is strained, leading to long delays and high rates of incomplete referrals. A review of more than 100,000 referral scheduling attempts showed wide variability in wait times—from 8 to 73 days—with an average of 22 days [1].  Additionally, ~50% of referrals are never completed, largely due to operational bottlenecks and limited triage capacity [2,3].

                          These delays are not just administrative—they directly affect patient outcomes. In a consecutive cohort study of 648 patients with squamous cell carcinoma, the majority of patients developed significant signs of tumor progression within a wait time of 4 weeks [4]. For time-sensitive conditions like lung, kidney, and pancreatic cancer, each week of delayed treatment increases mortality risk by 1.2% to 3.2% [5]. In early-stage cases, delays in cancer care can raise 5-year and 10-year mortality rates by as much as 47.6% and 72.8%, respectively [5]

                          A key challenge lies in how referrals are routed and scheduled. At UCSF, the status quo workflow is labor-intensive. Referral documents arrive from outside providers as faxes or PDFs that must be manually reviewed. Currently, this triage process falls to practice coordinators, who work hard to manage a high volume of incoming cases while juggling many responsibilities. However, they typically do not have clinical training, and determining the right subspecialist and urgency level for complex cancer cases often requires input from nurse navigators or physicians. Even when marked urgent, referrals can take days to reach the right destination—such as in head and neck oncology, where the average triage time is over 4 days.

                          To improve access, UCSF has introduced a patient self-scheduling portal for some specialties. While this is a step in the right direction, the current system is limited in scope and with various levels of oversight. It relies on a series of 3 generic questions and lacks integration with predictive models or clinical context. As a result, it does not capture the complexity of real-world triage, which can lead to patients being misrouted. Specialty clinics may also be scheduled suboptimally, with patients who may have benefited from additional work-up (e.g., APP visit) or additional medical record reconciliation. These challenges lead to financial losses, reduced market competitiveness, and prevent subspecialists from focusing on high-value, "top-of-license" care.

                          While past efforts—like hiring additional staff or sending reminders—have helped incrementally, they don’t address the core issue: patients need better tools to navigate the referral process, and staff need support to triage more efficiently.

                          Our solution is designed to bridge this gap by embedding intelligent triage and decision support into the patient self-scheduling experience—helping patients land in the right clinic faster while reducing the load on care teams and improving access to timely treatment. Importantly, this project also aligns with 1 of the 4 UCSF Ambulatory Services Health IT Portfolio Initiatives for FY2025. 

                          Section 2: How might AI help? 

                          We want to enable safe, accurate, and most notably, intelligent self-scheduling for new specialty patients (Figure 1). Our solution consists of an AI-powered triage algorithm developed through collaboration with IIAM Corporation and taps into their existing software platform’s advanced capabilities. Once deployed, our algorithm will assist with the patient self-scheduling workflow, thereby reducing dependence on manual triage and improving the timeliness of care. 

                          Figure 1. UCSF Self-Scheduling Workflow and Proposed Intervention. The top row demonstrates UCSF’s current online self-scheduling portal. Patients are prompted to answer 3 in-descript questions about whether or not they have cancer, but many patients are unsure of their diagnosis. In the bottom row, we propose embedding IIAM’s algorithm into the online patient self-scheduling portal and using the associated referral documents to showcase provider names and appointment availability in accordance with the suspected etiology and urgency, maximizing clinical optimization (“top of license work”) and resource utilization. 

                          Our algorithm accepts previous medical records (e.g., clinical notes, radiology/pathology reports) as input data, including external documents. To ensure a smooth workflow, our algorithm will be flexible enough to accept multiple file formats including PDFs, images, and plain text files. The uploaded information is then reviewed by machine learning models to identify the patient’s current medical need and output the best matching subspecialty physician(s). For instance, a patient with a suspicious thyroid nodule will be matched with the appointment times of head & neck surgeons specializing in thyroid surgery. This approach will increase the rate of correct patient-physician matching and one-contact resolution, reduce time to treatment, increase surgical conversion rates in clinic, and maximize “top of the license work” among clinical providers.

                          The IIAM software platform has achieved a referral accuracy rate nearly 30% higher than labor-intensive, personnel-driven status-quo workflows (90% vs 60%) -- even with limited patient information (90% accuracy rates with 30-70% patient records missing). This clinical effectiveness was confirmed using patient data from the two largest national cancer databases (SEER, NCDB) and three tertiary healthcare centers (UCSF, Hopkins, MGB). At UCSF, our product has undergone both retrospective and prospective validation. Our algorithm was 100% accurate in identifying malignancy. Our team hopes to utilize the UCSF AI Pilots program to develop a solution that is more tailored to the institution’s specific self-sceduling needs and refine algorithm performance via training on UCSF clinical data.

                          Section 3: How would an end-user find and use it? 

                          The AI tool will be integrated into UCSF’s existing online scheduling system, where it will prompt new patients seeking a specialty visit to upload relevant medical documents—such as referral letters, imaging results, pathology reports, or lab work. 

                          Once documents are submitted, IIAM’s AI analyzes the content to understand the underlying condition, clinical urgency, and appropriate subspecialty. Within seconds, the patient receives a tailored list of providers with real-time appointment availability, ranked by clinical fit and urgency, based on the content of their documents (Figure 2). The patient then selects from the AI-filtered list of appointment options, enabling them to self-schedule with an appropriate provider without waiting for manual triage, provider verification, or callbacks. If the AI determines that a more urgent evaluation is needed, it may recommend an earlier visit slot or flag the case for real-time escalation to nurse triage.

                          The system is designed to require minimal effort from the patient while maximizing the value of any existing work-up they have completed. Importantly, the AI acts as a behind-the-scenes assistant—not a gatekeeper. Patients retain the ability to view other available providers or request help if needed. For internal users (e.g., nurse navigators or access center staff), a clinical summary of the AI’s triage decision can also be displayed to assist in complex case management.

                          To minimize any errors, we intend to retain a human-in-the-loop during the triage and prior to scheduling for the first three months of the live pilot. If success metrics and accuracy rates remain high, the team will discuss and consider gradually scaling back the human-in-the-loop involvement. The algorithm provides a confidence score with every referral, and any referral that is associated with a low confidence score will automatically be flagged for human review. 

                          Section 4: Embed a picture of what the AI tool might look like. 

                          Figure 2. Proposed UCSF Self-Scheduling Workflow. Based on the patient’s pre-existing work-up and documents, the UCSF patient self-scheduling portal will provide the patient with an AI-filtered list of appointment options with subspecialists that treat the patient’s existing condition. 

                          Figure 3. UCSF Self-Scheduling User Interface. A picture of the patient user interface with the AI-filtered list of appointment options with and appropriate subspecialists

                          Section 5: What are the risks of AI errors? 

                          False negatives—cases where urgent conditions are missed—can lead to harmful delays in care. In contrast, false positives may cause patients to be seen earlier than necessary, potentially burdening provider schedules. To mitigate this, the algorithm is intentionally designed to favor false positives over false negatives when evaluating the urgency of a patient’s chief complaint. This approach was developed in close collaboration with oncology providers, based on the shared principle that it is preferable to evaluate a benign lesion too early than to delay care for a potential cancer patient.

                          Encouragingly, the algorithm has demonstrated a high level of accuracy across leading healthcare systems (Figure 4). It significantly outperformed the traditional personnel-based call center at Johns Hopkins, achieving 87% accuracy using a random forest model, compared to 60% under current workflows. At UCSF, the algorithm achieved approximately 90% accuracy when benchmarked against physician assessments and pathology-confirmed diagnoses (Figure 5). These real-world results suggest a reliable and scalable solution for streamlining referrals and improving timely access to specialty care.

                          Figure 4. IIAM performance at both UCSF and JHMI. IIAM performance on incoming referrals during 2024 at both Johns Hopkins and UCSF. At JHMI, status-quo referral workflows involve a centralized call center and EPIC-based random forest tree algorithms, with a baseline performance of 60%. The physician’s assessment/plan and any surgical pathology reports served as the ground truth.

                          Figure 5. Confusion matrix for IIAM ML algorithm and UCSF Head and Neck Pathologies. Pathologies were determined from physician assessment and any related pathology results during a patient visit. Of the pathologies, 96% of non-endocrine neoplasm pathologies matched algorithm pathology predictions; 85%  of benign lesion pathologies matched algorithm pathology predictions; 96% (26/27) of thyroid pathologies matched algorithm pathology predictions; 100% of parathyroid pathologies matched algorithm pathology predictions; and 100% of salivary gland pathologies matched algorithm pathology predictions. 

                          Section 6. How will we measure success?

                          Measurements using data that is already being collected in APeX: 

                          • New patient referral volume
                          • Number of patients scheduled per month via the self-scheduling portal
                          • Time to triage
                          • Time to schedule 
                          • Time to treatment

                          Other measurements you might ideally have to evaluate the success of the AI 

                          • Percentage of appropriately scheduled referrals (etiology and urgency)
                          • Satisfaction scores (patients, providers, practice coordinators)
                          • Percentage of incoming referrals scheduled (UCSF H&N baseline: 62%)
                          • Surgical conversion rate aka optimal clinical utilization (e.g., cancer patients that are non-surgical candidates see medical oncology first; non-biopsied patients see ENT first; benign lesion or incomplete work-up see an APP first)

                          Section 7: Describe your qualifications and commitment: 

                          Our team combines deep clinical expertise with a proven track record of applying AI/ML solutions to real-world healthcare challenges. With a shared commitment to improving patient access and outcomes—particularly in cancer care—we are uniquely positioned to lead this initiative.

                          Katherine Wai, MD is a head and neck cancer surgeon-scientist in the UCSF Department of Otolaryngology–Head and Neck Surgery. She has published peer-reviewed research on the use of AI/ML to improve triage and referral processes for cancer patients, reflecting her dedication to innovation in care delivery. Dr. Wai is the principal investigator for UCSF’s pilot studies involving IIAM technology and will ensure the project meets and exceeds national quality improvement benchmarks.

                          Patrick Ha, MD is the Medical Director of UCSF Mission Bay Adult Services and holds the Irwin Mark Jacobs and Joan Klein Jacobs Distinguished Professorship in Head and Neck Surgery. An international expert in head and neck cancer research and outcomes, Dr. Ha brings deep expertise in NCCN and AJCC cancer guidelines. He will oversee the integration of AI-driven solutions into UCSF’s self-scheduling workflows and ensure alignment with UCSF Cancer Center's patient access and strategic goals.

                          Nicole Jiam, MD is the Director of the UCSF Otolaryngology Innovation Center and Chief Executive Officer of IIAM Health. A clinical informaticist with experience collaborating across leading academic health systems—including Mass Eye and Ear and Johns Hopkins—Dr. Jiam has authored multiple peer-reviewed publications on AI/ML in healthcare and holds patents from both institutions. She brings a rare blend of clinical, academic, and entrepreneurial experience, having served on advisory boards for health tech companies nationwide and as a former fellow with digital health-focused VC firms. Dr. Jiam will actively guide product development and participate in regular progress reviews with UCSF’s AER team.

                          Together, Drs. Wai, Ha, and Jiam lead a cross-disciplinary effort grounded in clinical excellence and operational insight. With active involvement from UCSF leadership, their work reflects a sustained commitment to transforming cancer care access through scalable, AI-powered solutions.

                          Supporting Documents: 

                          AI Generated Discharge Instructions to Improve Patient Care Transitions

                          Proposal Status: 

                          The UCSF Health problem
                          The After Visit Summary (AVS) is the sole document patients receive after an inpatient hospitalization. It contains vital information including a summary of their hospitalization, medication changes, follow-up appointments, future labs/imaging needed, and important points of contact. For many patients we care for at UCSF with significant medical complexity and followed by multiple subspecialties, this document can become lengthy and burdensome, sometimes dozens of pages in length. 

                          Over the past few months, we have engaged with patients, their families, nurses, and physicians to learn about their experiences with the AVS. We have amassed a list of issues within the AVS needing improvement, with numerous stakeholders identifying artificial intelligence (AI) as a possible solution to these issues: 

                          1. From a physician perspective, generating patient-facing discharge instructions (one component of the AVS) is of variable priority, complicated by a non-standardized view based on the physician’s personal experiences, a fragmented understanding of the individual patient's health literacy in the setting of heavy physician turnover/discontinuity, and little time to complete this task thoughtfully amid competing clinical demands.  

                          1. From a nursing perspective, delivering the AVS to the patient and family requires devoting precious minutes to clarifying the details in this document. Especially in times of limited staffing, this can represent a significant time burden and can introduce frustration around some of the areas of ambiguity in the document. Standardization, accuracy, debulking, and streamlining of information reduces questions at the point of discharge and improves nursing experience and satisfaction.  

                          1. From a patient and family perspective, various AVS components including hospital course summaries, key medication changes, follow-up appointments, and return precautions are presented in a variety of formatting styles, often creating bloat and obfuscating details as a result. In addition, components including specialty-specific scheduling information and diagnosis-specific resources can add significant length to the AVS with no regard to a patient’s health literacy level and preferences in communication style. In isolation, this abundance of information can be helpful, but patients/families have expressed that this information can be exceedingly burdensome and conducive to disregarding other important aspects of the document. Literature reveals similar findings, with patient data from other institutions revealing themes of feeling overwhelmed by the amount of information/length in discharge paperwork and having poor clarity on follow-up plans. This can only be made worse by ambiguity or discrepancies in the provider-created discharge instructions within the document.

                          How might AI help?
                          Our vision is for AI to draft discharge instructions based on APeX notes and information currently templated in the current AVS and Discharge Summary. The large language model (LLM) will be trained to present information in a fashion that suits the patient’s stated preferences regarding health communication, a practice not currently standardized at UCSF but which represents a significant patient-centered advance in AVS design. Some possible preferences to be taken into account might be expected level of detail, inclusion of holistic health practices, use of visual information, and references to specific outpatient provider names for the patient rather than specialty names. It would use scheduled appointments and referrals placed within APeX, imaging/labs ordered, post-discharge orders and existing progress/consult notes to display information in the best format for that particular patient.

                          These instructions could potentially include a brief hospitalization summary and medication changes, as well as explaining the stage of scheduling for various discharge follow-up appointments and next steps to take, numbers to call, and the likely topics at each appointment. For example, "The Cardiology appointment with UCSF Cardiology on May 1, 2025 at 9:30 am will be to discuss your recently diagnosed heart failure. You can expect to discuss these medications: carvedilol, spironolactone, losartan, and furosemide." or "The Neurology Infusion clinic will contact you for scheduling your immunosuppressive Cytoxan for vasculiits. They will complete a prior authorization with your insurance company. For questions, call 415-514-****." By providing information in a succinct, standardized manner, we believe this will achieve: 

                          • Significantly shortened and more patient-friendly AVS documents for the patient/family  
                          • Less time spent by physicians and nurses on duplicating and sorting information in discharge documentation 
                          • Better understanding of the purpose of upcoming visits and relevance of certain medications 
                          • Subsequently, better adherence to documented medical plans  
                          • Greater patient activation and engagement in completing treatment plans 
                          • Greater patient trust in the health system   
                          How would an end-user find and use it?
                          This would be a highly visible intervention. The AVS (Figure 1) is often the sole document all patients receive after hospitalization, and it represents a written memorandum of high importance points from the clinician team to the patients and caregivers. The current generators (physicians) (Figure 2), messengers (nurses), and recipients (patients/families) would all immediately see a different document design, one that balances patient-centered communication with the need for a standardized workflow to ensure smooth clinical operations. 
                           
                          Embed a picture of what the AI tool might look like.Figure 1: Discharge Instructions, typically on page 2 of AVS, where AI-generated information would go (patient view)
                           
                          Figure 2: Discharge Instructions tab with sample AI-generated information displayed (provider view)
                           
                          What are the risks of AI errors? What are the different types of risks relevant to the proposed solution and how might we measure whether they are occurring and mitigate them?
                          The various components of this AI tool will have varying degrees of risk and consequences: 

                          1. Some aspects of the medication management plan (including contingency plans for symptoms like chest pain, weight gain, headaches, or other possible harbingers of serious disease) have a potential to be misrepresented by the LLM or miss important nuances. 

                          1. Hospital course summaries can be significantly briefer than a Discharge Summary, but there is a chance of generating misunderstandings through oversimplification.  

                          1. Follow-up appointments are presented as structured data but nevertheless could be erroneously reported or displayed by the LLM. Safeguards against this error include already-existing MyChart and phone reminder processes that occur prior to appointments. 

                          1. Reason for visit and medications to be discussed: this data would be AI-generated using existing progress/consultant notes within APeX. There is more room for hallucination to occur in this aspect of the AI tool.  

                          Current practice patterns emphasize direct clinician (physician, nurse) clarification of the AVS with patients as they prepare to discharge, which may protect against some errors. Importantly, human error is a notable flaw of the current AVS generation platform, with several areas (including discharge instructions) requiring manual data entry to present information found in other areas of the same document.  

                          How will we measure success?
                          Measurements using data that is already being collected in APeX:
                          Initial success would be measured by assessing for the presence of high consequence errors (inaccurate medication dose, falsified hospital course information, erroneous appointment time or location) by comparing the AI-generated AVS with the currently utilized version in APeX.
                           
                          Additional measurements ideally present to evaluate success of the AI tool:
                          Subsequent success would be measured by assessing for more low consequence errors (e.g., incorrect reason generated for a follow-up visit) and the burden of generating the AI-summarized instructions in the AVS. We have previously used some of these methods to help inform our work with AVS improvement. 
                           
                          • Qualitative feedback from patients and families (e.g., from the Patient and Family Advisory Council) 
                          • Quantitative measurements of follow-up appointments/labs/imaging successfully scheduled/attended versus missed 
                          • Quantitative measurement on number of phone calls/questions received regarding follow-up appointments by the Care Transitions Outreach Team (who contact patients post-discharge)  HCAHPS survey data regarding patient satisfaction on post-discharge transition 
                          Describe your qualifications and commitment:
                          We are academic hospitalists at UCSF who are motivated by our commitment to ensuring the patients we discharge have the best understanding possible of their care plan and next steps. With allocated protected time and a desire to collaboratively, systematically, and thoughtfully solve problems just like this, we are well equipped to dedicating the necessary effort into ensuring this project’s success. 
                           
                          References
                          Omonaiye, O., Ward-Stockham, K., Darzins, P., Kitt, C., Newnham, E., Taylor, N.F. and Considine, J., 2024. Hospital discharge processes: Insights from patients, caregivers, and staff in an Australian healthcare setting. Plos one, 19(9), p.e0308042.

                          Schwarz, C.M., Hoffmann, M., Smolle, C., Borenich, A., Fürst, S., Tuca, A.C., Holl, A.K., Gugatschka, M., Grogger, V., Kamolz, L.P. and Sendlhofer, G., 2024. Patient-centered discharge summaries to support safety and individual health literacy: a double-blind randomized controlled trial in Austria. BMC Health Services Research, 24(1), p.789.
                           
                          Supporting Documents: 

                          Implementation of an AI-Powered Platform for Scalable, Real-Time Lab Monitoring in Immunosuppressive Treatment

                          Proposal Status: 

                          1. The UCSF Health Problem

                          Patients at UCSF Health who receive immunosuppressive medications (e.g., disease modifying drugs (DMARDs) like methotrexate or biologics) require frequent laboratory monitoring to detect potential toxicities such as liver enzyme elevation, cytopenias, elevated blood pressure, organ dysfunction, or opportunistic infections. Although guidelines recommend regular laboratory monitoring, these processes rely on manual workflows during clinic visits and are unfortunately ad hoc in nature, leading to significant patient safety risks. Further complicating monitoring, tests may be performed in outside laboratories and therefore only captured in clinical media in pdf form (or discussed during the visit) within the EHR. National studies done by our team also demonstrate that critical safety monitoring for immunosuppressive medications frequently is not done or delayed.12 As a result, patients may face delayed therapy adjustments and unnecessary risk of medication-induced harm, with crucial lab values or infection screenings failing to surface in real time. Existing methods—such as manual spreadsheets or nurse phone reminders, are labor-intensive and difficult to scale. Operational leadership in rheumatology at UCSF has identified development of improved lab monitoring systems as a high-priority target for improvement, noting that automated, comprehensive approaches could significantly increase efficiency, reduce legal risks to the institution, and most importantly, enhance patient safety.

                          2. How Might AI Help?

                          We propose an AI tool to automatically monitor drug safety gaps. The tool would identify patients overdue for recommended labs based on medications, laboratory data, outside lab tests contained in PDF documents, and labs that were entered by clinicians in notes. We envision this tool as a system with a dashboard as the frontend; the dashboard would streamline tracking of required safety tests, while minimizing false alarms regarding missed screenings, and reducing the need for manual chart review, both increasing safety and decreasing administrative burdens. Specifically:

                          • The tool will select patients who are currently receiving a medication of interest that requires periodic testing, including oral DMARDs and biologic drugs. Required tests and their limit dates will be kept in a table the tool has access to, such as a csv file, which is updated daily.
                          • The tool will then retrieve all relevant structured EHR data (laboratory results and other relevant screenings) for patients and flag those with significant delays in monitoring. By using existing branching algorithms and prior experience with Clarity pulls, we could quickly implement this real-time data pull using LOINC codes. For example, methotrexate requires laboratory monitoring every 3 months; the tool will flag all patients who do not have evidence of monitoring in the last 5 months. This grace period will identify patients who are clearly outside guideline recommended screening windows.
                          • For flagged patients, the tool will then leverage Versa through the API, screening notes and patient portal messages from recent visits for mention of labs and tests of interest that may have been performed externally. For PDF documents in clinical media, the tool will use an OCR algorithm (or Python extraction if digital) followed by Versa. If information is found, the tool will update its data, removing the flag, indicating that the patient did receive their periodic testing in time, and presenting lab results in the dashboard. This will allow us to avoid Versa running unnecessarily. We have prior experience in prompt design and in-context learning methods using Versa for information extraction, even when the information is not explicitly available but requires model reasoning, and will use this expertise here.3
                          • The AI tool will raise a warning for all patients that remain flagged (i.e. no testing data found in either structured data, notes, or clinical media PDF files).
                          • The loop will be repeated as needed to ensure patients continue to receive their safety testing. The loop will also include criteria for exclusion (for example, the medication is no longer prescribed, or the patient is no longer seen at the clinic, with functionality for staff to update this latter component in the platform). We will explore how often to re-send alerts that are not acted upon, and re-run the loops, through qualitative interviews with the care team.

                          3. How Would an End-User Find and Use It?

                          The AI tool would run in the background and require no UI. Interaction is limited to: (1) MAs receiving alerts, who will then communicate with other members of the care team and patients as needed, and (2) loop results saved in a dashboard the care team can consult. The dashboard, which will be iteratively designed with clinical staff, may also include search functionality to customize checks for individual patients (e.g., a patient that has a comorbidity that requires more frequent monitoring) as well as have a feedback option, so that the study team can monitor problems as they come up. The figure below presents a diagram of our proposed model.

                           Figure 1: Model Diagram

                          4. What Are the Risks of AI Errors?

                          • False Positives: The system might raise a false alarm by missing documentation of a lab that has already been completed and documented in the EHR. To mitigate this risk, we will do extensive testing with human annotators and staff during pilot testing specifically asking if they have observed any false positives, and if so, we will retrain/adjust the model. We will ensure this risk is minimal by conducting chart review and model evaluation during an initial proof of concept phase. Once fully implemented, model maintenance will include regularly scheduled validation to ensure the model continues to function properly, preventing model drift. Feedback collection will be integrated in the dashboard.
                          • False Negatives: The system fails to detect a truly missed test. The feedback system designed above will also detect and mitigate this risk if present. We will also assess the prevalence of false negatives during the proof-of-concept phase and regularly though implementation.
                          • Hallucinations/Omissions: Versa incorrectly extracts lab data from notes (leading to either a false negative or a false positive). We consider this risk as a special case of both risks defined above as retraining Versa if this occurs would not be possible. While the risk of misreading old results as current is low since we will only add notes from the relevant time period, past labs may still be discussed in those notes. If this risk is observed, we will try to mitigate it through prompting techniques such as metaprompting and chain of thought, or by using a smaller, local model we are capable of fine-tuning, such as ClinicalBERT.

                          We will further mitigate these risks by tracking real-time use and outcomes; for example, how many flagged alerts led to ordered labs, indicating a lab was indeed past due. We also place a human in the loop, as MAs will interact with the dashboard and report to clinicians and nurses.

                          5. How Will We Measure Success?

                          Our plan is to pilot the AI tool in the UCSF Rheumatology clinic during the first year. We aim for a model that is fully implemented and collecting use data by the end of the first year. If effective, we will seek to extend to other UCSF clinics using medications that require toxicity monitoring, such as gastroenterology, dermatology and nephrology. We propose:

                          During model design and proof of concept (Months 1-3): We will manually chart review a statistically representative sample of our dataset of potential gaps identified by the AI tool with 2-3 UCSF rheumatologists, indicating how many identified gaps were true positives, using classic machine learning metrics (F1, precision, recall, AUROC), as well as correctness of retrieved data. This will also serve as an evaluation set for any future changes to the model. We will continuously add data to this evaluation set throughout pilot testing.

                          During implementation and pilot testing (Months 4-12):

                          • Changes in lab completion rates: Proportion of patients on target medications who complete missed labs detected by the AI tool within guideline-specific intervals.
                          • Number of missed tests identified by Versa missing in structured data.
                          • Performance of different prompts and models, in clinical media and in notes, using classic ML metrics, including correctness of retrieved lab values from notes and PDFs.
                          • Time-to-order: Time, in days, from gap detection to lab order and to lab completion.
                          • Provider feedback: We will interview 2 clinical staff each week for the first week of implementation, followed by every other week until conclusion of the pilot or complete removal of false positives, false negatives, and hallucinations.
                          • Provider satisfaction: At the conclusion of the pilot test, we will interview 5-6 UCSF rheumatologists and nurses (or until thematic saturation) on their perspective of the AI tool using semi-structured interviews following the technology acceptance model.

                          If we achieve excellent accuracy (>97% correct assessments) we will explore the possibility of further automating the process by generating pended lab orders for clinicians to sign.

                          6. Qualifications and Commitment

                          Project co-leads: Augusto Garcia-Agundez, PhD. Postdoctoral researcher at the Division of Rheumatology. Expert in AI methods and NLP. Dr. Garcia-Agundez will commit 10% effort to implement and validate AI methods and conduct continuous evaluation. Jinoos Yazdany, MD MPH. Chief of the Division of Rheumatology at ZSFG and the Alice Betts Endowed Professor of Medicine at UCSF. Dr. Yazdany has ample experience in quality improvement projects and prior experience designing EHR add-on tools such as a dashboard for Rheumatoid Arthritis. Dr. Yazdany will commit in-kind effort for the year to collaborate with UCSF Health AI and APeX teams. Co-Is: Gabriela Schmajuk, MD, MSc. Chief of Rheumatology at SFVAHC and Professor of Medicine in the Division of Rheumatology at UCSF. Andrew Gross, MD. Dr. Gross is Medical Director of Rheumatology at UCSF. Diana Ung, PharmD, APh, CSP. Dr. Ung is the UCSF Specialty Pharmacist for Rheumatology. Nathan Karp, MD. Director of QI for Rheumatology at UCSF.

                          7. Summary of Open Improvement Edits

                          Added clarification about existing preliminary work including branching algorithms for required labs and timeframes, and knowledge of where to pull the required data from Clarity.

                          A Multimodal Foundation Model for Enhanced Prostate Cancer Care at UCSF Health

                          Proposal Status: 

                          1.  The UCSF Health Problem

                          Prostate cancer is the most commonly diagnosed malignancy among men in the United States. Accurate diagnosis, staging, and monitoring are critical for effective treatment planning and improving patient outcomes. Imaging modalities, such as Prostate-Specific Membrane Antigen (PSMA) Positron Emission Tomography (PET), CT, and MRI, provide valuable insights but often lack the integration necessary to fully capture the complexity of prostate cancer progression. Current challenges include:

                          • Fragmented Data Interpretation: Clinicians must manually synthesize information from disparate imaging modalities (PSMA PET, CT, MRI) and clinical records, which can be time-consuming and prone to variability.​
                          • Variability in Imaging Assessments: Interpretations of imaging studies can vary among radiologists, leading to inconsistencies in diagnosis and treatment planning.​
                          • Diagnostic Workflow Inefficiency: Radiologists face growing workloads with increasingly complex multimodal cases. Generating comprehensive, standardized reports is labor-intensive and varies by provider.

                          AI methods have exceptional potential to better leverage the full spectrum of available data, including imaging, and clinical measures, to improve performance. The intended end-users of this project are medical oncologists, radiation oncologists, radiologists, and urologists, who currently lack integrated tools to synthesize these multimodal data.

                          2.  How might AI help?

                          A multimodal foundation model leveraging PSMA PET/CT/MRI datasets and textual clinical data from UCSF prostate cancer patients offers a transformative solution. PSMA PET has recently emerged as an extremely powerful tool for more accurately identifying prostate cancer cells, particularly in the metastatic setting. It is combined with CT and/or MRI to provide anatomical reference, allowing for identification of localization patterns (e.g. metastases to the bones vs liver vs lymph nodes) as well as removing false positives. MRI has the additional benefit of potentially incorporating additional contrasts depicting perfusion and cell density that also can reveal tumor characteristics. Complementing this, textual clinical data encompasses a wealth of information regarding patient history, symptoms, PSA levels, biopsy results (including Gleason scores), treatment regimens, and follow-up outcomes. By integrating imaging and clinical data, a multimodal vision foundation model can potentially uncover intricate relationships and patterns that remain hidden when each data type is analyzed separately. Furthermore, a significant advantage of multimodal foundation models lies in their ability to learn from large-scale unlabeled or naturally paired data, which is particularly beneficial given the challenges and costs of annotating large, multimodal datasets.

                          Phase 1 will focus on developing a pre-trained multimodal foundation model using a substantial corpus of PSMA PET/CT/MRI images and corresponding textual clinical data from prostate cancer patients at UCSF.  We have a PSMA PET database of over 2000 studies at UCSF which will be used to identify the datasets.  This will be done with self-supervised learning. Phase 2 will involve fine-tuning the pre-trained foundation model for two downstream high-impact applications that directly assist clinical workflow:

                          1. Lesion Detection & Segmentation: Automate identification and outlining of cancerous lesions on scans, to assist in consistent tumor detection.
                          2. AI-Generated Radiology Reports: Automatically produce a draft radiology report from the multimodal data, reducing the reporting workload on radiologists.

                          By project end, we will deliver (1) a pre-trained multimodal foundation model, (2) a fine-tuned lesion segmentation AI tool, and (3) a fine-tuned report generation AI tool validated on UCSF data.

                          3.  How would an end-user find and use it?

                          When deployed, clinicians will access the AI tool through familiar systems (the APeX EHR and PACS imaging viewer). For example, when a prostate cancer patient’s PSMA PET/CT or MRI is opened, an AI Decision Support panel will display the model’s outputs. These include visual lesion overlays on the scans (highlighting detected tumors or metastases) and an AI-generated draft report summarizing key findings (e.g. lesion locations and characteristics). The radiologist can review the highlighted lesions, adjust or accept the draft text, and be notified of any critical findings flagged by the AI. Explanations (e.g. confidence levels or reference images for each prediction) will accompany the results to help the clinician trust and understand the recommendations. This AI support is embedded seamlessly into the existing workflow via a tab in APeX/PACS, so end-users can incorporate the tool without needing to switch to a separate application.

                          4.  Embed a picture of what the AI tool might look like.

                           

                          Figure: AI Decision Support interface for prostate cancer care, showing multimodal imaging (CT and PSMA PET shown) with cancerous lesion detection and classification overlays, and an auto-generated report summarizing the findings is displayed.

                          5.  What are the risks of AI errors?

                          Risks include false positives (benign findings flagged as cancer), false negatives (missed cancer lesions), hallucinations (incorrect information in the report), algorithmic bias (disparities in care), and overreliance on AI (overlooking clinical judgment). We will measure error rates by comparing AI predictions to ground truth data (histopathology, outcomes) and through clinician feedback. Mitigation strategies include using high-quality training data, rigorous validation, uncertainty estimation, transparent presentation of limitations, human oversight and clinician override options. Continuous monitoring and retraining will be essential.

                          6.  How will we measure success?

                          a. Using Existing APeX Data:

                          • Diagnostic Accuracy: Comparison of AI-assisted diagnoses with traditional methods using metrics such as sensitivity, specificity, and area under the curve (AUC).
                          • Workflow Efficiency: Time taken for report generation and decision-making processes before and after AI implementation.
                          • Number of unique clinicians using the tool and frequency of use.

                          b. Additional Measurements:

                          • Surveys and feedback from clinicians on usability and clinical utility.  
                          • Prospective evaluation of AI accuracy in downstream tasks (e.g., checking segmentation accuracy and report correctness in a pilot setting).

                          Continued support from UCSF Health leadership will require demonstrating significant clinician adoption, positive feedback, improvements in intermediate outcomes or workflow efficiencies, and evidence of the AI model's accuracy.

                          7.  Describe your qualifications and commitment:

                          I am Mansour Abtahi, a Specialist at the University of California San Francisco, where I am currently developing AI/ML models on prostate cancer using clinical and imaging data (PET/CT/MRI). I bring extensive expertise in developing innovative AI/ML models within healthcare, as evidenced by my publications on analyses of ophthalmology imaging data in top-tier journals. My skills encompass a wide range of deep learning architectures, including CNNs, Transformers, ViTs, and Large Multimodal Models, along with proficiency in computer vision techniques for medical image analysis, particularly in 3D imaging (MRI, CT, PET) relevant to this project. I am well-positioned to co-lead with esteemed UCSF faculty such as Dr. Thomas Hope and Dr. Peder Larson. Dr. Hope, Vice Chair of Clinical Operations in Radiology, has led the translation of PSMA PET imaging into the clinic, creating a paradigm shift in prostate cancer assessment, and the use of theranostic agents for precision treatment of prostate cancer. Dr. Larson, Director of the Advanced Imaging Technology Research Group, is a biomedical engineer who specializes in imaging physics and multimodality imaging, with projects improving prostate cancer imaging with PET/MRI and hyperpolarized metabolic MRI.  Collaborating with them aligns perfectly with the goals of this project. I am fully committed to dedicating up to 10% of my time as a co-lead, participating in progress sessions, and collaborating closely with UCSF’s Health AI and AER teams. This effort will drive the successful integration of AI models into the APeX EHR system, with the goal of improving prostate cancer patient outcomes at UCSF Health.

                          Supporting Documents: 

                          AI-Driven Endometriosis Symptom and Risk Assessment Tool for Personalized Patient Management

                          Proposal Status: 

                          The UCSF Health Problem: Endometriosis is a chronic, debilitating condition affecting approximately 10% of reproductive-age women worldwide. It is characterized by the growth of endometrial-like tissue outside the uterus, leading to severe pelvic pain, dysmenorrhea, dyspareunia, bowel and bladder dysfunction, and infertility. Despite its prevalence, endometriosis remains significantly underdiagnosed, resulting in profound patient suffering and substantial economic burden.

                          The diagnostic journey for endometriosis is protracted, with patients often experiencing a delay of 8-10 years from symptom onset to diagnosis. This delay is compounded by the fact that patients consult an average of 10 physicians before receiving a correct diagnosis. This protracted process leads to years of untreated pain and suffering, impacting patients' quality of life, mental health, and productivity.

                          The economic impact of endometriosis is substantial. Studies have estimated the annual cost of endometriosis-related healthcare and lost productivity to be in the range of $10,000 to $20,000 per patient. This includes direct medical costs (e.g., physician visits, imaging, surgery, medications) and indirect costs (e.g., lost wages, reduced work productivity, absenteeism). A study published in the Journal of Human Reproductive Update estimated the annual cost of endometriosis in the US alone to be over $69 billion.

                          Given the significant burden of endometriosis, there is an urgent need for improved diagnostic and management strategies. The current reliance on subjective symptom assessment and non-standardized history taking contributes to diagnostic delays and inconsistencies in care. An AI-driven tool that can standardize patient history collection, assess symptom severity, and predict treatment response has the potential to: reduce diagnostic delays and improve patient access to appropriate care, enhance clinician understanding of endometriosis and improve patient-clinician communication, optimize treatment planning and improve patient outcomes, and reduce the economic burden of endometriosis by minimizing unnecessary healthcare utilization and lost productivity.

                          How Might AI Help? AI can help by developing a dynamic, interactive questionnaire integrated into the patient's history within APeX EHR. This questionnaire would cover key symptom domains (pelvic pain, dysmenorrhea, dyspareunia, bowel symptoms, etc.), patient medical history, family history, and lifestyle factors. Machine learning algorithms would analyze patient responses to generate: 1. symptom severity score, quantifying the patient's endometriosis burden, 2. risk stratification for specific endometriosis subtypes (e.g., deep infiltrating endometriosis, ovarian endometrioma). 3. personalized recommendations for diagnostic workup, and predicted likelihood of symptom improvement with various treatment options.

                          How Would an End-User Find and Use It? Within the patient's history in APeX EHR, a "Endometriosis Symptom and Risk Assessment" button would be available. Clicking this button would initiate the interactive questionnaire. Patients could complete the questionnaire in the clinic or remotely via the patient portal. Once completed, the AI would generate a report summarizing the symptom severity score, risk stratification, and personalized recommendations. The report would be displayed within the patient's chart, with visual aids to enhance understanding. Clinicians could use the report to: guide discussions with patients about diagnostic and treatment options, tailor treatment plans based on predicted likelihood of symptom improvement, pend orders for recommended labs and imaging directly from the report interface.

                          Embed a picture of what the AI tool might look like:

                          "Endometriosis Symptom and Risk Assessment"

                          Patient: [Patient Name]

                          Date: [Date]

                          Symptom Severity Score: 18 (Moderate-Severe)

                           

                          Risk Stratification:

                          - Deep Infiltrating Endometriosis: 60% probability

                          - Ovarian Endometrioma: 30% probability

                           

                          Recommended Diagnostic Workup:

                          - Pelvic MRI with endometriosis protocol

                          - Pelvic ultrasound

                           

                          Predicted Treatment Response:

                          - Hormonal Therapy (GnRH agonist): 70% likelihood of symptom improvement

                          - Laparoscopic Surgery: 85% likelihood of symptom improvement

                           

                          [Symptom Severity Graph: Showing patient's pain scores over time]

                          [Risk Probability Chart: Visualizing risk of different endometriosis subtypes]

                          [Buttons: “Order MRI,” “Order Labs,” “Discuss Treatment Options”]

                          What are the Risks of AI Errors? 1. False Negatives: The AI might underestimate symptom severity or miss subtle indicators of endometriosis, leading to delayed diagnosis.False Positives: The AI might overemphasize certain symptoms, leading to unnecessary investigations or treatments, 2. Algorithmic Bias: The AI might exhibit biases based on the training data, potentially impacting care for minority populations.3. Data Misinterpretation: Clinicians might misinterpret AI-generated risk scores or recommendations.

                          Mitigation strategies include rigorous validation of the AI algorithm on diverse patient populations, clear communication of the AI's limitations and the importance of clinical judgment, continuous monitoring of AI performance and user feedback.

                          How Will We Measure Success?

                          This project's success will be evaluated based on the following framework, addressing real-world uptake, meaningful impact, safety, and equity/fairness:

                          1. Real-World Uptake (Process Metrics): Measurements using data already collected in APeX: include frequency of tool utilization by clinicians (number of assessments completed), time taken to complete the assessment questionnaire, integration of AI-generated recommendations into patient care plans (e.g., orders for recommended labs/imaging), and changes in referral patterns to the Comprehensive Endometriosis Center. Other measurements ideally needed: cinician satisfaction surveys regarding tool usability and integration into workflow, patient feedback on the clarity and helpfulness of the AI-generated report.

                          2. Meaningful Impact (Health Outcome Metrics): Measurements using data already collected in APeX, such as reduction in the time from symptom onset to diagnosis, changes in the utilization of diagnostic procedures (e.g., number of MRIs, laparoscopies), changes in the utilization of treatment modalities (hormonal therapy, surgery), patient-reported outcome measures (PROMs) for pain, quality of life, and symptom severity, reduction in complications related to endometriosis. Other measurements ideally needed include longitudinal data on symptom control and disease progression, cost-effectiveness analysis of the AI-driven approach, and patient reported measures of improvement in specific symptoms.

                          3. Safety and Equity/Fairness: Measurements using data already collected in APeX: analysis of potential disparities in tool utilization and outcomes across different patient demographics (e.g., race, ethnicity, socioeconomic status), tracking of adverse events related to diagnostic or treatment decisions influenced by the AI tool. Other measurements ideally needed: validation of the AI algorithm's performance on diverse patient populations to ensure equitable outcomes patient feedback on perceived fairness and trust in the AI-driven assessment.

                          Evidence for Continued Support/Abandonment: Continued Support: Significant reduction in time to diagnosis, demonstrable improvement in patient-reported outcomes, and high clinician/patient satisfaction would warrant continued support. Abandonment: Evidence of significant algorithmic bias, lack of clinical uptake, or adverse patient outcomes would necessitate re-evaluation or abandonment.

                          Describe your qualifications and commitment: This project addresses a high-priority area within UCSF Health, specifically the significant delays and disparities in endometriosis diagnosis and management. The AI-driven tool has the potential for large-scale impact by improving patient outcomes and streamlining clinical workflows.

                          Jeannette Lager, As the Section Chief for Minimally Invasive Gynecologic Surgery (MIGS) at UCSF, and formerly the Medical Director and Interim Chief of the Urogynecology and MIGS division, I have clinical and leadership expertise directly relevant to endometriosis care. My role as Associate Director of the Comprehensive Endometriosis Center provides me with deep insight into the patient population and the challenges they face. I have a strong understanding of the clinical workflows and can design the AI tool to integrate seamlessly into existing practice.

                          My co-investigator, Zaineh Khalil, is a highly experienced Nurse Practitioner with a specialized focus on endometriosis care. Her extensive clinical experience, coupled with her deep understanding of patient needs, makes her an invaluable asset to this project. She has been actively involved in quality improvement initiatives within the MIGS and Urogynecology clinic. Her expertise in navigating the complexities of endometriosis management, combined with her understanding of clinical workflow, will be instrumental in ensuring the tool's practical application and successful integration into the clinical setting.

                          Furthermore, I have a strong track record of collaboration with UCSF administration and faculty, ensuring engagement and support from critical operational and clinical champions. Together with my co-investigator, we are positioned to effectively lead this project and ensure its successful implementation.

                          AI Chatbot Integration into APeX MyChart for Enhancing Patient Education and Reducing Provider Burden in Chronic Autoimmune Disease

                          Proposal Status: 

                          1. What problem are you trying to solve?  

                          In chronic autoimmune diseases, such as Myasthenia Gravis (MG), as in several other conditions across specialties, patients experience fluctuating symptoms and highly individualized disease courses. This leads to patients having numerous questions, and a high demand for information, with patients often getting overwhelmed and confused with numerous internet search results. Often patients relaize after their visit that they forgot to ask or clarify several questions pertaining to their diagnosis leading to new chains of messages .Access to sub specialized healthcare providers is significantly limited. While electronic health records (EHRs) such as Apex MyChart are an interface to bridge care gaps, frequent MyChart messages can overwhelm providers and staff. 

                          One current approach by heath systems including UCSF health is to warn patients about possible charges when providers respond to myChart messages This may especially dissuade patienst from marginalzed socioecomic status from messaging  providers .We thus propose a relational AI chatbot to reduce provider burnout and enhance patient care, by offering a reliable source of answers to patient questions about their specific diagnosis (s). 

                          Effective education and support are crucial for patients with chronic autoimmune diseases to manage symptoms, adhere to treatment plans, and make informed lifestyle choices. Without targeted guidance, patients feel overwhelmed and disconnected from healthcare providers, leading to suboptimal outcomes. Lack of readily available guidance provokes anxiety with MyChart messages being one of the few available resorts. Additionally, reducing the burden on healthcare providers and their ancillary staff is essential for care quality and to prevent burnout.  

                          Previous efforts have included telehealth-based education programs and mobile health (m-health) apps. However, these solutions often require high patient involvement and may not provide personalized, continuous support. Traditional educational methods like brochures and websites can lead to information overload, require fact-checking and do not offer answers tailored to individual questions.While APEX/EPIC allows providers to include disease-specific educational content in the After Visit Summary, the material is often too lengthy, making it difficult for patients to find key information—ultimately contributing to information overload. MyChart messaging thus remains the predominant method for patient provider communication outside of clinic visits, which tends to overburden health systems. 

                          The primary end-users for this pilot project are patients with MG, particularly those newly diagnosed. Secondary end-users include healthcare providers, such as neurologists and specialists, who can leverage the relational AI chatbot to enhance patient education and engagement while managing their workload more effectively. 

                          We will first pilot this for providers at UCSF health treating Myasthenia Gravis (MG). The tool and infrastructure can be extended and adapted to other automimmune diseases such as Multiple Sclerosis and Rheumatoid Arthritis.

                          2. How might AI help?  

                          We propose the development and evaluation of a relational AI chatbot [1,2] aimed at delivering patient-centered education to support behavior change, improve disease literacy, and promote treatment adherence. This intervention is designed to enhance patient outcomes while alleviating the growing workload on providers and care teams tasked with responding to generic disease questions. In brief,this will make patient care more effecient and reduce redundancies.

                          AI chatbots have the potential to offer a promising scalable approach to supporting chronic disease management by offering flexibility and enabling reliable, equitable conversations. Research shows that patients often feel more comfortable disclosing sensitive information to nonjudgmental AI chatbots[3-5], leading to improved clinical outcomes. Unlike traditional one-way interventions, the chatbot gives patients more agency in their learning process. While delivering structured information, it adapts to patients’ needs, answering questions, directing them to verified sources, and connecting them with healthcare providers as needed.

                          We will develop the chatbot through a phased, safety-first approach using GPT-4 via Azure OpenAI, which offers secure, HIPAA-aligned infrastructure. In Phase 1, we will curate a trusted knowledge base drawing from UCSF-approved materials, validated educational sources, and original content authored by Dr. Paul. In parallel, we will compile approximately 100 high-quality doctor–patient conversations to support few-shot prompting and our retrieval-augmented generation (RAG) system. This foundation will enable the chatbot to generate accurate, contextually grounded responses in a patient-friendly tone aligned with institutional standards.

                          In Phase 2, we will implement key guardrails, including confidence thresholds that trigger fallback responses or human referrals when uncertainty is detected, as well as a human-in-the-loop review process to audit interactions and inform iterative improvements. In Phase 3, we plan to explore deployment on fine-tunable architectures (e.g., open-source LLMs) to further enhance domain adaptation while preserving transparency, interpretability, and clinical safety. This modular development strategy allows us to launch with a robust, secure system and expand its capabilities responsibly over time.

                          Ultimately, this innovation can pave the way for disease-specific modules and educational materials across multiple chronic autoimmune diseases that patients (through their providers) can avail  upon diagnosis, engage in conversations over the initial 3-6 months to enhance their disease knowledge, and management skills. This can potentially reduce some messaging volume on myChart which currently represents a huge burden for UCSF health as well as other tertiary health systems. After successful deployment in Myasthenia Gravis clinics at UCSF Health, our medium-term vision extends to applying this relational AI chatbot to other chronic autoimmune conditions especially ALS and Rheumatoid Arthritis  where disease knowledge and management efficacy and skills are critical. This approach ensures that patients are not only informed but also feel supported and motivated throughout their healthcare journey, ultimately driving better outcomes.

                          3. How would an end-user find and use it? 

                          The AI chatbot would be integrated directly into APeX MyChart, the same interface that patients use to send messages to their providers. Once a patient receives a confirmed diagnosis, the provider would inform them about the AI chatbot and its capabilities. If the patient agrees, the chatbot can be enabled for them from the “Wrap-Up” section, allowing them to begin using it immediately. Patients would interact with the chatbot through text-based conversations within APeX MyChart, receiving real-time responses and support. They could also connect with healthcare providers if needed, ensuring that the chatbot complements rather than replaces direct provider communication. The AI support would be most useful during the initial diagnosis and early stages of disease management, when patients are seeking information about the disease and guidance for daily life. By being embedded in APeX MyChart, the chatbot becomes a seamless part of the patient's existing communication and management tools. 

                          Our proposed chatbot will have three core capacities: education, relational engagement, and persuasive (nudging) abilities. These are grounded in Co-PI Zhang's AI Chatbot Behavior Change Model[1] and the RESPECT model [2] from UCSF for improving patient-provider communication. The chatbot will provide comprehensive information on MG symptoms, medications, side effects, and early warning signs, while fostering a supportive relationship to boost patient confidence and behavioral efficacy. It will also offer practical guidance for symptom monitoring and daily challenges at home and work and suggestions in the realm of preventive health.Further it will provide links to direct patients to reliable physician trusted sources such as Myasthenia Gravis foundation of America (MGFA). 

                          4. Embed a picture of what the AI tool might look like. 

                          Figure 1: A simple illustration of a patient logging in to MyChart and being prompted to chat with the AI chatbot within the message center. And an example of a message exchange where a chatbot answers their question.

                          5. What are the risks of AI errors? 

                          Potential AI errors include false positive or false negative predictions, and "hallucinations" from generative AI models. False positives could lead to unnecessary anxiety or interventions, while false negatives might result in missed symptoms or delayed treatment. Hallucinations could provide incorrect or misleading information. To measure and mitigate these risks, we would implement continuous monitoring and validation of the AI outputs, involve healthcare professionals (specialty physicians) in reviewing critical information, and provide clear disclaimers about the AI's limitations. Regular feedback from users would also help identify and address errors. The first 10 patients using the chatbot will be the PI’s own clinic patients and then in 3 months will offer other MG providers.  While it engages users in interactive, personalized conversations, the chatbot will be programmed to deliberately defer any questions related to acute symptom management, emergencies, or health crises—promptly advising patients to call 911 or contact their healthcare provider. It will also include routine reminders encouraging patients to message their care team if their questions are not fully addressed, reinforcing that the chatbot is intended as an educational tool rather than a substitute for clinical care.

                          6. How will we measure success? 

                          We will take a multipart approach to measure the success of the AI chatbot incorporating both patient and provider feedback as well as log data analysis from Apex. In the initial phase, we will conduct feasibility testing to assess technical stability, response accuracy, and integration within clinical or research workflows.

                          a)Physician and Patient Feedback:  In one-on-one interviews, physicians will respond to questions about their enthusiasm for using the tool, their ability to provide it to patients, and their perception of whether the tool has the potential to reduce the provider and staff burden of responding to messages.Since the pilot is aimed at Myasthenia Gravis patients, in a short patient survey, specific measures will be used to assess patients' confidence in disease knowledge, symptom awareness, and treatment adherence, along with a scale to measure patients' satisfaction with the chatbot,their willingness to continue using it and whether the chatbot helps answer questions they would otherwise want to reach the provider for. These surveys can be integrated into mychart.  

                          b)Outcome Metrics based on log data analysis from Apex: 

                          (i)Reduction in Provider Messages:  Analyzing “provider-patient dyads", we will measure the reduction in MyChart messages received by providers’ offices, comparing data before and after AI chatbot implementation doing both "within provider" and "between providers" analyses. 

                          (ii)Predictors of Bot Usage: We will assess factors predicting greater usage of the bot to ensure typical health disparities based on race, ethnicity and gender are not exacerbated. 

                          (iii)Healthcare Utilization: We will compare measures such as ER visits, hospital admissions, and urgent care visits over a one-year period to assess tangible reductions. 

                          (iv)Evidence for Leadership: We will analyze data showing increased patient engagement, improved health outcomes, reduced provider messages, and positive feedback from users to convince UCSF Health leadership to continue supporting the AI implemenntation. 

                          Abandonment Criteria: If the AI fails to improve patient outcomes, shows high rates of errors, or receives consistently negative feedback, we will consider abandoning the implementation 

                          7. Describe your qualifications and commitment: 

                          PI Dr. Pritikanta Paul is a Neuromuscular Neurologist and currently a Health Sciences Assistant Professor of Neurology at the UCSF. His specific clinical and research interests lie in immune mediated muscle and nerve diseases. His recent experiences of seeing medically underserved patient populations, have led to an interest in health disparities as they affect outcomes in neuromuscular diseases and he recently developed an educational intervention utilizing text messages in improving outcomes of myasthenia gravis.

                          Co PI: Dr. Jingwen Zhang, Associate Professor,Dept of Communication,also affiliated withDepartment of Public Health Sciences at University of California Davis (UCD). Dr. Zhang’s research focuses on understanding, designing, and testing emerging persuasive technologies in shaping public health attitudes and behaviors. Her research has been supported by NIH, USDA, Robert Wood Johnson Foundation, and University of California. During the past five years, she has focused on understanding and developing conversational AI or chatbot in persuasion and health promotion. She has collaborated with scholars from UCSF’s School of Nursing to develop an AI chatbot for promoting physical activity, and is currently working on developing a chatbot to promote heart health awareness and knowledge among underserved minority women in the U.S.   

                          This project directly aligns with research priorities for both the PI and Co-PI. Both PIs are committed to dedicating effort to this project, participating in regular work-in-progress sessions, and collaborating with the Health AI and AER teams for development and implementation of the AI algorithm.The PI Dr. Paul has the assurance of release time from their academic department.

                          References 

                          1.  Zhang J, Oh YJ, Lange P, Yu Z, Fukuoka Y. Artificial Intelligence Chatbot Behavior Change Model for Designing Artificial Intelligence Chatbots to Promote Physical Activity and a Healthy Diet: Viewpoint. J Med Internet Res. 2020 Sep 30;22(9):e22845. doi: 10.2196/22845. PMID: 32996892; PMCID: PMC7557439.

                          2.     Mutha, S., Allen, C. & Welch, M. (2002). Toward culturally competent care: A toolbox for teaching communication strategies . San Francisco, CA: Center for Health Professions, University of Californ ia, San Francisco.

                          3.    Lee, Y. C., Yamashita, N., & Huang, Y. (2020). Designing a chatbot as a mediator for promoting deep self-disclosure to a real mental health professional. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1), 1-27.

                          4.   Branley-Bell, D., Brown, R., Coventry, L., & Sillence, E. (2023). Chatbots for embarrassing and stigmatizing conditions: could chatbots encourage users to seek medical advice?. Frontiers in Communication, 8, 1275127.

                          5.   Liang, K. H., Shi, W., Oh, Y. J., Wang, H. C., Zhang, J., & Yu, Z. (2024). Dialoging resonance in human-chatbot conversation: how users perceive and reciprocate recommendation chatbot’s self-disclosure strategy. Proceedings of the ACM on Human-Computer Interaction, 8(CSCW1), 1-28. doi: 10.1145/3653691

                          Supporting Documents: 

                          AI-Augmented Lung Nodule Tracking

                          Proposal Status: 

                          The UCSF Problem:

                          Lung nodules are a common finding on CT and are often benign but rarely may represent malignancy. To monitor for potential malignancy, follow-up imaging is often recommended. If there is concern for malignancy, adequate follow-up allows earlier diagnosis and treatment, while lack of follow-up may lead to progression and worse prognosis.  

                          Lung nodules are found in two different ways: lung cancer screening CTs, which are CT scans to screen patients who are at high risk for lung cancer, and those incidentally found on other CT scans. At UCSF, we perform around 400 screening CTs a year, a number that is dwarfed by the number of incidentally found nodules. Approximately 65,000 CT chests for other indications were performed in 2024, and population data has noted that around 31% of CT chest scans contain a nodule. At UCSF around 4,000 CTs in 2024 have the word “nodule” or “follow-up” in their impression1. While not all of these may represent lung nodules requiring follow-up, the number is likely in the thousands.

                          Here a UCSF, we have a lung nodule nurse coordinator to follow-up nodules found on lung cancer screening CTs. These are tracked on an Excel spreadsheet. Unlike many peer institutions such as the VA and SFGH, we have no system-wide tracking program for incidentally found nodules. Concerning nodules must be tracked by the ordering physician or a patient’s primary care doctor, which may be challenging without a centralized system for tracking these nodules and risks a patient not obtaining adequate follow-up.

                           

                          How might AI help?

                          Manually reviewing all 65,000 CT scans per year to find concerning nodules would be time-consuming and cost prohibitive. However, large language models could efficiently and cost-effectively parse through reports and extract the key information needed to determine the optimal follow-up time. This information can then be used to populate a dashboard, enabling coordinators to track suspicious nodules, identify if a patient is overdue to follow-up, and easily reach out to patients.

                           

                          How would an end-user find it and use it?

                          Our AI tool would be limited to users part of the pulmonary nodule program including lung nodule coordinators and pulmonologists. We will directly coordinate with the relevant individuals so that they are aware of and understand how to use the AI-enabled dashboard.

                           

                          Picture of the AI tool:

                          Using Versa API, we will extract the information needed for clinical decision-making about lung nodules from radiology reports. This includes the number of nodules and recommended follow-up time if suggested by the radiologist. For nodules 6mm or greater, which are considered higher risk, we will extract the size of the nodule, location, and characteristics such as groundglass, solid, or subsolid as these would change the recommended follow-up time as per the guidelines. This data will be extracted as structured data using OpenAI’s structured outputs and then be used to build a dashboard. The dashboard will contain key patient information, including age, MRN, date of CT, characteristics noted above, if a repeat CT is scheduled and when, and if a referral to a relevant subspecialty is ordered as shown in Figure 1. This will allow for easy filtering of which patients have been connected to care and which patients may benefit from outreach.

                           

                          Figure 1: Mock-up of dashboard 

                          What are the risks of AI errors?

                          There are two primary risks to introducing AI for lung nodule tracking:

                          1. A significant nodule requiring follow-up is not reported: The nodule would then not be included in the monitoring dashboard. However, CT scan results would be sent to the ordering user for follow-up. At present, incidental lung nodules are not tracked and so this workflow would be the same as the current state.
                          2. Inaccurate description of the nodule or its characteristics: It is possible that the model may describe a nodule when one does not exist or incorrectly state the characteristics of the nodule. The dashboard will include a copy of the original report to verify that the nodule exists and the characteristics of the lung nodule.

                           To mitigate against these risks, we will perform robust testing prior to launch on retrospective data to assess accuracy and identify any characteristics that lead to a higher risk of hallucination.

                          Our preliminary testing on 200 CT chest reports from 2024 containing the word nodule notes a 96.4% accuracy and 1.5% hallucination rate. We achieved a 95% sensitivity and 97.1% specificity in identifying nodules 6mm or greater, a key cutoff point in the lung nodule guidelines and when estimating the risk of malignancy. Our precision was 93.5% and F1 score was 94.3%. Continuous performance monitoring will be performed to ensure persistent high accuracy and hallucination rate. A lung nodule coordinator will also have access to the full CT scan report and will serve as the human-in-the-loop to verify all characteristics.

                           

                          How will we measure success?

                          We will measure success by:

                          -       Sensitivity and specificity: We will assess the sensitivity and specificity of the model in detecting nodules requiring follow-up.

                          -       Percentage of patients who obtain recommended follow-up images: We will assess what percentage of patients with lung nodules obtain the recommended follow-up imaging within 3 months of the recommended time frame.

                          -       Frequency of outreach: We will track how often the lung nodule monitoring team reaches out to patients with lung nodules.

                          -       Qualitative feedback: We will obtain feedback and measuretool satisfaction, usability, helpfulness, and perception of accuracy from those involved in the lung nodule monitoring program

                          -       Time to referral: We will measure time to referralfrom identification of the nodule to the next recommended step of care, including interventional pulmonary, oncology, thoracic surgery, and interventional radiology.

                          -       Equity and bias analysis

                           

                          Describe your qualifications and characteristics:

                          Cat Blebea, MD: Dr. Blebea is a current pulmonary and critical care fellow and clinical informatics fellow. She frequently manages patients with pulmonary nodules in clinic and has first-hand experience with pulmonary nodule monitoring programs at the VA and SFGH. She also as experience using large language models to improve patient care as a member of the prompt engineering team for the Intelligent InBasket project, which uses large language models to create draft responses to patient messages.

                          Leo Liu, MD: Dr. Liu is an informaticist and hospitalist at UCSF. He serves as the physician lead for inpatient informatics at St. Mary’s and St. Francis, associate program director for the clinical informatics fellowship, and director of the GME Clinical Informatics, Data Science and Artificial Intelligence Pathway.

                          References:

                          1.         Gould MK, Tang T, Liu ILA, et al. Recent Trends in the Identification of Incidental Pulmonary Nodules. Am J Respir Crit Care Med. 2015;192(10):1208-1214. doi:10.1164/rccm.201505-0990OC

                           

                          Summary of Open Improvement Edits: 

                          Updated proposal with recent preliminary data findings. 

                          Supporting Documents: 

                          Proposal for an AI Triage Aide (ATriA) Tool to Improve Referral Processing in Neurology

                          Proposal Status: 

                          1. The UCSF Health Problem 

                               UCSF Neurology receives approximately 125 outpatient referrals daily, with average patient wait times for some subspecialties ranging 6 - 9 months. Wait times for patients to see a neurologist at UCSF are significantly higher compared to the national average, where a recent study found that the average wait time for an appointment is 34 days for Medicare participants (1). While there are many factors to explain this discrepancy, including geographic location, specific neurological conditions managed and healthcare system capacity, there is significant room for improvement in the current referral processing system.   

                               To schedule a patient in the appropriate subspeciality clinic, the current pathway for external referrals relies on non-clinicians to manually review unstructured and often incomplete referral packets and then use the APeX scheduling decision trees to select the single most appropriate input for chief complaint to ultimately schedule patients with the appropriate subspecialty clinic.  Accommodating rapidly changing clinical/paraclinical criteria for acceptance of referralin this rules-based system requires labor-intensive collaboration between clinicians and specialized programmer support. 

                               These factors contribute to the scheduling of lower priority referrals, improper utilization of subspecialty clinic slots, extended wait times for appropriate referrals, and increased no show/cancellation rates, as patients opt to seek care elsewhereDelays in diagnosis and management can lead to poor health outcomes, increased acute care utilization and increased healthcare expendituresFurthermore, mis-triage negatively impacts patient satisfaction, increases healthcare system burden, leads to staff and clinician burnout and fosters moral injury. 

                          2. How Might AI Help?  

                          The administrative (and clinical) staff require the proper tools to efficiently identify and interpret complex medical information and improve the precision and accuracy referral routingLarge Language Models (LLMs) are increasingly being used for clinical decision support and administrative automation, leading to better workflow efficiency and improved access to specialized care. An AI-driven process can enhance efficiency in neurology referrals by:  

                          • Extracting essential demographic and insurance information from referral documents  

                          • Identifying missing referral information  

                          • Summarizing relevant clinical information in a standardized format  

                          • Determining the primary chief complaint to guide appropriate routing   

                          • Supporting the existing decision tree for neurology clinic assignment  

                          • Identifying time-sensitive conditions requiring expedited care  

                          • Identifying cases appropriate for specialty clinics that span multiple divisions. 

                          Since many referrals do not have a clear clinical indication or may cite multiple clinical indications, an AI tool could also be employed to gather information outside referral letters to better prioritize and categorize referrals (2). AI-based solutions have helped to streamline referral processing and have shown promise in decreasing time to treatment for oncology patients and improving overall patient experience (3).   

                          We are proposing a staggered two-step process to integrate ATriA: 

                          1. Initial Enhancement: Versa API support that connects to department-specific databases (Excel) with the underlying logic for the APeX scheduling decision trees, allows for inclusion of clinical guidelines and provider preferences that could not be built into the decision trees, and uses optical character recognition (OCR) to convert scanned referral documents into machine-readable text for querying by staff. 

                          1. APeXBuildUse of LLM that incorporates the aforementioned data (e.g., demographic, clinical, logistical) for decision support to categorize and prioritize referrals as well as recommend updates.  

                          3. How Would an End-User Find and Use It?  

                          Versa API Enhancement  

                          Versa is now available to all staff who take the prerequisite AI training and request access. The individual would select the triaging assistant for that department (API, department-specific database), upload referral PDFs, and then provide queries related to that individual staff members role. 

                          APeX Build forATriA 

                          The ATriA tool would integrate directly into the existing APeX system, processing referral documents through an automated pipeline:  

                          • Natural Language Processing (NLP): Use UCSF’s Versa LLM to organize extracted text into structured sections  

                          • Named Entity Recognition (NER): Extracts clinically relevant data, including financial and insurance-related information 

                          • Standardization in Output: The AI tool will generate a templated report populated with relevant clinical and paraclinical information for review 

                          • Decision Support: The AI tool recommends the appropriate neurology subspecialty, assigns urgency, and flags incomplete referrals for human review.   

                          There will be an actionable button labeled “Process referral” near the top of the Appointment Request interface in APeXThere will be another actionable button labeled “Generate indication” under the “Indications” section of the Appointment Request interface in APeX. 

                          4. AI Tool Visualization 

                          Versa Enhancement Request 

                          A screenshot of a computer

AI-generated content may be incorrect., Picture 

                          ATriA Apex Build Request 

                          A screenshot of a computer

AI-generated content may be incorrect., Picture 

                          5. What Are the Risks of AI Errors?  

                          Limitations to the implementation of LLMs include concerns regarding accuracy, reliability and potential bias (4). While AI can streamline referral processing, potential risks include, misinterpretation of clinical data, AI hallucinations, bias in model training leading to potential underperformance in underrepresented populations or rare neurological disorders, and breaches in regulatory compliance particularly with HIPAA compliance and data security. To mitigate risks, a human-in-the-loop approach will be used, ensuring patient coordinators review AI-generated summaries before finalizing referrals. Additionally, an initial validation phase will be conducted to assess AI accuracy and regulatory compliance before full implementation.  

                          6. How Will We Measure Success?  

                          Initial validation and workflow optimization will focus on a cohort of internal referrals.  Subsequent efforts will focus on external referrals, which are often more fragmented in terms of referral information.  

                          Key Metrics: collected in APeX: 

                          • Time to decline or accept referral  

                          • Time to new patient appointment scheduling: Reduction in average wait times.  

                          • Time to diagnosis and treatment: Faster access to specialized care.  

                          Key Metrics: Ideal 

                          • Referral triage accuracy: Concordance between AI-generated classifications and standard clinical workflows. If the ATriA tool is unable to achieve at least 0.8 concordance with standard clinical workflows, then we will reassess feasibility of this tool implementation.  

                          • Health professional satisfaction: Surveys (e.g., modified System Usability Scale (5)) for patient coordinators, nurses, and physicians  

                          • Patient satisfaction: Surveys assessing perceived efficiency and experience.  

                          • Operational efficiency: Reduction in time spent by coordinators on referral triaging.  

                          • Cost savings: Lower administrative costs and improved resource allocation.  

                          • Role evolution: Shift in patient coordinator duties towards patient engagement rather than manual triaging.  

                          • Other key metrics will be added as we solicit constructive feedback from our stakeholders.

                          This AI-driven approach represents a hybrid model between rule-based and machine-learning methodologies, offering a scalable and sustainable solution to the challenges of neurology referral processing.   We anticipate that with optimization these methodologies can be utilized in non-neurological specialities.

                          7. Describe Your Qualifications and Commitment  

                               The project will be spearheaded by UCSF General Neurology Division Technology and Division Chiefs (Pierre Martin, Maggie Waung) in collaboration with the UCSF Clinical Administrative Director (Mark Datuin). We request salary support for the academic co-leads, Pierre Martin and Maggie Waung.   

                               Pierre Martin is an outpatient general neurologist with SmartUser certification for Epic and has an academic focus on educational technology.  He helps to develop SmartTools and other resources for colleagues to improve clinical efficiency.  He recently secured Innovations Funding for Education to develop an immersive mobile application for medical trainees to learn clinical neuroanatomy via an interactive 3D model.  

                               As the General Neurology Division Chief, Maggie Waung has been intimately involved in clinical operations over the past 3 years. She assisted with development of the clinical decision tree and has worked closely with the Ambulatory Clinical Informatics lead, Katie Grouse on optimizing patient triage for General Neurology over the past year. She also works closely with all stakeholders that might benefit from this project including clinical providers (MDs and APPs), LVNs, RNs, patient coordinators, clinic managers, other Neurology Division Chiefs, and the Neurology Vice Chair for Clinical Affairs, John Engstrom.       

                               We will plan on weekly meetings and incorporate time for work-in-progress sessions with the Health AI and AER teams. We plan to provide monthly updates to the General Neurology Division and Clinical Division Chief meetings to solicit feedback as needed.   

                          Headache Evaluation and Diagnosis - with Generative Artificial INtelligence (HEAD-GAIN): Improving Access

                          Proposal Status: 

                          Section 1: The UCSF Health Problem 

                          Headache disorders affect a wide swath of the population as the third highest cause of disability-adjusted life years worldwide(1) and often impact people during their peak productive years, extracting a significant financial toll at upwards of 20 billion dollars annually (2,3). Accurate and rapid diagnosis of secondary headaches is imperative to prevent neurological morbidity. Moreover, early identification and treatment of primary headache disorders improves outcomes by preventing headaches from progressing into debilitating, chronic conditions (4). A critical decision point in the accurate diagnosis of headaches is whether brain imaging is needed. If every person with headache received a brain MRI, this would place unnecessary strain on the health system (5). However, not obtaining an MRI in a patient with secondary headache can be devastating. 

                          While neurologists are well-equipped to evaluate and diagnose headache disorders, primary and acute care providers are usually the first line of care for patients with headache (4,6). Quality and depth of training for first-line providers on the management of headache disorders is highly variable (7-11). Furthermore, a shortage of neurologists results in limited access to specialty headache care (12). The UCSF General Neurology division receives over 12,000 referrals annually, and on average, 22-25% of these referrals are headache-related. Of these headache referrals, 1 in 6 are secondary headaches that may receive delays in diagnosis due to long referral wait times. Based on these data, approximately 3,000 patients with headaches currently stand to benefit from implementation of the HEAD-GAIN tool each year. Furthermore, validation of the HEAD-GAIN tool could inspire the development of similar technologies in other neurological or medical subspecialties; for instance, our team is currently working on an analogous tool for patients with neurodegenerative disease.  

                           

                          Section 2: How AI Addresses the Problem  

                          LLMs like OpenAI’s generative pre-trained transformers (ChatGPT-3) are being increasingly studied and implemented throughout medicine, from virtual assistants to clinical decision support (13–16). Headache classification is a common application of generative AI with PVAs serving as a source of rich phenotypic data.17  AI systems can be used to differentiate secondary headaches, with one machine learning-based prediction model demonstrating an accuracy of 0.74 (17,18). Some research has already suggested that application of LLMs towards headache classification and diagnostic work-up may help to reduce overcrowding in emergency departments and allow providers to improve patient triaging (19). 

                          Despite the promise of applying LLMs to patient care, data on the diagnostic accuracy of generative AI compared to physicians is mixed (20–28). Additionally, generative AI runs the risk of providing biased diagnostic suggestions impacting clinical care (27), especially when trained with a non-diverse and uniform patient population (28).Given these concerns, research into how to appropriately implement these systems into clinical workflows (29,30), including for headache diagnosis and management remains critical.  

                          There is great need for a scalable AI tool that 1) helps distinguish primary from secondary headaches, 2) supports physician medical decision-making, and 3) facilitates efficient patient care delivery. Through this study, we hope to develop and validate a generative AI-based tool that can be applied to the diverse patient population at UCSF who need headache care. 

                          To help us more accurately and efficiently triage and diagnose headache patients, we propose to validate and implement a diagnostic and management tool using a large language model (LLM) coupled with a Qualtrics-based pre-visit assessment (PVA). This tool is intended to identify patients at risk of secondary headaches to be scheduled sooner, recommend first-line treatments for primary headaches to referring providers.  

                          For this study, participants will complete Qualtrics-based PVA once. Versa, a HIPAA-compliant LLM, will review the PVA and output a high-level summary, likely diagnosis, and imaging recommendation. Each participant will then see a neurologist who will perform a detailed clinical history and neurologic physical examination. If the AI tool, designated as Headache Evaluation And Diagnosis-with Generative Artificial INtelligence (HEAD-GAIN), demonstrates high efficacy and useability, we will pilot this program on APeX, where we can first apply it to all headache referrals to UCSF Neurology, with the ultimate goal of making this tool accessible to primary care doctors.  

                           

                          Section 3: End-User Workflow  

                          The HEAD-GAIN tool is intended to expedite in-person encounters with headache patients. After validation in the UCSF General Neurology Clinics, this tool may be deployed in primary care settings, where physicians have less specialized training in the diagnosis and management of headache disorders. 

                          We envision the end-user workflow as such: APeX automatically scans referral requestfor keywords such as “head pain,” “headache,” or “migraine. If detected, a button becomes visible on the patient's chartThe patient coordinator will click the button to send the patient an automated MyChart message and SMS text message via CipherHealth containing instructions to complete the PVA. The patient clicks on the link in the text or MyChart message, which prompts them to decide whether to consent to engagement with the HEAD-GAIN tool. If the patient consents, they are directed to the Qualtrics-based PVA. The PVA responses are saved to a secure, HIPAA-compliant server. Python Notebook automates the remaining steps.  Notebook feeds the PVA responses into UCSF Versa and requests a high-level summary, likely diagnosis, and imaging recommendation. An engineered prompt is given to Versa to generate these data, determine the most likely diagnosis, and categorize it as primary or secondary headache. If there is a likelihood of a secondary headache, then the referral will be marked “urgent” and expedited to see a neurologist within 5-7 business days. If the AI-determined diagnosis is a primary headache condition, then a curated list of potential first-line treatments for the identified primary headache condition will be provided back to the referring provider while the referral is pending neurology evaluation.

                          To improve patient access and survey completion, future iterations of this tool will incorporate the use of OpenAI's Whisper, which is an AI/machine learning model for speech recognition and transcription, to deploy the PVA. 

                          Section 4: Image of AI Tool 

                           

                          A screenshot of a computer

AI-generated content may be incorrect. 

                           

                          Section 5: Possible Sources of Error 

                          The major concern associated with the HEAD-GAIN tool is the diagnostic error rate for secondary headache. Furthermore, inaccurate diagnosis of primary headaches could also lead to erroneous recommendations for headache treatment.  

                          To mitigate these risks, a study team member will review discrepancies between Versa-determined headache diagnoses compared to the gold standard of neurologist headache diagnosis. Discrepancies will be reviewed individually and will be analyzed by prompting Versa to elucidate its clinical chain of thought.  We will then modify the PVA to try to improve the performance of the HEAD-GAIN tool.  We will also conduct prompt engineering to optimize the accuracy of data output by UCSF Versa.  The concordance rate for the diagnosis of primary versus secondary headache will be calculated, with the aim of achieving a concordance rate of 0.85 after modifications to the PVA and prompt engineering.  

                          Optimization of the PVA and iterative prompt engineering will be combined to increase the sensitivity for HEAD-GAIN to identify secondary headaches before implementation into clinical workflows.

                          Section 6: Metrics of Success 

                          Collected in APeX 

                          • Validation stage: Number of referrals identified and selected for HEAD-GAIN intervention;Wait times for headache consultations from referral date; Utilization of MRI brain imaging before and after headache referral; Concordance between AI diagnosis and management and neurologist diagnosis and management. 

                          • Implementation stage: Time to diagnosis of secondary headache starting from referral date;Patient and referring provider satisfaction pre- and post-intervention; Time to initiation of first primary prevention or abortive medication; Proportion of providers that use the HEAD-GAIN tool.  

                          Other Measurements 

                          • Validation stage: Concordance of headache diagnosis and imaging recommendation by Versa compared to consultant neurologist. 

                          • Implementation stageAcceptability and useability of the HEAD-GAIN tool for referring providers and consulting neurologists 

                          Section 7: Qualifications   

                                The project will be led by UCSF General Neurology Division’s Technology Chief Pierre Martin.  He is an outpatient general neurologist with SmartUser certification for Epic and has an academic focus that involves the design and development of educational technology.  He recently secured Innovations Funding for Education to develop an immersive mobile application for medical trainees to learn clinical neuroanatomy via an interactive 3D model.  The HEAD-GAIN project team includes two UCSF neurologists, an external neurologist collaborator from Emory University, an AI expert from UCSF Memory and Aging Center, UCSF medical student, and graduate programming student from UCSC.  We currently maintain biweekly meetings and plan to incorporate time for work-in-progress sessions with the Health AI and AER teams.

                          Enhancing Orthodontic Care Through Automated Reminders for Radiograph and Cleanings

                          Primary Author: Mona Bajestan
                          Proposal Status: 

                          1. The UCSF Health Problem
                          Orthodontic treatment requires regular radiographic imaging to monitor tooth movement, ensure proper bracket positioning and evaluate the risk of root resorption [1]. However, due to long term orthodontic treatment (average of 24 months), it is not uncommon for orthodontists to occasionally overlook the timing of radiographs, potentially leading to delayed treatment adjustments and prolonged overall treatment duration. Additionally, orthodontic patients should receive dental checkups and cleanings every six months, or more frequently, if necessary, to maintain oral health and prevent issues such as interproximal caries. Unfortunately, orthodontists may overlook these routine reminders, increasing the risk of caries development during treatment [2,3].
                          A solution is needed to automatically track patient radiographic and dental cleaning records, issuing timely reminders to ensure adherence to these critical steps in the orthodontic workflow. Previous attempts at addressing similar issues include manual tracking and reminders from administrative staff, but these approaches are prone to human error. The intended end-users of this solution include orthodontists, dental assistants, and administrative staff who manage patient records and appointments.
                          2. How Might AI Help?
                          AI can assist in monitoring and analyzing patient records by identifying patterns and issuing timely reminders. The AI system would utilize patient electronic health records, including past radiographs and dental cleaning history, to track due dates for necessary imaging and oral hygiene appointments.
                          The AI model would produce automated reminders when a patient is due for a radiograph or dental cleaning, ensuring timely intervention. It would help solve the problem by reducing the likelihood of missed imaging or hygiene appointments, thereby improving treatment efficiency and preventing oral health complications.

                          3. How Would an End-User Find and Use It?
                          The AI support would be integrated into the existing orthodontic patient management system (Apex), functioning within the standard workflow. When a patient’s record is accessed, the system would check whether an X-ray or cleaning is due based on predefined intervals. If due, a pop-up notification would alert the orthodontist or staff, prompting immediate action.
                          End-users would see these notifications as part of the patient’s digital chart. The recommendations would be explained through a brief summary indicating the last recorded radiograph or cleaning date and the recommended next steps. Users can then schedule the necessary appointments directly from the interface. Additionally, the system may allow customization of reminder frequency based on patient-specific needs. This AI-driven automation would enhance patient care by ensuring timely interventions and reducing reliance on manual tracking.
                                

                          5.What are the risks of AI errors? 

                          • False Positive Errors: The system may incorrectly remind a patient as needing an X-ray or dental cleaning when they are not actually due. This could lead to unnecessary imaging or redundant appointments, increasing patient costs and exposure to radiation.
                          • Mitigation: Upon receiving a reminder, the orthodontist needs to determine whether an X-ray should be taken or if the patient should be referred for a dental cleaning. Additionally, if the patient’s general dentist does not use the APeX system and the cleaning was performed at another clinic but is not recorded in the system, the orthodontist can select the “refuse with a comment” option and manually enter the cleaning date.
                          • False Negative Errors: The AI may fail to flag a patient who is due for an X-ray or dental cleaning, leading to missed appointments and potential treatment delays or oral health complications.
                          • Mitigation: Regular audits of AI predictions against actual clinical schedules and patient histories can help identify and correct systematic under-reporting.
                          • Hallucination Errors: Generative AI components, if used, may provide incorrect or misleading recommendations based on incomplete or misinterpreted data.
                          • Mitigation: Restrict AI-generated outputs to structured data analysis rather than free-form text generation, and ensure recommendations are based on verified clinical guidelines.
                          • To measure and mitigate these risks, continuous monitoring of AI performance metrics—such as accuracy, recall, and precision—will be necessary. Additionally, user feedback from orthodontists and dental staff should be collected to refine AI predictions and minimize errors over time.

                          6.How will we measure success? 
                          To determine whether the AI system is being effectively used and is achieving its intended goals, we will measure outcomes in two categories: data already being collected in APeX and ideal supplementary measurements.

                            a. A list of measurements using data that is already being collected in APeX

                          • Reminder Utilization Rate: Track how often the AI-generated reminders are triggered and viewed within the system.
                          • Follow-up Compliance: Measure the percentage of patients who complete dental cleanings and radiographs within recommended intervals.
                          • Caries Incidence: Compare caries incidence rates before and after orthodontic treatment among patients with and without AI-supported reminders (if general dentists also use APeX).
                          • Treatment Plan Audits: Evaluate completed treatment plans for documentation of oral hygiene checkups and imaging completion.
                          • Time-to-Treatment Adjustment: Assess whether timely radiographic imaging correlates with more prompt bracket repositioning and treatment plan modifications.

                            b. A list of other measurements you might ideally have to evaluate success of the AI

                          • User Feedback and Satisfaction: Surveys and interviews with orthodontists and staff about AI usability, helpfulness, and alert fatigue.
                          • Clinical Outcome Improvement: Reduction in the number of delayed treatments or undiagnosed caries due to missed cleanings or imaging.
                          • Reduction in Manual Tracking: Quantify decrease in staff workload related to manual scheduling or tracking of appointments.
                          • Patient Satisfaction Scores: Assess patient perception of care quality, particularly regarding timely follow-ups and preventive care.
                          • False Alert Rate: Monitor and categorize false positives and negatives to continuously refine AI accuracy.
                          • To convince UCSF Health leadership to continue supporting the AI, we would need to demonstrate improved patient care outcomes (e.g., reduced caries, faster treatment completion), higher adherence to recommended dental care timelines, and increased provider satisfaction. If the AI results in low adoption, excessive false alerts, or no measurable improvement in clinical outcomes, it may indicate a need to reevaluate or discontinue implementation.
                          7.  Qualifications and commitment:
                          Dr. Mona Bajestan, DDS, MS, is an associate clinical professor in the Division of Orthodontics at UCSF, with extensive expertise in orthodontic treatment and oral health science. As a diplomate of the American Board of Orthodontics and an active participant in national professional organizations, she remains engaged with the latest advancements in orthodontic research and clinical practice.
                          With her background in both clinical care and academic research, Dr. Bajestan is well-positioned to contribute to the development and implementation of this AI-driven solution. She is committed to dedicating effort to this project over the coming year, including actively participating in regular work-in-progress sessions, collaborating with the Health AI and AER teams, and ensuring that the AI algorithm aligns with clinical needs. Her role as a faculty member at UCSF and her involvement in orthodontic education and patient care will facilitate the integration of this tool into real-world clinical workflows, ultimately improving patient outcomes and operational efficiency.
                          References:
                          1.     Heboyan A, Avetisyan A, Karobari MI, Marya A, Khurshid Z, Rokaya D, Zafar MS, Fernandes GVO. Tooth root resorption: A review. Sci Prog. 2022 Jul-Sep;105(3):368504221109217. doi: 10.1177/00368504221109217
                          2.     Liu Q, Song Z. Incidence, severity, and risk factors for white spot lesions in adolescent patients treated with clear aligners. Orthod Craniofac Res. 2024 Oct;27(5):704-713. doi: 10.1111/ocr.12791
                          3.      Walsh LJ, Healey DL. Prevention and caries risk management in teenage and orthodontic patients. Aust Dent J. 2019 Jun;64 Suppl 1:S37-S45. doi: 10.1111/adj.12671
                          Supporting Documents: 

                          Using LLMs to identify opportunities to improve diagnostic processes and reduce patient harms

                          Proposal Status: 

                          1.  The UCSF Health problem.   

                          Diagnostic errors are common and harmful. Previous work from our group suggests a missed or delayed diagnosis takes place in 25% of patients who die or who are transferred to the ICU; a diagnostic error is the direct cause of death in as many as 8% of deaths.

                          Our research team, as part of a multicenter study called Achieving Diagnostic Excellence through Prevention and Teamwork (ADEPT) led by UCSF has been prospectively gathering information about delayed or missed diagnoses at UCSF Health since 2023, focusing this time on RRT calls, ICU transfers, and deaths on the Medicine service.  ADEPT randomly samples 10 cases per month, using a two-physician adjudication approach to develop gold-standard case reviews which yield not only an assessment of whether an error took place but also the harms related to the error, and any underlying diagnostic process faults.

                          Our current data suggest a significant opportunity for improvement, with 12% of patients who died, had an ICU transfer, or RRT call having a diagnostic error.  Harms were substantial – with 17% of errors thought to be a cause of death, and 40% producing temporary or permanent harm (such as need for additional monitoring or testing, longer length of stay, or additional therapies).   The most common diagnostic process fault was problems with assessment (e.g. anchoring on a diagnosis, or failing to recognize a stronger alternative), testing problems (not choosing the right test in a timely fashion).  If we were to extrapolate the results from ADEPT to all 17000 admissions to UCSF Health, these numbers would represent more than 500 errors per year, of which 83 were a likely cause of death.  Even if not causing death, 200 patients would have suffered longer hospital stays and additional and potentially unnecessary treatments.

                          2. How might AI help? 

                          The field of patient safety in general, but diagnostic errors in particular, suffers from an inability of health systems to detect errors and opportunities for improvement at scale.  Uniquely, diagnostic errors are extraordinarily hard to detect using administrative data, and determining underlying causes with any accuracy requires chart reviews such as our team carries out.  AI tools, and LLM’s in particular, are a potentially groundbreaking approach to solving both problems.

                          Our current chart-based approach, while appropriate for research purposes as part of our ADEPT AHRQ-funded work, is time-consuming – taking up to 45” per case to finalize.  A lighter-weight approach could be folded into existing case review programs (such as M&Ms) but would still need substantial support in determining underlying causes and developing an actionable set of priorities for leaders in Safety, or providers seeking to improve performance. 

                          Preliminary data: We have recently begun work to develop llama-3 and GPT-4o LLMs with prompts derived by ADEPT diagnostic process review team  (Figure).  Prototype ADEPT Diagnostic Process Reviews (ADPR) utilize llama-3 and GPT-4o models (running in secure AWS cloud environment) against a Clarity-derived data model composed of all notes, results, orders, and vital signs (including intake/output) for the hospital stay, concatenated.  Preliminary LLM-ADPRs are producing results that replicate our case reviews but are fairly unsophisticated; we are undertaking work now to improve clinical utility, and to determine correlation between our reviewers’ assessment and those of the LLM-ADPR.

                          Project plan:  We propose to leverage the expertise of our case review teams, a large sample of more than 300 ‘gold standard’ cases, and pilot work we have currently underway using llama-3.1 and GPT-4o models leveraging EHR data gathered as part of ADEPT, to create an AI-powered diagnostic excellence learning cycle targeting Medicine patients.

                          By using ADEPTs existing case review process as a starting place to engineer prompts that replicate clinical adjudication and also monitor LLM-ADPR agreement with expert and users’ assessments, we will be able to generate diagnostic process-focused case summaries that replicate the evidence-based framework used in our research studies but at much lower time requirements. These summaries can then be used for at least three purposes:

                          1) As part of an enhanced UCSF mortality review process, with the LLM summary added to existing reviews as a way to identify opportunities for physician and system performance improvement,

                          2) As part of an automated diagnosis cross-check/diagnostic time out provided electronically to clinicians after an RRT takes place.

                          3) As part of an automated self-assessment and feedback system employed after an ICU transfer, or patient death has taken place. This summary may be presented in individual cases, or as a summary of all events which took place during a clinical block.

                          As a first step, we will map our current data model into HIPAC data sources, and then link HIPAC EHR data to data from our chart reviews to create our validation dataset (we anticipate 375-400 cases being available.  Using these data we will iteratively test prompts against selected encounters, while also identifying methods (e.g. RAG) for dealing with longer hospital stays. Once we achieve a high level of clinical reliability and accuracy in the derivation set, we will proceed to deploying enhanced Mortality Review, diagnostic cross-check, and self-assessment pilots focused on Medicine patients and leveraging the existing partnerships between our ADEPT team and UCSF Health leadership. Emblematic of this partnership: our ADEPT review group is now a subgroup of the UCSF Patient Safety Committee and ADEPT cases with harmful errors are submitted as Incident Reports.

                          3. How would an end-user find and use it?  

                          We do not plan to embed these case summaries in the medical record but would generate ADEPT-LLM diagnostic process summaries as part of existing case review processes (option 1, above), or deliver them linked to a REDcap survey within 12 hours of an RRT call (Option 2, above), or via REDCap survey 1 week of ICU transfer or death (Option 3). For each use case, the LLM-ADPR will provide a framework for reconsidering the case and diagnostic process, along with survey questions which will permit the end-user to agree or disagree with the summary and offer clarifications.

                          4.  Embed a picture of what the AI tool might look like

                          A preliminary version of the LLM-ADPR summary is provided; the LLM-ADPR would be accompanied by a REDCAP survey (or embedded in the survey, itself).  A key step in our pilot/validation step will be to increase usability ofour LLM-ADPR output in addition to clinical accuracy.

                          5. What are the risks of AI errors?    

                          Given that our AI output will not be used to actively direct clinical care, risks of hallucinations (e.g. false positive results), or unclear/imprecise results on patient care are nearly zero.  However, it is possible that these same issues could produce higher work burden on patient safety reviewers or end recipients. 

                          We do not envision this case review summary being represented in Apex but presented to providers and Patient Safety review staff separately from clinical workflows. This design increases the safety of our program, increases feasibility (since compute power for entire encounters is limited, particularly for real-team computations), and maps towards a future state where diagnostic cross-check and case feedback tools can be inserted into workflows or clinical scenarios where context length, clinical need, and patient factors increase overall effectiveness.

                          6. How will we measure success?   

                          While the ultimate goal of this program is to reduce diagnostic errors, harms of diagnostic errors, and in turn mortality, ICU transfers, and RRT calls, it is unlikely we will show an effect on these outcomes in a year.

                          For this study period, we will measure agreement between LLM-ADPR documents and our recipient (e.g. patient safety staff, clinicians whose patients have a trigger event) in terms of their agreement with the identified opportunities for diagnostic improvement, as well as usability of the LLM-ADPR.  In the 10-12 cases/month reviewed by ADEPT case reviewers (ADEPT will run through September 2026), we will be able to calculate agreement statistics compared to gold-standard adjudications.  Finally, we will estimate the work-hours saved for patient safety staff and clinical reviewers, time which can eventually be better used to address care gaps rather than in gathering chart review data.

                           

                          7. Describe your qualifications and commitment:

                          This program will be led by Andrew Auerbach MD MPH, PI of ADEPT and a leader of several informatics and implementation science-focused programs at UCSF, as well as our existing ADEPT Project Coordinator Tiffany Lee BA.  Our team includes our current ADEPT review team (Drs. Molly Kantor, Peter Barish, Armond Esmaili), as well as informatics and LLM experts (Drs. Charumathi Subramanian and Madhumita Sushil);  Dr. Sushil led development of the LLM-ADPR shown in Figure 2. Finally, our program has strong endorsement and support of Chief Quality Officer Dr. Amy Lu.

                          Supporting Documents: 

                          Creation and Implementation of AI enabled Survivorship Care Plans for Adult Cancer Survivors.

                          Primary Author: Niharika Dixit
                          Proposal Status: 

                          The UCSF Health problem: Cancer survivors at UCSF Health have unmet needs significantly impacting their health-related quality of life (HRQOL). These include managing ongoing side effects of treatment, surveillance for cancer recurrence, informational needs, health promotion and coordination of care. A Survivorship Care Plan (SCPs) is a document that is completed at the completion of curative cancer treatment when the patient is transitioning to survivorship care. SCPs are a patient activation tool that provides patients with important information about the details of their cancer care, follow-up plan, delayed and long-term side effects, and health maintenance. SCPs are considered an important component of comprehensive survivorship care by NCI(1) and Commission on Cancer (COC) that accredit cancer programs. Furthermore, SCPs can also serve as communication tool between primary care and oncology(2). However, the completion of SCP by healthcare providers has been low despite the fact that cancer survivors and primary care physicians (PCPs) consider SCP as an important tool in addressing informational needs of cancer survivors. At UCSF Health most patients do not get SCPs. The major barriers to SCP implementation are: 1. Creation of SCPs is time-consuming for the healthcare provider, 2. These complex documents require manual entry of many details leading to possible inaccuracies, and 3. SCPs are static documents and cannot be easily tailored to cancer types and include updated guidelines with advances in cancer care(3). Current literature suggests that it takes approximately 40 minutes to create a survivorship care plan, and that leaves very little time for discussion of the actual plan in a clinic encounter(4). The discussion between patient and provider of the SCP is the most important part of SCP process which is often unfortunately shortchanged in SCP delivery(5).


                          How might AI help?  AI/LLM are increasingly being used in many aspects of health care including medical documentation, radiology and pathology diagnostic reporting, patient communications, and decision supports. We expect that the time-consuming process of SCP/ treatment history generation could be taken over by AI, which will allow clinicians to spend more time with patients on discussion of SCP. We can optimize EPIC tools to provide automation of patient information such as demographics, health care team members, diagnosis and staging, and treatment received. We would utilize AI to provide multiple patient-friendly personalized components of the SCP. These components would include: 1. Long-term side effects of specific treatments received and of the cancer diagnosis itself (eg: cardiotoxicity, neuropathy, lymphedema, fear of recurrence), 2. UCSF resources relevant to the patients' diagnosis and demographics, including website and phone numbers, 3. Current NCCN Surveillance guidelines, based on patients' diagnosis, 4. Risk reduction strategies and resources based on the patients' diagnoses, 5. USPTF Age-appropriate healthcare maintenance screening recommendations based on the patients' demographics such as age and gender. We expect that this will include information from reputable organizations such as American Cancer Society, NIH, NCCN, USPTF and local UCSF resources. We expect that AI may be able to create these plans in five to ten minutes instead of 40 minutes allowing encounter to focus on discussion between patient and provider.
                          How would an end-user find and use it?
                          The SCPs  will be created by Survivorship/Oncology Healthcare providers who will be trained in the AI enabled process. The SCP will be delivered in the clinic and will be available to patients via MyChart. We anticipate SCP to be accessible to all providers and in the Survivorship section on the oncology snapshot page and attached to the problem in the Problem List. AI will allow these plans to be dynamic and easily updated with the most current resources and guidelines. AI support will be most impactful at the time of the SCP creation, which will be delivered at transition to survivorship care after completion of treatment for patients treated with curative intent. We expect two sets of end users. The first set will be 
                          patients who will be sent the SCP via My Chart and we aim to empower patients to manage their surveillance schedule, health care maintenance, and take steps to optimize their wellbeing. The other end user will be the PCPs who will be sent the SCP via MyChart or fax and will utilize the information to provide tailored appropriate primary care to cancer survivors. Finally, we hope that AI enable process will allows us to easily translate these into multiple languages allowing more patients to benefit from this process.

                          Embed a picture of what the AI tool might look like.   

                          We anticipate two distinct steps in the creation and implementation of the tool. Step 1 will include knowledge base creation step in which we will create a knowledge base to add to existing data in EPIC/APeX which will include key guidelines for cancer survivorship, relevant medical literature, and chemotherapy database including expected side effects and management of side effects(6). Based on the current literature we will divide this knowledge into the domain of survivorship care including 1. Surveillance, 2. Management of side effects of chemotherapy, 3. Health maintenance, 4. Management of comorbidities, and 5. Coordination of care. Step 2. will be creation and implementation of tool, in this step once clinician identifies a patient for SCP creation, the APeX tool will prepopulate the SCP template with details of treatments, and side effects from problem list, and the AI-enabled SCP will be created using the APeX data and knowledge base. Once the SCP is generated, the clinician will verify its accuracy and share it with the patient. We also expect a task-creation step including surveillance imaging, screening for cancer, and nudges for referrals for services. We will continue to iterate the SCP creation and delivery process to make it patient friendly and helpful to PCPs. We will actively seek feedback from survivorship expert group at UCSF and a patient advisory board to improve the SCP. We expect the SCP to be available in the Oncology Snapshot or problem list in APeX (see below). For patients, it will be available through MyChart.

                           Once the SCP generation process is standardized, we will generate and deliver the first 20 SCPs and will evaluate the process with our survivorship expert group and patient advisory board. We will iteratively improve the process to optimize he SCP generation and delivery process.

                          What are the risks of AI errors?  The biggest risk would be hallucinations, as AI may generate recommendations that are not evidence-based. We will train clinicians who will participate in the pilot to double-check recommendations and links to ensure that recommendations are within guidelines. Site-specific survivorship providers will vet all initial AI generated SCPs to ensure that all SCPs adhere to the guidelines for surveillance and follow up. Finally, we will do a set of 20 SCPs which will be assessed for accuracy and consistency by investigators.   

                          How will we measure success?  We will use specific metrics to evaluate success. 1. Evaluation of AI-SCP for accuracy and consistency by the investigators and end users, 2. The number of SCP created and delivered to patients, 3. The time needed to create SCP’s. Delivery of SCPs will help us meet the NCI and  COC requirements for Cancer Survivorship Care. Finally, we will evaluate the SCP implementation success using the four item measures of Acceptability of Intervention Measure (AIM), Intervention Appropriateness Measure (IAM), and Feasibility of Intervention Measure (FIM)(7). This will be done as a brief survey for providers creating the SCPs. 

                          6.A. Measurements below are already being collected in APeX  

                          1. Number of patients eligible for SCPs: those treated with curative intent 

                          2. SCP delivery: the numerator is patient receiving SCPs, the denominator is eligible patients. We will evaluate current overall numbers, and site-specific data. What disease sites are not using the SCP currently.  

                          3. Capturing treatment details including prior chemotherapy, surgery, radiation oncology.  

                          4. Capturing prevalence of persistent side effects.   

                          6.B. A list of other measurements you might ideally have to evaluate success of the AI  To evaluate the success, we will need to include measures such as cancer surveillance, HRQOL measures, needs assessment, comorbidity care measures such as Hemoglobin A1C, lipid profile, screening for other cancers in the last one year for patient who have received SCPs.   

                          4. Describe your qualifications and commitment: Our applicant team includes Dr Niharika Dixit and NP Angela Laffan and cancer survivorship provider group. Dr Dixit is a medical oncologist and Professor of Medicine focused on breast cancer and survivorship care. She provides clinical care at ZSFG and is the Physician-lead of the UCSF Survivorship and Wellness Institute. Dr. Dixit focuses on improving care of cancer survivors across UCSF and Affiliate sites. She has conducted several research projects related to cancer survivorship and cancer survivorship care planning and is committed to optimal delivery of cancer survivorship care using innovative approaches in diverse clinical settings. Dr Dixit additionally serves as co-chair of a Cancer Survivorship Task Force for ASCO and is the co-lead for the UCSF Survivorship and Symptoms Science Research Hub promoting research in cancer survivorship and symptom science. NP Laffan is an Oncology Nurse Practitioner in the GI Medical Oncology and GI Survivorship program at UCSF. NP Laffan is the co-creator of the GI Oncology Survivorship Program and has an active clinical practice providing survivorship care to patients who have completed treatment for GI related cancers.  NP Laffan is also the Clinical Lead in the UCSF Survivorship and Wellness Institute focusing on program development, survivorship research initiatives and clinician education. We are also supported by other survivorship care clinicians who have site specific expertise such as Lung, breast and colon cancer etc. This group meets who meet once a month to discuss survivorship care across UCSF. We will present our planned intervention in these meeting to solicit ongoing feedback, As part of this endeavor we will work with a Patient advisory board to iteratively refine SCPs creation and delivery. Together, we believe that we are a strong team with extensive knowledge of survivorship care including barriers and exciting areas of opportunity. We are energized by this exciting opportunity to work with the health AI team and APeX enabled research team to address a health care issue that has been limited predominantly by the time intensity and complexity of the task. We are confident that AI can assist in rapidly creating SCPs that are patient friendly, easy to access, tailored and actionable which will lead to improved survivorship patient care.

                          References:  

                           

                          1.    Mollica MA, McWhirter G, Tonorezos E, Fenderson J, Freyer DR, Jefford M, et al. Developing national cancer survivorship standards to inform quality of care in the United States using a consensus approach. J Cancer Surviv. 2024 Aug;18(4):1190–9.

                          2.    Hayes BD, Young HG, Atrchian S, Bennett EV, Haynes EMK, Loader A, et al. Optimizing the integration of family physicians into cancer survivorship care in the BC Interior: a mixed methods study of physicians’ opinions and experiences. J Cancer Surviv. 2025 Feb 4;

                          3.    Birken SA, Deal AM, Mayer DK, Weiner BJ. Determinants of survivorship care plan use in US cancer programs. J Cancer Educ. 2014 Dec;29(4):720–7.

                          4.    Birken SA, Mayer DK, Weiner BJ. Survivorship care plans: prevalence and barriers to use. J Cancer Educ. 2013 Jun;28(2):290–6.

                          5.    Dixit N, Sarkar U, Trejo E, Couey P, Rivadeneira NA, Ciccarelli B, et al. Catalyzing Navigation for Breast Cancer Survivorship (CaNBCS) in Safety-Net Settings: A Mixed Methods Study. Cancer Control. 2021;28:10732748211038734.

                          6.    Pradeepkumar J, Pankaj Kumar S, Reamer CB, Dreyer M, Patel J, Liebovitz D, et al. Survivorship Navigator: Personalized Survivorship Care Plan Generation using Large Language Models. medRxiv. 2025 Mar 28;

                          7.    Weiner BJ, Lewis CC, Stanick C, Powell BJ, Dorsey CN, Clary AS, et al. Psychometric assessment of three newly developed implementation outcome measures. Implement Sci. 2017 Aug 29;12(1):108.

                           

                           

                          A Clinical Decision Support Tool for Personalized Tacrolimus Dosing in Solid Organ Transplantation

                          Proposal Status: 
                          1. The UCSF Health problem.

                          Tacrolimus is a major immunosuppressive drug in solid organ transplantation.1 Due to its narrow therapeutic index, however, tacrolimus administration requires strict monitoring and adjustment to achieve optimal therapeutic dosages. Excessive dosages may increase the risk of nephrotoxicity while insufficient dosages can lead to acute rejection.2 Consequently, transplant patients require lifelong monitoring of tacrolimus trough concentrations.

                          When trough levels fall outside the desired range, clinicians must account for a patient’s last dosage, including the dosage amount, the route and timing of administration, and concomitant medications. Other factors, such as gene polymorphisms, comorbidities, and patient demographics impact tacrolimus pharmacokinetics.3 As a result, dosages required to reach target whole blood concentrations of tacrolimus vary among individuals. This variability imposes substantial time burdens on practitioners who may struggle to account for all relevant covariates. Patients also face a significant cumulative time burden.4 Following their first admission, patients must travel to the hospital twice weekly as outpatients for venous blood sampling, which are required to review therapeutic drug monitoring results.5

                          While previous studies have explored ML and tacrolimus trough levels in kidney6,7 and liver8,9 transplant recipients, data on tacrolimus dosage prediction is limited. Yoon et al.9 developed a long short-term memory (LSTM) model in liver transplant recipients based on trough levels up to 14 days following transplantation. However, the model did not adjust for patients with once-daily dosage, concomitant medications, and comorbidities. Additionally, there are few studies that have explored machine learning approaches in tacrolimus dosing in lung10 and heart transplant recipients. Our model intends to fill this existing gap while improving patient safety and reducing costs associated with dosing errors.

                          2. How might AI help?

                          Long short-term memory (LSTM) is a recurrent neural network that processes sequential data to generate predictions, and LSTM models have been previously used to predict therapeutic tacrolimus concentrations.

                          By proposing accurate tacrolimus dosages, AI will save the time clinicians spend working through individual patient data in the immediate postoperative period, reduce the high-level of expertise to make clinical decisions, and additional patient lab checks, which will ultimately save clinical costs. Moreover, by preventing potential over- and underdosing, AI will help mitigate patient hospital length of stay and readmissions. Indeed, previous studies demonstrate that personalized tacrolimus dosing over time leads to shorter median hospital stays compared to conventional dosing.9

                          We propose developing a LSTM model that predicts tacrolimus concentrations and proposes dosage adjustments using data from the EHR “Patient Synopsis” and “Transplant Summaries” from APeX. Data will be extracted using Versa with assistance from the UCSF Data Core or through Transplant Insights, which can be used to scan both UCSF and external EHRs. We will begin by implementing tacrolimus dosage monitoring for kidney transplant recipients, given the large cohort size, number of touchpoints, and the quality of data available, which will be sufficient to train our models. Patients must complete daily trough-level testing during the initial 10-day hospitalization and twice-weekly outpatient testing for months. With nearly 400 annual kidney transplants, this imposes significant financial burdens and demands substantial time commitments. Our model will then be extrapolated to lung, liver, and heart transplant recipients.

                          Our goal is to implement a clinical decision support tool to guide providers in adjusting tacrolimus dosages in the immediate postoperative phase and during ongoing immunosuppression maintenance of solid organ transplant recipients. This would be a practice-changing technology that would save time for physicians, patients, and caregivers by (1) assisting in clinical decision making and ultimately (2) limiting patient trips to the lab. Additionally, by accounting for all covariates, LSTM models will help reduce potential human error and improve patient safety by preventing over- or underdosing.

                          3. How would an end-user find and use it?

                           

                          The AI tool will be integrated with UCSF’s APeX EHR system in a unique tab. This tab will present predictive tacrolimus dosage with a streamlined workflow. The workflow will show trough trends, current tacrolimus medication regimens, correlation with clinical stays, and allow clinicians to quickly adjust covariates (e.g., concomitant medication modifications and discontinuations). Using the model, clinicians can specify the desired tacrolimus level for a patient, and the appropriate dose will be recommended. Additionally, clinicians will be able to pull out and examine aggregate trends for their patients.

                           4. Embed a picture of what the AI tool might look like.

                          Figure 1 depicts what the AI tool may look like. This will be available as a separate tab in APeX. The transplanted organ will be shown. Directly below the transplant information, previous model predictions can be tracked to verify accuracy and whether they were followed. The predicted tacrolimus projected dose is shown as a star and will estimate the expected trough. As users input information in the right panel, “Adjust for Your Next Dosage”, they will have the option to alter data as necessary. Weight, height, and age will be automated, but users will have the option to enter manual data, if necessary, as indicated by the “More” button. Concomitant medications can be selected from a patient’s current medication. Labs will be automatically entered in the left panel.

                           

                           

                          Figure 1. Proposed model. Please note that hospital stays are not reported, but the model would include this feature.

                          5. What are the risks of AI errors?

                          AI errors in the context of LSTM models include overfitting or underfitting on training data. This may lead to inaccurate tacrolimus dosage recommendations that compromise patient safety by through toxicity or insufficient immunosuppression. Previous data, however, has suggested that LSTM models can accurately predict tacrolimus dosages to achieve actual concentrations within the therapeutic range when sufficient training data is provided.9

                          Despite these risks, UCSF maintains complete data on a large cohort of transplant recipients, which will be used to validate the model. The model will undergo continuous performance monitoring, bias assessment, and iterative refinements. We plan to incorporate algorithms to enhance model transparency and observe the effects of covariates on tacrolimus concentrations. Additionally, our tool will assist clinical decision making; clinicians will use their discretion when administering tacrolimus dosages.

                          6. How will we measure success?

                          We will measure the success of our model for tacrolimus dosages based on adoption, the impact on clinician efficiency, and patient outcomes. These metrics include the following:

                          Primary Outcome —Therapeutic Accuracy: We will track how consistently patients’ tacrolimus levels remain within the desired therapeutic range, as well as the rate of dosing adjustments that adhere to AI recommendations.

                          Secondary Outcomes:

                          Healthcare Costs: We will analyze overall costs related to tacrolimus therapy, including additional laboratory tests, ICU/hospital length of stay, pharmacist and clinician charting durations, and hospital readmissions associated with suboptimal dosing (toxicity or organ rejection). These measures will be compared pre- and post-implementation.

                          Clinical Efficiency: We will examine the frequency of clinic visits for dosage adjustments, readmission rates due to organ rejection or nephrotoxicity, and overall hospital length of stay will be compared pre- and post-implementation.

                          User Feedback and Engagement: We will solicit feedback from transplant clinicians on their experiences integrating the AI tool into daily practice. This may include quarterly Qualtrics surveys focused on usability, perceived accuracy, and impact on workflow.

                          Adoption Metrics: We will measure how often the AI-driven recommendations are accepted or overridden, as well as the reasons behind provider decisions.

                          Continuous Improvement: Feedback loops will be established to refine the AI model, ensuring that user input guides enhancements and fosters long-term adoption.

                          7. Describe your qualifications and commitment:

                          Dr. Steven Hays is a pulmonologist and director of the UCSF Lung Transplant Program and Medical Director of the Transplant Digital Health team. The team includes Anna Mello, the Manager of Transplant Quality and Digital Health; an acute care nurse practitioner; transplant coordinators; and Logan Pierce. Dr. Pierce serves as Managing Director for the Department of Medicine Data Core, which focuses on clinical data extraction, analysis, and visualization.

                          The Transplant Digital Health team has launched several projects, including the UCSF Health Home Spirometry Kit, a self-sustaining project that generates $1.2 million per year. Additional projects include Transplant Insights, a tool that pulls information from 95% of EMRs to create clinical summaries; and The Kidney Pre-List Chat Program, which enabled data-driven triage of online referrals and an annual savings of upwards of $580,000.

                          Additional team members include Dr. John Roberts, a board-certified surgeon who specializes in abdominal transplantation, and Dr. David Quan, a transplant pharmacist. Dr. Roberts has published nearly 170 papers on topics such as immunosuppression. Dr. Quan oversees UCSF's transplant pharmacist group and serves as program director for the UCSF Medical Center's specialized residency in solid organ transplant.

                           

                          References

                          1.         Araya AA, Tasnif Y. Tacrolimus. In: StatPearls. StatPearls Publishing; 2025. Accessed April 3, 2025. http://www.ncbi.nlm.nih.gov/books/NBK544318/

                          2.         Randhawa PS, Starzl TE, Demetris AJ. Tacrolimus (FK506)-Associated Renal Pathology. Adv Anat Pathol. 1997;4(4):265-276.

                          3.         Staatz CE, Tett SE. Clinical Pharmacokinetics and Pharmacodynamics of Tacrolimus in Solid Organ Transplantation. Clin Pharmacokinet. 2004;43(10):623-653. doi:10.2165/00003088-200443100-00001

                          4.         Veenhof H, van Boven JFM, van der Voort A, Berger SP, Bakker SJL, Touw DJ. Effects, costs and implementation of monitoring kidney transplant patients’ tacrolimus levels with dried blood spot sampling: A randomized controlled hybrid implementation trial. Br J Clin Pharmacol. 2020;86(7):1357-1366. doi:10.1111/bcp.14249

                          5.         Leard LE, Blebea C. The transformation of transplant medicine with artificial intelligence-assisted tacrolimus dosing. J Heart Lung Transplant. 2025;44(3):362-363. doi:10.1016/j.healun.2024.11.029

                          6.         Mok S, Park SC, Yun SS, Park YJ, Sin D, Hyun JK. Optimizing Tacrolimus Dosing During Hospitalization After Kidney Transplantation: A Comparative Model Analysis. Ann Transplant. 2025;30:e947768. doi:10.12659/AOT.947768

                          7.         Zhang Q, Tian X, Chen G, et al. A Prediction Model for Tacrolimus Daily Dose in Kidney Transplant Recipients With Machine Learning and Deep Learning Techniques. Front Med. 2022;9:813117. doi:10.3389/fmed.2022.813117

                          8.         Li ZR, Li RD, Niu WJ, et al. Population Pharmacokinetic Modeling Combined With Machine Learning Approach Improved Tacrolimus Trough Concentration Prediction in Chinese Adult Liver Transplant Recipients. J Clin Pharmacol. 2023;63(3):314-325. doi:10.1002/jcph.2156

                          9.         Yoon SB, Lee JM, Jung CW, et al. Machine-learning model to predict the tacrolimus concentration and suggest optimal dose in liver transplantation recipients: a multicenter retrospective cohort study. Sci Rep. 2024;14(1):19996. doi:10.1038/s41598-024-71032-y

                          10.       Choshi H, Miyoshi K, Tanioka M, et al. Long short-term memory algorithm for personalized tacrolimus dosing: A simple and effective time series forecasting approach post-lung transplantation. J Heart Lung Transplant. 2025;44(3):351-361. doi:10.1016/j.healun.2024.10.026

                          GIVersa-Endoscopy: A Large Language Model (LLM) based AI Assistant for Endoscopy Sedation Triage

                          Proposal Status: 

                          The UCSF Health Problem:

                          Sedation planning prior to endoscopic procedures is an important quality metric and essential step in the procedural workflow. Triage of which patients require higher levels of anesthesia support is critical to maximize patient safety and allocate limited anesthesia resources. At most institutions, patients are assessed for sedation risk from pre-existing medical conditions based on manual chart review by clinical staff and endoscopists: this is a time consuming and labor-intensive task.  

                          At UCSF Health, patients are triaged for endoscopy procedures across five locations, based on their anesthesia risk and required sedation type. As our health system expands across hospital systems and ambulatory care centers throughout San Francisco and the greater Bay Area, the decision tree for appropriate triage becomes increasingly complex. Currently, the gastroenterology office reviews up to 150 direct endoscopy referrals per week. This workflow of manual review by staff and faculty diverts valuable time away from direct patient care and contributes to administrative burdens, which in turn leads to physician burnout. 

                          Although this proposal is initially tailored to gastroenterology-specific procedures, the administrative challenges of peri-operative sedation triage are widespread across the health system in many divisions, highlighting the larger potential uses for an AI-based sedation triage assistant. 

                          How might AI help?

                          We propose the development of an LLM-based assistant (UCSF Versa) customized via the Retrieval-Augmented Generation (RAG) methodology. This approach adds the UCSF anesthesia endoscopy guidelines and American Society of Gastrointestinal Endoscopy (ASGE) clinical guidelines into the assistant’s search database, thereby reducing hallucinations by controlling the data sources used for response generation. This customization named, “GIVersa-Endoscopy” would serve as the chat interface intended to triage sedation levels (moderate sedation versus anesthesia) and recommend the appropriate procedure location stratified by anesthesia risk level (Parnassus operating room, Parnassus endoscopy unit, Mission Bay endoscopy unit, Mount Zion endoscopy unit, or ambulatory care centers) based on clinical patient data extracted from the electronic medical record. This design has been successfully tested by our team in a pilot retrospective cohort study using a custom Epic Smartphrase to extract relevant clinical data for the assistant. 

                          How would an end-user find and use it?

                          The GI-Versa Endoscopy LLM assistant will be integrated into the existing Epic referrals and pre-procedural planning dashboards. When processing a direct procedure referral, the end-users, the administration staff and faculty member responsible for reviewing and scheduling new referrals, will receive an alert with an option to generate the AI-augmented triage recommendation. The interface will allow the user to “Approve” or “Decline” the AI’s recommendation. The final decision on sedation level and where the patient should be scheduled for their procedure along with the underlying reasoning on why the assistant reached that decision will be recorded in the patient’s chart. This integration ensures that the tool is actionable, easily discoverable, and seamlessly fits into the existing workflow.  

                          Example AI tool output

                          See attached figure. 

                          What are the risks of AI errors?

                          The risks for an AI-augmented preprocedural sedation triage assistant includes: 

                          1) Patient safety: A false negative would occur when a patient is recommended a lower level of sedation support by the assistant when the patient should have been triaged to a higher level of sedation support. 

                           2) Overdependence on the AI: Excess reliance on the AI system may compromise decision making in complex clinical cases. 

                          To mitigate these risks, we will conduct rigorous validation tests and continuously monitor performance metrics such as sensitivity, specificity, and error rates. The system is designed to require human verification of the AI’s recommendations prior to scheduling the patient.  All AI assistant outputs will be logged in the patient’s chart for audit and review. Licensed professionals will maintain primary accountability for patient care decisions. 

                          How will we measure success? 

                          We plan to evaluate the AI solution through a prospective cohort study of patients referred for direct endoscopy at UCSF Health. Success metrics will include: 

                          Measurements Using Existing APeX Data:

                          1)    The volume of referrals processed by the AI tool.

                          2)    The percentage of referrals where AI recommendations are accepted after clinician review.

                          3)    Reduction in manual review time as recorded within current workflow data. Time reduction in manual processing of each direct procedure referral.

                          4)    Correlation between AI-augmented sedation risk stratification and patient outcomes.  This will be measured via post-procedural audits which are performed to ensure successful and safe completion of patient procedures. This workflow is already in place for endoscopic procedures at our health system. 

                          Additional Measurements:

                          1)    Review of AI recommendation logs versus human overrides to identify error rates and opportunities for improvement in assistant performance. 

                          2)    End-user satisfaction surveys to assess overall usability, reduction in administrative load, and time efficiency. 

                          3)    Impact on patient wait times to schedule procedures (time from referral placement to receipt of procedure appointment confirmation). 

                          4)    Impact on administrative clinic staff and ease of integration within existing workflows.

                          Describe your qualifications and commitment 

                          Dr. Lakshmi Subbaraj, MD, a current second year gastroenterology fellow, has career aspirations in the integration of AI applications in clinical practice to enhance patient care and improve clinical operations. Her technical background in computer science, completion of an “AI in Healthcare: From Strategies to Implementation” course, and recognition as an ASGE AI scholar underscore her commitment to this goal this early in her career. The blend of her clinical acumen, technical expertise, and communication skills has been pivotal in spearheading this project.

                          Dr. Jin Ge, MD, MBA, an NIH-funded clinical researcher and transplant hepatologist, also serves as Director of Clinical AI for the Division of Gastroenterology and Hepatology. He is the co-lead for this project and has a proven track record in building and implementing AI initiatives. His professional background with experiences in healthcare administration, data science, and artificial intelligence – along with his previous collaborations with the UCSF AER Team and AI Tiger Team, make him exceptionally qualified to lead this project through the design, testing, and implementation stages. 

                          ProUCare - An AI-Preventive Medicine Extender

                          Proposal Status: 

                          The UCSF Health problem

                          Patients seek preventive health recommendations from their primary care providers (PCP). Health topics include but are not limited to topics like nutrition, musculoskeletal health, injury prevention, and aging well. These broad topics extend beyond the standard 20-minute annual visit and current APeX-derived healthcare maintenance banner topics. 

                          PCPs are ill-equipped to provide tailored recommendations at scale. For instance, one survey showed that 73% of physicians believe nutrition guidance should be part of patient visits, but only 15% feel fully prepared to provide it. Moreover, insurance coverage rules for nutrition services vary by plans. Some plans cover registered dieticians and nutritionists, while others, like Medicare Part B cover for limited qualifying conditions of diabetes, chronic kidney disease, and transplant. Similarly, the confidence levels among PCPs in providing exercise and injury prevention recommendations are considered moderate, ranging between 50% and 67%.  

                          The knowledge gaps extend to trainees. A 2018 survey reported only 29% of US medical schools met the recommended minimum of 25 hours of nutrition education. Fast forward to 2023, only 7.8% of US medical schools now meeting the 25-hour minimum. Similarly, while 83% of US medical schools require some form of musculoskeletal education, explicit didactics on injury prevention is limited.

                          PCPs and clinicians in training could utilize ProUCare AI tool to generate tailored, shareable preventive medicine guidance which adapts throughout the arc of our patients’ lives. 

                          How might AI help?

                          My vision is for the ProUCare AI tool to generate tailored recommendations based on the patients’ EHR inputs complemented by evidence-based nutrition science, physical medicine, and occupational medicine foundations.

                          The recommendations would (1) replace the current BMI-generated AVS nutrition handouts, (2) empower patients in recognizing and reducing occupation-related health impacts, (3) improve PCP confidence in creating preventive health advice at scale, (4) promote patient-clinician practice shifts to wellness from chronic disease management, and (5) improve the medical education foundation across these intersections.

                          How would an end-user find and use it?

                          PCPs and trainees would access the ProUCare AI tool during specific settings e.g. the annual visit (pull) and scenarios e.g. patient-led preventive health inquires (push). The tool can be embedded on the screen left side closer to Care Gaps and SDoH icons. Alternatively, the tool can be added in the order section of the Annual Exam SmartSet. (Fig 1, see attachment).

                          Piloting primary care (faculty, APP, trainee) users would enter their nuanced question in the ProUCare question field. ProUCare would compile data sourced from the patient's record (structured and free text), EHR inputs from questionnaries, social determinants of health data (SDoH), insurance plans coupled with UC resources and vetted references (society guidelines, publications) by the project's panel of experts from Nutrition, Osher, and Physical therapy. (Fig 2, see attachment).

                          ProUCare's outputs could populate the After visit summary (AVS) (Fig 3, see attachment). Providers could copy/paste relevant subsections into the assessment/plan section within the “Problem List”, and as part of portal message replies for the scenario based inquiries. The three locations are visible to the patient via MyChart between visits and to their care teams across UCSF. 

                          What are the risks of AI errors? What are the different types of risks relevant to the proposed solution and how might we measure whether they are occurring and mitigate them?

                          Risk 1. Inaccurate recommendations due to incomplete data, and inconsistent data entry practices.

                          Measurement and mitigation strategies:

                          • Conduct an inventory of annual exam EHR-based questionnaires, medical, and social history fields.
                          • Perform a consensus review with nutritionists, physical and occupational medicine to define pertinent components from the inventory which require revision.
                          • Optimize for structured field versus free-text entry. 
                          • Educate clinicians and care teams on EHR annual exam workflow changes.
                          • Perform bimonthly to monthly checks for data completeness and AI tool output accuracy. 

                          Risk 2.  Misdirection. Recommendations could lead to unnecessary interventions.

                          Measurement and mitigation strategies:

                          • Educate and remind PCPs and trainee users of the AI tool’s role as an extender.
                          • Create a feedback system which enables provider and patient feedback for reporting and analyzing errors in preventive health recommendations. Use the generated feedback to refine input parameters. 
                          • Perform an analysis of the order and referral pattern changes and outcomes based on the AI tool’s recommendations. 

                          Risk 3. Bias. AI-generated recommendations may not account for cultural, regional, or socioeconomic dietary patterns. Similar dataset underrepresentation of occupations and functional status could introduce biases and hallucinations.

                          Measurement and mitigation strategies:

                          • Use diverse datasets that include varied demographics, occupations, and dietary patterns
                          • Employ techniques like Retrieval-Augmented Generation (RAG) to ensure outputs are grounded in patient-specific contexts.
                          • Use the same feedback system outlined above to incorporate clinical and patient feedback loops to refine recommendations based on usability and relevance.

                          How will we measure success?  What can we measure that will tell us whether intended end-users are using the tool, changing what they do, and improving health for patients or otherwise solving the problem you described above? What evidence would you need to convince UCSF Health leadership to continue supporting the AI in APeX? What results would make you consider abandoning the implementation? Please include 2 subsections:

                           
                          Measurements, quantitative and qualitative:
                          • ProUCare usage patterns. Number of times opened, timing (during clinic, after clinic hours), setting (annual, Medicare wellness, Mychart message, recommendations based on test results), and time spent accessing the tool.
                          • ProUCare education assessment. Users and patients rate the output utility (thumbs up/down, Likert scale 0-5), number of times reference links are opened per category and by audience (clinician type - faculty, APP, trainee vs patient).

                          • Practice changes. Measure the amount and rate of change in the completeness of information collected on SDoH questionnaires. Track  the number of wellness, preventive-focused ApeX tickets submitted as a result of ProUCare's inclusion e.g. social history structured fields for occupation categories, commute (expressed in hours), screen time (expressed in hours, reasons), cultural influences, budget ranges, stores related to food procurement.
                          • Visit metrics. Number of completed annual and Medicare wellness visits, referral numbers, wait times.

                          • Population health metrics. Preventive test completion rates (short term during the pilot year), percentages with well-controlled chronic conditions or precursor conditions which normalized e.g. prediabetes to normal.
                          • Patient satisfaction scores: Add a wellness question on practice surverys to gauge ProUCare's utlity to the patients.

                          • User surveys. During the development phase and post deployment which
                            • Assess changes in the confidence levels, external tools, resource knowledge and time spent creating tailored preventive medicine recommedations.
                            • Assess AI usability, design, and alert fatigue.

                           What evidence would you need to convince UCSF Health leadership to continue supporting the AI in APeX

                          The AI-Preventive Medicine Extender tool could provide the following benefits to UCSF Health leadership
                          • Population health
                            • Preventive health.
                              • Increased annual exam and Medicare wellness completion rates.
                              • Improved preventive health test and referral completion rates.
                              • Indentification and provision of early detection guidance for nuanced and under-represented patient populations currently missed by the current care gaps baner.
                            • Chronic disease management.
                              • Reduce the long-term incidence and severity of metabolic and musculoskeletal diseases like diabetes, hypertension, obesity and MASH. 
                              • Reduce total healthcare spend on medications, labs, imaging. 
                          • Education
                            • Improve providers and trainees' evidence-based foundations in nutrition, complementary alternative medicines, and physical health.
                            • Encourage patients to seek advice from vetted, compiled, evidence-based sources.
                            • Community engagement 1. Publicize current and future UCSF resources e.g. condition-focused nutrition group classes, webinars. 
                            • Community engagement 2. Strengthen clinical team and patient links to community-based organizations who may provide low-cost and assistance programs. e.g. food banks, sliding scale chiropractors.
                          • Clinical team satisfaction
                            • Reduce administrative tasks required for referrals, appointment tracking. 
                            • Reduce providers' non-billable time spent in chart review and researching information from different sources to devise a crafted recommendations. 
                          • Patient satisfaction. CGHAPs, Press Ganey.
                            • Service access improvements. Reduce long wait times to impacted referral services like neurology and physical therapy.
                            • Care coordination 1. Increase patients' awareness of insurance and employer contracted services based on ProUCare's recommendations.
                            • Care coordination 2. Improve patients' confidence in using ProUCare's advice as a bridge until the specialist appointment. 
                          Describe your qualifications and commitment:

                          Dr. Chan Tack is a dual boarded internal medicine and obesity medicine specialist. She is trained in human-centered design and public health management. She successfully led Primary Care Service preventive health colorectal cancer screening initiatives. The program's foundations extended across primary care sites and featured by the California Healthcare Safety Net Institute.

                          Dr. Chan Tack is an experienced Epic certified leader in digital health innovations across video visits and remote patient monitoring from pilot to system-wide launches. Her work as an informed physician leader enabled UCSF' 3,500+ clinicians, 4MM+ ambulatory video visits. 

                          She is supported by colleagues in the Nutrition Counseling, Osher and Physical Therapy departments. Dr. Chan Tack is committed to collaboration with the Health AI and AER teams for development and implementation of the AI algorithm. 

                           
                          References
                          1. https://www.pcrm.org/news/news-releases/poll-most-doctors-want-discuss-nutrition-patients-feel-unprepared
                          2. Lee AK, Muhamad RB, Tan VPS. Physically active primary care physicians consult more on physical activity and exercise for patients: A public teaching-hospital study. Sports Med Health Sci. 2023 Nov 20;6(1):82-88. doi: 10.1016/j.smhs.2023.11.002. PMID: 38463668; PMCID: PMC10918360.
                          3. Papastratis, I., Konstantinidis, D., Daras, P. et al. AI nutrition recommendation using a deep generative model and ChatGPT. Sci Rep 14, 14620 (2024). https://doi.org/10.1038/s41598-024-65438-x
                          4. https://blogs.und.edu/cnpd/2024/09/diet-related-diseases-are-the-no-1-ca...
                          5. Special thanks to clinic managers and practitioners in UCSF Nutrition Counseling, Osher Center, and Physical Therapy departments. 

                           

                          Enhancing Efficiency and Impact: AI-Powered Eligibility Assessment for Adult Complex Care Management

                          Proposal Status: 
                          1. The UCSF Health problem.    

                          The UCSF Office of Population Health (OPH) Complex Care Management (CCM) team provides advanced care management services to adult patients with high medical and/or psychosocial complexity who are high utilizers of inpatient or emergency department (ED) services. The CCM program involves essential high-touch support such as assessing individual patients’ healthcare challenges, developing targeted care plans, providing health education and coaching, coordinating linkage to care, and connecting to other community resourcesPrior analysis of this program’s outcomes showed a statistically significant decrease in ED and observation encounters for patients enrolled in the programThe impact of this program therefore has significant potential to help address UCSF Health’s current ED crowding and bed capacity challenges, reduce readmission rates, and help meet quality metrics associated with specific patient populations. 

                          Currently, one of the most time-consuming challenges faced by the CCM team is determining individual patient eligibility for the CCM program. Despite using a reporting workbench that identifies patients meeting initial objective criteria, the team must still manually chart review to determine eligibility, which can consume up to 30 minutes per patientThis manual process can also be inconsistent, as it has been noted that different reviewers assess medical and social complexities differently based on their training and clinical background. 

                          Ideally, much of the current time spent by the OPH CCM team on manual chart review could instead be spent on higher-yield patient care activities, expanding the team’s ability to manage a higher volume of patients without requiring additional staff, as well as allowing its team members to practice at the top of their licenses. Additionally, standardizing definitions across patient enrollment criteria and workflows around the patient identification process would reduce variability between team members in applying inclusion and exclusion criteria.  

                          We believe that AI can be used to optimize these areas of need and have done preliminary work to show how this can apply. Through the AI demonstration project, we hope to validate our initial prototype; build an automated pipeline at scaledesign, implement, and study appropriate clinical workflows; and work with appropriate governance committees to ensure safe and effective use. While this proposal is initially focused on addressing the needs of the CCM program in OPH, we believe that this model can be scaled to multiple other programs within OPH who would benefit from improved efficiency in patient identification processes.  

                          This work is particularly important given the current environment for UCSF Health. As our health system grows, we may not be able to hire more staff, and so will need to be more efficient. If successful, this initiative will allow CCM team members to re-allocate approximately one day per week to seeing additional patients instead of performing chart reviews. 

                          1. How might AI help?   

                          To address these challenges, we built an initial prototype using prompt engineering in Versa to optimize the CCM patient eligibility determination process. This AI system mimics the current workflow of CCM staff, reviewing relevant notes for complex patients with high inpatient and ED utilization within a set time period for specific inclusion and exclusion criteria. Specifically, patients must meet at least one of six indicators of medical or psychosocial complexity. These indicators include:

                          1. Uncontrolled chronic medical conditions
                          2. Use of high-risk medications
                          3. Evidence of barriers to medical adherence or limited health literacy
                          4. Lack of social support
                          5. Older age with frailty or poor functional status or cognitive impairment
                          6. Social determinants of health (SDoH) challenges

                          We completed an initial assessment comparing LLM-generated eligibility determinations with those from manual reviews using 27 distinct patient charts. Our findings revealed 100% sensitivity and 77% specificity in eligibility determination. The lower specificity was attributed to over-inclusion, as the initial prompt lacked explicit instructions on defining medical and social complexity. Patients with lower complexity levels were incorrectly classified as eligible. We expect that further refinements will lead to better specificity. During this initial assessment, we also observed the AI system’s ability to standardize the review process, as it identified 3 patients who were initially misclassified for eligibility upon re-review by the CCM team. 

                          The initial results from this AI system were received positively by the CCM team and OPH leadership. Together, we believe that this can be used as an adjunct tool to significantly streamline the time spent on and improve the overall quality of eligible patient review. 

                          1. How would an end-user find and use it 

                          Twice a month, the CCM team reviews a list of patients who have met certain utilization thresholds and other simple inclusion criteriaand subsequently performs chart review for the preceding 6 months of data to determine whether these patients are eligible for enrollment in the CCM program. Approximately 73 patients are reviewed every month. For each patient, the AI system described above would review and synthesize relevant patient notes to summarize findings for the CCM teamIt would generate the following outputs: eligibility determination, specific inclusion and exclusions that led to that determination, as well as which noteinformed those inclusion and exclusion decisions (see question 4).This setup increases transparency in the AI system, as it can show how it reached its conclusions as well as where to find relevant information for further review. This output should ideally be integrated into the existing Epic Reporting Workbench report used by the CCM team, as it includes patient data and patient outreach toolsCCM team members would then use this information to assist in their review.  

                          1. Embed a picture of what the AI tool might look like 

                          Please see above for an example of what the output from the AI system might look like. This data would ideally be integrated into an Epic Reporting Workbench report. Here, we only show 1 exclusion criteria for illustrative purposes. In reality, the output would include many more criteria currently utilized by the CCM team.  

                          1. What are the risks of AI errors 

                          False negatives, false positives, and hallucinations are critical errors we will monitor for in this AI system. These errors can arise at multiple levels, such as when the LLM is outputting answers to individual inclusion and exclusion criteria, and when it is giving a final eligibility assignment for each patient.  As we envision this system being used to augment the review process rather than automate it, it is particularly important to monitor the false negative, false positive, and hallucination rates in its responses to individual inclusion and exclusion criteria, as we anticipate the CCM team to primarily rely on these responses to optimize their review. Large error rates in any of these categories will limit the usefulness of the AI system.  

                          Including notes from CareEverywhere in the AI system will also be fundamental to ensuring accuracy of the AI outputAs these notes are also included in review by CCM staff in their current processes 

                          The only way to measure if these errors are occurring is to retain some number of complete manual chart reviews per month (i.e.10-20) and compare the AI system’s performance to this gold standard. In terms of mitigation strategies, there are a few possible strategies, including: (1) For criteria that includes math/logic/structured data, leverage traditional reporting rather than generative AI. (2) Prioritize low false negative and hallucination rates over false positive rates, if able. (3) When new patterns of error are discovered, use iterative cycles of prompt engineering to attempt to resolve the error. 

                          1. How will we measure success 

                          Please see below for lists of measures being tracked already, as well as measures that we want to track as part of the AI deployment. For this project to be successful, the AI system should be accurate, increase efficiency/satisfaction, increase the capacity of the CCM team, and be low cost to deploy and maintain relative to its benefits.  

                          1. Measurements using data that are already being collected 
                          • Number of enrolled patients per CCM staff 
                          • Number of San Francisco Health Plan patient referrals and acceptances per month 
                          • Outcomes of patients deemed eligible (e.g. completed, declined, dropped engagement, etc) 
                          1. Other measurements to evaluate success of the AI 

                          • Rates of sensitivity, false positive,false negative, and hallucinationof AI system per month compared to manual review gold standard 
                          • Average and total time required per patient to review chart to determine enrollment eligibility(pre/post) 
                          • Adoption rate of Versa API per clinical staff member per month 
                          • Versa API cost per month 
                          • Staff satisfaction (pre/post)  
                          • Number of inappropriate eligibility determinations requiring secondary review per month  
                          1. Describe your qualifications and commitment:  

                          Executive Sponsor: Tim Judson, MD, Interim Chief Population Health Officer 

                          Co-project leads: Esther Hsiang, MD, MBA, Interim Medical Director of Care Delivery Transformation, UCSF Office of Population Health and Leo Liu, MD, MS, FAMIA, Associate Program Director, Clinical Informatics Fellowship.  

                          Esther Hsiang leads the Innovations team in OPH in designing, implementing, and evaluating strategic initiatives related to new models of care delivery at scale. She is involved in partnering with OPH clinical programs to assess program outcomes and assessing how technology can improve clinical workflows and expand patient capacity. Her clinical experience as a hospital medicine physician with internal medicine training allows for a deep understanding of challenges in navigating inpatient and outpatient systems for medically complex patients. 

                          Leo Liu is an Applied ML Scientist who has developed many AI tools at UCSF, including a sepsis prediction model that is now in silent study in APeX. Leo has helped mentor CI fellows Sara Faghihi Kashani and Abi Fadairo-Azinge in the development of the pilot AI system above. In addition, Leo has published on evaluating ML models for clinical practice[1], as well as created the concept of Sabercaremetrics[2] – novel metrics to better measure clinical performance.  

                          Esther and Leo are both committed to dedicate effort to this project during the upcoming year, including participating in regular work-in-progress sessions and collaborating with the Health AI and AER teams for development and implementation of the AI system.  

                          Additional Team Members 
                          Robin Andersen, MSN, PHN, NP, Manager, Complex Care Management, OPH 
                          Joshua Munday, MSN, MPH, RN, FNP, Complex Care Management, OPH 
                          Kristin Gagliardi, Implementation Specialist, Innovations Team, OPH 
                          Sara Faghihi Kashani, MD, MPH, Clinical Informatics Fellow 
                          Abi Fadairo-Azinge, MD, MPH, Clinical Informatics Fellow 

                          Open edit period edit(s)
                          Added list of inclusion/exclusion criterias that pilot prompt evalutes for

                          References
                          [1] Liu X, Anstey J, Li R, Sarabu C, Sono R, Butte AJ. Rethinking PICO in the Machine Learning Era: ML-PICO. Appl Clin Inform. 2021 Mar;12(2):407-416.
                          [2] Liu X, Auerbach AD. A new ballgame: Sabercaremetrics and the future of clinical performance measurement. J Hosp Med. 2025 Apr;20(4):411-413.

                          Supporting Documents: 

                          The Whole Person Care Navigator: An AI-Powered Copilot for Social and Behavioral Risk-Informed Care

                          Proposal Status: 

                          1. The UCSF Health Problem

                          The profound impact of social and behavioral determinants of health (SBDH) on patient outcomes is well-established, with some studies citing that as much as 80% of health outcomes are driven by SBDH rather than medical care itself[1]. In response, CMS has launched a 3-step roadmap, where step 1 is to increase awareness of SBDH factors (by rewarding/penalizing patient screening rates or SBDH-1), step 2 is to address SBDH factors (by rewarding/penalizing screening positivity rates or SBDH-2), and step 3 is to improve clinical outcomes that are heavily impacted by SBDH factors (by rewarding clinical outcomes)[2]. The City and County of San Francisco have similarly prioritized addressing SBDH—particularly opioid use—due to their substantial local impact[3].

                          Although UCSF Health has made significant workflow changes to meet the goals of the first screening step, the latter two steps–reducing SBDH positivity rates and improving health outcomes in the presence of SBDH factors–are substantially more difficult, as they require a deep understanding of a patient’s social and behavioral needs and how it interplays with their medical needs. While the health system has always targeted these latter steps, gaps remain with current approaches.  Current workflows require structured screening for at risk SBDH, however, the Department of Care Management and Patient Transitions estimates that this screening workflow may miss up to 85% of patient's with at risk SBDH.  Therefore, the current workflow depends on (i) social work teams abstracting SBDH-relevant information across fragmented/unstructured charts from multiple encounters and often multiple years (akin to “finding a needle in a haystack”), searching for relevant resources that a patient is eligible for, and connecting patients to these social resources; and (ii) medical teams consulting with the social work teams to understand the social context, perhaps even conducting their own chart review, and adjusting medical care accordingly.  Our vision is to drastically scale the delivery of whole-person care (i.e. addressing both social and medical needs in an integrated fashion) by building an AI-powered copilot that reduces clerical burden and facilitates targeted referrals and treatment plans concordant with patients' social and behavioral needs.

                          2.  How Might AI Help

                          The Whole Person Care Navigator builds upon UCSF Health’s existing commitment to addressing SBDH and complements the recent effort to address high capacity by identifying and prioritizing patients with social needs. Current SBDH workflows at UCSF are triggered by structured data elements summarized in electronic health record (EHR) tools or obtained through manual chart review. However, most SBDH information is embedded and scattered across unstructured data elements in highly nuanced notes from the current medical encounter and historical record.  In addition, SBDH information is often uncovered during unstructured discussions, and SBDH info may not be fully documented due to time constraints.  With the recent advancements in large language models, we can develop an AI tool to extract SBDH information so that social workers and healthcare providers can maximize their time providing direct patient care and trigger downstream workflows. The proposed tool will (i) generate concise, patient-specific summaries highlighting social risks, previous interventions, and care gaps from both current and historical encounters, thereby reducing cognitive load and saving time previously spent on manual chart reviews; (ii) extract structured data elements for use by downstream workflows, e.g. flagging high-risk patients experiencing significant social and behavioral adversity or finding patient contacts hidden in the notes; (iii) offer personalized recommendations for relevant hospital and community resources tailored to each patient’s current and historical social circumstances and prior referral experiences; (iv) provide clinicians with actionable guidance to adjust medical treatment plans according to identified social and behavioral needs, e.g. “Given the patient’s transportation barrier, a travel voucher may help the patient attend their follow-up appointments” or “Given the patient’s financial instability, consider a 90-day supply of medications provided on discharge”[5] and (v) provide a structured data flag to the newly launched Early Screening for Discharge Planning (ESDP) and Prioritization Tool to enhance the Social Work structured data tool with the added power of AI. Given the versatility of LLMs, this tool will be easily customizable by end-users and across clinical settings.

                          Our team, in close collaboration with the ZSFG social worker team, has already prototyped a tool with features (i) and (ii) using PHI-compliant GPT-4o with in-context learning and internal verification via self-judging. We will work with UCSF to integrate existing databases of local and regional social resources for feature (iii) and develop with UCSF experts in social-risk informed care safe and actionable suggestions for adjusting treatment in light of patients’ social needs for feature (iv).

                          3.  How would an end-user find and use it?

                          The Whole Person Care Navigator decision support tool will be embedded within a custom APeX navigator as an interactive webpage. Social workers and hospital medicine teams will receive a daily updated summary of each patient’s SBDH, including referral recommendations and decision support for social risk-informed care. This summary will appear directly beneath the patient one-liner in the Epic Sidebar, enabling situational awareness during pre-rounding and facilitating integration into medical decision-making and multidisciplinary rounds. Additionally, a direct link to the custom navigator will be accessible within the "Social Drivers" section, located below the FindHelp link in APeX.  Finally, high risk flags for SBDH will be surfaced to providers via the Early Screening for Discharge Planning and Prioritization Tool.

                          4.  Embed a picture of what the AI tool might look like

                          Figure 1.  Anatomy of the proposed decision support tool seen embedded within a custom navigator within APeX.  An interactive demonstration of the Whole Person Care Navigator tool can be found at: https://zsfg-prospect.github.io/social-wayfinder/

                          Figure 2.  Example of a high risk flag generated by the Whole Person Care Navigator surfaced in the Early Screening for Discharge Planning and Prioritization Tool

                          5. What Are the Risks of AI Errors?

                          Potential risks from the Whole Person Care Navigator tool with our mitigation strategies are as follows:  (i) Hallucinations and inaccuracies: To mitigate this risk, all AI-generated summaries are accompanied with direct quotations from the clinical note that can be easily verified by the user. (ii) Unhelpful social-risk informed care suggestions: To mitigate this risk, the tool will be assessed on retrospective patient cases, where its recommended adjustments will be scored by social-risk informed care experts and healthcare providers as being unhelpful vs helpful; (iii) Inequitable performance: In addition to standard validation metrics[6] for AI-generated summaries, we will test the tool within protected subgroups to ensure equitable performance. We will adjust the AI algorithm if performance differences are found. (iv) Performance drift: Our group has been designing robust monitoring systems for LLM-based tools in collaboration with IMPACC to detect performance drift. We will apply our monitoring systems to this tool as well. We will then create avenues for end users to continuously provide feedback and create a standardized training program to onboard new users. 

                          6. How Will We Measure Success?

                          To validate the Whole Person Care Navigator tool, we will first conduct a pilot study focused on iterative improvement via rapid PDSA cycles with a small group of inpatient hospital medicine social workers and resident-led hospital medicine teams and then a cluster randomized trial with social workers, medicine residents and hospitalists across UCSF Health sites. The primary endpoints will be (i) time spent on chart review by social workers, as assessed through audit logs, (ii) whether the treatment plan was concordant with the patient’s social needs, as assessed through a standardized rubric using clinical notes and referrals, and (iii) the CMS screening positivity rate measure (SBDH-2). All Endpoint data are available within APeX. Secondary endpoints include the tool’s utilization rate and user satisfaction. Reasons for discontinuation would include limited adoption, negative user feedback, or persistent inaccuracies in the tool’s recommendations.

                          7. Describe your qualifications and commitment 

                          This project is co-led by Dr. Lucas Zier, MD, MS, an Associate Professor in the Division of Cardiology and DoC-IT, and AI researcher with clinical appointments at ZSFG and UCSF Health, and Dr. Jean Feng, PhD, Assistant Professor of Epidemiology and Biostatistics. They co-direct the PROSPECT Lab, which aims to improve health outcomes in underserved populations through AI-based tools and whose work has received national recognition including the CAPH/SNI Quality Leaders Award and the Joint Commission Tyson Award for Excellence in Health Equity. Drs. Zier and Feng have developed and deployed five large scale EHR-integrated decision support tools and led retrospective, prospective, and randomized trials at both ZSFG and UCSF Health assessing these tools. This project is supported and will be advised by Sarah Imershein, Senior Strategy Consultant for the Care Management and Patient Transitions Department, Natalia Kelley, Nurse Informatacist and UCSF social work leads Meher Singh and Timothy Chamberlain.

                          7.  Summary of Open Improvement Edits

                          We have made the following revisions to the proposal based on feedback and comments:

                          1. We have clarififed that the Whole Person Care Navigator will draw upon internal UCSF SBDH resource databases to provide patient specific recommendations.
                          2. We have addedd Figure 2 to demonstrate how high risk flags for SBDH will be surfaced to providers via the Early Screening for Discharge Planning and Prioritization Tool.

                          Additional clarifications about the prosposal in response to comments and questions:

                          1. We appreciate the feedback about adding qualitative interviews to assess faciliators and barriers to use of the tool.  Our build process includes interative human centered design with subject matter experts in the development phase and we have experience performing rapid qualitative interviews in the post deployment/optimaization phase.  Therefore we could include this both for providers using the tool and for patients.
                          2. We appreciate the question about flexibility and customization of this tool.  we agree that this could be expanded into the Emergency Department to assist with the management of opiate use disorder with small changes to the underlying prompts.  This flexibility and customization we believe is a stregnth of this approach.
                          3. We regards to questions about end user testing and importance of this issue with SBDH workflows we have been performing active user testing at ZSFG for over six months and have begun the process at UCSF this month.  Our initial testing and interviews with subject matter experts suggests that social workers spend approximately 60 minutes per patient manually reviewing fragmented charts across multiple encounters to bolster their understanding of the patient’s social and behavioral needs and treatment history. With an average daily caseload of 10 to 15  patients per week, this represents 40 to 60 hours per week of highly skilled professional time spent on chart navigation rather than direct patient management. documentation, totaling 40-60 hours weekly. This significantly limits direct patient care.  Additional time and effort is needed to then identify resources and makes referrals.  Because of these workflows limitations we estimate that social workers are unable to participate in the care of 25% of patients on inpatients teams and many of these patients are at high risk of adverse outcomes because of social and behavioral needs.  Thus this is a substantial problem.  Through a human centered interative design process we believe we have developed a deployment strategy, outlined in our proposal, that complements their workflow, reduces cognitive burden and improves efficiency.

                          References

                          [1] Magnan, S. (2017). Social Determinants of Health 101 for Health Care: Five Plus Five. NAM Perspectives. Discussion Paper, National Academy of Medicine, Washington, DC. https://doi.org/10.31478/201710c

                          [2] https://qpp.cms.gov/docs/QPP_quality_measure_specifications/CQM-Measures...

                          [3] https://www.sf.gov/news-mayor-lurie-unveils-breaking-the-cycle-vision-fo...

                          [4] https://www.findhelp.org/

                          [5] Gold, Rachel, et al. "Using electronic health record–based clinical decision support to provide social risk–informed care in community health centers: protocol for the design and assessment of a clinical decision support tool." JMIR Research Protocols 10.10 (2021): e31733.

                          [6] Yuan, Dong, et al. "Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM." arXiv preprint arXiv:2412.19906 (2024)

                           

                           

                           

                          Supporting Documents: 

                          TPS-Select: An Artificial Intelligence Approach to Guide Transitional Pain Service Referrals for UCSF Neuro-Spine Patients

                          Primary Author: Andrew Bishara
                          Proposal Status: 

                          Section 1. The UCSF Health Problem

                          While pain after spine surgery is expected to resolve within a few weeks, a substantial number of patients experience post-operative pain for over 6 months1. At UCSF Health, nearly 15% of patients need opioids three months after spine surgery—an indicator of this chronic post-surgical pain (CPSP) that adds significant burden to both the surgical care team and the broader health system. When poorly managed, CPSP after spine surgeries leads to new or escalated chronic opioid use2, higher costs of care3, and worse post-operative outcomes4. However, only a small fraction of these patients at UCSF have been evaluated by the pain service for either pre-surgical optimization or post-operative pain management.

                          To address this problem, the Pain Medicine Department launched a transitional pain service, a specialized program providing multidisciplinary perioperative care with a tailored combination of preoperative optimization, perioperative pain management planning, and ongoing support during the outpatient transition. Several other institutions have successfully implemented the TPS, resulting in significantly improved pain control and opioid use in targeted patient cohorts5-7. However, given the limited capacity of this engaged service, a better method to select high-risk patients is essential to maximize the efficacy of a TPS8.

                          Current perioperative guidelines provide simple selection criteria, such as the O-NET+ classification, that rely on a few indicators like psychiatric conditions or previous opioid use to guide patient selection9. Given the interplay of the various biopsychosocial factors that contribute to the development of CPSP, these criteria fall short in identifying vulnerable patients with modifiable risk factors. An intelligent selection mechanism is needed to help our intended end-users—UCSF neuro-spine and pain providers—easily evaluate and identify high-risk patients for referral to the UCSF TPS that can improve patient outcomes and reduce health system costs.

                          Section 2. How might AI help?

                          Given the constrained clinical capacity of a TPS, it is essential to identify patients with the highest likelihood of developing CPSP. Excessive enrollment of patients at low risk of developing CPSP can dilute TPS efficacy, allocating care to patients unlikely to benefit and diverting resources from patients with true CPSP. We believe that an artificial intelligence (AI) tool is the best approach to identify TPS patients as AI can automatically analyze multiple factors simultaneously and provide clinicians with a more comprehensive risk profile. While numerous machine-learning models have been developed to predict post-surgical opioid use and pain trajectories10, none have been rigorously evaluated for their ability to guide referrals to a TPS.

                          We previously retrospectively validated a decision-tree-based model using an eXtreme Gradient Boosting (XGBoost) algorithm with Neuro-Spine cases at UCSF from 2015 to 2023 at the pre-operative time of the surgical booking date. This model analyzes data pulled from UCSF Clarity, incorporating important features such as procedural details, prior medications, prior pain-related ICD diagnosis codes, and clinician prescribing patterns. The target population includes patients who were prescribed opioids between 30- and 180-days post-discharge and exhibited severe acute post-surgical pain trajectories, as determined by a simple linear regression classification of their in-patient pain score trajectory. This outcome definition was developed with pain physicians as the ideal TPS population, and we are now validating this rule-based criterion retrospectively against pain providers' clinical assessments of post-surgical pain syndrome. Using this outcome, the model showed superior performance at identifying patients with challenging postoperative pain control and opioid usage at the surgical booking date when compared to the O-NET+ tool from perioperative guidelines9 (Figure 1). ROC-AUC curve of modelFor comparison, the positive predictive value was 0.97 with our XGBoost model and 0.43 with the O-NET+ tool at the 0.68 threshold selected based on prospective validation results (Figure 5), highlighting our model’s superior specificity in identifying true high-risk patients compared to broader, less discriminating traditional criteria. Using the AI Pilot Award, we aim to silently prospectively validate this model to demonstrate that our model better identifies high-risk Neuro-Spine patients for TPS referral in the clinical setting compared to traditional methods.

                          After confirming that the accuracy of prospective validation matches the retrospective results, we plan to initiate an intervention using model-generated probabilities to refer Neuro-Spine patients to the previously established UCSF TPS, collaborating with UCSF pain physicians to conduct a pilot randomized controlled trial of an AI-driven TPS. The model is critical to this trial’s success because a prior RCT of the TPS7 lacked power for long-term outcomes, largely due to inadequate patient identification using basic screening methods. In the trial, patients receiving the TPS intervention will be compared to standard-of-care controls matched on outcome probabilities predicted by the model. Assuming a conservative estimate that 50% of the patients identified by the model will develop post-surgical pain, a stratified power analysis with an alpha of 0.05, 80% power, and a 10% absolute event reduction indicates that 104 participants per arm are needed. (Figure 2). Ultimately, we hope that this pilot randomized controlled trial of an AI-enabled TPS for UCSF Neuro-Spine patients will generate evidence of the clinical safety, equity, and benefit of our approach.Implementation Study Design

                          Section 3. How would an end-user find and use it?

                          The model will be embedded into the clinical workflow by running the model each night on all surgical cases booked that day for future surgical dates, using real-time data from UCSF's Epic Clarity Instance. When a patient exceeds the risk threshold, an automated notification will be sent to both the neuro-spine surgical team and the TPS team. The exact notification method—whether via email, APeX in-basket message, secure chat, or phone call—will be determined in collaboration with the development team and through planned user testing. Regardless of the modality, the notification will deliver a TPS referral recommendation and support pre-surgical planning for patient optimization. Versa ExplanationsAs an interpretability component of model results to promote clinician oversight, the notification will include a report generated by UCSF Versa GPT, a PHI-secure large language model (Figure 3).The report explains the rationale behind the model’s prediction using LLM-generated descriptions of the most important features for that patient. The surgical and TPS team will review this information and act on its recommendation—specifically, by initiating a referral to the TPS and coordinating perioperative pain management strategies. This low-friction, automated approach ensures the AI support is easily discoverable, actionable, and aligns with existing clinical workflows.

                          Section 4. Embed a picture of what the AI tool might look like.

                          DashboardApex build

                          In APeX, we plan to have a simple interface that includes the predicted risk score along with the key features driving the prediction (Figure 4A).While this approach will support clinical decision-making around the patient’s risk for CPSP, we also ideally envision an interactive, in-depth dashboard that will require more resources in implementation (Figure 4B). This dashboard includes three core components: (1) a visual risk bar indicating the predicted CPSP risk level, (2) a direct-action link enabling clinicians to place an order for TPS referral when appropriate, and (3) a display of the top features influencing prediction for each patient, explained to the clinician by the LLM-generated report, with the option to adjust key variables and re-run the model as needed.

                          Section 5. What are the risks of AI errors

                          Several types of AI errors could impact the effectiveness of our model and an AI-driven TPS. First, Classification Errors. Excessive false positives may dilute the TPS’s impact by allocating resources to lower-risk patients, and false negatives may inappropriately reassure care teams, leading to undertreatment of pain and missed patient concerns. We plan to monitor month-to-month variations in model performance, identifying features or patient clusters that may contribute to a less-than-optimal performance using our previously published toolkit of statistical methods11. To avoid confounding from the presently active TPS, we will also exclude TPS-referred patients from prospective validation; their low volume makes them unlikely to bias results. Second, Inequities in Model Performance. Disparities in prediction accuracy across race, gender, or opioid history could limit equitable access to care. We will track performance metrics stratified by these groups and aim to keep PPV differences within 10%. If differences exceed this threshold, we will address them using the statistical toolkit stated above. Third, Poor Calibration. A model that poorly balances opioid-naïve and opioid-experienced patients may reduce TPS effectiveness. Using the distribution of prediction scores of the prospective validation cohort, we can simulate month-to-month sampling to extrapolate how fluctuations in performance and cohort make-up will affect our likelihood of showing a statistical benefit. Using the effectiveness of previous TPS trials, this simulation can inform a power analysis for a pilot TPS trial and its predicted efficacy (Figure 5).Simulation

                          Section 6. How will we measure success

                          We designed specific success criteria for the prospective validation of the model and the clinical trial. For model validation, measurements can be derived using data that is already being collected in APeX. The primary endpoint for clinical readiness is achieving a sensitivity of at least 0.4 while maintaining a PPV of at least 0.67. Currently, we far exceed the goal PPV and sensitivity at a threshold of 0.68, which allows us to flag a minimum of 8 patients per month. Secondary endpoints include: (1) chart review confirmation by the pain service of patients flagged as positive, (2) other traditional model performance metrics such as overall ROC-AUC and specificity, and (3) fairness measures, such as PPV differences across subgroups. We began prospective data collection in February 2025, and we will need AER and HIPAC support for future model implementation. Even in the absence of a formal trial, pain medicine clinicians may use the model’s predictions to inform care planning and patient optimization. In this case, we will also collect implementation metrics, such as the proportion of ML-identified high-risk patients referred to pain medicine. For the clinical trial of an AI-driven TPS, additional measurements will ideally need to be collected to evaluate success of the AI. The primary endpoint will be a reduction in incidence of post-surgical pain at 90 days by at least 10% among patients enrolled in the TPS compared to those receiving standard of care. Secondary outcomes include (1) patient-reported pain levels, (2) functional assessments, (3) opioid use (in MMEs), (4) healthcare costs at 30-, 60-, and 90-days post-discharge, and (5) clinician usage and feedback of the model’s APeX interface.

                          Section 7. Describe your qualifications and commitment:

                          Dr. Andrew Bishara, the project lead, has received a research career development (NIH K23) award to develop and implement AI models that predict a range of perioperative outcomes, with current efforts for validation and implementation underway at UCSF. Dr. Bishara will commit 10% of his time to supporting the development and implementation of this AI algorithm with the Health AI and AER teams. Dr. Christopher Abrecht, the co-lead, works closely with UCSF neurosurgery, seeing patients alongside them at the Spine Center and launching a preliminary TPS to support their patients. Dr. Abrecht is also committed to advising the implementation of the machine-learning model and supporting the development of an AI-driven TPS.

                          AI chatbot to improve treatment options counseling in Ob/Gyn

                          Proposal Status: 

                          1. The UCSF Health problem

                          Treatment plans in obstetrics and gynecology can contain numerous reasonable options. A good decision depends on the communication of evidence-based risks and benefits in a manner the patient can understand. Patients also may need support clarifying their values as they make treatment decisions based on their values, preference, and risk tolerance. This process is limited by clinic time constraints, language concordance, health literacy, and the ability of providers to integrate an evolving evidence base.1,2 Expanding access to high-quality in-clinic counseling is limited by a shortage of providers and limited appointment lengths, with an average wait time of one month to receive care for conditions that significantly impact quality of life.3,4

                          Patients have begun to turn to general-purpose large language models (LLMs) for medical information, but these are not validated in the clinical setting, with guidance not personalized to the patient’s data.5 This underscores the value of developing a validated AI-enhanced counseling tool to augment human-led care.

                          2. How might AI help?

                          LLMs have shown immense promise in diagnostic and management reasoning tasks, and their ability to translate between languages and adjust reading level is well-suited to a patient-facing role.6–8 Retrieval-augmented generation from reliable sources has been shown to mitigate hallucinations and inaccuracies that can occur with general-purpose models.9

                          An LLM chatbot could tailor options counseling to the patient’s specific scenario and provide interactive, language and reading-level concordant counseling to augment in-clinic discussions. This chatbot would be integrated into UCSF MyChart and would use retrieval-augmented generation from the patient’s clinical data including history/symptomatology (now made more detailed and more accurate by ambient scribing), physical exam, lab and imaging studies, and provider assessment and plans, as well as high-quality literature and clinical guidelines. Context-rich language and reading-level concordant interactions between the patient and the chatbot would improve patient understanding of their clinical course and treatment options to enable truly informed treatment decisions. In addition, chatbot interactions can help patients clarify and document their values and preferences as they evolve with their understanding of their condition and options, to help patients and providers prepare for future clinical encounters.

                          For example, a patient with abnormal uterine bleeding might be counseled on a variety of medical and surgical treatment options in clinic, then spend time interacting with the chatbot at home, in their own language and reading level, to further review the risks and benefits in the context of their goals to come to a decision that could be communicated with the provider, who would then facilitate treatment. As the patient trials the therapy over the next 3 months, they can document their progress, questions that come up about side effects, and if these are not adequately answered in the chatbot, the patient can escalate to scheduling a clinic appointment. If the patient’s medical conditions outside the field of Ob/Gyn evolve over time, this may introduce new relative or absolute contraindications to treatment options, and these can be flagged in the chatbot to trigger a new discussion between the provider and patient.

                          This context-rich, longitudinal, patient-centered interaction, grounded in the relevant, curated literature would strongly augment brief and infrequent clinical interactions. This approach addresses the problem of high-quality language and reading-level concordant counseling without increasing clinical resource utilization. Changes in treatment plan would necessarily go through a provider, ensuring there is always a human-in-the-loop with the responsibility to ensure next steps are medically sound.

                          I propose piloting the chatbot on one or two clinical scenarios with discrete treatment options. However, a strength of my proposal is that this approach can be scaled to scenarios across Ob/Gyn and other specialties. Possible scenarios in Ob/Gyn include:

                          1)    Trial of labor after cesarean vs. planned repeat cesarean: Choosing between trial of labor after cesarean (TOLAC) and planned repeat cesarean requires in depth discussion of preferred birth experience, risk of complications in the context of lifetime reproductive goals and tolerance for different types of risk.

                          2)    Abnormal uterine bleeding and pelvic pain: Patients with heavy, irregular, and/or painful menstrual periods can choose among several hormonal and non-hormonal medications as well as surgical procedures with varying contraindications, probability of success, and risk.

                          3)    Labor and delivery interventions: Labor and delivery decisions often must be made quickly to balance maternal and fetal well-being, by on-call providers who have sometimes just met the patient. Unexpected interventions, even with informed consent, can lead to birth trauma and distrust. A typical prenatal visit is lasts 20 minutes or less and covers routine screening, management of acute pregnancy and non-pregnancy issues, and anticipatory guidance, making it challenging to proactively counsel patients on the wide variety of possible labor and delivery experiences.

                          4)    Menopause management: There are numerous hormonal and nonhormonal treatments for menopause symptoms that can have a profound impact on quality of life. The data on risks associated with these options is complex, making choosing appropriate, effective treatments a challenge. Patients with cancer-related or cardiovascular risk factors are often given a blanket recommendation to avoid treatment instead of making an informed decision that incorporates a nuanced understanding of risk.

                          Other scenarios which require similarly complex counseling include: Pregnancy options counseling, contraception counseling, and pelvic floor disorders treatment options

                          3. How would an end-user find and use it?

                          Providers would inform patients about the chatbot during in-clinic counseling. The chatbot could be activated by the provider or automatically triggered for specified diagnoses. The chatbot would be embedded in the MyChart app, which could prompt the patient to use the tool as part of the after-visit summary. Data from the chatbot about patient preferences and treatment progress could be fed into the patient’s chart to enable shared decision-making.

                          4. Embed a picture of what the AI tool might look like.

                          See the attached chat window mock-up and an example dialogue in Versa chat. Use of retrieval augmented generation from the medical record and literature (or integration with a model fine-tuned on medical literature such as OpenEvidence, which has been proposed in other projects for clinical decision support) would make this dialogue even more patient-centered and grounded in evidence.

                          5. What are the risks of AI errors?

                          The primary risk for this intervention is hallucination of inaccurate or inappropriate information for a specific patient—for example, suggesting an estrogen-containing contraceptive to a patient with risk factors for venous thromboembolic disease. This risk will be mitigated by the following: 1) personalizing counseling using patient-specific medical history data and using retrieval-augmented generation from reputable literature and guidelines; 2) evaluating of pilot chatbot counseling sessions by expert clinicians prior to widespread rollout of this feature; 3) maintaining a human-in-the-loop such that treatment decisions, prescriptions, and surgical plans are finalized with a human provider.

                          6. How will we measure success?

                          a. A list of measurements using data that is already being collected in APeX 

                          - Proportion of patients with supported medical conditions who are using the chatbot

                          - Comparison of treatments utilized between chatbot users and nonusers

                          - Comparisons of the of number clinical interactions (visits, clinician advice messages, phone calls) between users and nonusers of the chatbot

                          b.        A list of other measurements you might ideally have to evaluate success of the AI 

                          - Accuracy of chatbot advice given during the pilot phase as graded by expert clinicians. LLMs have also shown promise in automated large scale evaluation of LLM tools and their use could be considered during the evaluation phase.10

                          - Patient satisfaction scores compared between chatbot users and nonusers

                          - Provider satisfaction and qualitative perspective on the utility of the chatbot

                          - Validated measures of decision-quality from pilot users

                          High quality counseling, high uptake by patients, patient/provider satisfaction, high decision quality, and efficient utilization of clinic resources would support continuing the project. Negative trends in any of these metrics could be grounds to abandon the project.

                          7. Describe your qualifications and commitment

                          Ammar Joudeh, MD: I am an Assistant Professor in the Department of Obstetrics, Gynecology, and Reproductive Sciences and an Ob/Gyn generalist clinician. I have expertise in prenatal care, labor and delivery care, and the medical and surgical management of benign gynecologic conditions. In addition, I have a track record of managing complex operational research and quality improvement initiatives in settings as diverse as rural Zambian health centers, Indian public schools, and California labor and delivery wards. As the quality improvement chief of my residency program, I focused on labor management and received an award from the midwives in my program for my commitment to patient-centered care. I believe every patient should have access to in-depth, empathetic, and personalized counseling, and I recognize that our time-constrained clinical environment challenges this ideal. I believe this chatbot would be the first of its kind to integrate personalized clinical data and robust evidence to augment patient counseling. I am eager to contribute part of my existing academic time, in addition to the time supported by the initiative, to developing this tool.

                          Miriam Kuppermann, PhD, MPH will utilize 2% of the 10% effort to co-lead this project. Dr. Kuppermann is recognized expert in developing and evaluating informed decision-making support tools. She has led numerous intervention studies aimed at improving informed decision making in obstetrics and gynecology among racially, ethnically and socioeconomically diverse populations of patients.11,12 In her own research, she has recognized the limitations of existing patient-facing decision support tools and the difficulty of designing a tool that can cater to the breadth of patient values, preferences, languages, and literacy levels while evolving with changing guidelines. The flexibility and adaptability that AI will provide this tool is a gamechanger, and Dr. Kuppermann’s expertise in decision support interventions will be invaluable for both the design and evaluation of this tool.

                          8. Summary of open improvement edits:

                          In open improvement, I refined the proposal in collaboration with project co-lead Dr. Kuppermann, added Dr. Kuppermann’s bio and a letter support from my clinical division chief, added references from the literature that support the value of the proposed pilot, and responded to Dr. Pletcher’s comment by adding an example patient dialogue using Versa Chat (see attachments).

                          References:

                          1.        Truong S, Foley OW, Fallah P, et al. Transcending Language Barriers in Obstetrics and Gynecology: A Critical Dimension for Health Equity. Obstetrics and Gynecology. 2023;142(4):809-817. doi:10.1097/AOG.0000000000005334

                          2.        Yahanda AT, Mozersky J. What’s the Role of Time in Shared Decision Making? AMA J Ethics. 2020;22(5):E416-E422. doi:10.1001/AMAJETHICS.2020.416

                          3.        Gravel K, Légaré F, Graham ID. Barriers and facilitators to implementing shared decision-making in clinical practice: A systematic review of health professionals’ perceptions. Implementation Science. 2006;1(1):1-12. doi:10.1186/1748-5908-1-16/TABLES/3

                          4.        Corbisiero MF, Tolbert B, Sanches M, et al. Medicaid coverage and access to obstetrics and gynecology subspecialists: findings from a national mystery caller study in the United States. Am J Obstet Gynecol. 2023;228(6):722.e1-722.e9. doi:10.1016/j.ajog.2023.03.004

                          5.        Why ChatGPT Has a Better Bedside Manner Than Your Doctor - Bloomberg. Accessed April 21, 2025. https://www.bloomberg.com/news/articles/2025-04-11/why-chatgpt-has-a-bet...

                          6.        Goh E, Gallo RJ, Strong E, et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat Med. 2025;31(4):1233-1238. doi:10.1038/s41591-024-03456-y

                          7.        Brodeur PG, Buckley TA, Kanjee Z, et al. Superhuman Performance of a Large Language Model on the Reasoning Tasks of a Physician.

                          8.        Dzuali F, Seiger K, Novoa R, et al. ChatGPT May Improve Access to Language-Concordant Care for Patients With Non–English Language Preferences. JMIR Med Educ 2024;10:e51435 https://mededu.jmir.org/2024/1/e51435. 2024;10(1):e51435. doi:10.2196/51435

                          9.        Zakka C, Shad R, Chaurasia A, et al. Almanac-Retrieval-Augmented Language Models for Clinical Medicine. Published online 2024. doi:10.1056/AIoa2300068

                          10.      Johri S, Jeong J, Tran BA, et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat Med. 2025;31(1):77-86. doi:10.1038/s41591-024-03328-5

                          11.      Dehlendorf C, Fitzpatrick J, Fox E, et al. Cluster randomized trial of a patient-centered contraceptive decision support tool, My Birth Control. Am J Obstet Gynecol. 2019;220(6):565.e1-565.e12. doi:10.1016/J.AJOG.2019.02.015

                          12.      Kuppermann M, Kaimal AJ, Blat C, et al. Effect of a Patient-Centered Decision Support Tool on Rates of Trial of Labor After Previous Cesarean Delivery: The PROCEED Randomized Clinical Trial. JAMA. 2020;323(21):2151-2159. doi:10.1001/JAMA.2020.5952

                          Integrating OpenEvidence LLM into UCSF Epic for Evidence-Based Clinical Decision Support

                          Proposal Status: 

                          1. The UCSF Health problem

                          Large language models (LLMs) have rapidly emerged as powerful tools in clinical medicine, offering advanced capabilities in information retrieval, decision support, and diagnostic reasoning. Many clinicians (including at UCSF) are already using publicly available LLMs (ChatGPT, Claude, OpenEvidence) for clinical queries. Beyond challenges related to security and patient privacy, because the LLM does not interface directly with patient data, clinicians cannot easily obtain fully contextualized, patient-specific responses (taking into account age, past medical history, medications, etc.). The lack of an integrated solution means clinicians must manually enter curated data into an LLM to achieve the best results – a process with significant error risk.

                          Moreover, ensuring the reliability of AI-generated responses to clinical queries is paramount: mechanisms that ensure transparency of information sources are critical as they allow for physician oversight. Clinical AI tools must provide evidence-backed answers with clear citations to ensure human-in-loop medical decision making. With the rapid adoption of LLMs, UCSF clinicians will soon require an AI tool that delivers trustworthy, patient-tailored evidence within their workflow. OpenEvidence is an AI tool designed and trained specifically to deliver evidence-based and is already used widely in the UCSF and broader medical community (reportedly used by over 20% of U.S. doctors)​.

                          2. How might AI help?

                          We see an opportunity to integrate OpenEvidence within UCSF’s clinical environment to enhance its utility. We propose a light integration via a SMART on FHIR application that can securely access and extract relevant structured patient data (e.g., demographics, active medications, comorbidities, allergies) from the EHR. By automatically incorporating such context, an in-EHR OpenEvidence tool could deliver more precise and personalized real-time clinical support, reducing clinician burden and minimizing the risk of context omissions. For example, when a hospitalist asks a question (such as, “What is the optimal anticoagulation strategy for a patient with a DVT?”), the system would access structured data like the patient’s demographics (age, sex), active problem list (such as history of GI bleed), current medications, lab results, and other key context via the FHIR API. Using these details, the AI can interpret the question in light of the specific patient (e.g. “78-year-old female with atrial fibrillation, chronic kidney disease stage 3, on metformin…”). This context inclusion helps the LLM formulate an answer that is relevant to the individual patient, rather than a one-size-fits-all response and reduces reliance on the clinician to know every covariate that would change or contextualize the specific question at hand.

                          For this pilot, the app will pull key structured fields (patient demographics, medication list, allergy list, active diagnoses, and possibly recent vital signs or lab values) but not free-text notes, to limit complexity. With these inputs, OpenEvidence’s backend LLM can retrieve matching evidence (guidelines, clinical trial data, etc.) and generate an evidence-based answer. Crucially, the output will include transparent citations. By having the specific patient context, the AI can filter out irrelevant information and highlight applicable evidence. We believe this integration would produce outputs that are highly specific and grounded in evidence, closing the gap between raw EHR data and medical knowledge at the bedside.

                          3. How would an end-user find and use it?

                          Our primary end-users during this pilot will be inpatient hospitalists – specifically those on the UCSF Goldman(attending only service) service and the Medicine Teaching service (attending physicians and residents on the general medicine wards). We plan to deploy OpenEvidence as an embedded SMART on FHIR app, eitherwithin Epic or externally depending on UCSF AI teams recommendations (The OpenEvidence engineering team are willing to facilitate either). The interface would include a list of all patients within a user’s service list and allow selection of one patient at a time. Once launched, the user interface would be straightforward and familiar. The clinician is presented with a text input field labeled something like “Ask a clinical question about this patient.” They might type, for instance, “Does this patient’s diabetes and heart failure change the target hemoglobin for transfusion?” and hit submit (Figure 1; Section 4 below). The answer would consist of a few paragraphs of explanation, written in concise clinical language, and will include small reference superscripts or bracketed numbers corresponding to citations. The UI will prominently show these citations (e.g. a sidebar or footnote list of reference titles), as well as the ability to mark accuracy of both answer and provided citations.

                           4. Concept figure

                           

                          5. What are the risks of AI errors?

                          No AI is perfect – especially in high-stakes clinical settings – so we have identified several risks and failure modes, along with mitigation strategies for each:

                          • Overfitting to unique patient context: Evidence from the published literature is limited and often much simpler than highly complex clinical practice. If fed full patient chart information, the LLM might give an answer that overemphasizes those idiosyncratic details (e.g. focusing on a rare comorbidity that isn’t relevant to the question). This could lead to answers that are too narrow or not generalizable. Mitigation: We will work with the OpenEvidence team during initial testing to make sure it uses patient context appropriately. If a pattern of over-narrow answers is seen, we can adjust the algorithm to find a more appropriate balance.
                          • Incorrect answers or reasoning errors: The LLM could get the answer wrong, or even if an answer is correct, it could be derived from a suboptimal citation. It’s possible that integrating complex patient data may lead to either inaccuracies or suboptimal citation retrieval compared to what OpenEvidence is currently trained to do. Mitigation: We plan a rigorous accuracy review process. In the pilot, every answer the system gives will be logged, and a sample will be reviewed by a team of UCSF hospitalists for clinical correctness. We will track the percentage of answers with significant errors. If above an acceptable threshold, we will iterate on the model (for instance, by providing more training examples or adding rule-based checks for certain high-risk questions). Users will also have an easy way to report a “bad answer” from within the UI, which will alert our team to review that case. Furthermore, the tool will carry a disclaimer reinforcing that it is an assistant – final decisions rest on the clinician. Encouraging users to verify critical suggestions via the citations is another safety check (and one of the reasons we insist on transparent sources).

                          6. How will we measure success?

                          We will define success using both quantitative metrics from system logs and qualitative/clinical outcomes. Our evaluation will include:

                          1. Metrics from existing APeX data: Because the SMART app is launched within Epic (APeX), we can leverage backend analytics to capture usage patterns. Key metrics will include: the number of times the OpenEvidence app is opened (overall and per user per shift), the timing and duration of use, and the number of questions posed. We will also track how often users click on citation links to view more details – a proxy for how much they trust but verify the content. If possible, we will also examine workflow integration metrics such as: do clinicians tend to open the app on certain patients more (perhaps sicker patients or those on teaching teams) and at what points in the day (e.g. spikes during morning rounds)? A successful integration would show consistent usage (e.g. most hospitalists using it daily) and repeat usage by the same clinicians, indicating they found it valuable enough to come back.

                          B. Additional ideal measurements: We will complement the usage data with direct user feedback and clinical outcome assessments. This will involve:

                          • User surveys and interviews: We will administer brief surveys to the hospitalists and residents after they’ve used the tool for a few weeks. These surveys will gauge perceived impact on decision-making confidence, time saved, and trust in the AI’s recommendations.
                          • Comparative analysis (control groups): Ideally, we will design the pilot to include a comparison between different modes of information access:
                            • OpenEvidence with EHR context – the full integrated tool as described.
                            • OpenEvidence without EHR access – e.g. using the OpenEvidence app or web without patient context (so the clinician can ask questions but must manually input any patient details).
                            • UpToDate (or usual care) – clinicians relying on standard resources like UpToDate, guidelines, or consults, but not using OpenEvidence at all.

                          We can assign different ward teams or time blocks to different strategies and examine differences. Success would be reflected if the teams with EHR-integrated OpenEvidence report better qualitative outcomes than either control group.
                           

                          7. Describe your qualifications and commitment

                          This project will be led by a UCSF faculty hospitalist with significant clinical and operational experience. Dr. Peter Barish is an Associate Clinical Professor of Medicine and is the Medicine Clerkship and Medicine Acting Internship director at UCSF St Mary’s Hospital. Dr. Barish has a long-standing interest in the field of clinical excellence and co-leads a biweekly DHM conference “Cases and Conundrums” as well as publishes a monthly newsletter (The CKC Dispatch) on current Hospital Medicine literature. He is also the UCSF co-site-lead of the national research study Achieving Diagnostic Excellence Through Prevention and Teamwork (ADEPT) where he and colleagues in DHM are currently investigating the use of LLMs to better understand and prevent diagnostic error.

                          He will be supported by Dr. Travis Zack, who is an assistant professor in the Division of Hematology/Oncology, an affiliate in the Department of Clinical Informatics and Technology (DoC-IT), the Computational Precision Health Program (CPH) and the Senior Medical Advisor for OpenEvidence. Additionally we have had discussions with OpenEvidence cofounder Zack Ziegler, who has enthusiastically expressed support for this project and will provide the engineers and technical expertise required to build the SMART on FHIR application here at UCSF (this would be the first integration of OpenEvidence within an EHR).

                          AI-Enhanced Clinical Decision Support Tool to Personalize HIV Treatment and Address Care Delivery Gaps

                          Proposal Status: 

                          Section 1. The UCSF Health Problem. Despite major advancements in antiretroviral therapy (ART), challenges remain for people with HIV (PWH) who are sub-optimally engaged in care due to treatment misalignment, stigma, unstable housing, mental health needs, and other barriers that compromise adherence.1,2 Clinical decisions often follow standardized protocols, overlooking patient preferences and psychosocial factors—typically buried in unstructured electronic healthcare record (EHR) notes. This gap impedes the delivery of personalized HIV care and contributes to persistent disparities in treatment engagement and viral suppression, particularly among underserved populations.Previous approaches to improving care engagement at UCSF have leveraged structured data to flag patients at risk for non-adherence, but such models offer little insight into why disengagement is occurring or how to respond effectively.We propose a novel clinical decision support (CDS) tool that combines natural language processing (NLP) of unstructured clinical notes with real-time visualization of key structured data to support individualized treatment planning for PWH. The tool will extract and display patient-stated treatment preferences and barriers, overlayed with ART history, resistance, lab trends, adherence, and engagement—within a collapsible interface. End-users will be HIV and primary care providers at UCSF and affiliated sites, who need timely, actionable insights to tailor care for patients who need it most. By highlighting this information at the point of care, the tool aims to transform hidden insights into proactive, equitable, and patient-centered HIV care.

                          Section 2. How might AI help? Artificial intelligence and NLP can highlight patient-centered insights buried in unstructured EHR notes.4 While structured EHR data can flag PWH at risk for non-adherence, it rarely reveals why engagement is faltering or how best to intervene in a patient-specific, actionable way.5 Predictive models may offer promise in identifying insights to retention to care,6 and our approach will use NLP to extract tailored treatment preferences, psychosocial barriers, and care facilitators from both provider documentation and electronic patient intake and feedback forms (e.g., treatment delivery preferences, concerns about stigma or side effects, housing needs, food support, or transportation assistance). These extracted insights will be integrated into a provider-facing CDS tool that also synthesizes structured data elements critical to HIV care. By combining intelligent data extraction with thoughtful interface design, this tool will empower providers to deliver personalized and equity-informed care to PWH--particularly those sub-optimally engaged.

                          Section 3. How would an end-user find and use it? The AI-enabled CDS tool will be embedded within the provider-facing EHR interface and designed for ease of discovery and use during routine workflows. It will be most valuable during pre-visit planning, in-visit decision-making, and multidisciplinary case reviews—particularly when caring for PWH who are sub-optimally engaged or have a detectable viral load. The tool will appear as a collapsible sidebar or within a dedicated clinical tab, requiring no change to existing documentation practices. When opened, the interface will display key structured and unstructured patient insights. NLP-derived summaries of treatment and care delivery preferences, barriers (e.g., stigma, confidentiality concerns, mental health, unstable housing), and facilitators (e.g., mobile clinic access, social work support) will appear alongside structured data visualizations such as an ART medication history timeline (with individual drugs and resistance overlays), a hoverable tooltip summarizing the current regimen and resistance concordance, HIV viral load and CD4 trends, a refill-based adherence timeline, appointment adherence percentage, and a timestamped record of counseling on comprehensive support services. High-yield features such as a traffic light–style engagement risk score and a follow-up alert for patients without upcoming appointments provide rapid, actionable context. Each AI-derived item will include a confidence score and a link to its original source (e.g., provider note, intake form, lab result). Providers will be able to validate, dismiss, or amend insights with a single click; trigger referrals; or initiate preference-aligned ART discussions directly from the interface. Hover tools and minimal clicks keep the CDS intuitive—not disruptive. Designed around principles of efficient interpretability and low cognitive burden, the tool transforms traditionally siloed insights into a unified, actionable view that enables providers to deliver more personalized, responsive, and equitable HIV care.

                          Section 4. Embed a picture of what the AI tool might look like. See Supplement Figure 1. CDS Tool Example

                          Section 5. What are the risks of AI errors?  This project involves several potential risks related to AI-driven summarization and data integration, including false negatives (missed insights), false positives (misinterpreted or overstated insights), hallucinations (unsupported or fabricated content), and inconsistent summary performance across populations.7 The AI may fail to detect or highlight key patient preferences—such as a desire for long-acting ART or a transportation barrier—limiting personalization. It may also misread clinical context (e.g., flag outdated issues or produce summaries not grounded in documentation). There is also a risk that the model underperforms for certain populations, particularly those with sparse or non-standard documentation styles, which could reinforce existing health disparities.8 To mitigate these risks, we will take a multi-layered approach. First, the model will be designed as an extraction (not generative) tool, constrained to surface directly documented concepts with confidence scoring and source-linked outputs for transparency. Second, we will develop and test the model against a gold-standard annotated dataset of clinical notes, allowing us to benchmark performance using recall, precision, and false discovery metrics. Manual audits and provider/patient validation and feedback during the pilot phase will enable real-time refinement. Third, performance will be evaluated across demographic and clinical subgroups (e.g., race/ethnicity, gender identity, language) to identify and address equity gaps. Finally, the interface will be designed to support human review and easy correction of outputs, keeping the clinician-patient partnership in control at all times. This layered strategy ensures the tool remains safe, transparent, and equitable.

                          Section 6. How will we measure success? We will assess success using an evaluation approach informed by the RE-AIM framework, focusing on adoption (provider usage), effectiveness (impact on clinical decisions and patient outcomes), and maintenance (potential for integration into long-term clinical workflows).9 We will compare results pre- and post-implementation, track variation across provider groups and patient subpopulations, and incorporate both quantitative and qualitative metrics. If the tool demonstrates high clinical utility and provider trust, we will pursue further scaling within UCSF Health. If adoption or impact is limited, we will revise or discontinue implementation. Data measurements already collected in APeX include: 1) CDS tool interaction logs: Frequency of tool access, hover expansions, link clicks, and edits to summaries; 2) Documentation changes: Increase in capture of preferences, barriers, and facilitators post-implementation; 3) ART regimen adjustments: Number of preference-aligned regimen changes made after tool use; 4) Referral patterns: Uptake of case management, behavioral health, and social services; 5) Appointment engagement: Change in visit attendance and no-show rates among flagged patients; 6) Clinical outcomes: Viral suppression among sub-optimally engaged patients over time; 7) Provider usage patterns: Tool usage by role (e.g., HIV PCPs, ID specialists, case managers). Additional ideal measurements to evaluation success include: 1) Provider satisfaction and utility (via surveys/interviews); 2) Time saved: Reduction in pre-visit preparation or chart review time; 3) Accuracy and trustworthiness: Validated through clinician review and audit of AI outputs; 4) Equity analysis: Tool performance and outcome disparities by race, ethnicity, language, or insurance status; 5) Patient-reported experience: Patient perspectives on care personalization, engagement, and satisfaction; 6) Annotation benchmarking: Recall, precision, and F1 scores against a gold-standard annotated dataset

                          Section 7. Describe your qualifications and commitment: This project will be co-led by Jose Gutierrez, PhD, FNP-BC, AAHIVS, Malcolm John, MD, and Michelle Cohen, FNP-BC, MPH, AAHIVS— an interdisciplinary group of UCSF clinician-researchers with complementary expertise in HIV care, health equity, informatics, and implementation science. Dr. Gutierrez is an Assistant Professor in the UCSF Department of Family Health Care Nursing. His research focuses on understanding patient preferences to develop AI-enabled decision support tools for personalized HIV treatment. He is also co-founder of the UCSF School of Nursing Artificial Intelligence Interest Group and serves on the UCSF CRIO Advisory Council. Dr. Malcolm John is a Professor of Medicine and Director of the UCSF Health Black Health Initiative. A long-standing leader in HIV care and health equity, Dr. John brings deep expertise in operational innovation and culturally responsive care for marginalized communities to shape system-level strategies to reduce disparities and improve care engagement across UCSF Health. Michelle Cohen is a family nurse practitioner and HIV specialist at both the UCSF 360 Positive Care Center and UCSF Women’s HIV Program. In addition to her HIV clinical knowledge, Michelle brings prior experience in global HIV prevention advocacy with AVAC and frontline HIV clinical research, including the landmark iPrEx PrEP study in Peru and HIV cure studies based at ZSFG. Together, they bring expertise in HIV care, informatics, and system-level implementation. All team members are committed to active participation in regular work-in-progress meetings and ongoing collaboration with the UCSF Health AI and AER teams throughout the project year to ensure rigorous, real-world integration of the CDS tool.

                          References

                          1.             Marcus R, Tie Y, Dasgupta S, et al. Characteristics of Adults With Diagnosed HIV Who Experienced Housing Instability: Findings From the Centers for Disease Control and Prevention Medical Monitoring Project, United States, 2018. J Assoc Nurses AIDS Care 2022; 33(3): 283-94.

                          2.             Rooks-Peck CR, Adegbite AH, Wichser ME, et al. Mental health and retention in HIV care: A systematic review and meta-analysis. Health Psychol 2018; 37(6): 574-85.

                          3.             Shade SB, Marseille E, Kirby V, et al. Health information technology interventions and engagement in HIV care and achievement of viral suppression in publicly funded settings in the US: A cost-effectiveness analysis. PLOS Medicine 2021; 18(4): e1003389.

                          4.             Patra BG, Sharma MM, Vekaria V, et al. Extracting social determinants of health from electronic health records using natural language processing: a systematic review. J Am Med Inform Assoc 2021; 28(12): 2716-27.

                          5.             Ridgway JP, Uvin A, Schmitt J, et al. Natural Language Processing of Clinical Notes to Identify Mental Illness and Substance Use Among People Living with HIV: Retrospective Cohort Study. JMIR Med Inform 2021; 9(3): e23456.

                          6.             Oliwa T, Furner B, Schmitt J, Schneider J, Ridgway JP. Development of a predictive model for retention in HIV care using natural language processing of clinical notes. J Am Med Inform Assoc 2021; 28(1): 104-12.

                          7.             Ji Z, Lee N, Frieske R, et al. Survey of Hallucination in Natural Language Generation; 2022.

                          8.             Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019; 366(6464): 447-53.

                          9.             Glasgow RE, Vogt TM, Boles SM. Evaluating the public health impact of health promotion interventions: the RE-AIM framework. Am J Public Health 1999; 89(9): 1322-7.

                           

                          Leveraging AI for patient-centered post-discharge checklists

                          Proposal Status: 

                          The UCSF Health problem: Hospital discharges are among the most vulnerable transitions in patient care, where errors and miscommunication can lead to missed follow-up, patient harm, and readmissions. National data has shown that nearly 1-in-5 Medicare patients are readmitted within 30 days of hospital discharge, often due to missed appointments, unfilled prescriptions, or unrecognized clinical deterioration (Jencks et al., NEJM 2009;360:1418-28). Effective discharge instructions are essential for patient understanding and self-management post-hospitalization. However, research indicates that discharge summaries often lack key components, such as clear communication of pending follow-up tasks and pending test results, which are vital for the continuity of care (Chatterton et al., J Gen Intern Med. 2024;39(8):1438-1443 and Al-Damluju et al., Circ Cardiovasc Qual Outcomes. 2015;8(1):77-86). Although discharge summaries, after-visit summary (AVS) instructions and follow-up plans are documented, UCSF currently lacks a scalable, AI-powered solution to ensure patients adhere to these instructions and complete their post-discharge tasks, leading to potential gaps in post-discharge care.

                          We propose an AI-powered post-discharge checklist generator that:

                          1. Parses discharge AVS instructions, discharge summaries, and the most recent consultant notes using a large language model (LLM) to identify actionable tasks for both the patient and the provider team.[1] [2]  The goal would include cross-referencing these tasks with “standard of care” to reduce inconsistencies and omissions.
                          2. Generates a real-time post-discharge checklist to ensure structured follow-up actions are clearly tracked within APeX.
                          3. Shares post-discharge checklist updates with CTOP (Care Transitions Outreach Program) and outpatient providers to improve care continuity for recently discharged patients.
                          4. Facilitates communication with the patient directly via both automated outreach and CTOP-specific communication to improve adherence to follow-up instructions.

                          Previous attempts using electronic health record (EHR) tools or manual discharge audits have failed to scale due to reliance on provider team memory or incomplete tracking. Our solution will augment–not replace–existing discharge processes by introducing structure and accountability, and ultimately aims to enhance patient satisfaction, patient safety, reduce readmissions, and improve coordination between hospital and outpatient teams.

                          How might AI help?  Our approach will leverage LLMs and EHR data to parse discharge AVS instructions and discharge summaries to extract follow-up tasks, and automatically generate structured checklists to streamline discharge follow-up and improve care coordination. The system will:

                          1. Parse discharge AVS instructions, discharge summaries, the most recent consultant notes using natural language processing (NLP) toidentify tasks (e.g., attend follow-up appointments, pick up medications, complete labs).
                          2. Track completion of the “discharge checklist” by querying the EHR for relevant data (e.g., appointment attendance, lab or radiology orders, confirmation of prescription pick-up–where available)
                          3. Provide automated real-time reports to CTOP and care coordination teams ( discharge providers and primary care providers with access to APeX can opt-in for their own auditing and quality improvement), to prompt targetted patient outreach when the checklist completion status is incomplete and outstanding tasks remain.
                          4. Future extensions to include customizable patient nudges via interactive patient-facing chatbot support or augmenting LLM-generated discharge summaries for for outpatient providers or LLM-generated AVS summaries (see separate proposals submitted by Division of Hospital Medicine).

                          How would an end-user find and use it? Discharging physicians and interprofessional care team members can view the AI-generated post-discharge checklist embedded in the APeX workflow in the discharge documentation interface while completing their discharge summary. This real-time post-discharge checklist generation will:

                          ●      Help discharging physicians identify missing or unclear discharge instructions from pre-existing primary team and consultant notes.

                          ●      Cross-reference the AI-generated post-discharge checklist with the AVS to ensure consistency and completeness.

                          ●      Bothoutpatient primary care and the CTOP team will have real-time access to the updated post-discharge checklist. Key features include:

                          ●      Customizable patient tracking lists

                          ●      Ongoing monitoring: daily tracking of discharge checklist completion will be semi-automated (AI updates the checklist at least daily; secondary layer of confirmation by human users)

                          What the AI tool might look like.In our initial pilot, we envision adding a toggle button in the “Hospital Medicine SVC” contexts for the Parnassus campus The initial mock-up of this function is shown below with possible examples of AI-generated post-discharge checklist tasks:

                          Once the post-discharge tasks are confirmed by the discharging provider, we envision these tasks auto-populating via SmartLink into the discharge summary. Additionally, these post-discharge tasks will be added to the “To Do” column in the “PARN - 24h D/C” and “PARN - 4d D/C” System Lists, to ensure appropriate monitoring for the outpatient team and CTOP.

                            

                          What are the risks of AI errors? Our AI-generated post-discharge checklist is designed as an augmentative support system rather than a provider of clinical guidance. Its primary role is to streamline best-practice, patient-centered communication after discharge. The post-discharge checklistswill be available in real-time, allowing providers to review, validate, and identify potential discrepancies, including:

                          ●      Ambiguous language in discharge summaries

                          ●      Inconsistencies between discharge summaries and the AVS

                          Potential AI Errors and Their Impact

                          1. False Positives – the LLM may hallucinate or incorrectly interpret text in the discharge summary or AVS as a follow-up task when no such action is actually needed; if not verified by the discharging provider, this can lead to unnecessary CTOP outreach or patient notifications.
                          2. False Negatives – the LLM may look at vague or ambiguous text in the discharge summary or AVS and overlook or miss tasks; if not verified by the discharging provider, this could create additional confusion for both the patient and outpatient providers.

                          Error Mitigation Strategy

                          ●      Provider oversight: real-time visibility allows discharging physicians to validate AI-generated post-discharge checklists and address discrepancies by editing prior to the patient’s discharge.

                          ●      Manual audits and feedback loops: Rreview will help identify systematic errors and refine AI model performance.

                          ●      Task prioritization: the AI is anticipated to adapt over time, prioritizing high-value checklist items (e.g., follow-up visit attendance) to improveclinical relevance.

                          How will we measure success? Once implemented, the tool’s effectiveness will be assessed by comparing pre- and post-implementation data, focusing on:

                          ●      Clarity and accuracy of checklist documentation via manual review

                          ●      Improvements in checklist completion rates compared to baseline risk model

                          ●      Impact on readmission rate compared to baseline risk model.

                          Qualifications and commitment: Seth Blumberg and Prashant Patel are attending hospitalists. Seth runs a research group focused on computational modeling in medicine, including modeling the clinical progression of hospitalized patients using EHR data. Prashant is a physician-educator with experience improving provider experience and interprofessional team collaboration to improve the efficiency of patient discharges.

                          Supporting Documents: