Artificial Intelligence / Machine Learning Demonstration Projects 2025

Crowdsourcing ideas to bring advances in data science, machine learning, and artificial intelligence into real-world clinical practice.

Automated Knee Osteoarthritis Grading Decision Support Tool

Primary Author: Yuntong Ma
Proposal Status: 

The UCSF Health problem

Plain knee radiographs are a cornerstone in evaluating osteoarthritis (OA), and at UCSF, nearly all such studies include Kellgren-Lawrence (KL) grading to assess severity. While essential, this grading task is repetitive, time-consuming, and inherently subjective—especially in borderline cases1,2. In fact, studies show only moderate inter-reader reliability in KL grading, creating inconsistencies in diagnosis and downstream care decisions3,4.

KL grading adds to radiologist workload5, reducing time available for complex cases and clinical consults. Despite its clinical importance, few innovations have addressed this burden at scale. Although machine learning methods for KL grading have shown promise in research settings, these tools have yet to be integrated meaningfully into clinical workflows.

Our proposed solution targets a high-volume, repetitive task to reduce radiologist burden, improve grading consistency, and support radiology education. Our goal is not to replace clinical judgment, but to reinforce it with consistent, reproducible assessments of OA severity.

How might AI help?

We propose integrating an automated AI model for KL grading directly into the radiology workflow at UCSF. Our model has been trained on thousands of local clinical knee radiographs and has demonstrated strong agreement with expert radiologist grading. The tool uses a deep learning pipeline to automatically detect knee joints and assign KL grades (0 to 4, or total knee replacement). This structured information would be embedded directly into the radiology report and supported with visual overlays to guide interpretation. We hypothesize that this system will improve inter-reader reliability, assist radiologists in ambiguous cases, and reduce the clinical workload associated with routine KL scoring.

Automating KL grading using deep learning has the potential to significantly enhance clinical practice by improving grading consistency and reducing the workload of radiologists. Despite promising research, automatic KL grading has not yet been integrated into clinical workflows. Our proposed study would demonstrate the feasibility and effectiveness of such integration, showing that an automated system can streamline workflow, reduce reporting burden, and provide decision support for ambiguous cases.

Key beneficiaries of this tool are:

- Radiologists: With KL grading offloaded to the algorithm, radiologists can devote more time to more complex interpretations and high-value consults. Structured outputs integrated into PowerScribe reduce cognitive load and speed reporting.

- Patients: Consistent and reproducible KL scores reduce the risk of diagnostic variability. More available radiologist time means better patient communication and faster turnaround.

- Trainees: The system’s visual outputs and attention maps can serve as an educational aid for residents and fellows learning musculoskeletal imaging, helping them learn to identify KL grades more reliably and confidently.

By deploying this AI system at the point of care, we can support more efficient workflows, better diagnostic consistency, and enhanced trainee learning.

As a proof-of-concept for automatic KL grading for clinical knee OA scoring at UCSF, we have developed a deep learning pipeline which effectively detects the knee joint on clinical radiographs and automatically classifies OA severity. In this project we trained two models: an object detection model to first perform cropping around the knee joint, and a classification model to identify KL grade (0-4) or presence of total knee replacement (TKR).

For object detection, 814 unilateral AP knee radiographs from the Osteoarthritis Initiative (OAI)9 were used to train image cropping around the knee joint. A You Only Look Once (YOLO, version 8)10 object detection model was trained on OAI radiographs. An 80/10/10 split was applied for training, validation, and test sets. Mean intersection-over-union values between predicted and ground truth bounding boxes for the test set, training set, and validation set were 0.8947, 0.8845, and 0.8635, respectively. Applying object detection to UCSF clinical knee radiographs, at least one knee joint was detected in 94.73% (n = 10,842) radiographs and cropped.

For classification, 9,166 anonymized clinical knee radiographs were acquired from UCSF PACS AIR11 for training, validation, and prediction. 4,978 bilateral and 4,188 unilateral radiographs. KL labels were extracted from corresponding UCSF radiology reports were extracted using regular expression and used to run a pretrained EfficientNet-B712 classification model on an 80/10/10 training/validation/test data split of cropped and KL-labeledUCSF knee radiographs. Weighted Cohen's Kappa showed substantial agreement (0.74 for validation set and 0.76 for test set).

How would an end-user find and use it?

The AI system will be integrated with clinical radiology tools Visage Picture Archiving and Communication System (PACS) and Nuance PowerScribe reporting software. Radiological images will be routed to the KL grading software based on a set of filters, such as modality and body part. The models will detect left and right knee joints and generate numerical KL scores of osteoarthritis severity: 0 – No OA, 1 – Doubtful OA, 2 – Mild OA, 3 – Moderate OA, 4 – Severe OA, 5 – hardware, such as artificial joints from total knee replacement.

When a knee radiograph is opened for interpretation, suggested grades, along with brief descriptors of severity (e.g., "mild OA"), will be pre-populated into the PowerScribe radiology report template. Radiologists can edit or accept the AI-suggested grades. Additionally, an annotated image with overlaid KL scores and saliency maps (indicating the most relevant areas of the image used for classification) will be available for review in the Visage PACS viewer software. Trainees can use these visualizations to better understand the features driving each grade. This process occurs passively and seamlessly, saving time and offering real-time decision support. See example mock-up of the end-user interface below:

What are the risks of AI errors?

There are two primary categories of errors:

Failure to detect the knee joint: This would result in the system being unable to suggest a KL grade. However, our preliminary data show that the model successfully detects the knee joint in over 95% of cases, making this scenario unlikely. When it does occur, the fallback is manual grading by the radiologist, as is currently done.

Misclassification of osteoarthritis severity: This includes both false negatives (e.g., missed severe OA), which may lead to underestimation of disease severity and undertreatment, and false positives (e.g., overgrading of mild cases), which may lead to unnecessary further evaluation or intervention.

The rate of these failures will be measured comparing the record of model output to the final radiology reports. Our preliminary data showed good agreement between generated KL grades and ground truth values, assigned by radiologists. To mitigate these risks:

  • AI-generated grades will never bypass radiologist review. Only radiologist-approved grades will be included in the final report.
  • We will conduct a quality assurance study comparing AI grades to independent radiologist adjudication.
  • Continuous monitoring of agreement between AI predictions and final report grades will be conducted, and discrepancies will be reviewed to guide system refinement.

How will we measure success?

Initial evaluation: In a randomized adjudication study, two radiologists will independently compare prior clinical KL grades and AI-generated grades. We will measure inter-rater and model agreement using Cohen’s Kappa, with success defined as the model matching or exceeding inter-radiologist agreement.

Clinical impact: We will evaluate how frequently AI-generated grades are accepted without edits, and whether AI usage improves intra- and inter-reader consistency across reports. We will assess radiologist-reported satisfaction with the tool and trainee confidence and accuracy in KL grading via structured surveys and explore whether AI integration leads to measurable efficiency gains, including reduced interpretation time per case and an increase in the number of radiographs read per day.

Continuous Evaluation: We will implement a dashboard for continuous monitoring of the model performance by recording the generated grades and comparing them with the final grades entered into the reports. Any significant decrease in the rate of agreement will be followed up with model fine-tuning with additional data.

Describe your qualifications and commitment

This project is led by Dr. Yuntong (Lorin) Ma, MD, Assistant Professor of Radiology at UCSF, and Eugene Ozhinsky, PhD, Associate Professor of Radiology at UCSF.

Dr. Ma’s work focuses on developing deep learning solutions for musculoskeletal imaging, including automated classification of osteoarthritis severity and diagnosis of inflammatory arthropathy. Her research emphasizes practical clinical integration, and she actively contributes to shaping standards for the responsible deployment of imaging AI. Her ongoing research focuses on advancing the clinical implementation of AI tools in ways that are practical and relevant to patient care. If selected, she will commit 10% effort for at least 1 year towards this project to ensure its success.

Dr. Ozhinsky’s research focuses on applying advanced image acquisition and machine learning techniques to improve diagnosis, predict disease progression, and guide therapy—particularly in musculoskeletal conditions. He has developed AI models for tasks such as hip fracture detection, automated OA grading, and MRI protocol optimization. His long-term goal is to translate these novel techniques into routine clinical care so that they result in meaningful improvements in patient outcomes.

Drs. Ma and Ozhinsky will oversee a multidisciplinary team of scientists and engineers in close collaboration with the UCSF Center for Intelligent Imaging. The team meets weekly to review progress, troubleshoot challenges, and plan next steps. Regular engagement with key clinical stakeholders will guide implementation, including the Radiology AI Governance Committee, UCSF Health AI, and AER leadership.

References

1.         Kohn MD, Sassoon AA, Fernando ND. Classifications in Brief: Kellgren-Lawrence Classification of Osteoarthritis. Clin Orthop Relat Res. 2016;474(8):1886-1893. doi:10.1007/s11999-016-4732-4

2.         Braun HJ, Gold GE. Diagnosis of osteoarthritis: imaging. Bone. 2012;51(2):278-288. doi:10.1016/j.bone.2011.11.019

3.         Wright RW, MARS Group. Osteoarthritis Classification Scales: Interobserver Reliability and Arthroscopic Correlation. J Bone Joint Surg Am. 2014;96(14):1145-1151. doi:10.2106/JBJS.M.00929

4.         Köse Ö, Acar B, Çay F, Yilmaz B, Güler F, Yüksel HY. Inter- and Intraobserver Reliabilities of Four Different Radiographic Grading Scales of Osteoarthritis of the Knee Joint. The Journal of Knee Surgery. 2017;31:247-253. doi:10.1055/s-0037-1602249

5.         Tiulpin A, Thevenot J, Rahtu E, Lehenkari P, Saarakkala S. Automatic Knee Osteoarthritis Diagnosis from Plain Radiographs: A Deep Learning-Based Approach. Sci Rep. 2018;8:1727. doi:10.1038/s41598-018-20132-7

6.         Lee LS, Chan PK, Wen C, et al. Artificial intelligence in diagnosis of knee osteoarthritis and prediction of arthroplasty outcomes: a review. Arthroplasty. 2022;4:16. doi:10.1186/s42836-022-00118-7

7.         Swiecicki A, Li N, O’Donnell J, et al. Deep learning-based algorithm for assessment of knee osteoarthritis severity in radiographs matches performance of radiologists. Computers in Biology and Medicine. 2021;133:104334. doi:10.1016/j.compbiomed.2021.104334

8.         Norman B, Pedoia V, Noworolski A, Link TM, Majumdar S. Applying Densely Connected Convolutional Neural Networks for Staging Osteoarthritis Severity from Plain Radiographs. J Digit Imaging. 2019;32(3):471-477. doi:10.1007/s10278-018-0098-3

9.         NIMH Data Archive - OAI. Accessed November 4, 2024. https://nda.nih.gov/oai

10.       Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, Real-Time Object Detection. Published online May 9, 2016. doi:10.48550/arXiv.1506.02640

11.       AIR Overview. UCSF Radiology. May 29, 2018. Accessed March 20, 2025. https://radiology.ucsf.edu/research/core-services/PACS-air

12.       Tan M, Le QV. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Published online September 11, 2020. doi:10.48550/arXiv.1905.11946

Supporting Documents: 

Comments

It looks like you've done a fairly extensive retrospective analysis to develop the AI and validate.  You are proposing a prospective validation - can you describe what you think the incremental value is of the prospective validation?  What additional implementation problems will you face?  Do you expect the accuracy to differ when implemented prospectively?