Algorithms that use only International Classification of Diseases (ICD)-9 codes do not perform well in identifying systemic sclerosis (SSc), according to study results published in Arthritis Research & Therapy. The highest-performing algorithms were those that incorporated clinical data with billing codes.
The study of SSc has been limited by its rarity and small sample sizes, making electronic health records (EHRs) a potentially powerful tool for the study of this patient population; however, implementation of validated methods is necessary. Investigators of the current study developed and validated EHR-based algorithms incorporating clinical data and billing codes to identify patients with SSc in the EHR.
A de-identified EHR with over 3 million patients was utilized to identify the 1899 potential patients with SSc, among whom 200 were randomly selected in the training set for chart review. An expert panel of rheumatologists selected the following algorithm components based on clinical knowledge and EHR-accessible data: ICD-9 (710.1) and ICD-10-CM codes for SSc; positive antinuclear antibody (ANA) (titer ≥ 1:80); and the keyword “Raynaud phenomenon.”
Researchers developed algorithms using both rule-based and machine learning techniques; positive predictive values (PPVs), sensitivities, and F-scores (accounting for PPVs and sensitivities) were calculated for these algorithms. The algorithm with the highest F-score was then validated in a set of 100 participants randomly selected from the database of 1899, who were also excluded from the original training set.
Of the 200 potential patients in the training set, 85 were classified as true cases on chart review, 70 did not have an SSc diagnosis, 24 had SSc diagnosis uncertainty, and 21 had missing specialist notes. Systemic lupus erythematosus (SLE) was the most common diagnosis among the study participants (n=38) who did not have SSc.
Algorithms using only 1 count of the SSc ICD-9 code had low PPVs (52%), but the PPVs increased as code counts increased (63% for ≥2 counts, 79% for ≥3 counts, and 86% for ≥4 counts). Higher PPVs were found when the researchers used ICD-10-CM codes vs ICD-9 code, with 82% PPV for ≥1 count of the ICD-10-CM codes, 84% for ≥2 counts, 88% for ≥3 counts, and 91% for ≥4 counts.
Algorithms using ≥3 or ≥4 counts of the ICD-9 or ICD-10-CM codes and ANA positivity had the highest PPV (100%) but low sensitivity (50%). The algorithm with the highest F-score (91%) was ≥4 counts of the ICD-9 or ICD-10-CM codes with an internally validated PPV of 90%. A random forest machine learning method, including 500 trees and sampling 2 random variables per tree, yielded an algorithm with a PPV of 84%, a sensitivity of 92%, and an F-score of 88%. The Raynaud phenomenon keyword was the most important feature for this algorithm.
“By bridging the gaps between administrative databases and prospective cohort studies, EHR-based cohorts represent powerful tools. Researchers can accurately and efficiently identify [patients with] SSc, examine outcomes longitudinally, and ask clinically important questions about the disease,” the researchers concluded.
Jamian L, Wheless L, Crofford LJ, Barnado A. Rule-based and machine learning algorithms identify patients with systemic sclerosis accurately in the electronic health record. Arthritis Res Ther. 2019;21(1):305.