The aim of this study was to externally validate, in real clinical practice, the diagnostic ability of an AI algorithm to refer radiologists to chest x-rays, as well as to detect potential diagnoses that the algorithm missed. . trained. Thus, the overall accuracy of the algorithm was 0.95 (95% CI 0.92–0.98), sensitivity was 0.48 (95% CI 0.30–0.66) and specificity was 0.98 (95% CI 0.97–0.99). The results obtained have shed further light, as indicated by various expert groups.26,28,29External validation of AI algorithms in a real clinical context is needed to establish the necessary measures and adaptations to ensure safety and effectiveness in any given environment. Therefore, in the context of the developed model, it is important to understand and interpret what each of the obtained results refers to.
High accuracy values were observed in most cases (between 0.7–1). Accuracy is expressed as the ratio of correct classification results to the total number of cases tested. This value was high because, for each condition and for groups of conditions, the ability to detect true negatives was good, considering that no abnormalities were found in most of the images analyzed (51.8 %). Working with an AI algorithm that quickly determines that there is no abnormality can serve as a triage tool, streamlining the diagnostic process, allowing professionals to focus on other tests. , allowing for shorter waiting lists, shorter waiting times for evaluations and lower costs in secondary tests. .
With sensitivity, referring to the ability to detect an abnormality when one is actually present, physical findings or abnormalities such as sternal cables, enlarged heart, abnormal ribs, spinal implants, cardiac valves, or interstitial markings are detected. High sensitivity values were shown when applied. On the other hand, low sensitivity values were observed for most conditions, indicating that the algorithm had a limited ability to detect certain conditions such as in the mediastinum, vessels or bones. These results are consistent with the results of a study that externally validated a similar algorithm in the emergency department.35. Furthermore, the algorithm demonstrated low sensitivity in detecting pulmonary emphysema, linear atelectasis, and heeler prominence, which are prevalent conditions in the primary care setting.31.
Low sensitivity was also observed when detecting nodules, with the algorithm finding more nodules than the reference radiologist, in most cases confusing them with areolas in breast tissue. Although it is important to be able to detect any warning signs and that the professional is in charge of making clinical judgments and determining the need for complementary tests, it is possible that this external validation may have detected a potential gender bias in training. . Algorithm When it comes to breast imaging, it is important to distinguish between anatomical aspects of breast tissue and any possible changes that may occur during different stages of life, as opposed to symptomatic conditions or abnormalities.36. Other studies have also found a high false positive value in detecting nodules due to other causes such as fat, pleura or interstitial lung disease.37.
Finally, specificity is the ability to correctly identify images with no radiological abnormalities, results showed high values for all condition groupings, as the algorithm was able to detect images without any abnormalities.
Following the authors’ desire to contribute to the improvement of the AI model, some radiologist findings were identified that had been overlooked during algorithm training, particularly related to bronchial conditions, including chronic bronchopathy, bronchiectasis, and Bronchial wall thickening. Additionally, the algorithm missed common chronic conditions often seen in primary care, including chronic pulmonary abnormalities, COPD, and fibrocystic abnormalities. Additionally, it was noted that the names of specific conditions within the AI algorithm should be adjusted to align with the names used in the radiology field. Interstitial markings can be converted to interstitial abnormalities, consolidation to condensation, aortic sclerosis to valvular sclerosis, and rib abnormalities to rib fractures.
Once the main variables characterizing the algorithm’s capability are discussed, the results obtained are different from the majority of published studies, as most of them have obtained a high algorithm’s capability. However, it should be noted that most of these are internal validations and have not been tested in actual clinical practice settings.38,39,40.
A study in Korea performed internal and external validation of an AI algorithm that was able to detect the 10 most prevalent chest X-ray abnormalities and was able to demonstrate differences in sensitivity and specificity values. Internal validation yielded sensitivity and specificity values between 0.87–0.94 and 0.81–0.98, respectively. On the other hand, external validation yielded sensitivity and specificity values between 0.61–1.00 and 0.71–0.98, respectively.41. This difference can also be seen in a study in Michigan, where an AI algorithm capable of detecting the most common chest X-ray abnormalities was internally and externally validated.42and in a study at Seoul University School of Medicine, where a lung cancer detection algorithm was validated in population screening.43.
Therefore, the results obtained from the external validation show the need to increase the sensitivity of the algorithm for most situations. Considering that AI should serve as a diagnostic support tool and the ultimate responsibility for clinical decisions rests with the practitioner, it is ideal for the algorithm to assess potential abnormalities for the practitioner and Confirm the This ensures maximum diagnostic effectiveness. Recent studies have shown that the use of AI algorithms to assist the practitioner significantly improves diagnostic sensitivity and specificity and reduces image reading time.20, 44.
Improved sensitivity may help to address the global shortage of specialist radiologists, particularly in the primary care setting of Central Catalonia, where it was validated.45,46. More and more, general practitioners are entrusted with interpreting X-rays. In this context, the development of these tools can be a valuable asset in the assessment process.