Performance Aspects of Automated Rhythm Detection Capabilities for AEDs

Victor Krauthamer, PhD, Scientist, Office of Science and Engineering Labs, Center for Devices and Radiological Health (CDRH), Food and Drug Administration (FDA); Shanti Gomatam, PhD, Mathematical Statistician, Office of Surveillance and Biometrics, CDRH, FDA; and Oscar Tovar, MD, Medical Officer, Office of Device Evaluation, CDRH, FDA
Victor Krauthamer, PhD, Scientist, Office of Science and Engineering Labs, Center for Devices and Radiological Health (CDRH), Food and Drug Administration (FDA); Shanti Gomatam, PhD, Mathematical Statistician, Office of Surveillance and Biometrics, CDRH, FDA; and Oscar Tovar, MD, Medical Officer, Office of Device Evaluation, CDRH, FDA
Automated detection of shockable and non-shockable arrhythmias is an essential function of automatic external defibrillators (AEDs). The American Heart Association (AHA) provided a set of sensitivities and specificities for arrhythmia detection in AEDs in 1997.1 These recommendations are also used by national and international standards, by the Association for the Advancement of Medical Instrumentation, and the International Electromedical Commission. This paper will explain how standard diagnostic concepts and terminology applies to AEDs. We will review and discuss the concepts of sensitivity, specificity, positive predictive value, negative predictive value, overall accuracy, and the receiver-operator-characteristic (ROC) as they apply to AED arrhythmia detection. Detection Performance – A Simple View of Shockable and Non-shockable Rhythms with Sensitivity and Specificity The simplest way of looking at the diagnostic capability of an AED is with a binary classification: i.e., shockable versus non-shockable rhythms. The variety of shockable and non-shockable rhythms, even though they have different origins and consequences, can be simply viewed as those needing therapeutic electric shock and those not needing one. Broadly, the 1997 AHA recommendations1 for specifying and reporting arrhythmia analysis algorithm performance describe the shockable and non-shockable rhythms: the shockable rhythms are coarse ventricular fibrillation and rapid ventricular tachycardia, and the notable non-shockable rhythms are a normal sinus rhythm (NSR), asystole, atrial fibrillation, supraventricular tachycardia, sinus bradycardia, atrial flutter, and pulseless electrical activity. Fine ventricular fibrillation was omitted from either category and listed as an intermediate rhythm. For simplicity we did not include the intermediate rhythms in our discussion below, and use the simple binary view of shockable and non-shockable rhythms to define diagnostic performance. We define two fundamental terms to characterize detection performance: • sensitivity = probability of a shock advised for patients who truly have a shockable rhythm • specificity = probability of no shock advised for patients who truly have a non-shockable rhythm In order to determine sensitivity and specificity, it is essential to know the actual rhythm presented. The AHA1 recommended agreement by three experts as to the identity of each rhythm used for testing. If the population sensitivity and specificity are known, they can be applied to every rhythm presented for any patient population. When a shock is correctly advised for a shockable rhythm, this is termed a true positive (TP); a shock incorrectly advised for a non-shockable rhythm is known as a false positive (FP). Similarly, when no shock is advised for a non-shockable rhythm, it is termed a true negative (TN); a false negative (FN) is when no shock is advised for a shockable rhythm. Table 1 shows these four possible results for the binary detector. For a complete fundamental discussion of diagnostic performance, we recommend the review by Zweig and Campbell.2 A visual approach to understanding sensitivity and specificity has also been presented.3 Using Table 1, sensitivity can be defined as the proportion of true positives among all shockable rhythms. Stated as above, sensitivity is the chance of a shockable rhythm being correctly identified by the device: Sensitivity = true positives ÷ (true positives + false negatives) Specificity is defined as the proportion of true negatives among all non-shockable rhythms. Also, as stated above, specificity is defined as the probability that a non-shockable rhythm is correctly identified by the device: Specificity = true negatives ÷ (true negatives + false positives) The terms specificity and sensitivity refer to all possible rhythms of interest in the entire patient population. In practice, estimates of sensitivity and specificity can be derived empirically from the frequencies obtained by testing a collection of sample rhythms and categorizing them in Table 1. Table 2 provides an example of how this is done with hypothetical data from the performance of an imaginary device. Estimated sensitivity of this device is 95% (95/100), and estimated specificity of this device is 99% (990/1000). The AHA reporting guidelines1 provide recommendations for reporting algorithm performance and list performance goals for arrhythmia analysis algorithms. Although the AHA updated the guidelines for defibrillation in 2005,4 these guidelines did not revise the 1997 reporting of performance guidelines. The term overall accuracy is a term that is often used for describing AED rhythm detection performance. It is the probability of correct detection of both shockable and non-shockable rhythms. From Table 1, it would be the proportion of true positives and true negatives in the population: Overall Accuracy = (true positives + true negatives) ÷ (true positives + true negatives + false positives + false negatives) In contrast to sensitivity or specificity, overall accuracy describes the overall performance of a test; it does not contain the relative differentiation between shockable rhythms and non-shockable rhythms. By itself, overall accuracy is an inadequate description of the diagnostic value of the AED. Both sensitivity and specificity need to be considered because the consequences of low sensitivity and low specificity are different. Overall accuracy is also dependent on the prevalence of shockable rhythms in the patient population (for example, if the prevalence of shockable rhythms is high, a test that classifies all rhythms as shockable will have good overall accuracy, but it is of no practical value because it does not distinguish the non-shockable rhythms). When studies employ an enrichment technique (selection of a fraction of either shockable or non-shockable rhythms larger than prevalent in the population), overall accuracy cannot be meaningfully interpreted. Predictive Value – Measured or Calculated From Prevalence, Sensitivity and Specificity The intrinsic performance for distinguishing shockable from non-shockable rhythms is characterized by its sensitivity and specificity. The value of a test also depends upon the prevalence of these rhythms in the intended population. The predictive value depends upon prevalence as well as the intrinsic performance of the detection system. Positive predictive value (PPV) is the percentage of correctly detected shockable rhythms (true positive test results) relative to the total number of all positive test results. It can be measured empirically from Table 1 as: PPV = true positives ÷ (true positives + false positives) Its value is that it predicts the chance of a shockable rhythm, given a positive test result in a particular prevalence in the population. Similarly, the negative predictive value (NPV) informs of the chance of a non-shockable rhythm given a negative result and the prevalence. The NPV can be determined empirically for a population from Table 1: NPV = true negatives ÷ (true negatives + false negatives) Estimated predictive values can be calculated from sample test results. Using the sample results in Table 2, PPV is 90.48% (95/105) and NPV is 99.50% (990/995). Alternatively, when specificity, sensitivity, and prevalence are known, Bayes theorem can be used to calculate the predictive values. Bayes theorem conveniently allows the predictive value for any prevalence to be calculated given the intrinsic sensitivity and specificity of the device. From Bayes theorem, the probability of having a shockable rhythm, given a shock advised, is: PPV = (prevalence) (sensitivity) ÷ (prevalence) (sensitivity) + (1 – specificity) (1 – prevalence) NPV = (1 – prevalence) (specificity) ÷ (1 – prevalence) (specificity) + (1 – sensitivity) (prevalence) An example for the relative effects of prevalence: an AED may have 90.0% sensitivity and 99.0% specificity for shockable rhythms. In one in-hospital study, adults with apparent cardiac arrest had a prevalence of 23% for shockable rhythms, and a pediatric cohort had a 14% prevalence.5 From Bayes theorem, the probability that a patient has a shockable rhythm when a shock is advised would be 96.4% for an adult and 93.6% for a child. The NPV would be 97.1% for adults and 98.4% for children. Therefore, when a shock is not advised, the chance of having a shockable rhythm is 2.9% for adults and 1.6% for children. (This calculation assumes that detection sensitivity and specificity are the same in children as in adults.) Comparing and Selecting Detection Performance For binary diagnostic tests, as the case stated above, there is always a sensitivity paired with a specificity. This information describes the intrinsic performance of the algorithm. In the diagnostic algorithm, sensitivity and specificity can be “traded.” In the extreme example given above in relation to overall accuracy, an AED that advises shocks in every patient would have 100% sensitivity; however, its specificity would be 0%. For every device, the sensitivity and specificity can be traded in the design. Comparison of performance between devices is not always obvious. For example, is a device (say Device 1) with 95% sensitivity and 88% specificity superior in performance to one (say Device 2) that has 85% sensitivity and 92% specificity? A useful graphical method for comparisons of sensitivity and specificity between detection systems has been given by Biggerstaff.6 This method can be used to divide the sensitivity-specificity space into four regions with respect to the performance of a particular device: (I) a region that indicates performance superior to the device; (IV) a region that indicates performance inferior to the device; (III) another that indicates superior performance for confirming absence of the shockable rhythms, which amounts to superiority for detecting non-shockable rhythms; and (II) one that indicates superior performance for confirming presence of the shockable rhythms. This concept is illustrated in Figure 1. The sensitivity-specificity space is divided into four regions based on the performance of Device 2 (denoted by the red asterisk). The figure shows that Device 1 (blue triangle), while not superior in overall performance to Device 1, is superior for detecting non-shockable rhythms. When the AED performance can be varied using different algorithmic thresholds, the different sets of sensitivities and specificities selected can be plotted to show the receiver-operating-characteristic (ROC). In an ROC curve, the vertical axis represents the sensitivity and the horizontal axis represents (1-specificity). Each point on the graph represents the performance at a particular cut-off. The ideal ROC curve would be one that sharply rises to the top left corner of the graph and stays at the 100% sensitivity level for other values of (1-specificity). A detection system whose ROC curve is completely above that for a second device is superior in performance. The ROC describes the detection capability over a range of cut-offs. It can be used for comparing devices or adjusting algorithms when comparison over multiple cut-offs is relevant. The interested reader may refer to other papers2,7 for further discussion. In Figure 2, a collection of ROC curves is illustrated using three hypothetical algorithms. A single cut-off can be varied for each algorithm so that both sensitivity and specificity vary. For example, one algorithm may rely on a rate cut-off, another may rely on waveform duration, and a third may rely on amplitude. For each algorithm, performance at many cut-offs can be tested, and the performance results of each test (i.e., pairs of estimated sensitivities and specificities) are the points on each curve. By convention, (1-specificity) or the false positive fraction, not specificity itself, is plotted on the horizontal axis. The green curve represents an algorithm with perfect performance, i.e., it has 100% sensitivity and 100% specificity for some cut-off, and thus perfectly separates shockable and non-shockable rhythms. The solid black line represents a statistically uninformative algorithm, which results in the sensitivity and false positive fraction being the same at each cut-off value; however, an ROC curve is not necessarily clinically useful just because it is above the solid black line. ROC curves that are closer to the top left corner of the graph represent algorithms with better performance. The algorithm represented by the purple circles does better than either of the two remaining algorithms (blue triangles and red plusses) as it is above both of their ROC curves. For the remaining two algorithms (blue triangles and red plusses), the ROC curves cross, so neither is better than the other overall, although the blue curve shows superior performance in the clinically relevant part of the curve on the left. The trade-off between sensitivity and specificity on the ROC can be made according to a preference. If there is a preference to not allow any shockable rhythms to be missed, then the trade would be for higher sensitivity with lower specificity. If the desire is to avoid inappropriate shocks of non-shockable rhythms, then higher specificity would be traded at the expense of sensitivity. It should be noted, however, that if the intent is to operate the device with a single cut-off, only a single point on the ROC curve is of interest. Most current AED devices have a single sensitivity and specificity that is set by the manufacturer. AHA Recommendations for Specifying and Reporting Arrhythmia Analysis Algorithm Performance and Industry Standards – Not All Shockable or Non-shockable Rhythms are Treated the Same The AHA recommendations for AED performance name two shockable rhythms and several non-shockable ones.1 They provide guidelines for the proportion of correctly detected rhythms for each of several rhythm classes. The difference in performance for each class of rhythms relates to risk/benefit for shocking these different rhythms. For example, concerning NSR, a “specificity” of 99% is recommended. The high specificity would avoid a normal cardiac rhythm from being disrupted by an inappropriate shock. Other perfusing rhythms are treated with less “specificity.” Coarse VF is treated with higher sensitivity than rapid VT in the recommendations. Note that in these guidelines, the same detection algorithm is tested with samples of each rhythm. Each rhythm class is tested and listed separately in the guidelines. The rhythms are tested without the presence of artifact (e.g., from mechanical or electromagnetic sources). The design and testing of detection algorithms with artifacts or noise present is an additional and important challenge for standardized testing. Additional Considerations For simplicity, intermediate rhythms have been excluded in our performance discussions above; we have assumed that both rhythm truth and device differentiate all rhythms into one of two classes – shockable or non-shockable. The question of how intermediate rhythms (fine VF and non-shockable VT) should be addressed in performance estimates is a non-trivial one. Brief reference to how to deal with intermediate results (referred to as equivocals) is made in a FDA guidance document.8 In the sections above we have introduced the concepts of sensitivity, specificity, PPV, NPV, and the ROC curve via a specific, simple case. Detailed statistical discussions of these and other useful diagnostic concepts can be found in a book on the subject.9 While we have touched upon the ideas of “population” values and empirical estimates obtained by testing samples of rhythms, we have not discussed variability or confidence intervals. Clearly, these elements are critical for assessing performance goals based on samples of rhythms. Confidence intervals for the interpretation of estimated sensitivity and specificity is subject to the sampling techniques of the test rhythms. For example, when multiple rhythms are taken from the same patient, these rhythms may be correlated with one another. Hence, the statistical confidence interval estimates would be affected.