Advertisement

Interpretation of the Outputs of a Deep Learning Model Trained with a Skin Cancer Dataset

Open ArchivePublished:June 01, 2018DOI:https://doi.org/10.1016/j.jid.2018.05.014

      Abbreviations:

      AUC (area under the curve), BCC (basal cell carcinoma), CNN (convolutional neural network), ISIC (International Skin Imaging Collaboration), SCC (squamous cell carcinoma)
      To the Editor
      We have made our algorithm publicly available for our model to be tested with various custom datasets and to create an environment where researchers can communicate on our results. We appreciate the comments by
      • Navarrete-Dechent C.
      • Dusza S.W.
      • Liopyris K.
      • Marghoob A.A.
      • Halpern A.C.
      • Marchetti M.A.
      Automated Dermatological Diagnosis: Hype or Reality?.
      , who have used a publicly available dataset, International Skin Imaging Collaboration (ISIC), for testing our model. This letter aims to explain their outcome with low sensitivity, and to discuss considerations necessary in interpreting the outputs from deep learning models.
      There are two methods of evaluating the performance of a model: Top-(n) accuracy and the area under the curve (AUC) of the receiver operating characteristics curve. Top-(n) accuracy is the percentage that a model renders a correct diagnosis within its (n)th choice. The AUC is a measurement of a model’s sensitivity/specificity with the varying level of a threshold for a disease of interest. Top-1 accuracy has been often used in general object recognition research, whereas AUC has been widely used in medical research (
      • Esteva A.
      • Kuprel B.
      • Novoa R.A.
      • Ko J.
      • Swetter S.M.
      • Blau H.M.
      • et al.
      Dermatologist-level classification of skin cancer with deep neural networks.
      ,
      • Gulshan V.
      • Peng L.
      • Coram M.
      • Stumpe M.C.
      • Wu D.
      • Narayanaswamy A.
      • et al.
      Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.
      ,

      He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. 2015. arXiv:150201852.

      ,

      Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, et al. CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. 2017. arXiv:171105225.

      ,

      Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. 2014. arXiv:14090575.

      ).
      One of the reasons that most medical deep learning research has used AUC instead of Top-1 accuracy is the practical limitation of a deep learning model. When the number of training datasets is small (1,000 or less images per diseases) and unbalanced, the outputs of the convolutional neural network (CNN) model tend to tilt to one side (

      Jeni LA, Cohn JF, De La Torre F. Facing imbalanced data—recommendations for the use of performance metrics. Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on IEEE; 2013. p. 245–51.

      ,

      Van Hulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proc. 24th international conference on machine learning. ACM; 2007. p. 935–42.

      ). Because the sum of all outputs of CNN is 1.0, the output of each class competes with each other. For example, if some feature of a wart’s images greatly affects the algorithm, the output of the model would tilt toward the wart, producing a high probability of the wart for all given images. One way to solve this issue is to add more training images for the classes that render low (or weak) probability. However, it is impossible to supplement enough number of images in the medical field.
      To alleviate this issue, we used the AUC to report the sensitivity and specificity of the model while setting thresholds for each diagnosis. The added benefit of using the AUC is the ability to select the degree of sensitivity and specificity of the model by adjusting the thresholds. When our model (70617.caffemodel) was tested with Edinburgh’s 10 disorders, the Top-1 accuracies for basal cell carcinoma (BCC) and melanoma were largely dissimilar, 67% for BCC and 32% for melanoma, whereas the AUCs for BCC and melanoma were similar, 0.8848 for BCC and 0.9110 for melanoma.
      The sensitivity of 16 dermatologists for diagnosing BCC and melanoma with the Asan and Edinburgh dataset was approximately 40% in our previous report (
      • Han S.S.
      • Kim M.S.
      • Lim W.
      • Park G.H.
      • Park I.
      • Chang S.E.
      Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm.
      ), which seemed to be much lower than the sensitivity that people expect the trained dermatologists would possess. If the task was discriminating malignancy from nonmalignancy, they would have produced much higher sensitivity. However, the task was to classify 12 diseases. If the answer was squamous cell carcinoma (SCC) over intraepithelial carcinoma, it is incorrect. In addition, the test dataset contained a mixture of Caucasian and Asian, which may have added extra difficulty. For example, it is challenging to distinguish between pigmented BCC of Asians and melanoma of Caucasian with a single image without history.
      To test the generalizability of our algorithm, we tested our model with the ISIC dataset. For creating receiver operating characteristics curves, the melanoma (n = 37), BCC (n = 39), and SCC (n = 24) images from ISIC were tested together with six benign disorders from the Edinburgh dataset (n = 819) because no benign counterparts existed in ISIC. The single-crop AUC values of our model with melanoma and BCC were 0.7432 and 0.7988, respectively (Figure 1). The sensitivity for melanoma and BCC was 59.4% and 64.1%, respectively, when we applied the optimal Caucasian threshold that was calculated in our previous works (
      • Han S.S.
      • Kim M.S.
      • Lim W.
      • Park G.H.
      • Park I.
      • Chang S.E.
      Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm.
      ). In the letter by
      • Navarrete-Dechent C.
      • Dusza S.W.
      • Liopyris K.
      • Marghoob A.A.
      • Halpern A.C.
      • Marchetti M.A.
      Automated Dermatological Diagnosis: Hype or Reality?.
      , only the first or first to fifth results were selected from the web-DEMO outcomes for reporting sensitivity; however, all cases exceeding the threshold must have been added when calculating the sensitivity.
      Figure thumbnail gr1
      Figure 1Results with nine disorders: 3 malignancies from ISIC (n = 100) + six benign disorders from Edinburgh (n = 819). (a) The AUC value with melanoma from ISIC; (#70617; single-crop) = 0.7275, (#70617; multicrop) = 0.8333, (#New online; multicrop) = 0.8515. (b) The AUC value with basal cell carcinoma from ISIC; (#70617; single-crop) = 0.8216, (#70617; multicrop) = 0.8599, (#New online; multicrop) = 0.9640. We tested the “70617.caffemodel” created and released in our previous study, which was used on the web-DEMO. We also tested with the new model (built on 4 March 2018). The test code is available at “http://api.modelderm.com for the further validation with custom dataset. The single-crop method includes resizing photography into 224 × 224 and analyzing the resized photography. The multicrop method includes magnifying the center of the picture by a maximum of 400% (100, 200, 300, 400%). All results from each magnification were arithmetically averaged. The squamous cell carcinoma (SCC) and intraepithelial carcinoma (IEC) were not divided in the ISIC validation dataset, and therefore, we could not retrieve the AUC result. When we analyzed the SCC output and IEC output separately (#70617; single-crop), the sensitivity of SCC was 37.5% (9/24) with the optimal threshold of SCC (0.0096), and the sensitivity of IEC was 100% (24/24) with the optimal threshold of IEC (0.0076). AUC, area under the curve; ISIC, International Skin Imaging Collaboration.
      These AUC values (0.7432, 0.7988) can be further improved by an additional preprocessing and interpretation method. When we implemented a multicrop preprocessing step with the same model, the AUC values were improved to 0.8333 (melanoma) and 0.8599 (BCC), representing 12% and 8% improvement, respectively. To assess the performance of the model in the “malignancy vs nonmalignancy” test, we made the model to discriminate between Edinburgh nevus (n = 331) and ISIC melanoma (n = 37). The AUC for melanoma of the same model jumped from 0.6944 to 0.9129 (single-crop) (Figure 2).
      Figure thumbnail gr2
      Figure 2Results with two disorders: melanoma of ISIC (n = 37) + nevus of Edinburgh (n = 331). (a) The AUC values of melanoma if we used [melanoma output/(melanoma output + nevus output)]; (#70617; single-crop) = 0.9129, (#70617; multicrop) = 0.9087, (#New online; multicrop) = 0.9508. (b) The AUC values of melanoma if we used [melanoma output]; (#70617; single-crop) = 0.6944, (#70617; multicrop) = 0.8089, (#New online; multicrop) = 0.9106. We used the ratio of melanoma output to nevus output to solve the problem of “melanoma vs nevus.” We calculated the ratio [melanoma output/(melanoma output + nevus output)]. The AUC value of the melanoma using the ratio is much better than the result using the melanoma output alone because there is no misclassification to other malignancies. AUC, area under the curve; ISIC, International Skin Imaging Collaboration.
      There are several ways to further improve the performance of the model: (1) collect more images for training; (2) reduce the number of training classes as the number of training data required to maintain proper accuracy increases exponentially with an increase in classes; (3) remove improperly tagged photographs or unusually atypical cases. The AUC of melanoma and BCC from ISIC using the updated model, available online (http://api.modelderm.com), were 0.8785 and 0.9576, respectively. For discriminating “melanoma vs nevus,” the AUC of melanoma also improved to 0.9508.
      One of the limitations of our study is the potential high false-positive rate to undefined classes (disorders that are not included in training or even normal skin that are folded or shaded). In practice, it is appropriate to use the threshold that is higher than the determined optimal threshold and showing a similar sensitivity to that of dermatologists. There can be false positives even for the trained classes with the use of the optimal threshold, which was shown in the use of the web-DEMO. Another limitation of our model was the influence of image composition on outcome. This issue may be alleviated by using a region-based CNN to crop a nodular lesion automatically. We will consider applying this method in the future, as we developed a unified image composition method for creating an onychomycosis dataset (

      Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proc. IEEE conference on computer vision and pattern recognition; 2014. p. 580–7.

      ,
      • Han S.S.
      • Park G.H.
      • Lim W.
      • Kim M.S.
      • Im Na J.
      • Park I.
      • et al.
      Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: automatic construction of onychomycosis datasets by region-based convolutional deep neural network.
      ,

      Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. 2015. arXiv:150601497.

      ).

      Conflict of Interest

      WL is employed by SK Telecom. But the company did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

      References

        • Esteva A.
        • Kuprel B.
        • Novoa R.A.
        • Ko J.
        • Swetter S.M.
        • Blau H.M.
        • et al.
        Dermatologist-level classification of skin cancer with deep neural networks.
        Nature. 2017; 542: 115-118
      1. Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proc. IEEE conference on computer vision and pattern recognition; 2014. p. 580–7.

        • Gulshan V.
        • Peng L.
        • Coram M.
        • Stumpe M.C.
        • Wu D.
        • Narayanaswamy A.
        • et al.
        Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.
        JAMA. 2016; 316: 2402-2410
        • Han S.S.
        • Kim M.S.
        • Lim W.
        • Park G.H.
        • Park I.
        • Chang S.E.
        Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm.
        J Invest Dermatol. 2018; 138: 1529-1538
        • Han S.S.
        • Park G.H.
        • Lim W.
        • Kim M.S.
        • Im Na J.
        • Park I.
        • et al.
        Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: automatic construction of onychomycosis datasets by region-based convolutional deep neural network.
        PLoS One. 2018; 13: e0191493
      2. He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. 2015. arXiv:150201852.

      3. Jeni LA, Cohn JF, De La Torre F. Facing imbalanced data—recommendations for the use of performance metrics. Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on IEEE; 2013. p. 245–51.

        • Navarrete-Dechent C.
        • Dusza S.W.
        • Liopyris K.
        • Marghoob A.A.
        • Halpern A.C.
        • Marchetti M.A.
        Automated Dermatological Diagnosis: Hype or Reality?.
        J Invest Dermatol. 2018; 138: 2277-2279
      4. Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, et al. CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. 2017. arXiv:171105225.

      5. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. 2015. arXiv:150601497.

      6. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. 2014. arXiv:14090575.

      7. Van Hulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proc. 24th international conference on machine learning. ACM; 2007. p. 935–42.

      Linked Article

      • Automated Dermatological Diagnosis: Hype or Reality?
        Journal of Investigative DermatologyVol. 138Issue 10
        • Preview
          In this issue of the Journal of Investigative Dermatology, Han et al. (2018) have made a landmark contribution to the application of artificial intelligence (AI) in dermatologic diagnosis. Although previous studies have reported that computer algorithms can successfully diagnose skin cancer from medical images with human equivalency (Esteva et al., 2017; Ferris et al., 2015; Marchetti et al., 2018; Menzies et al., 2005), Han et al. have made their computer algorithm publicly available for external testing.
        • Full-Text
        • PDF
        Open Archive
      • Classification of the Clinical Images for Benign and Malignant Cutaneous Tumors Using a Deep Learning Algorithm
        Journal of Investigative DermatologyVol. 138Issue 7
        • Preview
          We tested the use of a deep learning algorithm to classify the clinical images of 12 skin diseases—basal cell carcinoma, squamous cell carcinoma, intraepithelial carcinoma, actinic keratosis, seborrheic keratosis, malignant melanoma, melanocytic nevus, lentigo, pyogenic granuloma, hemangioma, dermatofibroma, and wart. The convolutional neural network (Microsoft ResNet-152 model; Microsoft Research Asia, Beijing, China) was fine-tuned with images from the training portion of the Asan dataset, MED-NODE dataset, and atlas site images (19,398 images in total).
        • Full-Text
        • PDF
        Open Archive
      • Automated Classification of Skin Lesions: From Pixels to Practice
        Journal of Investigative DermatologyVol. 138Issue 10
        • Preview
          The letters “Interpretation of the Outputs of Deep Learning Model trained with Skin Cancer Dataset” and “Automated Dermatological Diagnosis: Hype or Reality?” highlight the opportunities, hurdles, and possible pitfalls with the development of tools that allow for automated skin lesion classification. The potential clinical impact of these advances relies on their scalability, accuracy, and generalizability across a range of diagnostic scenarios.
        • Full-Text
        • PDF
        Open Archive