Development and Validation of an Algorithm to Identify Patients with Advanced Cutaneous Squamous Cell Carcinoma from Pathology Reports

To facilitate nationwide epidemiological research on advanced cutaneous squamous cell carcinoma (cSCC), that is, locally advanced, recurrent, or metastatic cSCC, we sought to develop and validate a rule-based algorithm that identiﬁes advanced cSCC from pathology reports. The algorithm was based on both hierarchical histopathological codes and free text from pathology reports recorded in the National Pathology Registry. Medical ﬁles from the Erasmus Medical Center of 186 patients with stage III/IV/recurrent cSCC and 184 patients with stage I/II cSCC were selected and served as the gold standard to assess the performance of the algorithm. The rule-based algorithm showed a sensitivity of 91.9% (95% conﬁdence interval ¼ 88.0 ‒ 95.9), a speciﬁcity of 96.7% (95% conﬁdence interval ¼ 94 ‒ 2-99.3), and a positive predictive value of 78.5% (95% conﬁdence interval ¼ 74.2 ‒ 82.8) for all advanced cSCC combined. The sensitivity was lower per subgroup: locally advanced (52.3 ‒ 86.2%), recurrent cSCC (23.3%), and metastatic cSCC (70.0%). The speciﬁcity per subgroup was above 97%, and the positive predictive value was above 78%, with the exception of metastatic cSCC, which had a positive predictive value of 62%. This algorithm can be used to identify advanced patients with cSCC from pathology reports and will facilitate large-scale epidemiological studies of advanced cSCC in the Netherlands and internationally after external validation.


INTRODUCTION
Cutaneous squamous cell carcinoma (cSCC) is one of the most common cancers in humans and is still increasing (Lomas et al., 2012;Tokez et al., 2020). Despite the high incidence rates, cSCC is excluded from many national cancer registries, including the United States (Wehner, 2020). Even if data on the primary tumor are registered, no country collects data on follow-up (Adalsteinsson et al., 2021;Guorgis et al., 2020;Stang et al., 2019;Tokez et al., 2022;Venables et al., 2019). The rationale is that given the high incidence rates and relatively low occurrence of metastatic cSCC, manually reviewing all pathology reports and patient files for disease progression is not feasible. However, owing to the high overall cSCC incidence rates, the absolute number of patients with advanced cSCC is also significant. These advanced patients are at risk of death (Tokez et al., 2022), but no national data are currently being collected. However, a good estimate of the probability and risk factors for disease recurrence and the likelihood of local and systemic progression would help to improve treatment decisions and surveillance recommendations. Automated identification of advanced cSCC (i.e., locally advanced, recurrent, or metastatic cSCC) could therefore represent a feasible, cost-effective solution for cancer registries to target this subgroup of patients with cSCC.
The use of automated extraction from free-text pathology reports to select patients for cancer registries has been previously reported in the literature (Glaser et al., 2018;Hanauer et al., 2007;Jouhet et al., 2012;Nguyen et al., 2015). Two studies used natural language processing (i.e., the application of computational techniques that aid computers in comprehending, interpreting, and manipulating human language) to automatically identify keratinocyte cancers but did not concentrate on advanced cSCC (Eide et al., 2012;Thompson et al., 2020).
We aimed to develop and validate an algorithm to identify patients with advanced cSCC from pathology reports. Automatic identification of patients with advanced cSCC using this algorithm will facilitate research on advanced cSCC at a population-based level.

Development of the algorithm
Identification of locally advanced primary cSCC.
We identified locally advanced primary tumors staged as T3 or T4 according to the American Joint Committee on Cancer, eighth edition (AJCC8) tumor classification using three criteria, all of which had to be met: (i) a hierarchical histopathological code from the Nationwide Pathology Registry (i.e., Nationwide Network and Registry of Histo-and Cytopathology [PALGA] code) indicating a primary cSCC combined with a PALGA code for skin or subcutis or a PALGA sublocalization code that is likely to be a cSCC; (ii) the absence of a PALGA localization code in the first position that is likely to be another type of squamous cell carcinoma (SCC) (mucosa, cheek, maxilla, mandible, larynx, floor of the mouth); and (iii) the presence of any of the following high-risk features within the free-text conclusion of the pathology report: T3 or T4, tumor diameter > 4 cm, invasion depth not precisely known but at a minimum of 5.5 mm or an exact invasion depth >6.0 mm, invasion beyond the subcutaneous fat, invasion in muscles, invasion in deep structures, bone erosion or invasion, nerve invasion 0.1 mm, any perineural invasion (excluding in nerves <0.1 mm), angioinvasive growth, and invasion depth reaching the bottom of the excision (Table 1). A few criteria, such as a minimum invasion depth of 5.5 mm and perineural invasion, angioinvasion, and invasion depth reaching the bottom of the excision, are not official AJCC8 criteria but were included because these criteria increased the algorithm's sensitivity without a large decrease in the positive predictive value (PPV) in preliminary analyses (data not shown). We identified locally advanced primary tumors staged as T2b or T3 according to the Brigham and Women's Hospital (BWH) alternative T-classification system by integrating three criteria: (i) PALGA code for primary cSCC combined with a PALGA code for skin or subcutis or PALGA sublocalization code that is likely to be a cSCC; (ii) the absence of PALGA localization code at the first position with a low likelihood of being a cSCC (mucosa, cheek, maxilla, mandible, larynx, floor of the mouth); and (iii) bone invasion or at least two of the following high-risk features within the pathology report's free-text conclusion: tumor diameter 2 cm, poor differentiation, perineural invasion in nerves 0.1 mm, invasion beyond the subcutaneous fat, invasion in deep structures, or invasion in muscles ( Table 1).

Identification of recurrent cSCC.
For the identification of recurrent cSCC, two criteria had to be met: (i) PALGA code for primary cSCC combined with a PALGA code for skin or subcutis or skin or subcutis in the free-text pathology conclusion and (ii) free text in the pathology conclusion indicating a recurrence ( Table 1).

Identification of metastasis.
We identified metastasis in three ways: (i) a PALGA code for metastatic SCC in combination with a PALGA code for skin or subcutis; (ii) PALGA code for primary SCC, metastatic SCC, metastatic carcinoma, or possible metastasis in combination with a PALGA code for parotid, salivary, or submandibular gland; and (iii) a free-text algorithm that identifies metastatic or malignant cells from the pathology conclusion in combination with squamous from the pathology conclusion or a primary SCC PALGA code (Table 1). Subsequently, we excluded pathology reports showing metastatic disease unlikely to have originated from an SCC of the skin (e.g., oral and pharyngeal cancers, lung cancer, or SCC of unknown origin) ( All of the foregoing principles were included to enhance the algorithm's capabilities. This way, patients with multiple advanced cSCCs could still be identified even if one of their advanced cSCC reports was missing.

Study population.
We included 186 patients with advanced cSCC treated at the Erasmus MC Cancer Institute (Rotterdam, The Netherlands) between May 18, 2018 and October 9, 2020 ( Table 2). The majority of patients had locally advanced primary cSCCs, of which 116 were classified as T3/T4 according to AJCC8, and 63 were classified as T2b/T3 according to BWH. There were 30 local recurrent tumors and 40 metastases. In addition, we included 184 patients treated at the Erasmus MC Cancer Institute between January 16, 2016 and September 23, 2020 with a T1/T2 cSCC according to AJCC8 and who were not T2b/T3 according to BWH (Table 2).

Measures of performance
Sensitivity.
The algorithm correctly identified 171 of 186 patients with advanced cSCC, which resulted in an overall sensitivity of 91.9% (95% confidence interval [CI] ¼ 88.0-95.9) ( Table 3). The majority of false negatives were caused by clinically identified features of advanced tumors that were not described or seen during pathological assessment, such as a clinical diameter >4 cm or imaging-detected bone invasion. All false negatives are summarized in Supplementary  Table S1. The sensitivity of the three subgroups was lower, ranging from 23.3% for recurrent cSCC to 52.3% for T2b/T3 (BWH) locally advanced primary cSCC, 70.0% for metastases, and finally 86.2% for T3/T4 (AJCC8) locally advanced primary cSCC. The sensitivity of the algorithm for locally The algorithm falsely identified six patients as advanced cSCC among 184 patients with low-stage cSCC, whereas 178 were correctly categorized as low stage, resulting in a specificity of 96.7% (95% CI ¼ 94.2-99.3) ( Table 3). All false-positive cases are summarized in Supplementary Table S4. Stratified analysis revealed a specificity >97% for all subgroups.

DISCUSSION
In this study, we have developed and validated a rule-based algorithm on the basis of hierarchical histopathological codes and free text that automatically identifies patients with advanced cSCC from pathology reports with a very favorable sensitivity of 91.9% and a specificity of 96.7%. The PPV or the percentage of all identified pathology reports that are certain advanced cSCC cases was almost 80% for all advanced cSCC combined. Such a high PPV is critical when the algorithm is used to identify patients with advanced cSCC for cancer registries or other observational studies to avoid reading too many patient files of low-risk patients and thereby wasting registration time.
The sensitivity of specific subgroups was lower. For example, if only the part of the algorithm to detect metastasis was used, 70% of all metastatic cSCC would be detected instead of more than 90%. The combined algorithm has a higher sensitivity because most patients with metastatic cSCC also have a pathology report for a locally advanced primary or recurrent cSCC and will be identified in this manner when a pathology report for metastasis is missing, for example, in the case of imaging-detected metastasis without histological confirmation. Thus, when this algorithm is used to assess the prevalence of specific subgroups of advanced cSCC, it should be taken into account that the stratified sensitivity was lower and that, for example, 30% of metastatic cSCC may have been missed. The stratified PPV was equally high for most subgroups, except for metastasis. Of all patients who were identified as having cSCC metastasis, 38% were false positives. Reasons for this included that the algorithm misidentified reports of patients with mucosal SCC metastasis or reports where it was reported to be unclear whether the tumor was a new primary cSCC or a skin metastasis but was in fact a new primary cSCC. Nevertheless, our algorithm is thought to save a huge amount of time. Even if the algorithm is only used to detect metastases, it still saves a lot of time compared with opening all the files of patients who develop cSCC every year.

Comparison with literature
Various computational techniques have been explored to extract cancer-related information from pathology free text (Spasi c et al., 2014). The majority focused on colorectal, breast, prostate, and lung cancer (Buckley et al., 2012;Coden et al., 2009;Currie et al., 2006). To the best of our knowledge, this is the only automated pathology algorithm that concentrates on advanced cSCC. Eide et al. (2012) employed pathology reports' free-text retrieval capacity to identify the incidence of keratinocyte cancers to validate medical claims data algorithms but not (high-risk) cSCC in particular. Thompson et al. (2020) used supervised learning methods to build a web application that automatically extracts diagnostic information for keratinocyte cancers, such as (subtype) diagnosis and site, from free-text pathology. Their objective was to estimate incidences accurately in the absence of nationwide registration, not to identify or extract cSCC highrisk features, particularly (Thompson et al., 2020).

Strengths and limitations
Strengths of the study include the manual registration of 186 patients with locally advanced primary (stage III), recurrent, or metastatic cSCC (stage III/IV) and 184 patients with stage I/ II cSCC from the medical patient files of the Erasmus MC Cancer Institute. Because this dataset included patients with cSCC with clinically diagnosed advanced cSCC but no histological confirmation (e.g., imaging-detected bone invasion), this was critical for an accurate estimation of the algorithm's sensitivity. Furthermore, it was of vital importance to be able to retrieve the complete history of all pathology reports by linking them to a nationwide database of pathology reports (PALGA) to include primary cSCC of referred patients. Information from pathology reports is complex to retrieve automatically because most reports are written in narrative format and because the pathologists' nomenclature for describing a diagnosis or lack thereof varies greatly between pathologists. The sensitivity of our algorithm could have been higher because of several high-risk features that were present during pathological assessment but were not reported, such as tumor diameter. Nationwide implementation of synoptic reporting for tumor characteristics would therefore greatly improve data quality and collection. Synoptic reporting is currently used in 29% of all cSCC pathology reports in the Netherlands, but more laboratories have agreed to use it in the near future (Swillens et al., 2019).
However, also in case of poor synoptic reporting rates, the algorithm can still identify patients with advanced cSCC from pathology reports accurately. Given that SNOMED-CT was reported to be utilized in over 50 countries in 2013 and that synoptic reporting is likely to grow in the future, we believe that our rule-based algorithm can be used globally after external validation (Lee et al., 2013).
Another obstacle that we encountered during the analysis was that the algorithm identified more patients with advanced cSCC than those we had initially included in our selection. All pathology reports of patients who were identified by the algorithm but not included in the sensitivity dataset were therefore manually reviewed. However, scoring a pathology report as a true positive was done in a conservative way. For example, if it was unclear from the pathology report whether it was a new primary cSCC or a skin metastasis, we included the report as false positive, whereas if we had had the clinical information, this may have been a true positive. This is likely to have resulted in an underestimation of the PPV. The algorithm has yet to be externally validated, which will require data from both a nationwide pathology registry as well as data from a single institution. To enable external validation and thereby increase its international applicability, the algorithm has been translated into corresponding international SNOMED-CT and English free text.
This study shows that patients with advanced cSCC can be accurately identified from pathology reports, allowing costeffective-targeted surveillance of patients with advanced cSCC. Although external validation still has to take place, this rule-based algorithm opens up future large-scale epidemiological research on advanced cSCC.

Definition of advanced cSCC
In this study, advanced cSCC was defined as locally advanced primary cSCC (either T3/T4 according to AJCC8 or T2b/T3 according to BWH), recurrent cSCC, or metastatic cSCC (skin, nodal, or distant metastasis). CSCC that had been staged according to the seventh edition of the American Joint Committee on Cancer were included if they fulfilled the AJCC8 T3/T4 criteria (i.e., T3/T4 or T1/T2 with perineural invasion, tumor depth >6 mm, invasion beyond subcutaneous fat, or minor bone erosion).

Study population and data sources
Sensitivity dataset.
To determine sensitivity, we retrieved data on patients with advanced cSCC from the clinical patient files of the Erasmus MC Cancer Institute. These patients were identified by reviewing the records from the multidisciplinary skin cancer board meetings between May 18, 2018 and October 9, 2020, where all patients with advanced cSCC were discussed weekly. Subsequently, we retrieved all pathology reports that met the criteria related to cSCC (either primary, recurrent, or metastatic) from these patients from PALGA (see Supplementary Table S5) (Casparie et al., 2007). Pathology reports from other pathology laboratories were also included because patients may have been diagnosed with advanced cSCC in another hospital before being sent to the Erasmus MC Cancer Institute.

Specificity dataset.
To determine the specificity of the algorithm, we selected a random sample of patients with low-stage/ nonadvanced, stages I and II cSCC according to AJCC8 and who were not T2b/T3 according to BWH from the Erasmus MC Cancer Institute between January 16, 2016 and September 23, 2020. These patients were identified by reviewing all patient records with a Diagnostic Related Group code for skin cancer in combination with a specific diagnosis of cSCC. These patients were also linked to PALGA to retrieve the same selection of pathology reports as previously mentioned.

PPV dataset.
To determine the PPV, we retrieved all pathology reports of cSCC in the Erasmus MC Cancer Institute from PALGA during the same time period. To identify all patients with cSCC on the basis of pathology reports, we applied the selection criteria presented in Supplementary Table S5. Thereafter, all cSCC-related pathology reports in the Erasmus MC Cancer Institute were retrieved using the same criteria as those used for patients in the sensitivity and specificity dataset (see Supplementary Table S5).

PALGA data
The data from PALGA included the report's conclusion, which was either free-text based or automatically generated if synoptic reporting was used. In addition to the conclusion, the pathologist assigned one or more diagnostic rules to each report as a standard, which consisted of a combination of diagnostic terms (localization, procedure, disease) from the PALGA thesaurus (https://www.palga.nl/palga-on-line-thesaurus.html). The diagnostic terms are automatically translated into one or more PALGA codes from a hierarchical coding system on the basis of SNOMED-CT, a well-established international terminology system that allows language-based data exchange both nationally and internationally. Examples of PALGA free-text conclusions and diagnostic rules can be found in Supplementary Table S6. We provided the PALGA codes as well as the SNOMED-CT for our algorithm and translated Dutch free text into English to facilitate external validation, thereby increasing the international applicability of our rule-based algorithm.

Data extraction from Erasmus MC Cancer Institute medical files
Data from the medical files of patients with advanced cSCC included type of advanced cSCC (i.e., locally advanced, recurrent, and metastatic cSCC); tumor location; tumor diameter (cm); pathology features (e.g., tumor differentiation, invasion depth (mm); presence of invasion beyond subcutaneous fat; perineural invasion 0.1 mm; lymphovascular invasion and bone invasion; presence of in-transit, regional, or distant metastasis; date of pathology diagnosis; and pathology record number. Clinical factors, such as imaging-detected bone invasion, were also recorded.
For stage I/II cSCC, the following data were retrieved: tumor location, tumor diameter (cm), pathology features (tumor differentiation, invasion depth [mm]), date of pathology diagnosis, and pathology record number. Patients were excluded if invasion depth was unreported or if it reached the bottom of the biopsy unless an invasion depth >6 mm was thought to be very unlikely (e.g., a superficial biopsy of a tumor <1 cm in clinical diameter). Similarly, patients with an unreported clinical tumor diameter were only included in stage I/II selection if the postoperative defect size suggested that the tumor should have been <4 cm.

Statistical analyses
To calculate a specificity or sensitivity of 85% as a single proportion with a 95% CI between 80 and 90%, we aimed to include 193 advanced cSCC and 193 stage I/II cSCC. Patient and tumor characteristics were presented as means and proportions.
The sensitivity, specificity, and PPV of the algorithm with a 95% CI were calculated. Measures of performance were stratified by the type of advanced cSCC (i.e., locally advanced primary cSCC according to AJCC8 or BWH, recurrent cSCC, and metastatic cSCC). The algorithm was developed using SAS 9.1.3 (SAS Institute, Cary, NC). Descriptive statistics were used to characterize the study cohort and were performed using Statistical Package for the Social Sciences 25.0 statistical software (SPSS, Chicago, IL). This study was approved by the scientific committees of the Erasmus MC Cancer Institute (MEC-2020-0054), PALGA, and the Dutch Clinical Research Foundation (W20.048/NMWO20.02.007) and was conducted with waived informed consent.

Data availability statement
The data used to support the findings of this study are available from the Erasmus MC Cancer Institute and Nationwide Network and Registry of Histo-and Cytopathology, but they are under license and hence not publicly available. The authors can provide data on reasonable request and with permission from the Erasmus MC Cancer Institute and Nationwide Network and Registry of Histo-and Cytopathology. The single high-risk factor was clinical tumor diameter (>4 cm), which was recorded in the patient file but not indicated in the pathology report 106 T3/T4 Perineural growth was seen during MMS, which was recorded in the patient file but not in the pathology report. 212 T3/T4 Bone invasion clinically detected by CT scan (not during pathological assessment). 225 T3/T4 Perineural growth was seen during MMS, which was recorded in the patient file but not in the pathology report. 245 T3/T4 Bone invasion clinically detected by CT scan (not during pathological assessment). 247 T3/T4 The single high-risk factor was clinical tumor diameter (>4 cm), which was recorded in the patient file but not indicated in the pathology report. 321 T3/T4 Perineural growth was seen during MMS, which was recorded in the patient file but not in the pathology report. 352 T3/T4 The single high-risk factor was a clinical tumor diameter (>4 cm), which was recorded in the patient file but not indicated in the pathology report 229 Recurrence No mention of recurrence in the pathology report 367 Recurrence No mention of recurrence in the pathology report. 315 Skin metastasis The skin metastasis is described in the pathology report as a large-cell malignancy matching the SCC localization. 213 Lymph node metastasis Detected by imaging, not by histopathology (FNA was inconclusive). 357 Lymph node metastasis This pathology report was missing in our selection because it lacked a morphology code of SCC. Only a code for carcinoma was included.