If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Latent class analysis (LCA) is a statistical technique that allows for identification, in a population characterized by a set of predefined features, of hidden clusters or classes, that is, subgroups that have a given probability of occurrence and are characterized by a specific and predictable combination of the analyzed features. Compared with other methods of so called data segmentation, such as hierarchical clustering, LCA derives clusters using a formal probabilistic approach and can be used in conjunction with multivariate methods to estimate parameters. The optimal number of classes is the one that minimizes the degree of relationship among cases belonging to different classes, and it is decided by relying on methods such as the Bayesian Information Criterion that capitalize on the value of the negative log-likelihood function, a well-established measure of the goodness of fit of a statistical model. LCA has not been extensively used in dermatology. The areas of application are manifold, from the phenotype classification to the analysis of behavior in relation with risk factors to the performance of diagnostic tests.
Latent class analysis (LCA) is a statistical way to uncover hidden clusters in data by grouping subjects with a number of prespecified multifactorial features or manifest variables into latent classes (LCs), that is, subgroups with similar characteristics based on unobservable membership (
). The assumption is that, theoretically, any combination of a set of features could happen, but in reality, only a few of them do happen, forming a limited set of clusters where the individual features have a specific probability of occurrence, exactly the LCs. For example, one may question if expert clinicians confronted with a series of patients with different clinical features are similarly consistent when posing a diagnosis of psoriatic arthritis. In a study using LCA, it was documented that expert clinicians group together into two clusters labeled as high and low diagnosers of psoriatic arthritis. No intermediate category was found (
Developing classification criteria for peripheral joint psoriatic arthritis. Step I. Establishing whether the rheumatologist's opinion on the diagnosis can be used as the "gold standard.
). Less experienced clinicians or residents might have been clustered in a larger number of categories, expressing diagnostic uncertainty.
Summary Points
What is Latent Class Analysis?
Latent class analysis (LCA) is a statistical way to uncover hidden clusters in data. This technique divides a set of observations (cases) characterized by several variables into mutually exclusive groups or classes, such that the observed variables are unrelated to each other within each class (local independence) and observations are similar in each class but different from those in other classes. Technically speaking, LCA is a special kind of finite mixture model (FMM), also known as unsupervised learning models, which model a statistical distribution by a mixture (or weighted sum) of other distributions and group similar data together based on selected parameters (i.e., data segmentation).
The optimal number of clusters in the set of observations is decided based on explicit probabilistic rules such as the Akaike Information Criterion and the Bayesian Information Criterion.
One advantage of FMMs compared with other methods of data segmentation, such as cluster analysis, is that they can be used in conjunction with multivariate methods.
Variables in LCA should be qualitative and nominal. When continuous variables are used, alone or in combination with categorical variables, the term latent profile analysis is usually preferred.
What are the major applications of LCA in clinical research?
LCA has a potentially extensive field of applications in clinical research, where qualitative and nominal variables are frequently used, ranging from phenotype classification to the analysis of behavior in relation to risk factors to the performance of diagnostic tests. The actual applications in dermatology have been, however, rather limited.
Terminology matters and we refer to the Glossary for definitions (see Supplementary Materials). The term latent implies that the analysis is based on an error-free underlying variable that is not directly measurable or observable but that can cause effects, for example, the diagnostic attitude of clinicians confronted with a given clinical scenario.
How to Use LCA
Clustering and the concept of finite mixture modeling
Technically speaking, LCA is a special kind of finite mixture model (FMM) (
), which assumes that an observed set of data derives from several underlying subpopulations and makes statistical inferences about the properties of these subpopulations having information on the pooled population only. In the previous example, the only distribution available is the distribution of the diagnostic decisions by experienced clinicians. How these clinicians group together, considering their propensity to make a diagnosis of psoriatic arthritis when confronted with specific clinical features, is not known. Unlike other clustering techniques such as hierarchical clustering that try to find clusters with some arbitrary chosen distance measure, FMM derives clusters using a probabilistic approach.
There are two sets of parameters in an LCA. The first is the set of inclusion probabilities (or class membership probabilities), that is, the probability that any random case in a population will be included in any LC. In the previous example, there are two classes, high and low diagnosers, and each experienced clinician has a given probability of belonging to one or the other class. The second parameter is the conditional probability that, given a specific class, a variable takes a certain value, for example, the probability that a patient with a specific feature, being assessed by a clinician in the class of low diagnoser, is classified as a patient with psoriatic arthritis. These probabilities are usually presented in a tabular format, as in Table 3, showing data from a study of hidradenitis suppurativa (HS). In the table, for example, the inclusion probabilities were 0.48 and 0.26 for LC1 and LC2, respectively. Comedones had a conditional probability of 0.25 in LC1 and 0.74 in LC2.
Table 1Details of the Main Available Software Packages for LCA
Software name
License
Package/plugin
Covariates
Polytomous manifest variables
Continuous manifest variables
Longitudinal LCA
Other features
R (R Foundation for Statistical Computing, Vienna, Austria)
Open source
poLCA
Yes
Yes
No
No
Results visualization, dataset simulation
e1071 (lca)
No
No
No
No
—
BayesLCA
No
No
No
No
Bayesian setting LCA
RandomLCA
No
No
No
Yes
Random effects LCA
LCAvarsel
Yes
Yes
No
No
Variable selection framework
SAS (SAS Institute Inc, Cary, NC)
Commercial
proc LCA
Yes
Yes
No
Yes (with proc LTA)
Accounting for sampling weights and clusters
STATA (StataCorp LLC, College Station, TX)
Commercial
LCA plugin
Yes
Yes
No
No
Accounting for sampling weights and clusters
MPLUS (Muthén & Muthén Computer Software, Los Angeles, CA)
Commercial
—
Yes
Yes
Yes
Yes
Ordinal, censored, and count manifest variables; FMMs and mixture regression; Random effects LCA
Parameters of the subdistributions are usually determined by maximum likelihood estimation with the expectation-maximization algorithm, a well-established measure of the goodness of fit of a statistical model. The typical equation is:
where is the probability of observing a particular combination of responses in a group of variables, is the probability of membership in LC , and is the probability of response to variable , conditional on membership in LC k.
As already mentioned, LCA divides a set of observations (cases) into mutually exclusive groups, or classes, such that manifest variables are unrelated to each other within each class (local independence) and observations are similar in each class but different from those in other classes. Additional covariates can also be used to predict class membership (Figure 1). Going back to the initial example, each clinician is confronted with a set of variables characterizing the patient as a case or noncase of psoriatic arthritis. These manifest variables, such as family history of psoriasis, rheumatoid factor (RF) titer, nail dystrophy, and toenails dactylitis, distribute differently in the two classes of high versus low diagnosers of psoriatic arthritis. In the analysis, to adjust for covariates that may affect classification, such as sex or age of the clinician, multivariate methods can be employed.
Figure 1Representation of an LCA model. Var n represent the manifest variables; Cov z, the additional covariates; and Class k, the latent classes predicted by LCA. LCA, latent class analysis.
In LCA, measurable variables and covariates should be qualitative and nominal, with one or more categories per variable (e.g., sex [males and females] and age categories). Depending on the software, variables can be directly handled as nominal entities or must be entered as dichotomous (dummy) variables. For example, RF titers were not taken as a continuous variable in the analysis we mentioned previously, but two dummy variables were employed, RF titer higher than 40 and RF higher than 80.
When manifest variables are taken as continuous, alone or in combination with categorical variables, the term latent profile analysis is the preferred one (
Mixture models: latent profile and latent class analysis.
in: Robertson J. Kaptein M. Modern statistical methods for HCI (Human–computer interaction series). Springer International Publishing Switzerland,
Cham, Switzerland2016: 275-290
in: Latent class and latent transition analysis with applications in the social, behavioral, and health sciences. John Wiley & Sons, Inc,
Hoboken, NJ2010: 181-224
). By examining repeats of the same categorical indicator, RMLCA allows to see how many common patterns of change over time emerge and what the probability of a target outcome is for each repeat in each class.
Different free and commercial software packages are available for LCA. Table 1 presents a list of the main available ones.
Choosing the number of classes
LCA usually provides several options for data grouping, and a crucial problem is to choose the optimal number of classes (k). The decision should be based on statistical ground. The methods more frequently adopted, such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), involve finding the k value that minimizes the negative log-likelihood function, increased by some penalty function that reflects the complexity of the model (
Interpretability is also an important issue but may give rise to some arbitrary decisions. In a study, comorbidity patterns were assessed in a series of more than 110,000 incident patients with psoriasis (
). The value of BIC was 10,320 for a four-class model and 7,814 for a five-class model. In spite of a higher BIC value, the four-class model was chosen because it was more easily interpretable based on the following classes: multi-comorbid class (patients with a variety of conditions), metabolic syndrome class, hypertension and chronic obstructive pulmonary disease class, and relatively healthy class. The decision of forgetting BIC values appears questionable and further validation is needed.
LCA in Dermatology
LCA has not been extensively used in dermatology. We searched Medline up to 28 February 28 2020 and retrieved a total 6,159 papers using an LCA methodology. Out of these, only 37 papers dealt with dermatological conditions (see Supplementary Material).
The areas of application were in rank order of frequency: the phenotype classification of allergic diseases and eczema (n = 12), the analysis of behavior in relation to several different risk factors (n = 10; out of these, six studies were dealing with sexually transmitted disease), the phenotype classification of skin diseases other than eczema (n = 8, including psoriasis, dermatomyositis, vitiligo, HS, chronic skin ulcers, and psychodermatological conditions), the performance of diagnostic tests (n = 4), and the pattern of response to drugs and adverse reactions (n = 3). Overall, the absence of studies in the area of cutaneous oncology and the limited number of studies dealing with chronic inflammatory diseases other than eczema is remarkable.
The use of LCA is well established when assessing patterns of behavior: a total of 2,301 (39%) studies in our search were dealing with behavioral issues. This high prevalence is partly because of the fact that a number of leading researchers have published papers encouraging the use of LCA in behavioral studies (
), because multiple aspects of individual functioning in mental health can be studied holistically.
Fatigue, sleep disturbance, and allergic disorders
Although poor sleep quality has been well documented in childhood eczema, few studies have examined the quality of sleep in the adult eczema population. Despite the high variability of presentation, is it possible to define consistent patterns of association of fatigue, sleep disturbance, and allergic disease in the adult population? The question was addressed in a study analyzing data obtained in the context of the 2012 National Health Interview Survey (
). The data analyzed pertained, in particular, to history of eczema, sleep problems, and overall health. BIC and AIC were used to select the best fitting model. The model had five classes, LC1–5 (Figure 2). Two classes presented high probabilities of sleep disturbance: LC4, characterized by high probabilities of eczema, asthma, hay fever, and food allergy, and LC3 with low probabilities of these disorders. LC1 had an intermediate probability of insomnia but not fatigue or sleepiness and an intermediate probability of eczema (Table 2). The study presented data from a cross-sectional study; patterns of changes over time may represent an interesting issue to explore in future studies by using RMLCA.
Figure 2Identification of five classes of sleep disturbance in allergic disease and the conditional probabilities of the items studied (redrawn from
Considerable variability occurs in the clinical presentation and disease severity of HS. In a cross-sectional study, LCA was applied to a series of 648 consecutive patients with HS with the aim of building an empirical classification scheme without any a priori hypotheses (
). Ten indicators pertaining to clinical features, namely, sites involved, lesion type (nodules, hypertrophic scars, comedones, papules and folliculitis, epidermal cysts, macrocysts, and pilonidal sinuses), severity assessment (by Sartorius score and Hurley stage), family history, and previous history of severe acne, were chosen to inform the clinical classification.
A classification into three LCs (LC1–3) provided the best fit of data as estimated by using BIC (Figure 3). LC1 patients (n = 299, 48%) had high probabilities for breast and armpit involvement and for hypertrophic scars; LC2 patients (n = 161, 26%) had high probabilities for involvement of the ears, chest, back, or legs and also for follicular lesions and a history of severe acne; and LC3 patients (n = 158, 26%) were characterized by gluteal involvement, follicular papules, and folliculitis (Table 3). Significant differences were found among the three LCs for sex, body mass index, smoking status, severity scores, age at disease onset, and HS duration. The identification of subgroups may allow for further investigation of matters such as biological markers, class changes over time, and prognosis.
Figure 3Identification of three main clinical patterns of HS and conditional probabilities of the individual features in each class (redrawn from
With the progress of information systems in medicine, a huge amount of data can be routinely collected, that is, big data. LCA can be used to analyze these data to find clusters, especially when rare LCs (with <5% data) are present. However, when a lot of information is available, a possible drawback is that redundant or noninformative variables present in the dataset may potentially introduce biases or reduce the efficiency of clustering algorithms. For this reason, standard stepwise selection or more complex search procedures via genetic algorithms may be used to create a reduced dataset for the subsequent analysis (
). Alternatively, more complex models, such as artificial neural networks, able to find data clusters without assumptions about data distribution or parameters can be considered.
There are several areas where LCA can be efficiently employed in dermatological research. For example, little has been done for profiling endotypes of complex disorders such as atopic dermatitis or psoriasis; LCA can be used in a way similar to what has been proposed for asthma where clinical data, functional data, comorbidities, and inflammatory parameters were considered together (
). Similarly, an area of potential development is the better characterization of symptoms such as itching or pain where clinical, psychological, and behavioral factors may interact. Finally, large opportunities for the use of LCA exist in oncology to analyze patterns of presentation and/or progression of cancer and to assess variables affecting the impact of preventive measures or treatment.
There is no difference. The two terms are synonyms.
B.
At variance with cluster analysis, LCA can be used in conjunction with multivariate methods, avoiding a two-step approach in estimating parameters.
C.
Cluster analysis works well with any kind of data, whereas LCA works with continuous data only.
D.
Cluster analysis requires a preliminary assumption of the number of classes present in the data at hand.
E.
Cluster analysis is a nonparametric technique.
See online version of this article for a detailed explanation of correct answers.
Detailed Answers
1.
What is meant by the term latent classes?
Answer: B. Subgroups of data with similar characteristics based on unobservable membership
Latent classes are subgroups of data or subjects, that is, clusters, with similar characteristics based on unobservable (latent) membership, where the individual features have a specific probability of occurrence.
2.
What is the Bayesian Information Criterion (BIC)?
Answer: E. A measure of corrected model fit based on the value of the negative log-likelihood augmented by a penalty function
BIC is a measure of corrected model fit. It reflects the negative log-likelihood augmented by a penalty function depending both on the complexity of the model (represented by the number of free estimated parameters) and the logarithm of the sample size. It is also the commonly used decision rule to select the optimal number of classes in latent class analysis (LCA).
3.
What is the simplest way to validate results in latent class analysis (LCA)?
Answer: C. Splitting procedure
As for any classification procedure, validation on a separate sample is required. The splitting procedure is the simplest method to validate the results of LCA. It involves a random splitting of the dataset into two equal groups: one is used to build the LCA model and the other to test its stability in terms of number of classes, classification accuracy, and predicted posterior probabilities.
4.
In a study, the BIC values for different number of classes (k) are reported in the table. Which classification would you prefer?
The k value that minimizes the BIC score is the preferred one. The BIC score reflects the negative log-likelihood function, increased by a penalty function influenced by the number of independent parameters estimated and by the sample size.
5.
How does cluster analysis differ from LCA?
Answer: B. At variance with cluster analysis, LCA can be used in conjunction with multivariate methods avoiding a two-step approach in estimating parameters
One advantage of LCA compared with other methods of data segmentation, such as cluster analysis, is that it can be used in conjunction with multivariate methods, avoiding a two-step approach in estimating parameters.
in: Latent class and latent transition analysis with applications in the social, behavioral, and health sciences. John Wiley & Sons, Inc,
Hoboken, NJ2010: 181-224
Mixture models: latent profile and latent class analysis.
in: Robertson J. Kaptein M. Modern statistical methods for HCI (Human–computer interaction series). Springer International Publishing Switzerland,
Cham, Switzerland2016: 275-290
Developing classification criteria for peripheral joint psoriatic arthritis. Step I. Establishing whether the rheumatologist's opinion on the diagnosis can be used as the "gold standard.