Advertisement

Databases for Clinical Research

  • Katrina Abuabara
    Correspondence
    Department of Dermatology, 1468 Penn Tower, One Convention Avenue, Philadelphia, Pennsylvania 19103, USA
    Affiliations
    Department of Dermatology, University of Pennsylvania, Philadelphia, Pennsylvania, USA
    Search for articles by this author
  • David J. Margolis
    Affiliations
    Department of Dermatology, University of Pennsylvania, Philadelphia, Pennsylvania, USA

    Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, Pennsylvania, USA
    Search for articles by this author

      Databases for Clinical Research

      The growing availability of digital health data offers many opportunities for clinical research. Studies drawing on electronic data are often efficient, although the usefulness and validity of the data depend on the research question. We briefly review types of epidemiologic study designs commonly used with patient databases and then describe the types of electronic databases available, outline considerations for the ad hoc design of new databases, and discuss potential limitations to consider when performing database research.

      What is the Research Question?

      Epidemiologic questions are often framed around an exposure and an outcome used to answer a predefined question. The exposure can be an environmental exposure, medication, risk factor, or disease state. For example, does isotretinoin (exposure) cause irritable bowel disease (outcome)? Or, is severe psoriasis (exposure) associated with an increased risk of myocardial infarction (outcome)? Outcomes may refer to the onset of a disease (incidence), presence of a disease (prevalence), or severity or duration of a disease or symptom. The research question and nature of the exposure and outcome variables should guide the choice of epidemiologic study design.

      Study Designs

      Epidemiologic study designs can be broadly categorized into descriptive and analytical studies (Figure 1). Descriptive studies, such as case reports and case series, tend to be hypothesis generating, and they ask questions about what, where, who, and when. Alternatively, analytical studies tend to test hypotheses and answer questions about why and how. They include both experimental studies (i.e., clinical trials) and observational studies (cross-sectional, cohort, and case–control designs) (
      • Vandenbroucke J.P.
      • von Elm E.
      • Altman D.G.
      • et al.
      Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration.
      ). Cross-sectional studies assess all individuals in a sample at the same time point; the downside is that they can’t ascertain the temporality of events and therefore can’t be used to draw conclusions about causation. Cohort and case–control designs follow individuals over time to ascertain the relationship between an exposure and an outcome and differ in terms of whether the population is selected based on the exposure (cohort study) or the outcome (case–control study). Study design selection should be guided by the suitability of the design for the research question at hand and by feasibility constraints. For a more complete description of epidemiologic study designs, we recommend an introductory textbook (
      • Gordis L.
      Epidemiology.
      ). Electronic databases are most commonly used for observational studies, but they can also be used for experimental studies (e.g., randomizing an intervention for patients within an electronic medical record (EMR)) or descriptive studies (searching an EMR for a case series).
      Figure thumbnail gr2
      Figure 1Common epidemiologic study designs RCT, randomized controlled trial.
      Figure thumbnail gr1

      Types of Electronic Databases

      Electronic databases can be categorized by the source of the data (Table 1). One major distinction is whether the data are repurposed (that is, originally generated for purposes other than clinical research) or the result of an ad hoc design specific to an individual study. Data that can be used for clinical studies but that were originally designed for a different research question fall somewhere in between and are referred to as “hybrid data.”
      Table 1Categories of electronic databases
      CategoryExamples (with selected dermatology-specific references)
      Repurposed data
       Claims data
       Government insurersUS Medicare, US Medicaid, national health insurers (
      • Huang Y.H.
      • Kuo C.F.
      • Chen Y.H.
      • et al.
      Incidence, mortality, and causes of death of patients with pemphigus in Taiwan: a nationwide population-based study.
      )
       Commercial insurersUnited HealthCare, Pharmetrics (
      • Arellano F.M.
      • Wentworth C.E.
      • Arana A.
      • et al.
      Risk of lymphoma following exposure to calcineurin inhibitors and topical steroids in patients with atopic dermatitis.
      ), Humana, Aetna
       Electronic medical record dataUK general practice research databases (
      • Gelfand J.M.
      • Dommasch E.D.
      • Shin D.B.
      • et al.
      The risk of stroke in patients with psoriasis.
      ;
      • Langan S.M.
      • Groves R.W.
      • Card T.R.
      • et al.
      Incidence, mortality, and disease associations of pyoderma gangrenosum in the United Kingdom: a retrospective cohort study.
      ); institution-specific databases (e.g., Kaiser Permanente, Veterans Affairs Computerized Patient Record Syste
      Registry dataSurveillance, Epidemiology, and End Results (
      • Linos E.
      • Swetter S.M.
      • Cockburn M.G.
      • et al.
      Increasing burden of melanoma in the United States.
      ); Swedish Family Cancer Database (
      • Chen T.
      • Fallah M.
      • Kharazmi E.
      • et al.
      Effect of a detailed family history of melanoma on risk for other tumors: a cohort study based on the nationwide Swedish Family-Cancer Database.
      )
      Ad hoc dataPediatric Eczema Elective Registry (
      • Mockenhaupt M.
      • Viboud C.
      • Dunant A.
      • et al.
      Stevens–Johnson syndrome and toxic epidermal necrolysis: assessment of medication risks with emphasis on recently marketed drugs. The EuroSCAR-study.
      ); EuroSCAR (
      • Mockenhaupt M.
      • Viboud C.
      • Dunant A.
      • et al.
      Stevens–Johnson syndrome and toxic epidermal necrolysis: assessment of medication risks with emphasis on recently marketed drugs. The EuroSCAR-study.
      )
      Hybrid dataNurses’ Health Study, National Health Interview Survey, Veterans Affairs Million Veteran Program, HMO Research Network, PatientsLikeMe

      Repurposed data

      Repurposed data include both administrative claims data, which are generated for billing purposes, and EMR data, which are generated for the purposes of patient care. Additionally, repurposed data include public health registry data such as cancer registries like the Surveillance, Epidemiology, and End Results (SEER) program in the United States and death registries that may be linked to claims or EMRs.
      Administrative claims data. Administrative claims data include inpatient and outpatient medical record codes, and these may be linked with pharmacy prescriptions and laboratory values. They often contain limited demographic and risk factor information, and they may have variable follow-up, especially in the United States, because patients frequently change insurers. Claims have been widely used in health services research and pharmacoepidemiology, and they are best suited to study outcomes that are easily captured by diagnostic codes such as procedures or acute events.
      EMR Data. EMR data are essentially paperless, digital versions of patient charts generated for the purposes of clinical care. The number of office-based practices and hospitals using EMR systems is increasing, yet there is a lack of standardization and interoperability. Like claims data, EMR data are best suited to study outcomes that are easily captured by diagnostic codes, yet they may offer the possibility of more detailed data via manual review of physician notes or natural language processing systems. They are also likely to lack routinely collected social and behavioral variables, although efforts are underway to improve collection of these types of data (
      • Adler N.E.
      • Stead W.W.
      Patients in context—EHR capture of social and behavioral determinants of health.
      ). Some EMRs may be representative of the general population and capture all of a patient’s health-care interactions, such as the Clinical Research Practice Datalink or the Health Improvement Network, both large UK general-practice research databases. Others may include only inpatient or specialty patient care.

      Ad hoc data

      Ad hoc data are generally designed for a particular study, and they often take the form of a prospective cohort study in which patients are selected for inclusion on the basis of a particular diagnosis or exposure. For this reason, they are often disease specific and may lack a control group.

      Hybrid data

      Large prospective cohort studies such as the Framingham Heart Study and the Nurses’ Health Study may be considered hybrids between repurposed and ad hoc data because they represent large amounts of data that have been used to test many hypotheses beyond the one they were originally designed to study. Similarly, national survey data are available that typically offer cross-sectional snapshots of patient-reported risk factors and health outcomes such as the National Health Nutrition and Examination Survey (NHANES) or the National Health Interview Survey. Finally, a newer type of electronic health data is becoming available via crowdsourcing (
      • Ranard B.L.
      • Ha Y.P.
      • Meisel Z.F.
      • et al.
      Crowdsourcing—harnessing the masses to advance health and medicine, a systematic review.
      ;
      • Wicks P.
      • Massagli M.
      • Frost J.
      • et al.
      Sharing health data for better outcomes on PatientsLikeMe.
      ). These data are particularly useful for rare exposures or outcomes, but they are prone to selection bias and information bias because they rely on patient self-reporting.

      Design of a New Ad Hoc Patient Database

      The Patient-Centered Outcomes Research Initiative has outlined a number of general considerations for the design of a patient database or registry (Table 2) (
      • Pcori
      Patient-Centered Outcomes Research Institute Methodology Committee.
      ). Another important consideration concerns patient selection. If possible, researchers should enroll all individuals who meet the case definition (or a random selection of these individuals) to ensure the external validity or generalizability of the results. They may also consider using incident cases to help differentiate between exposures prior to and after the onset of disease. Finally, researchers should carefully consider whether a comparison group will be enrolled and strive to ensure that the comparison group is randomly selected from a comprehensive listing of the target population.
      Table 2Considerations for the design of a patient database
      Consistent data collectionProvide clear, operational definitions of data elements. Create and distribute standard instructions to data collections. Use standardized data element definitions and/or data dictionaries whenever possible—review the literature to identify existing, widely used definitions before drafting new definitions
      Systematic patient enrollment and follow-upEnroll patients systematically and follow them in as unbiased a manner as possible, using similar procedures at all participating sites. Describe how patients and providers were recruited into the study. Monitor and minimize loss to follow-up. Develop a patient retention plan that documents when a patient will be considered lost to follow-up and what actions will be taken to minimize such loss
      Data quality assuranceCreate structured training tools for data abstractors. Perform data quality checks for ranges and logical consistency for key exposure and outcome variables
      Data safety and securityProvide transparency by describing data use agreements, informed consent, data security, and approaches to protecting security including risk of identification of patients

      Discussion

      In using electronic data, several potential limitations must be considered, including imprecision, potential sources of bias, and the generalizability of the results.
      Imprecision may arise from the study size or from the measurement of exposures, confounders, or outcomes. A variety of strategies, including detailed chart review and physician query, may be used to evaluate the validity of measurements in electronic databases. Dermatologic outcomes, in particular, tend to be less conducive to precise measurement in electronic databases because few diagnoses are based on routinely collected data. A researcher studying hypertension, for example, is likely to find more standardized data than a researcher hoping to study changes in acne lesion counts. Therefore, current databases are generally more useful for studying the incidence or prevalence of dermatologic disease than for studying disease resolution or changes in disease severity over time. Standardization of outcome scales and/or photographic assessments at regular intervals offer potential for improvement.
      Bias is a systematic deviation of a study’s result from a true value. Typically, it is introduced during the design of a study from flawed information or subject selection. There are many types of bias; two that may be particularly relevant to database studies include information bias and selection bias. Information bias occurs where there are systematic differences in the accuracy or completeness of data leading to differential misclassification of individuals regarding exposures or outcomes. For example, patients with a family history of melanoma may be more likely to receive skin checks and biopsies, making it appear that they have higher rates of atypical nevi. One potential way to assess the influence of some types of information bias is to measure the intensity of medical surveillance in the different study groups and to adjust for this in statistical analyses. Selection bias may be introduced if the probability of including subjects in the study (or probability of subjects being lost to follow-up) is associated with exposure or outcome. For example, a study of patients followed in clinics may overestimate the severity of a disease because patients with mild disease who seek medical advice less frequently are underrepresented. Selection bias affects the internal validity of a study, but it is often related to the external validity or generalizability of the results.
      Figure thumbnail gr3
      Generalizability refers to how representative the results from the study population are to the general population. Studies that only enroll patients from tertiary-care centers or that only include patients with particular demographic characteristics may have limited generalizability.
      The ideal database depends on the research question. Generically speaking, an ideal database might include linked records from inpatient and outpatient care, emergency care, mental-health care, all laboratory and radiological tests, and all prescribed and over-the-counter treatments, as well as alternative therapies. The population would be large enough to permit discovery of rare events and interactions, would be stable over time, and would be representative of the general population from which it was drawn. It would include genetic, social, and physiologic information on all members, and there would be the ability to gather additional information, either from physicians or from patients themselves, to confirm outcomes.

      Cme Accreditation

      This activity has been planned and implemented in accordance with the Essential Areas and Policies of the Accreditation Council for Continuing Medical Education through the joint sponsorship of the Duke University School of Medicine and Society for Investigative Dermatology. The Duke University School of Medicine is accredited by the ACCME to provide continuing medical education for physicians. To participate in the CME activity, follow the link provided. Physicians should only claim credit commensurate with the extent of their participation in the activity.
      To take the online quiz, follow the link below:

      SUPPLEMENTARY MATERIAL

      A PowerPoint slide presentation appropriate for journal club or other teaching exercises is available at http://dx.doi.org/10.1038/jid.2015.213

      References

        • Adler N.E.
        • Stead W.W.
        Patients in context—EHR capture of social and behavioral determinants of health.
        N Engl J Med. 2015; 372: 698-701
        • Arellano F.M.
        • Wentworth C.E.
        • Arana A.
        • et al.
        Risk of lymphoma following exposure to calcineurin inhibitors and topical steroids in patients with atopic dermatitis.
        J Invest Dermatol. 2007; 127: 808-816
        • Chen T.
        • Fallah M.
        • Kharazmi E.
        • et al.
        Effect of a detailed family history of melanoma on risk for other tumors: a cohort study based on the nationwide Swedish Family-Cancer Database.
        J Invest Dermatol. 2014; 134: 930-936
        • Gelfand J.M.
        • Dommasch E.D.
        • Shin D.B.
        • et al.
        The risk of stroke in patients with psoriasis.
        J Invest Dermatol. 2009; 129: 2411-2418
        • Gordis L.
        Epidemiology.
        Elsevier/Saunders: Philadelphi. 2013; (5th edn): 398
        • Huang Y.H.
        • Kuo C.F.
        • Chen Y.H.
        • et al.
        Incidence, mortality, and causes of death of patients with pemphigus in Taiwan: a nationwide population-based study.
        J Invest Dermatol. 2012; 132: 92-97
        • Langan S.M.
        • Groves R.W.
        • Card T.R.
        • et al.
        Incidence, mortality, and disease associations of pyoderma gangrenosum in the United Kingdom: a retrospective cohort study.
        J Invest Dermatol. 2012; 132: 2166-2170
        • Linos E.
        • Swetter S.M.
        • Cockburn M.G.
        • et al.
        Increasing burden of melanoma in the United States.
        J Invest Dermatol. 2009; 129: 1666-1674
        • Mockenhaupt M.
        • Viboud C.
        • Dunant A.
        • et al.
        Stevens–Johnson syndrome and toxic epidermal necrolysis: assessment of medication risks with emphasis on recently marketed drugs. The EuroSCAR-study.
        J Invest Dermatol. 2008; 128: 35-44
        • Pcori
        Patient-Centered Outcomes Research Institute Methodology Committee.
        Research methodology. 2013;
        • Ranard B.L.
        • Ha Y.P.
        • Meisel Z.F.
        • et al.
        Crowdsourcing—harnessing the masses to advance health and medicine, a systematic review.
        J Gen Intern Med. 2014; 29: 187-203
        • Vandenbroucke J.P.
        • von Elm E.
        • Altman D.G.
        • et al.
        Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration.
        PLoS Med. 2007; 4: e297
        • Wicks P.
        • Massagli M.
        • Frost J.
        • et al.
        Sharing health data for better outcomes on PatientsLikeMe.
        J Med Internet Res. 2010; 12: e19