Advertisement

Evaluating the Strength of Clinical Recommendations in the Medical Literature: GRADE, SORT, and AGREE

      The medical community relies on scientific evidence to guide clinical practice. Evidence from systematic reviews, randomized controlled clinical trials (RCTs), case–control or cohort studies, observational studies, and expert opinions are used to make disease-specific practice recommendations. More than 100 grading systems are used to rate the strength of these recommendations (
      • West S.
      • King V.
      • Carey T.S.
      • et al.
      47. Systems to rate the strength of scientific evidence: summary.
      ). A centralized and transparent method for evaluating and comparing these studies with the goal of translating evidence-based medicine to clinical practice guidelines is the cornerstone of two such validation scales: the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) and Strength of Recommendation Taxonomy (SORT).
      In the GRADE system, one frames a question, chooses critical and important outcomes by which to judge the existing body of evidence, rates the quality for each outcome, and finally decides on the direction (for or against) and strength (strong or weak) for the recommendation considered. The SORT method is a simpler rating scale that judges the study quality and strength of recommendation based on patient-oriented evidence (Table 1). Whereas GRADE and SORT evaluate the body of evidence to establish sound guidelines, the (Appraisal of Guidelines Research and Evaluation) instrument provides a framework for assessing the quality of development of clinical practice guidelines. is a generic instrument that provides an assessment of the validity and likelihood that the stated guideline will achieve its outcome.
      Table 1Comparison between GRADE and SORT with regard to the strength of recommendation and the quality of evidence
      Table thumbnail fx3

      GRADING OF RECOMMENDATIONS ASSESSMENT, DEVELOPMENT, AND EVALUATION (GRADE)

      The GRADE international group is composed of guidelines developers, systematic reviewers, clinicians, public health officers, researchers, methodologists, and other health professionals from around the world (
      • Mustafa R.A.
      • Santesso N.
      • Brozek J.
      • et al.
      The GRADE approach is reproducible in assessing the quality of evidence of quantitative evidence syntheses.
      ). The GRADE approach has been adopted by more than 65 organizations worldwide, including the World Health Organization, the US Centers for Disease Control and Prevention, the Cochrane Collaboration, and the American College of Chest Physicians, and it has become an international standard for guideline development (
      • Guyatt G.
      • Eikelboom J.W.
      • Akl E.A.
      • et al.
      A guide to GRADE guidelines for readers of JTH.
      ).
      The GRADE process begins with asking a clinically relevant, well-designed clinical question composed of four elements: a patient, problem, or population; an intervention; a comparison intervention; and an outcome. The second step in the GRADE system is to gather the best evidence to answer the question. The third step is assessing the quality of evidence and the confidence in the estimates of the treatment. The fourth step evaluates the trade-off between risks and benefits, reflecting the best assessment of patients’ perspective of the evidence before making the final recommendation (
      • Guyatt G.
      • Eikelboom J.W.
      • Akl E.A.
      • et al.
      A guide to GRADE guidelines for readers of JTH.
      ) (Figure 1).
      Figure thumbnail gr1
      Figure 1The GRADE process.
      Adapted with permission from Guyatt et al., 2013.
      The study design determines the initial quality of evidence rating. RCTs start as high-quality evidence, whereas observational studies begin as low-quality evidence. This ranking can be upgraded or downgraded based on specific factors that can affect the quality of evidence. Factors that can lower the quality of evidence include study limitations, inconsistencies in the results, indirectness of evidence, imprecision in the estimates, and publication bias. The rating can be upgraded if the study shows the presence of a dose–response effect or a large magnitude of the estimated effect.
      After assessing all the domains, the body of evidence per outcome is categorized as high (++++), moderate (+++), low (++), or very low (+) (
      • Mustafa R.A.
      • Santesso N.
      • Brozek J.
      • et al.
      The GRADE approach is reproducible in assessing the quality of evidence of quantitative evidence syntheses.
      ). The quality of evidence rating is summarized in the Evidence Profile (EP) table, which includes an explicit judgment of each factor that determines the quality of evidence. Table 2 is an example of a transparent and concise way of showing the guideline panel judgments about the domains. It also contains the Summary of Findings Table (SoF). The SoF is a quantitative assessment of the confidence in the estimates of effects (i.e., relative risk), without a qualitative judgment of the evidence rating that is provided in the EP table. The EP and the SoF tables serve different purposes and are directed toward different audiences. EP are intended for review authors and anyone who questions a quality of assessment. SoF are designated for a broader audience, such as users of systematic review and guidelines (
      • Guyatt G.
      • Oxman A.D.
      • Akl E.A.
      • et al.
      GRADE guidelines: 1. Introduction–GRADE evidence profiles and summary of findings tables.
      ).
      Table 2Evidence to recommendation framework
      Table thumbnail fx4
      The fourth step of the process is assessing the values and preferences of the target population regarding their beliefs and expectations for their health and life. This step refers to the process in which individuals weigh the potential benefits, harms, costs, limitations, and inconveniences of treatment options in relation to one another. With this information, the panel is more equipped to accurately define the trade-off between the benefits (desirable outcome) and risks (undesirable consequences) for a particular intervention. Ideally, “the panel” (guideline developers) will conduct a systematic review summarizing relevant studies regarding the patient’s values and preferences. The greater the variability or uncertainty in values and preferences, the more likely a weak recommendation is warranted (
      • Andrews J.C.
      • Schunemann H.J.
      • Oxman A.D.
      • et al.
      GRADE Guidelines: 15. Going from evidence to recommendations: the significance and presentation of recommendations.
      ).
      The overall strength of recommendation is based on the balance of risks and benefits, the quality of evidence, the values and preferences of the patients, and costs required for the treatment. Each component is given equal weight in relation to the other components. This strength of recommendation ranges on a continuum of categories from “strongly for” to “strongly against” the intervention (Table 1). If the panel is highly confident of the balance between desirable and undesirable consequences, they make a strong recommendation for (desirable outweighs undesirable) or against (undesirable outweighs desirable) an intervention (
      • Andrews J.C.
      • Schunemann H.J.
      • Oxman A.D.
      • et al.
      GRADE Guidelines: 15. Going from evidence to recommendations: the significance and presentation of recommendations.
      ).
      Guideline panels may also choose to make special recommendations when there is insufficient evidence, for example, an “only-in-research” recommendation. This recommendation is used when further research may reduce uncertainty about the intervention and further research is considered of good value for the anticipated costs. Alternatively, the panel may decide not to make recommendations for or against a particular strategy if they find the strength in the estimate is too low, the trade-off between risks and benefits is too close, or values, preferences, and resource implications are not known (
      • Andrews J.C.
      • Schunemann H.J.
      • Oxman A.D.
      • et al.
      GRADE Guidelines: 15. Going from evidence to recommendations: the significance and presentation of recommendations.
      ). The main limitation for using GRADE is that it is a complex methodology with a steep learning curve.

      STRENGTH OF RECOMMENDATION TAXONOMY (SORT)

      SORT was developed by the editors of U.S. Family Medicine and Primary Care journals and the Family Practice Inquiries Network as an initiative to construct a unified taxonomy that allows authors to rate individual studies or bodies of evidence (
      • Ebell M.H.
      • Siwek J.
      • Weiss B.D.
      • et al.
      Strength of recommendation taxonomy (SORT): a patient-centered approach to grading evidence in the medical literature.
      ). The SORT approach is the main methodology that the American Academy of Dermatology utilizes in its guideline development process.
      The SORT process addresses the quality, quantity, and consistency of evidence, and it emphasizes the use of patient-oriented outcomes that measure changes in morbidity or mortality (
      • Ebell M.H.
      • Siwek J.
      • Weiss B.D.
      • et al.
      Strength of recommendation taxonomy (SORT): a patient-centered approach to grading evidence in the medical literature.
      ). The expert panel reviews the bodies of evidence for each of the recommendations and assigns a strength of recommendation on a scale of A through C. For example, consistent and good-quality evidence for treatment at an A-level rating would include a systematic review/meta-analysis with consistent results or a high-quality, large individual RCT.
      An A-level recommendation is based on consistent and good-quality, patient-oriented evidence. A B-level recommendation is based on inconsistent or limited-quality patient-oriented evidence. A C-level recommendation is based on consensus, usual practice, opinion, disease-oriented evidence, or case series for studies of diagnosis, treatment, prevention, or screening (Table 1). The main limitation of SORT is that it is an overly simplified instrument that is not applied internationally.

      APPRAISAL OF GUIDELINES RESEARCH AND EVALUATION (AGREE)

      Whereas GRADE and SORT evaluate the body of evidence to establish sound guidelines, the instrument assesses the quality of the development of clinical practice guidelines. The quality of guidelines is based on the confidence that potential biases have been addressed adequately, that recommendations are both internally and externally valid, and that they are feasible for practice. New or existing guidelines and updates of existing guidelines may be appraised with . It is a validated tool with a 4-point numerical scoring system, ranging from 1 (representing strongly disagree) to 4 (strongly agree). Scores reflecting inadequate quality are assigned a score 2. This instrument can be applied to any disease area, including those in diagnosis, health promotion, and treatment.
      is composed of 23 key items encompassed within six domains. Each domain is intended to capture a different dimension of the guideline quality: scope and purpose, stakeholder involvement, rigor of development, clarity and presentation, applicability, and editorial independence. The domain score is calculated by adding all of the individual item scores in a domain and standardizing the total as a percentage of the maximum possible score for that domain. Each domain score may be useful for comparing guidelines and will aid in the decision whether to use that guideline. There is no set threshold for the domain score by which to define a “good” or “bad” guideline. Finally, an overall assessment is made as to the quality of the guideline, taking each of the appraisal criteria into account and rating it as “strongly recommend,” “recommend (with provisos or alteration),” “would not recommend,” or “unsure” (
      • AGREE Collaboration
      The Appraisal of Guidelines Research and Evaluation (AGREE) instrument.
      ).
      Recently, was modified to II. The purpose of this updated version was to improve reliability, validity, and supporting documentation. The newer version continues to have 23 items and six domains, whereas the rating scale for each domain has become more detailed, using a 7-point rather than 4-point scale. Score 1 is assigned when there is no relevant information; scores between 2 and 6 are given when the domain does not meet the full criteria; and a maximum score of 7 is given to exceptional reports ().
      Figure thumbnail fx2

      SUMMARY

      The instrument has been applied towards the critical appraisal of clinical practice guidelines and adaptation in evidence-based guidelines for “prevention of skin cancer” by the German Guideline Program in Oncology. The rigorous inclusion criteria required by the instrument narrows the 480 citations related to the topic "prevention of skin cancer" to only 12 studies. The strict criteria needed to be fulfilled by the tool demonstrate that methodological flaws are an important obstacle in the development of practical guidelines (
      • Petrarca S.
      • Follman M.
      • Breitbard E.W.
      • et al.
      Critical appraisal of clinical practice guidelines for adaptation in the evidence-based guideline “prevention of skin cancer”.
      ). The instrument was also chosen as the appraisal tool for evaluation of quality of clinical practical guidelines for treatment of psoriasis vulgaris, 2006-2009 (
      • Tan J.K.L.
      • Wolfe B.J.
      • Bulatovic R.
      • et al.
      Critical appraisal of quality of clinical practice guidelines for treatment of psoriasis vulgaris.
      ).
      GRADE and SORT are two methods of evaluating a body of evidence and the quality of studies to create a comprehensive recommendation. The instrument is a validated quantitative scoring method created to systematically assess the quality of practice guidelines. Knowledge of these commonly applied grading systems is important for the informed dermatologist and clinician to understand for clinical practice and guideline development.

      CME ACCREDITATION

      This CME activity has been planned and implemented in accordance with the Essential Areas and Policies of the Accreditation Council for Continuing Medical Education through the Joint Sponsorship of ScientiaCME and Educational Review Systems. ScientiaCME is accredited by the ACCME to provide continuing medical education for physicians. ScientiaCME designates this educational activity for a maximum of one (1) AMA PRA Category 1 Credit. Physicians should claim only credit commensurate with the extent of their participation in the activity.
      To take the online quiz, follow the link below:

      SUPPLEMENTARY MATERIAL

      A PowerPoint slide presentation appropriate for teaching purposes is available at http://dx.doi.org/10.1038/jid.2014.335.

      REFERENCES

        • AGREE Collaboration
        The Appraisal of Guidelines Research and Evaluation (AGREE) instrument.
        • AGREE Next Steps Consortium
        The AGREE II instrument.
        • Andrews J.C.
        • Schunemann H.J.
        • Oxman A.D.
        • et al.
        GRADE Guidelines: 15. Going from evidence to recommendations: the significance and presentation of recommendations.
        J Clin Epidemiol. 2013; 66: 726-735
        • Ebell M.H.
        • Siwek J.
        • Weiss B.D.
        • et al.
        Strength of recommendation taxonomy (SORT): a patient-centered approach to grading evidence in the medical literature.
        Am Fam Physician. 2004; 69: 548-556
        • Guyatt G.
        • Oxman A.D.
        • Akl E.A.
        • et al.
        GRADE guidelines: 1. Introduction–GRADE evidence profiles and summary of findings tables.
        J Clin Epidemiol. 2011; 64: 383-394
        • Guyatt G.
        • Eikelboom J.W.
        • Akl E.A.
        • et al.
        A guide to GRADE guidelines for readers of JTH.
        J Thromb Haemost. 2013; 11: 1603-1608
        • Mustafa R.A.
        • Santesso N.
        • Brozek J.
        • et al.
        The GRADE approach is reproducible in assessing the quality of evidence of quantitative evidence syntheses.
        J Clin Epidemiol. 2013; 66: 736-742
        • Petrarca S.
        • Follman M.
        • Breitbard E.W.
        • et al.
        Critical appraisal of clinical practice guidelines for adaptation in the evidence-based guideline “prevention of skin cancer”.
        JAMA Dermatol. 2013; 149: 466-471
        • Tan J.K.L.
        • Wolfe B.J.
        • Bulatovic R.
        • et al.
        Critical appraisal of quality of clinical practice guidelines for treatment of psoriasis vulgaris.
        J Invest Dermatol. 2010; 130: 2389-2395
        • West S.
        • King V.
        • Carey T.S.
        • et al.
        47. Systems to rate the strength of scientific evidence: summary.
        AHRQ Evidence Report Summaries. Agency for Healthcare Research and Quality. Rockville, MD2013