Is the Statistical Analysis Appropriate to Answer the Research Questions or Hypotheses?

Descriptive Statistics
Sufficient data in the way of tables and figures need to be provided to the readers. These findings are often summarized by using descriptive statistics. Measures of central tendency describe what is typical for the sample. The most commonly used measure of central tendency is the mean, which is simply the arithmetic average of a set of numbers. The mean is ideal for describing a variable that has been measured on a continuous or interval scale (e.g., height, weight, or age). Two other measures of central tendency are the median and the mode. The median is the middlemost value in a set of ranked data (50th percentile), while the mode is the most frequently occurring value in a data set. The median is often preferable for describing the typical response for an ordinal variable (e.g., an ordered scale with five levels ranging from greatest pain to least pain upon probing). The mode is often best for describing what is "typical" for a categorical variable (e.g., sex, race, eye color).

In conjunction with measures of central tendency, one must know how observations within a sample vary around that "typical" value. The standard deviation expresses the variability around the mean. It tells us how much a member of the sample, on average, deviates from the mean. A large standard deviation indicates the scores of a data set are widely dispersed around the mean, while a small standard deviation indicates tight clustering around the mean. By definition, the collection of individuals that falls between one standard deviation below and one above the mean comprises approximately 68% of a normally distributed population; similarly, plus or minus two standard deviations comprises about 95% of the population and is commonly used to define the limits for what is "normal" for a particular population. The variance is simply the standard deviation squared. The standard deviation provides a much better summary of the variability within the data than does the observed range (the span between the lowest and highest values), as the latter tells us nothing as to where the majority of observations fall. For the median, quartiles function analogously to the standard deviation; each quartile comprises 25% of the population. The inter-quartile range (IQR) from the 25th to the 75th percentile is a useful way to characterize the range of values that are most typical for a population. If a research report provides only means (or medians) without measures of variability, this indicates a serious omission and should raise a red flag as to the quality of the research.

Inferential Statistics
The techniques of inferential statistics enable the researcher to infer or generalize from a sample to the larger population from which it was drawn. With a few exceptions, all scientific studies that seek to answer questions such as “is there a difference between therapies” or “is one superior to another,” require inferential statistics. The selection of statistical tests by the researcher prior to initiating the investigation often requires consultation with a statistician. Each method of analysis, such as chi square, t-test, and ANOVA (analysis of variance), requires certain assumptions about the data, such as whether or not the scores are normally distributed around the mean. Violations of these assumptions may yield distorted results.

In this fundamental process of statistical reasoning, sample statistics are computed to estimate their counterparts for the population, known as parameters. The parameters, or the true characteristics of the population, are represented by the Greek letters μ the mean and σ for the standard deviation. The corresponding sample statistics are often represented with capital Roman letters: N for the sample size, X for the mean, and SD for the standard deviation. Standard errors (SE) can be computed for sample statistics; these reflect the precision with which the parameters have been estimated. As the sample size becomes very large, and therefore more representative of the population, the standard errors will become very small; this indicates great precision. Confidence intervals can also be computed to reflect the precision of estimates. For example, a study reported a mean DMFT score (decayed, missing, and filled teeth) of 2.23 with a 95% confidence interval from 2.03 to 2.43 for a sample of children. This indicates the true population mean, μ falls within that range 95% of the time.

Measures of Risk from Epidemiology
Certain measures are increasingly used to characterize oral disease, its causes and effectiveness of treatments. These terms and measures are derived from population-oriented research – epidemiology – and are somewhat different from those employed in traditional laboratory research. Prevalence is the proportion of individuals with a particular disease or condition at a point in time. In contrast, incidence rate is the number of new cases, typically of a disease, in a population over a specified period of time; incidence reflects the rate of increase (or decrease) of a disease in a population. Risk can be characterized in a number of ways that express the association between an exposure and an outcome. The exposure can be a disease, a treatment, behavior or environmental factor. Relative risk (RR) is the ratio of the risk of an outcome of those exposed to those who are unexposed. For instance, the RR computed in a hypothetical prospective study that looked at fluorosis in those exposed to well-water with an over-abundance of fluoride versus those with typical municipal tap water was 2.5. This estimate of RR tells us that those with the over-fluorinated water run 2.5 times the risk of developing fluorosis versus those drinking municipal water. Along with RR, investigators should present its 95% confidence interval, e.g., CI: 1.8-5.0. This tells us that we are 95% confident that the true value of RR falls between 1.8 and 5.0. Importantly, in this example, the interval does not include 1. Had it included 1 that would tell us that the risk was equivalent in the two groups, i.e. no association between water supply and fluorosis.

Another and more widely used measure of risk is the Odds Ratio (OR); it too should be presented with its 95% CI and has the same interpretation if it equals 1 or the CI includes it. OR is the ratio of the odds of an outcome occurring in the exposed group to the odds of that outcome in the unexposed group; this is the preferred measure for retrospective studies. Both RR and OR use categorical (nominal) variables. OR is often calculated in the context of logistic regression; the advantage of this approach is that an adjusted odds ratio (AOR) can be computed that expresses the association between two categorical variables while adjusting for other, potential confounders.

Number Needed to Treat (NNT) has gained some traction in studies on new therapies. NNT expresses the number of patients that need to be treated with the therapy over a specified period of time in order to achieve one additional good outcome. Practitioners want this to be as low as possible, ideally one. On the other hand in studies concerned with adverse side effects of new/experimental therapies, the measure of interest is Number Needed to Harm (NNH). Specifically, NNH is the number of patients that, if they received the treatment, would lead to 1 additional patient being harmed during a specified period. We typically want this number to be very large, particularly if the harm is serious, relative to the benefit of the treatment.

Hypothesis Testing
The purpose of this area of inferential statistics is to provide an objective means for determining whether the findings of a study (e.g., differences between treatments) are real or due to chance. It is not sufficient merely to state the mean of one group is larger (or smaller) than another, thereby concluding that a treatment effect was present. With hypothesis testing, the probability or risk can be determined that conclusions, based on the data, are wrong.

It is important the text state clearly which statistical tests were used to answer what questions or hypotheses and whether or not these tests yielded statistically significant results (less than the predetermined probability value). The probability level (p), often set at 0.05, must also be stated. In the case of a study concluding an experimental mouthwash prevented gingivitis, a result of p < 0.05 means that there is no more than a 5% chance the finding could have been due to chance rather than real differences. Alternately stated, these statistical results indicate there is an excellent probability (>95%) that the effect of the mouthwash is real. In this example, the results are statistically significant, since the p value was less than 0.05. Sometimes other p values (e.g., smaller and more stringent) are required, and the author should indicate why these are being used. Contemporary statistical software packages for personal computers will compute exact p values for each test (e.g., p = 0.0378). Ideally, the authors will have reported these for all statistically significant tests, rather than just state that p < 0.05, as the magnitude of p values provides additional useful information. For example, if p=0.001, this means that there is a 1 in 1000 chance of falsely concluding the presence of an effect, whereas if p=0.05, there is a 1 in 20 chance of being wrong. Other items typically reported are the numerical value of the computed test statistic (e.g., χ2 for chi square, F for ANOVA) and the corresponding degrees of freedom (df), which reflects the relationship between the sample size and the number of estimations involved in the test.

For a good presentation of the fundamentals of descriptive and inferential statistics, see Dawson and Trapp,9 Elston and Johnson,10 and Glantz.11 For a thorough discussion of evidence-based dentistry, see Hackshaw et al.12 Epidemiological measures and related concepts are outlined comprehensively in Guyatt et al.13