At Alimentiv Statistics, we place a high priority on educating our audience and clients about clinical statistical analysis and trial design. As such, we decided it was time to create a reference guide that could serve as a glossary of clinical study design & statistical analysis terms. From sample size to regression and everything in between, there are a lot of statistical trial design terms to understand. Terminology can seem like jargon but many are fundamental concepts.
We hope you find this glossary of value and get to bookmark this article so you can refer to it at any time.
Clinical Trial Design is the function to derive a valid and meaningful scientific conclusion using appropriate statistical methods. It is a framework of methods and procedures used to collect and analyze data based on a specific research question or problem. According to the dictionary of epidemiology, study design is the formulation of trials and experiments, as well as observational studies in medical, clinical, and other types of research involving human beings.
The goal of a clinical study is to assess the safety, efficacy, and / or the mechanism of action of an investigational medicinal productor procedure, or new drug or device that is in development, but potentially not yet approved by a health authority. It can also be to investigate a drug, device or procedure that has already been approved but is still in need of further investigation, typically with respect to long-term effects or cost-effectiveness.
The conclusions derived from a study can either show improved health care or in inadvertent harm to patients. Therefore, a well-designed clinical study that focuses on detailed methodology and governed by ethical principles is key.
The entire group of people (animals, devices, etc.) about which you would like to draw conclusions.
Example: People with Stage I cancer, or healthy adults in Europe.
A subset of the overall target population that qualifies for the study design being considered. This is often defined by factors such as the study’s inclusion/exclusion criteria.
Example: People with Stage I cancer who meet the study inclusion/exclusion criteria.
The specific group chosen from the study population for a specific study. Ideally these are chosen as randomly as possible from the study population.
Example: Out of all the people with stage I cancer who meet the study inclusion/exclusion criteria, the 100 subjects who actually took part in the study are the sample.
The number of participants that are in a given sample or are desired for a given sample (depending on whether the study is complete or has yet to begin or be completed).
Example: For the study in the previous example, this would be 100.
Using random assignment to decide which treatments/interventions are given to participants in a study/trial. For true randomization, this is done in an impartial, probabilistic manner (such as via random number generation) such that neither participants nor investigators decide treatment assignment. This is done to reduce possible bias in treatment assignment and to attempt to balance demographic and other factors among treatments.
There may be cases where randomization is not appropriate; for example, a study on the effects of cigarette smoking could not ethically randomly assign subjects to begin smoking cigarettes.
A controlled factor in a study that is introduced to participants. This may be a medication, a device, a form of therapy, or even may be no intervention at all, depending on the study design.
Example: A new investigational treatment; a placebo; no intervention; a standard of care intervention where the physician treats the patient as they usually would.
A comparator for an experimental treatment or intervention. May be a placebo, or may be an active control (e.g. a clinically-accepted and validated treatment for the same indication).
A measured variable in a study.
Examples: Blood pressure; pain rating; time from treatment to cancer progression; whether a subject experienced myocardial infarction or not.
A specific outcome for analysis in a study. Typically, endpoints are more strictly defined than outcomes, with not just a quantity (e.g. diastolic blood pressure) but also an associated timepoint or period for collection and analysis.
Note: It is fairly common for some to use “outcome” and “endpoint” interchangeably. Usually when this occurs, the meaning is the same as for “outcome” above.
Examples: Blood pressure 3 months after treatment; percent change in pain rating from baseline to week 6; time from treatment to cancer progression as measured for up to a year after treatment; whether a subject experienced myocardial infarction in the first two years post-treatment.
The primary endpoint of a study is the key result or outcome of greatest concern. While it is possible to have more than one primary endpoint, frequently a study only has one primary endpoint or a very limited number of co-primary endpoints (see multiplicity). This is often the main marker of the “success” or efficacy of a treatment/intervention. This is typically the endpoint that is considered when determining the sample size or power of a study.
Non-primary endpoints are frequently separated into categories by type (e.g. efficacy, safety) and/or priority (e.g. secondary, exploratory).
Note: Cross-link to “multiplicity / multiple comparisons” and “power”.
An endpoint that is meant to substitute for a result that is difficult or impossible to capture.
Example: A study in cancer may be most concerned with the length of participant survival as an endpoint, but a study to monitor all participants for the rest of their lives may be extremely onerous. In this case, a surrogate endpoint may be measuring participants’ tumor sizes, or measuring the length of time from remission until disease recurrence. The best surrogate endpoints are closely correlated to the “true” desired endpoint.
Note: Cross-link to “correlation”
A determination by participant matter experts (often in conjunction with validated research) regarding the minimum quantifiable effect of a treatment/intervention that is considered to have noticeable importance or effect. The definition of clinical meaningfulness may vary by therapeutic area, disease context, and between participant matter experts. Distinct from statistical significance.
Example: A pain medication shows a small but statistically significant decrease in pain over another treatment, but participants showed no change in their perceptions of pain or quality of life, so the change is not considered clinically meaningful.
Cross-link to “Statistical significance”
A Statistical Analysis Plan is a technical document that details the statistical analyses and techniques that will be used to examine data from a study. While a study protocol provides details regarding design and conduct, they generally only outline the study’s statistical methods. A SAP is a companion document that provides this specificity. A thorough SAP provides a clear picture of all statistical outputs that will be provided, as well as detailing any statistical models that will be used.
SAPs are particularly important in clinical trials, where it is vital to pre-specify analyses in order to prevent the appearance of bias in analytical decisions.
An analysis (often of limited scope) that occurs during a study, rather than after the study is complete. Reasons for an interim analysis include verifying participant safety, re-considering the planned enrollment for the study, or stopping the study early due to either a lack of efficacy or overwhelming evidence of efficacy.
Interim analyses should be pre-specified before the study begins and developed with careful statistical consideration, as they may impact other aspects of the study (such as the final analyses).
When a treatment is being investigated for public use and regulatory approval, the overall research lifecycle for the treatment is often considered in phases that are defined by the US Food and Drug Administration. There is some flexibility in the definitions and goals of these phases, especially in the earlier phases, but they are useful guidelines. Each phase’s trial designs should be based on information from the previous phases: decisions about endpoints, monitored safety events, treatment timing, and many other aspects should be directly informed by data collected earlier in the treatment lifecycle.
Pre-clinical – Pre-clinical studies are those that take place during a treatment’s development. These may include studies in animals or cell cultures.
Phase 0 – Not commonly discussed. Proposed by the US FDA, Phase 0 trials are optional, exploratory first-in-human trials meant to help quickly distinguish treatments that are worth further evaluation, based on their pharmacodynamics and pharmacokinetics. A Phase 0 trial is in a very small number of patients (up to 15) and uses microdoses of treatment that are far below therapeutic levels for a short period of time.
Phase I – Phase I trials are early in the lifecycle of a new treatment. Phase I trials are in limited numbers of participants (less than 100) and are often in healthy participants (except in the case of cancer treatments). The goals of Phase I are to establish preliminary evidence of safety and to attempt to find a dosage that balances efficacy and safety. Pharmacodynamics/pharmacokinetics is often useful to collect in this phase, though these may also be investigated in Phase II.
Phase II – Phase II trials are larger in scale than phase I, both in recruitment and length. Phase II trials may span several months to 1-2 years, and often enroll 100-300 participants. Phase II trials have several goals: reaching final decisions about the most desirable dosage(s); safety evaluation in a wider range of participants for longer periods of time; documentation regarding side effects; finalization of pharmacodynamic/pharmacokinetic understanding of the treatment, and preliminary proof of efficacy. Phase II trials serve as footholds to allow for successful Phase III trial design; the design of Phase II trials should be oriented toward giving the most useful data to understand how the treatment functions and is best measured.
Phase III – Phase III trials are generally the key trials for any necessary regulatory approval for a treatment. These trials are often “pivotal trials,” intended to demonstrate beyond reasonable doubt the efficacy and safety of a treatment. As such, Phase III trials are large in scale, with hundreds or thousands of participants and timelines of one to several years. Due to the large scope and long timelines, these trials are meant to closely examine any possible safety signals in the treatment. If targeted for regulatory approval, these trials are almost always randomized and contain a control (placebo or active) for comparison to demonstrate the efficacy of the treatment. These are often the culmination of all that is learned from the previous phases to ensure that the significant resource commitment for such a large undertaking is well-spent.
Phase IV – Phase IV refers to trials that take place after regulatory approval. These trials are often observational in nature and are meant to track the safety and efficacy of the treatment under realistic treatment conditions and in more diverse populations. These trials may use supplementary databases of information (such as insurance claims or governmental healthcare tracking) in addition to or in place of direct participant observation.
Phase 0 trial reference:
FDA reference for other phases:
Many statistical analyses have specific assumptions that are required to be met for the analyses’ results to be valid. Sensitivity analyses evaluate how deviations from these assumptions may affect the results of the important analyses of the study. Typically, it is encouraged to plan sensitivity analyses for at least the primary analysis of a design.
Note: Many people use “sensitivity analysis” to have a similar meaning to what is described for “supplementary analysis” below. The International Council for Harmonization (ICH) draws a distinction between these two, and this glossary follows that precedent.
Example: The primary analysis for a study requires that any missing data are missing at random. A sensitivity analysis is planned where any missing data are tested with a range of different values to see how the analysis results would change if this were violated.
Cross-link to “Missing at random”
Link to the ICH E9R1 addendum:
Additional analyses meant to add insight to the primary or main analyses but that do not directly test the assumptions of the original analysis.
Example: Repeating the primary analysis but with a different set of covariates in the model; repeating an analysis with the Per Protocol population.
Cross-link to Per Protocol population
Analyses for a study that are not pre-specified. Typically, these are analyses that are specified and performed after data have been collected and unblinded (if applicable). These analyses may provide important information but generally are not considered for regulatory approval or interpreting the study results. Post-hoc analyses run risks of being uninformative and/or manipulative by modifying and adding analyses in response to apparent trends in the data or by trying many analyses until one succeeds (see multiplicity).
Cross-link to multiple comparison/multiplicity.
Methods of designing a study and analyzing the results to attempt to be able to map cause/effect relationships onto collected variables. Uses correlations between collected variables and event timing (e.g. does one variable precede another), as well as statistical techniques to model relationships between variables. By examining variability in the covariates, how changes in one are related to changes in others, and using models with multiple levels, it can be possible to obtain some understanding of the causal links between variables. These links may be expressed in a visual form like causal maps that express the relationships in a flowchart-like manner.
Example: A large health insurance database of various demographic and lifestyle factors is used in a study of myocardial infarction. Causal analyses could be used to attempt to discover the ways in which the collected variables affect one another and may lead to increases or decreases in heart attack risk.
When establishing a cause/effect relationship between two variables (A and B), a mediating variable is one that fully or partially explains that relationship. If it seems that a causal relationship exists such that A causes B, M is a mediator if the reality is that A causes M which causes B. If controlling for M fully explains the cause/effect relationship between A and B, then M is a full mediator, otherwise it is a partial mediator. You can think of A, M, and B as dominos in a row, where M is a necessary part of the chain causing B.
For example: A study finds that people who are older tend to be paid more and determines using causal analysis methods that age (A) tends to cause higher income (B). However, when number of years of experience in their employment field (M) is introduced to the model, the relationship is shown to be that older people (A) tend to have more years of experience in their jobs (M) and people with more years of experience in their jobs tend to be paid more (B). In this case, years of experience in employment is a mediating variable in this model.
When establishing a cause/effect relationship between two variables (A and B), a moderating variable is one that modifies their relationship. If a mediating variable is like a midway stop on the trip from A to B, a moderating variable is more like a decision that changes what route is taken from A to B. With a moderating variable, A still causes B (instead of A causing M causing B), but the effect of A on B changes depending on M.
Example: A clinical trial determines that physiotherapy (A) reduces pain (B) in subjects with arthritis. Examining the data, however, shows that subjects in warmer climates (M) show greater pain relief with physiotherapy than subjects in colder climates. Climate temperature (M), then, could be considered a moderating variable for the physiotherapy (A) and pain (B) causal relationship.
Events that occur after a subject has started treatment that either affects the interpretation of collected results or affects the existence of collected results. This is not to be confused with missing data, where results exist and could be interpreted as usual but were not collected.
Proper planning for and handling of intercurrent events is an important topic beyond the scope of this glossary. The ICH addendum E9(R1) should be consulted at a minimum, and skilled clinical and statistical expertise used in study planning for these.
Examples: Subject death during study; subject switching treatments during study; subject taking rescue medication during study.
Not examples: Subject missed a visit; subject blood sample lost or damaged; subject withdrew from study.
Cross-reference with missing data
Link to the ICH E9R1 addendum:
A pre-specified quantity that describes the treatment effect targeted by the clinical question and trial objective. Is comprised of several attributes: the treatment of interest and comparators (if any); the population of subjects being targeted; the endpoint that will be collected to address the question; the strategies for handling intercurrent events that will be utilized, and the population-level summary that will be used. As with intercurrent events, this is topic requiring more nuance and information than can be covered in this glossary.
Refer to the ICH E9(R1) addendum for more information.
Link to the ICH E9R1 addendum:
A study where there is no treatment or intervention under the control of the investigators or study sponsors. There may be active observation (e.g. taking laboratory samples) but no choice or influence of treatments or interventions by those involved with the study.
A study where there is a treatment or intervention under the control of the investigators or study sponsors.
A form of interventional study where a treatment/intervention is compared to a control and participants are randomized in their assignment to treatment or control. There may be several treatments under investigation and/or controls in the trial, but for an RCT there should be at least one of each.
A design where participants are randomized within groups for certain pre-specified key characteristics. Essentially randomization is handled separately within each “block” specified by the intersection of these key characteristic(s).
Example: A study is being done both within the U.S. and in Europe with a blocked design that treats “region” as a block. Thus, this study randomizes participants separately within the U.S. and within Europe to balance of treatments/interventions is approximately the same within each region. If a blocked design weren’t used, then it could be possible for all participants in Europe to receive an experimental treatment while all participants in the U.S. receive a control, which may make inference more difficult.
Similar to block design, stratified designs are based on defining participants by some key characteristics. In a stratified design, however, the strata designed by these characteristics are used to define which participants are put into the sample rather than to define treatment assignment.
Example: A study stratified by region (U.S. or Europe) limits enrollment so that 50% of the sample are participants from the U.S. and 50% are participants from Europe; this stratification would not influence randomization to treatment at all, unless the study was also a block design.
A study design meant to test or examine two or more independent, overlapping treatments simultaneously. When planned and executed properly, can provide two or more studies’ worth of information in a single study, but requires care and reasonable confidence that treatments do not interact heavily.
Example: A pain study that attempts to distinguish between Drug X and Drug Y for pain, as well as examining the effects when subjects are provided regular physiotherapy or not. Subjects can therefore be assigned to one of four conditions: Drug X + Physiotherapy; Drug Y + Physiotherapy; Drug X + No physiotherapy; Drug Y + No physiotherapy.
A study design that is somewhat similar to block designs. In a minimization randomization, the goal is to minimize the imbalance in each randomized arm based on pre-specified strata. When a new subject enters the study, the subject is hypothetically added to each arm, one at a time, and the imbalance score of that arm calculated based on a pre-determined equation. The arms are then weighted so that the subject is more likely (or required) to be randomized to the arm(s) with the lowest imbalance score.
Example: A study is performing minimization randomization based on age (< 50, ≥ 50) and region (North America or South America). A new subject that is 60 years old and lives in South America is enrolled; the algorithm will temporarily assign the subject to each treatment arm, one at a time, and calculate an imbalance score. Generally (depending on the specific equation used), arms that have fewer subjects aged ≥ 50 and in South America will have lower imbalance scores, and it is more likely this subject will be randomized to those arms.
A study design where participants are exposed to multiple different treatments over time. This may be observational or interventional. Frequently this may take the form of exposing participants to an experimental treatment and a control in a randomized order. This design reduces the individual variability from participants that may reduce confounding factors.
Cross-link to confounders / confounding factors
A study design that matches each subject in one treatment arm with a similar subject in another treatment arm to compare their results. This study design serves a similar purpose to a crossover design, but can be used when a crossover design is not feasible. Must be designed with care; how subjects are matched with one another can have a significant impact on the study results. Matching may be based on specific factors or on a more complex technique like propensity scores.
Cross-link to propensity scores
A study design that targets a specific sub-population or group that is more likely to show the desired effect in response to treatment(s). Typically this is achieved using a stratified design based on the factor(s) that define the sub-population.
Example: A prior study shows that subjects who are age 60 or older appear to respond more strongly to Drug X. A new study uses an enriched design by enrolling mostly subjects who are age 60 or older.
A study design where the same variable is measured multiple times in each subject or matched subjects. A crossover design is one example of a repeated measures design. Frequently, when “repeated measures” is brought up in common usage, it refers to designs where subjects’ measurements on one or more variables are captured on a regular and frequent basis (e.g. monthly). While, technically, a design where a variable is only measured twice (before and after treatment) is a repeated measures design, the common use of the term implies more frequent assessment.
Repeated measures are useful by allowing the capture of more information from the same number of subjects; there are statistical models that can leverage the repeated measurements to provide more precise estimates of effects over time. Repeated measures are also important when study time-to-event data in order to ensure that information about when events occur are sufficiently precise.
Example: Subjects on Drug X are being monitored for signs of cancer remission. Lesion size and RECIST evaluation grade are captured for each subject every two months until two years post-treatment or remission occurs.
A study design used early in drug development that begins with a low dose of the treatment and slowly gives participants stronger doses in order to attempt to find the maximum tolerated dose.
A form of dose escalation design that is somewhat common in Phase I trials for oncology. 3 participants given a low dose of the experimental treatment and monitored for pre-specified toxicity events. If 0 participants experience toxicity, then the next group of 3 participants is enrolled at a higher dose. If 2 participants experience toxicity, then the next group of 3 participants is enrolled at a lower dose (or the study ends). If 1 participant experiences toxicity, another group of 3 participants is enrolled at the same dose (hence the name, 3+3): if 1 or more of those participants experience toxicity, then the dose is lowered for the next group of the study ends. Otherwise, if 0 of the additional participants experience toxicity, the next group is enrolled at a higher dose.
3+3 designs are often used because of their fairly early-to-follow approach to dose escalation—investigators or clinicians can follow a simple flowchart for how each cohort of 3 participants should be enrolled, without need for random assignment or computer algorithms. Research has shown, however, that 3+3 designs do not accurately determine the maximum tolerated in many circumstances, and that other, modified designs (such as accelerated titration) are more effective.
Cross-link “Phase I” to the clinical trials phase I-IV entry.
Link research has shown to:
A form of dose escalation study design that is used in Phase I, often in oncology; these trial designs aim to fulfill a similar goal as 3+3 designs (finding maximum tolerated dose), but in a more efficient manner that results in fewer participants receiving sub-therapeutic doses of treatment. Accelerated titration designs vary significantly, but tend to emphasize a more aggressive approach to the early portion of the study when participants are on lower doses. These designs appear to better balance detection of the maximum tolerated dose with efficiency of study design and may result in fewer participants being under-treated.
Example: One kind of accelerated titration design enrolls only one participant at each dosage until the first toxicity event, at which point larger cohort sizes may be used to examine higher doses to determine safety. Accelerated titration designs may also allow for intrapatient dose escalation, where a participant on a lower dose may be escalated to a higher dose if no toxicity is observed.
Cross-link “Phase I” to the clinical trials phase I-IV entry.
A study design philosophy of designing studies from the ground-up to allow for pre-specified flexibility. By building adaptiveness into the study design from the beginning, this allows studies to maximize data-gathering and minimize resource costs. See our article on adaptive designs.
Adaptive by design techniques may be that have interim analyses to confirm sample size estimates, opportunities to remove treatment arms that are performing suboptimally, studies that can seamlessly transition from one clinical development phase to another, or dose-finding studies that use computer models to determine each subsequent dose based on the efficacy and toxicity of previous doses. Adaptive designs need to be carefully designed to maximize their potential and adhere to regulatory guidance.
Cross-link clinical development phase with the Clinical Trial phases above
Link to FDA guidance on adaptive designs:
A type of study design most commonly used for cancer treatments. In a basket design, a single treatment is used for several different cohorts (“baskets”) of participants with different but related diseases or conditions (e.g. differing types of cancer with the same mutation). Useful in determining whether a treatment appears to function across variations in a disease, or whether there are specific subgroups that have greatest/least treatment effect.
A type of study design that can be viewed as an counterpart of basket designs. In an umbrella design, participants all have variations of the same disease (e.g. a single type of cancer with multiple mutations), and are given different treatments/interventions based on variant. May also be applied based on predictive risk factors in addition to/other than disease variant.
A type of study design where multiple interventions are evaluated against a common control group. Platform trials are carefully designed with pre-specified rules for adaptation to allow for the addition of new arms, removal of ineffective or undesirable arms, and multiple interim analyses or data looks. These trials are often designed to go for a long or indefinite period of time. May also be referred to as Multi-Arm Multi-Stage (MAMS) designs.
The ITT principle is a particular method of considering data and analyses from randomized studies with multiple arms. Under the ITT principle, subjects’ data should be analyzed under the arm that the subject was originally assigned to, regardless of what treatment(s) the subject may have actually received, treatment adherence/compliance, or later withdrawals or deviations.
Ideally, ITT consideration of subjects and data provides a less biased analysis of treatment effects, as it may reduce the possibility of bias introduced by treatment changes during the course of the study. ITT helps keep the “fairness” of the original randomization by ignoring treatment crossover or dropout by subjects who are not responding to treatment.
True ITT analysis requires continuing to follow and measure subjects who withdraw from treatment or drop out of the study, which may pose difficulties. By ignoring issues with treatment adherence or protocol deviations, it may also obscure some effects—if a subject assigned to active treatment is accidentally given placebo during the study, for example, ITT will underestimate the true effect of the treatment.
ITT analysis is often desired or required by regulatory authorities. Due to its drawbacks, however, it can also be important to define other study populations to allow for examination of the data from alternate lenses; for example, safety analyses often consider subjects who have had any exposure to the active treatment under consideration as “active treatment” subjects for the purpose of summarizing adverse events.
Real World Data are data relating to the research questions that are collected not as part of clinical trials. RWD is generally used to refer to data specifically related to subject health and healthcare. RWD can be appealing since it may require fewer resources to collect than bespoke data from a study; however, there are many facets such as data collection and cleanliness that need to be carefully considered to result in useful data.
Examples: Information from healthcare databases from insurance; patient-generated data such as biometrics captured by smart watches and other devices.
Link to the FDA page on RWD/RWE:
Real World Evidence is clinical evidence that is the result of accumulating and analyzing RWD. RWE may be collected as part of organized clinical trials—including randomized or observational trials—as part of planning clinical trials to identify possible patient populations, or for other purposes.
Examples: A pharmaceutical agent may be required to collect and regularly analyze healthcare claim data to monitor a recently approved drug to ensure there are no concerning safety signals after it is in the public market.
Link to the FDA page on RWD/RWE:
A specific characteristic of an entire population. In most circumstances, this is impossible to measure due to the size of the population; one of the goals of statistical analyses is to estimate a parameter.
Examples: The average income of Chinese residents in a given year; the proportion of living people today who have been diagnosed with Stage 1 cancer.
Cross-link to Population
A numeric quantity computed from a sample. Often used to estimate or otherwise draw inferences about parameters for the population from which a sample was drawn.
Examples: The average income of a sample of 400 Chinese residents; the proportion of health records from one insurance database that were for patients diagnosed with Stage 1 cancer.
Cross-link sample and population
Statistics that are only meant to describe the sample from which they were calculated, with no generalization or inference.
Example: The average age of participants in a selected group; the percent of surveyed people who answered “Yes” to an opinion question.
Cross-link to sample
Statistics that are calculated based on a sample of participants to make inferences or generalizations about the larger population from which the sample was drawn.
Example: Calculating the percentage of surveyed people who answered “Yes” to a question is a descriptive statistic; using that percentage to estimate how many people in the entire surveyed region would answer “Yes” to the same question is an inferential statistic.
Cross-link to sample and population
A factor or variable that is heavily related to both the treatment/intervention and to the measured outcome.
Example: If all participants that were aged under 50 received one treatment and all the participants who were aged over 50 received a different treatment, it would be difficult or impossible to tell which differences between the groups were due to treatment instead of age. In this case, age would be a confounder.
Cross-link to outcome
A field of statistics based on viewing probability as a question of “How often would this happen if we could repeat the same thing a very large number of times?”
In frequentist statistics, probability is viewed as the same as “frequency”: if you could flip a fair coin an infinite number of times, 50% of them would be heads, so the probability of heads on a fair coin is 50%.
A field of statistics based on viewing probability as a way of describing uncertainty and belief based on prior knowledge and experience.
A Bayesian statistician may say that the probability of heads on a fair coin is 50% based on their prior experience of coin flips and the knowledge that coins are supposed to be equally-weighted; they may then flip the coin several times and start to reconsider their belief as they see how this specific coin behaves.
A numerical variable that represents a measurement that can be (hypothetically, given the correct equipment) collected at infinite precision.
Example: A subject’s height: it could technically, with the correct equipment, be collected in meters, centimeters, millimeters, and so on to infinitely accurate levels.
A numerical variable or measurement that is not infinitely precise. Usually, a discrete variable is something that can be counted.
Example: A subject’s red blood count: while there are a large number of red blood cells in a given blood sample, they can be counted. Another example may be the number of times a subject performs a certain action (e.g. smokes a cigarette) during a certain amount of time.
A variable that takes the form of one of a limited and fixed set of categories. Can be ordinal or nominal (see below).
Example: A subject’s response to a Likert scale from Strongly Disagree to Strongly Agree; the result of a coin flip (heads or tails); a subject’s housing chosen from several discrete choices (own home, rent, live in other’s home, no housing, other).
A type of categorical variable that has an intrinsic, sensible ordering.
Example: A 5-point Likert scale (Strongly Disagree, Disagree, No Opinion, Agree, Strongly Agree) has an intrinsic ordering; there are ways to order the responses that do not make logical sense.
A type of categorical variable that does not have an intrinsic, sensible ordering. Orders can be applied to the categories (usually alphabetical) but changing the order but these are outside systems imposed on the order, not a sensible ordering built into the categories.
Example: A subject’s country of residence.
A variable that can be expressed numerically. Technically, any type of variable can be quantifiable, even a nominal variable; with nominal variables, however, the numeric expression doesn’t have any mathematical meaning (e.g. a randomly assigned subject ID for participants in a study). When “quantifiable” is brought up in common usage, it usually means not only a variable that can be expressed numerically but where the attached number has some specific meaning. When a study involves non-quantifiable variables, there may be attempts to find quantifiable substitutes for them instead (e.g. “depression” as a concept is not quantifiable, but a score on a validated depression rating scale is quantifiable and should be closely related—see surrogate endpoints).
Example: Age, household income, rating on a depression scale, and number of cigarettes smoked in a week are all quantifiable variables. Race, region of residence, job title, and life satisfaction are not quantifiable variables (though there may be a quantifiable scale closely related to life satisfaction that could be used).
Cross-link to surrogate endpoint
A measure of how far spread out a set of values are, as compared to their mean (average). A larger variance means that values are more spread out, whereas a smaller variance means values are more closely clustered around the mean. The standard deviation is the square root of the variance and is the more commonly reported statistic of variability for data.
Example: The average income of everyone in the world would have high variability (some have no income, a small number make millions or billions of dollars per year); the average temperatures in a small country for a given month would have less variability.
A relative measure of variability that is the ratio of the standard deviation to the mean. Useful as a way of expressing the scale of variance that matches the scope of the data.
Example: In a study, subjects’ heart rates have a standard deviation of 20. The mean heart rate of these subjects is 80 . The coefficient of variation is .25, and one could express this as “The standard deviation is 25% of the mean.”
The amount and direction two variables change in relationship to one another. In statistics, “correlation” specifically refers to the linear relationship between two variables; that is, the way that the two change together regardless of what the starting values are.
Correlation ranges from -1 to +1. If correlation is positive, it indicates that the two variables tend to increase together; if correlation is negative, it indicates that the two variables have a reciprocal relationship, where one tends to decrease as the other increases.
A correlation of -1 or 1 indicates a perfect linear relationship: the two variables always change in tandem and in a consistent relationship where it’s possible to know exactly how much one variable has changed from knowing the change in the other variable. A correlation of 0 indicates that there is no linear relationship. (This does not indicate that there is no relationship between the two variables; it may be that the relationship is more complex and cannot be approximated by a straight line.
The adage “correlation does not imply causation” is always important: just because two things tend to vary together does not mean one causes the other. There may exist causation (e.g. the more pets someone owns, the more they may spend on pet food) but two things may be related without one causing another (e.g. ice cream sales and incidents of drowning may tend to increase together, but this is due to seasonal temperatures and not due to ice cream causing drowning or vice versa).
Examples: Height and weight will tend to have a positive correlation. Number of cigarettes smoked daily will tend to have a negative correlation with average lifespan.
A collected measurement or factor that is being used to predict another outcome in the study. May also be called “predictor variable” or “independent variable,” though the latter is less common today.
Example: Whether a subject has been given Drug X or placebo is used to attempt to predict the subject’s quality of life; Subjects’ blood pressures are collected to attempt to predict their overall cardiovascular health.
A measurement or factor that the study or analysis is attempting to predict. May also be called the “outcome variable” or “dependent variable,” though the latter is less common today.
Example: For the examples from Explanatory Variable above, “quality of life” and “cardiovascular health” are the response variables.
A method of statistically estimating the relationship between a response variable and one or more explanatory variables. A common, simple method of regression is linear regression, where the relationship between the explanatory variable(s) and response variable is modeled by a single straight line. Many examples of more complicated models exist, including: Non-linear models, which do not follow a straight line; Multivariate models, which may relate the explanatory variable(s) to several response variables at once; and piecewise models, where the model itself may change depending on the values of the explanatory variable(s).
A factor in a trial/model where the levels of the factor used in the trial are either all possible levels or non-randomly chosen levels of the factor in question. If something is considered a fixed effect, then the values of this variable represented in the trial are of interest on their own rather than as a representation of something more general.
Example: If a trial contains an experimental treatment, a standard of care comparator, and a placebo, treatment is a fixed effect, as these are deliberately chosen treatments and the goal of the trial is to compare them without extrapolating to other treatments. In a trial that enrolls subjects from several pre-specified regions to compare between those regions, region may be a fixed effect.
A factor in a trial/model where the levels of the factor used in the trial are random and/or are only meant to represent a smaller part of a larger whole. Generally, if something is considered as a random effect, comparison between the levels in the trial is not of particular interest but rather meant to represent the randomness represented by the factor.
Example: In a multi-site trial, the actual sites chosen for recruiting subjects are not of particular interest, but the site-to-site randomness may be examined to see how differences in region/practitioner may affect results. In these cases, site is a random effect.
A form of statistical analysis that combines data or results from multiple independent studies. By combining the results of several studies, meta-analyses attempt to reduce bias and increase precision and power. Ideally, combining studies allows for more accurate results and reduces the chances of results that appear significant or insignificant due solely to randomness.
Meta-analyses are more complex to carry out than one may expect from the definition. A proper meta-analysis requires extremely thorough searching methods to systematically review studies in order to determine which studies: have the same research question and endpoints as the meta-analysis; have sufficient data or results available to be included, and appear to have been conducted in a rigorous and unbiased manner. This review needs to be performed in a pre-specified and controlled manner in order to avoid adding bias in the form of study selection.
Generally speaking, propensity scores are a way of modeling the probability that, under different conditions, something would have occurred. The most common use for propensity scores is in analyses of observational/non-randomized data to assist in estimating what effects subjects probably would have experienced if they had been exposed to a different treatment. Broadly, this is accomplished by using a chosen set of covariates to model the probability that a subject would be within one arm of the study; then this score is used to “match” a subject with a similar score that is in another arm of the study. These subjects are considered similar and so represent (with room for error) what the other matched subject would experience on the corresponding arm. (See Matched Pair Designs)
This analytical technique is useful for when an observational dataset already exists and randomization is not possible (e.g. health databases) or when random assignment of treatments is not possible (e.g. when the “treatment” is a behavioural activity such as drinking alcohol or smoking).
Cross-link to observational study and matched designs,
A broad school of often complex statistical methods for analyzing groups of variables, particularly sets with multiple response variables and a number of covariates. The depth and many methods of performing SEM are beyond the scope of this glossary. Typical features of SEM models are tracing networks of relationships among the variables using their correlation and variance, and the ability to identify and estimate “latent” variables—ones which were not directly measured, such as intelligence or depression. (Note: Scales which correlate with intelligence and depression exist, but they do not directly measure these nebulous psychological concepts.) SEM can be used to analyze, among other things, causal relationships.
Cross-link with causal analysis
Suggested further reading:
Pieces are data that were supposed to be collected but were not. Distinct from data that do not exist due to intercurrent events, missing data are data for which the variable to be measured did exist at the time and could be interpreted as usual but were not collected. If a subject dies during a study, any data that should be collected after that is affected by an intercurrent event (the subject’s death), because the quantities to be measured from the subject do not exist anymore. However, if a subject misses a scheduled visit, the quantities that would have been measured for that visit still existed and could have been interpreted as usual, they simply were not collected.
Data are considered missing if they were scheduled to be collected; if a subject is not supposed to have their blood pressure taken at a visit and miss that visit, their blood pressure is not considered missing for that visit.
Examples: Scheduled measurements for a visit that subject did not attend; analytes from blood samples that were lost or damaged; a survey answer where the subject’s response is unreadable. Cross-reference with intercurrent events
A categorization of missing data. Data are considered missing completely at random when there is no relationship between whether a piece of data is missing and its value nor any other variable’s value.
With all categorizations of missing data, this is often driven by theory or hypothesis; since the data are missing, we can’t observe what the values would have been if we collected them. MCAR is therefore something that may be assumed but is very rarely known. It is fairly rare for missing data to be truly MCAR.
Example: A blood sample in a batch of blood samples was lost in transit to the analysis lab; it is very unlikely that this is related to the analyte values of that blood sample or any other information about the subject, so it might be assumed to be MCAR.
A categorization of missing data. Data are considered missing at random (but not missing completely at random) when there is no relationship between the missing data’s value and the fact that it is missing, but there may be a relationship between the fact that it is missing and a different variable. In other words, if missing data are MAR, it would hypothetically be possible to predict which data would be missing based on all the other observed data (e.g. age, income).
Like MCAR, this is a hypothetical relationship; we cannot observe the missing data to prove that any differences between the missing and non-missing data are totally explained by other observed variables.
Example: If subjects with lower income are more likely to miss appointments, and any differences in their (unrecorded) endpoint values from the non-missing subjects’ could be explained by differences in income, the missing data would be MAR.
A categorization of missing data. Missing data are missing not at random when there is a relationship between the fact that a variable is missing and the value of that variable. This relationship is beyond what can be explained by other, observed values.
Example: A subject misses an appointment to have their blood pressure observed due to dizziness from low blood pressure. The fact that their blood pressure is low is directly related to the fact that it did not get collected.
The act of substituting in non-missing values for missing data. A way of “replacing” missing data so it is no longer missing, so that information is not lost in analyses. This can be done with many different methods.
Note that any form of data imputation requires assumptions regarding the missing data and the imputation. Frequently there need to be assumptions about the pattern of the missing data (MCAR, MAR, MNAR) and what substitutions will not bias the analyses. It is generally important to perform sensitivity or supplementary analyses to ensure the imputation does not exert too much influence on the results.
Cross-link with sensitivity analyses and supplementary analyses
A simple form of imputation that is used for a variable that is collected frequently throughout the course of the study. With LOCF, the last (most recent) non-missing value for a measurement for a subject is substituted for the missing value(s).
Note that, while LOCF has been commonly used in some fields at times, it should be viewed with concern. Multiple papers and studies have shown that LOCF can be prone to bias in a variety of circumstances. While LOCF is appealing due to its simplicity, it should be generally avoided.
Example: Subjects’ pain scores on a scale of 1-10 are taken every week. One subject recorded scores of 4, 5, [missing], 8, 8 on Weeks 1, 2, 3, 4, and 5. Since their week 3 score is missing, LOCF would use the last non-missing score (Week 2) and substitute it in, giving the imputed Week 3 score of 5.
Cross-link to repeated measures design
Reference to https://onlinelibrary.wiley.com/doi/pdf/10.1002/pst.267 as an example of the issues with LOCF
A simple form of imputation where missing value(s) are substituted with the mean of all the non-missing values of that variable.
Like LOCF, while mean imputation is attractive due to its simplicity, it can result in estimates that are quite biased, even when data are MCAR. So, similar to LOCF, mean imputation should be avoided in most circumstances.
Example: A study is examining change in subjects’ blood pressure from baseline to 2 months after receiving a randomized treatment. One subject on the active treatment arm did not have their blood pressure taken at 2 months; the average of all the blood pressures (regardless of arms) at 2 months is 125.2, so this subject’s missing systolic blood pressure value is imputed as 125.2.
Reference to this for an example of the issues with mean imputation
A class of imputation methods that are more sophisticated than single imputation (e.g. LOCF, mean imputation). In multiple imputation, a model with some randomness is utilized to impute missing values, and the desired statistical result (e.g. mean, proportion) is calculated using the now-full dataset. This process is repeated multiple times, with the statistical result calculated each time. Finally, the results of all of these different imputations are then summarized by taking the mean, variance, and other summary statistics of the results. In this way, multiple imputation aims to reduce the possibility of random error or bias in imputation through averaging the results of many imputations.
While some, more complex methods of MI can obtain reasonable and unbiased estimates when missing data are MNAR, most common versions of MI require the missing data to be MAR or MCAR for the results to be valid.
In frequentist statistics, a hypothesis is a specific claim about a feature of a study population.
Example: The average pain score in people with an arthritis diagnosis is lower for participants on Drug X than participants on Drug Y.
A method of inferential statistics that is based on posing a hypothesis and collecting data in order to determine whether the study results support that hypothesis.
In inferential statistics, the null hypothesis is a necessary (often implicit) part of study design. Generally this takes the form of a “status quo” that the study will test.
Example: In a study designed to show if Drug X is more effective than placebo, the null hypothesis could be that Drug X is no more effective than placebo.
A Type I error occurs when a null hypothesis is rejected but, in reality, the null hypothesis is true. (May be considered a “false positive.”) Some dangers of a type I error include spending resources on further study of a treatment that is not effective and exposing more participants than necessary to a subpar treatment.
Example: in a study to prove that Drug X is more effective than Placebo, a Type I error would be determining that Drug X is more effective than Placebo when it is equally or less effective.
A Type II error occurs when the null hypothesis is not rejected when the null hypothesis is actually false. (May be considered a “false negative.”) The dangers of a type II error include having to devote resources to another study to confirm the results or, worse, ending study of a worthwhile treatment and losing the resources devoted to its development and testing.
Example: In a study to prove that Drug X is more effective than Placebo, a Type II error would be if the study determined that Drug X is no better than placebo when, in fact, Drug X is more effective than placebo.
A p-value is a measure of the probability of observing results at least as extreme as the results from an experiment if the null hypothesis were true.
Example: In a study to prove that Drug X is more effective than placebo, if the result for the primary endpoint has a p-value of 0.01, that means if Drug X only as effective as placebo (null hypothesis), there is a 1% chance that this study design would find results as or more extreme than the ones observed.
A result that suggests that the results of the trial would be unlikely if the null hypothesis was true.
Alpha is the threshold for statistical significance. Conceptually, it is considered the maximum acceptable chance of making a Type I error. Under certain controlled conditions, if a study’s primary analysis returns a p-value less than the pre-determined alpha value, the analysis could be considered statistically significant.
While 0.05 or 0.025 are commonly used values for α, careful consideration should be given regarding the risks of making a Type I error and the level of certainty desired in choosing a value for α. Also note that having a p-value < α is not the sole determinant of a “good” or “successful” trial, and study design should take into account the considerable discussion about the dangers of this approach (such as in this special issue by the American Statistical Association).
ASA special issue on p-values: https://www.tandfonline.com/toc/utas20/73/sup1
The probability of not making a Type II error, assuming that the null hypothesis is false. Less technically, it is the chances, under a certain set of assumptions, of a true positive, where the null hypothesis is false and the study reaches that conclusion. This is generally used to help calculate the sample size for a study prior to the study beginning. Note that power is calculated using specific assumptions regarding what the “true” results are (e.g. the treatment effect). Power is also tied to notions of statistical significance and p-values, so critiques of these methods of designing and analyzing trials apply to power, as well.
Example: in a study to prove that Drug X is more effective than Placebo, power is the probability that if Drug X is truly more effective than Placebo, the study will demonstrate this with statistical significance.
Multiplicity refers to the interaction between frequentist notions of statistical significance and what happens when multiple statistical tests/comparisons are performed. A single statistical test with an α = 0.05 is meant to have a maximum probability of type I error (“false positive”) of 0.05 (or 5%). Say, instead, a researcher was examining four treatments, and was performing the same test with each of them. While each separate test has a 5% chance of a false positive, there is an 18.5% chance that at least one false positive will be obtained, assuming all the results are independent of one another. In many situations, this may be considered an unacceptable level of risk of false positives. There are methods of controlling for multiplicity such that the family-wide error rate (the overall chance of at least one false positive) is controlled rather than the individual type I error rate of each test separately.
Cross-link to frequentist statistics
Generally, statistics are meant to estimate or approximate a population parameter. These estimates, however, naturally have some error (we don’t expect the average of our sample to exactly match the average of our population, for instance). A confidence interval is one way we attempt to describe and quantify that error.
A confidence interval is a range of numbers, based around a statistic, that attempts to capture a range where the estimated parameter is likely to be. A confidence interval always comes with a confidence value (commonly 95%, though it may vary). The confidence value is based on the alpha (α) being used for the statistical testing: α = 0.05 (5%) corresponds to a 95% confidence interval; α = 0.01 (1%) corresponds to a 99% confidence interval, and so on.
The confidence value indicates, “If we performed the same study with a different random sample over and over, and created the confidence intervals with the same method every time, what percent of the confidence intervals would truly capture the parameter?”
Example: There is a study to calculate the average systolic blood pressure of members of a population. 100 people’s blood pressures are randomly sampled and show an average systolic blood pressure of 108. The calculated 95% confidence interval for this sample is [92, 124]. If one did this study many times, approximately 95% of the confidence intervals calculated in this way would have the true average systolic blood pressure for this population in their range. Colloquially, it may be expressed as being “95% confident” that the true average systolic blood pressure for the population is between 92 and 124. (This is not the same as saying “There is a 95% chance the true average systolic blood pressure is between 92 and 124,” which is a common mistake: the true average blood pressure exists and predetermined. You do not know if the confidence interval contains it, but that is not the same as a frequentist expression of probability. Compare with a Bayesian credible interval.)
Cross link to: Statistics; Parameter; credible interval
A study design and associated statistical analyses with the goal of demonstrating that one arm performs significantly better than another arm on one or more specific metrics.
Example: A randomized clinical trial of Drug X and placebo whose primary endpoint is showing subjects on Drug X have lower average pain after 4 weeks than those on placebo.
Cross-link to randomized clinical trial (RCT), endpoint
A study design and associated statistical analyses with the goal of demonstrating that one arm does not perform significantly worse than another arm on one or more specific metrics.
Note that non-inferiority studies often need to be particularly rigorous and well-designed. Regulatory guidance such as the FDA Guidance on “Non-Inferiority Trials to Establish Effectiveness” and statistical expertise are particularly necessary for these designs.
Example: A randomized clinical trial of Drug X against an established treatment Drug Y, with the primary endpoint of showing that subjects on Drug X do not have significantly worse pain after 4 weeks than those on Drug Y.
Link to the FDA guidance on non-inferiority trials: https://www.fda.gov/media/78504/download
Crosslink to Randomized Clinical trial/RCT and endpoint
In a non-inferiority design, there is a prespecified margin that the arm being tested may be allowed to show a less desirable effect than the comparator arm. This non-inferiority margin is determined with both clinical and statistical guidance. In other words, the non-inferiority margin defines a small difference between arms that it is allowable for the tested arm to be “acceptably less efficacious” than the comparator arm.
Example: In a clinical trial attempting to show Drug X is non-inferior to Drug Y, the non-inferiority margin is set so that if subjects on Drug X experience up to 5% more pain than subjects on Drug Y, Drug X may still be considered non-inferior if other conditions are met.
A study design and associated statistical analyses with the goal of demonstrating whether one treatment is similar to another on one or more specific metrics. Frequently this takes the form of attempting to show that that a new treatment is functionally similar to an existing, accepted treatment or standard of care. Can be considered as testing two separate non-inferiority hypotheses in one design: testing that treatment A is non-inferior to treatment B, and then also testing that treatment B is non-inferior to treatment A.
Example: When a “generic” version of an approved pharmaceutical is developed, equivalence designs and testing are needed to show the new generic is functionally similar to the approved drug.
In frequentist statistics, the risk of an event is the probability the event will occur. It is calculated as (Outcomes where the event occurs) / (All outcomes).
The risk difference and relative risk are both ways of comparing the risk under two different conditions. The risk difference is calculated by subtracting the risk of the event in one arm from the risk of the same event in another arm. The relative risk is calculated by dividing those two risks instead.
Example: In a study of 100 subjects where 10 were diagnosed with an illness, the risk of that illness is 10/100 (= 0.1).
Say that 50 of these subjects were on Treatment A, and 8 of these subjects were diagnosed with the illness, and that the other 50 subjects were on Treatment B and only 2 of these were diagnosed with the illness. Then the risk of having the illness would be 8/50 (= 0.16) on Treatment A and 2/50 (= 0.04) on Treatment B. The risk difference would be 8/50 – 2/50 = 6/50 (= 0.12), and the relative risk would be (8/50)/(2/50) = 4. This might be expressed as saying that Treatment A has an absolute risk increase of 0.12 over Treatment B, and has 4 times the risk of the illness compared to Treatment B.
Cross-link to frequentist statistics
In frequentist statistics, the odds of an event happening are an alternative way of expressing the probability of an event. The odds of an event are calculated as the ratio between the probability that the event happens and the probability it does not happen.
The odds ratio is calculated between two study arms or treatments, by dividing the odds of an event in on treatment/arm by the odds of the event in another treatment/arm.
Examples: The odds of getting heads on a perfectly fair coin are 1, since the probability of heads (1/2) is the same as probability of not-heads (1/2), so the odds are 0.5/0.5=1. Similarly, the probability of rolling a 6 on a fair six-sided die are 1/5.
If on Treatment X, the probability of a subject’s disease going into remission is 10%, then the odds of remission on Treatment X are 1/9. If the odds of remission on placebo are 1/19 (probability of remission = 5%), then the odds ratio between treatment and placebo are (1/9) / (1/19), or 2.111: this means that the odds of remission are 2.11 times higher on Treatment X than on placebo. (Note that this is not the same as the probability of remission: on Treatment X, the probability of remission (10%) is exactly 2 times that of on placebo (5%).)
Cross-link frequentist statistics
A key characteristic of Bayesian inference is the incorporation of prior knowledge into analyses. Rather than treating each study or dataset as a separate, independent entity, Bayesian analyses allow for information from prior studies, from subject matter experts, and from beliefs to be part of the data being considered in a study. The prior distribution is how this information is incorporated
In more technical terms, in Bayesian analyses, it is assumed that the parameter being estimated (e.g. the mean for the population) is not a fixed quantity, but rather is random and could be modeled if it could be observed. This assumption means that it has a distribution—a statistical shape defining how it acts—and one can attempt to define the shape of that distribution. In practice, it means that a Bayesian analyst can define this prior distribution to reflect information from outside the study and the level of (un)certainty around this information. A “non-informative” prior distribution is one that reflects a lack of information (no prior information brought into the analysis) and generally results in an analysis that is analogous to a frequentist analysis.
Cross-link with parameter, population, frequentist statistics
The posterior distribution is what results from the integration of the information from the prior distribution and any new information from a study or other data source. It is a distribution rather than a single fixed estimate since in Bayesian inference, the parameters being estimated are not single fixed points and may be changing in response to other variables. The posterior distribution allows one to capture the uncertainty and randomness in results while still providing estimates based on information; a posterior distribution can allow one to provide any important covariates determined in inference (e.g. treatment, subject demographics) and calculate an estimate for the parameter of interest.
Cross-link with parameter
A range within which, given the observed data and resulting posterior distribution, an unobserved parameter has a certain probability of falling. Compare with the frequentist confidence interval; the credible interval has the more intuitive explanation, which is sometimes mis-applied to confidence intervals.
Example: Based on the data of a study of subjects’ fasting glucose levels, a posterior distribution for the mean glucose level of this population is obtained. From this, a 95% credible interval is calculated to be (90, 130): there is a 95% probability that the overall mean glucose level for this population is within 90-130 mg/dL.
Cross-reference with parameter, frequentist statistics, confidence interval, population