# EPIDEMIOLOGY - GLOSSARY OF EPIDEMIOLOGICAL AND STATISTICAL TERMINOLOGY

**Agents** - are biological, physical, or chemical factors that contribute to the occurrence of a disease. Biological agents such as viruses and bacteria are often necessary causes for (infectious) diseases. Chemical agents such as poisons or allergens, or physical agents such as radiation, noise, or heat, are all non-biological agents that are frequently not necessary causes for a disease but contributing factors.

**Alpha error (Type I error)** – is one of the two types of potential errors when conducting a statistical hypothesis test. An alpha error can only occur when the statistical test result is ‘significant’ (i.e. p<0.05). In this case, there is still a small chance (below 5%) that the observed difference (or the observed amount of association) is attributable to chance alone and does factually not exist in the wider population.

**Analytical epidemiological study** – is a quantitative, comparative study investigating the relationship between a study factor and an outcome. Analytical studies often assess the effect of potential causes of disease, pathogenic mechanisms, risk factors, prognostic factors, or remedial therapies. The crucial features of analytical research are the inclusion of a control group (against which comparisons are carried out) and the a priori formulation of a quantified, comparative research hypothesis.

**Attributable risk** – is a measure of association in cohort studies and experimental studies. The attributable risk is a difference measure and calculated as the difference between the incidence of the outcome in the exposed group (or intervention) and the incidence of the outcome in the unexposed group (or control).

Bar Chart (univariate) -is usually a graphical display of a categorical variable. A barchart consists of tabulated frequencies (absolute or relative frequencies), shown as separate disjointed rectangles representing the frequencies of observations of the different categories. In a bar chart, the heights of the bars directly reflect the frequencies.

**Bar Chart (bivariate)** – is also used as a name for type of bivariate graphical display where one categorical and one numerical variable are involved. In a bivariate barchart, usually the means (sometimes also other measures of central tendencies) of the numerical variable within the different categories of the categorical variable are displayed as disjointed rectangles or bars. The differences in height of the bars or rectangles depict differences in the means of the numerical variables in the groups of the categorical variable. The rectangles are often accompagnied by bar intervals indicating the size of the standard deviation of the numerical observations in each category.

**Basic reproductive rate** - is the average number of people directly infected by an infectious case during its infectious period, when the case enters a completely susceptible population. The basic reproductive rate is the theoretical potential of an infection to spread in an entirely susceptible population.

Beta error (Type II error) – is one of the two types of potential errors when conducting a statistical hypothesis test. A beta error can only be potentially committed when the statistical test result is ‘not significant’ (i.e. p≥0.05). In this case, there is still a chance that the observed difference (or the observed amount of association) actually exists in the wider population but the test failed to detect it. The beta error is controlled by conducting an appropriate sample size calculation during the design phase of the study.

Bias – If systematic error occurs during the conduct of an epidemiological study, which leads to a misinterpretation of the effect measure (which is, for example, prevalence, relative risk, or attributable risk) this misinterpretation is called bias. If there is no bias, then the effect

measure is called valid (or unbiased) and the results from the sample(s) are valid for the target population.

**Bivariate** – an adjective used in statistics to describe situations or procedures where two different variables are considered or involved.

**Blinding (masking)** – is a design feature of experimental studies. An experiment is called (single) blind, if either the participants or the investigator who administers the intervention or the investigator who assesses the outcome (if applicable) do not know whether the individual patient has been allocated to the intervention group or the control group. An experiment is called double blind, if neither the participants nor the investigator who administers the intervention or the investigator who assesses the outcome (if applicable) knows the group allocation of the participants. An experiment is called triple blind if the participants, the investigator who administers the intervention, and the investigator who assesses the outcome (if applicable) do not know the group allocation.

**Box-Whisker-Plot** is a bivariate graphical display of the values of a numerical variable in the categories of a categorical variable. For each category, the numerical values are displayed using five values: the minimum, the 25% quantile, the median (50% quantile), the 75% quantile and the maximum. Often the interquartile range is depicted as a box with the median marked within the box, and the minimum and maximum as whiskers. A boxplot may also indicate “outliers” (i.e. observations out of the “normal” range). Boxplots are non-parametric displays i.e. they do not make any assumptions of the underlying statistical distribution of the numerical variable displayed. The absolute position of the interquartile box and the median are used to give a visual impression of the central tendency of the numerical variable displayed in the categories (“groups”) of the categorical variable. The width of the box and the distance of the box to the whiskers indicate the degree of spread (dispersion) of the data. The symmetry of the median within the box and the whiskers around the box reflect the skewness of the distribution of the numerical variable.

**Case reports** – describe the experience gained from one single patient, participant or client. Case reports may identify unusual features of a disease or an individual.

**Case series** - are usually based on a very small group of patients, participants, or clients who have similar signs, symptoms, diagnoses, histologies, experiences or behaviours. Case series identify unusual features of a disease or of individuals and may lead for example to the formulation of new aetiological hypotheses, the identification of a new disease identity, or the identification of adverse effects to a certain exposure.

**Case-control study** – is one of the three basic analytical observational study designs. A case-control study starts by defining groups according to the outcome (e.g. disease present or absent) and then looks back to establish the study factor (e.g. exposure present or absent). A case-control study has a backward directionality. A case-control study is unable to estimate relative risk. Results of a case-control study are often expressed as exposure odds-ratios.

**Categorical variable** - is a characteristic with defined categories, such as gender (categories: male and female) or blood group (categories: A, B, AB, 0). Categorical data have to be recorded in exhaustive and exclusive categories that is, there must be enough categories so that each observation fits into a category (exhaustive), and one category only (exclusive).

**Causality (Disease aetiology)** – is about relating causes to their effects. In the context of epidemiology, causality is about identifying the causes of disease. Sir Austin Bradford Hill

has suggested nine criteria to be considered when aiming to identify causality in the epidemiological context.

**Chronic disease** - a long-lasting, persistent or recurrent disease. Chronic diseases often lead to a loss of function, impairment, and long-term disabilities. Typical chronic diseases include cardiovascular diseases, cancer, diabetes mellitus, asthma, and musculoskeletal diseases. These are diseases with complex aetiologies.

**Classical public health epidemiology** – generally aims to investigate distributions and causes of diseases in populations.

**Clinical epidemiology** - studies the diagnosis, prognosis, and therapies of patients. Clinical epidemiology is conducted in a clinical setting, usually by clinicians, with patients as the participants.

**Clinical equipoise** - Clinical trials comparing two different treatments can only be ethically justified if there is no convincing evidence that one treatment is better than the other. This prerequisite has been called clinical equipoise.

**Closed cohort** - is a cohort in which membership begins at a defined time or with a defining event and ends only with observed the study outcome, the end of eligibility for membership, or the end of the study period.

**Cluster sampling** - is a form of probability sampling that involves sampling in naturally occurring clusters such as schools, households, or suburbs. In single-stage cluster sampling, a random sample of clusters is drawn and within each selected cluster either all units of analysis or a random sample of units are observed. An example of a two-stage cluster sampling is randomly sampling schools in Australia, randomly sampling classes within selected school, and within each selected class all students are invited to participate.

**Coding** – is the process of assigning numbers to the categories of a categorical variable for statistical analysis (e.g. for the categorical variable gender: 0=female; 1=male). Coding is not required for numerical variables, as they are already documented in numbers (e.g. age in years).

**Cohort study** – is one of the three basic analytical observational study designs. A cohort study starts by defining groups by the study factor (e.g. exposure present or absent) and then follows-up these exposure groups to detect the outcome (e.g. disease present or absent). A cohort study has a forward directionality. A cohort study is able to estimate relative risk since incidences are observed.

**Comparative epidemiological study** – is an epidemiological investigation that involves a comparison; it is either analytic or quality control research.

**Conceptual research hypothesis** – is another word for an initial research idea.

Confidence interval – is part of inferential statistics. A confidence interval allows the following statement about the unknown population parameter by taking exclusively information from a sample into account: The true but unknown population parameter lies within a (1-α)-confidence interval with a probability of 1-α. In most cases α is set to 5% and as a consequence 95% confidence intervals are calculated.

**Confounding bias** – is one of the three main types of bias. Confounding may occur if the effect of the study factor on the outcome is mixed in the data with the effect of another variable (= confounder). Whether confounding truly exists in a study can only be assessed during data analysis.

**Consolidated Standards of Reporting Trials (CONSORT) checklist** – includes a checklist of topics which should be considered when writing up a randomised controlled trial for publication. The checklist contains the most important elements of a study protocol and is therefore also a very helpful guide during the planning stage of a RCT.

**Convenience sampling** – is a form of non-probability sampling. A convenience sample is a sample of “conveniently” available participants such as “patients from an hospital” or “man on the street” sampling.

**Critical appraisal** – of a scientific publication involves reading, comprehending and critically assessing the methods, results and conclusions drawn in the publication in order to assess the evidence provided by the manuscript. Critical appraisal requires proficiency in epidemiological and to a degree also in statistical methods.

**Cross-sectional study (survey, prevalence study)** – is one of the three basic observational study designs. Cross-sectional studies can be purely descriptive or analytical. In cross-sectional studies the presence or absence of disease and study factor(s) (or their amount if they are quantitative) is determined for each individual participant at one particular point in time. A cross-sectional study is non-directional.

**Cumulative incidence** - is the number of new cases (i.e. people newly acquiring a disease or an attribute) occurring over a specified period of time divided by the total number of people in the population at risk of becoming a new case at the beginning of the time period.

**Declaration of Helsinki** – was developed by the World Medical Association as a set of ethical principles when conducting research in human populations. The Declaration of Helsinki was originally adopted in 1964 and has since undergone several revisions. The Declaration is not legally binding under international law but is considered a cornerstone of human research ethics and has influenced numerous national guidelines. Access the Declaration of Helsinki here.

**Descriptive statistics** – summarise the collected data in a meaningful way. Descriptive statistics is usually the first step of a statistical analysis. Categorical variables are described using frequencies and percentages. Numerical data are described using measures of central tendency accompanied by a measure of dispersion.

**Descriptive studies** - describe patterns of disease occurrence in relation to characteristics of persons, place, and time. Descriptive research provides data that are used only for descriptive purposes. The research hypothesis of a descriptive study does not involve a comparison.

**Determinant-centred epidemiology** - is epidemiological research that investigates the effect of a specific determinant or exposure on health outcomes. For example, nutritional epidemiology investigates the effect of diet on health.

**Diagnostic test** – is a test applied to a person in order to determine the health status of the person. In contrast to a screening test, a diagnostic test is usually applied to symptomatic persons. Diagnostic tests are often used to confirm diseases suggested by symptoms and other circumstantial evidence.

**Direction of bias** - refers to the deviation of the estimated effect measure in the sample from the effect measure in the target population caused by bias. The direction of bias can be towards the null-value (i.e. under-estimating the true effect in the target population), away from the null-value (i.e. over-estimating the true effect in the target population), or switch over (i.e. in the study an exposure is identified as a risk but in the target population it is actually protective or vice versa).

**Directionality** – is the inner logic of an analytical study design. A study follows a forward directionality when first the exposure groups are defined based on the study factor and then the participants are followed-up to detect the outcome. Studies of “backward” directionality first define groups based on the outcome and then look backwards to exposure status of participants. “Non-directional” means that both exposure and outcome are observed simultaneously in one group of participants.

**Disease-centred epidemiology** – is epidemiological research that focuses on only one disease or defined group of diseases and investigates distribution and determinants of this disease or group of diseases. For example, cancer epidemiology investigates distribution and determinants of cancers.

**Dynamic cohort** - is a cohort that gains and loses members throughout its existence. Most cohorts in epidemiology are dynamic.

**Ecological (correlational) studies** - measure and correlate characteristics of several populations. In ecological studies the units of analysis are not individual persons but entire populations such as countries, states, or cities. An ecological study links characteristics of the populations such as correlating average household income with the infant mortality rate, or level of air pollution with overall mortality rate. Typically, the analysis of ecological data involves plotting a suspected risk factor against an outcome to judge whether a correlation is evident. The aggregated data used to create the scatter-plot are often from routinely available data sources.

**Ecological fallacy (or aggregation bias, ecological bias)** - occurs because an association observed between variables at an aggregate level does not necessarily represent the association that exists at an individual level. An ecological fallacy might occur when the interpretation of the results of an ecological study is not limited to the aggregate level but is applied to the individual level.

**Endemic** – means the occurrence of a disease in a population or region at ‘normally’ expected levels. Endemic implies that the disease is able to maintain itself in a population or region without cases entering the population or region from outside.

**Environment** – in an epidemiological context refers to the habitat in which the biological agent and the host exist, survive or originate.

**Epidemic** – means the occurrence of a disease in a population or geographical region at clearly higher levels than ‘normally’ expected.

**Epidemic curve** – is a graphical display of the distribution of cases of an outbreak by time of onset.

**Epidemiological (demographic) transition** – is the transition from high mortality rates, usually caused by infectious diseases, to lower mortality rates mainly caused by chronic diseases in a country or region. It is usually accompanied by a transition from high to low fertility rates. The theory of demographic transition evolved to explain the rapid changes in population structure as observed during industrialization of western countries in the 19th and 20th century.

**Epidemiological process** – is an idealised concept on how to conduct epidemiological research. It is a cyclic process governed by the scientific method. Current theory and knowledge inform a research idea. A study design is chosen to investigate the research idea. An operational, falsifiable, research hypothesis is formulated. Tools are developed to collect data in a standardised way. Data is collected, collated and statistically analysed. The results of this analysis confirm or reject the operational research hypothesis. The results of the study are published and thereby integrated into the current theory and knowledge.

**Epidemiology** - is the study of the distribution and determinants of health-related states or events in specified populations and the application of study results to control health problems.

**Ethics** - is the part of philosophy that deals with moral issues such as good and evil, right and wrong, what is just, etc.

**Evidence-based practice** - is an approach to health care where health professionals use the best currently available evidence possible. Evidence-based practice uses the most appropriate and most current information available to make optimal clinical decisions for individual patients.

**Experimental studies** – constitute one of the two major branches of analytical research (observational studies form the second branch). In experiments, the study factor - usually an intervention - is actively and deliberately managed by the investigator. The researcher intentionally alters one or more factors under controlled conditions in order to study the effects of doing so. For example, one group is given the new drug while the control group receives the standard treatment.

**Exposure odds-ratio** – is a measure of association used in case-control studies. The exposure odds-ratio compares the odds of exposure in the cases with the odds of exposure in the controls.

**Falsifiable research hypothesis** – is a precise quantified statement of the expected outcome that can be falsified or confirmed by the study results.

**Fixed cohort** – is a cohort in which membership is fixed by being present at some defining event; often the start of the study period.

**Health** – is a state of complete physical, mental, and social well-being; not merely the absence of disease or infirmity.

**Hill’s criteria of causality** – are a series of criteria that should be considered when judging whether an observed association between a study factor and an outcome is likely to be a causal relationship. The criteria include: strength of the association, biological gradient, lack of temporal ambiguity, specificity, experiment, plausibility, consistency of findings, coherence of evidence, and analogy.

**Hippocratic Oath** – is taken by doctors as a pledge to practice medicine in an ethical way. The Oath was written some 2500 years ago and is attributed to Hippocrates (460 – 370 BC) or his students.

**Histogram** – is a graphical display of a numerical variable. A histogram consists of tabulated frequencies (absolute or relative frequencies), shown as joining rectangles representing the frequencies of observations of discrete intervals - called bins. In general, the bins are allowed to be of different widths so that the area of the rectangles (i.e. the height times the width; not the height alone!), reflects the respective frequency observed for a specific bin. In most cases, however, the intervals of the bins are choosen equidistant (of same widths) so that the heights of the rectangles directly reflect the frequencies of the observations within the bins.

**Host** – is a person or other animal that harbours an infectious agent.

**Human Research Ethics Committees (HRECs)** -oversee the ethical conduct of research. In Australia, there are more than 200 established HRECs in institutions and organisations, including health districts and universities. Every research project that involves humans or human tissue requires ethical approval by a HREC.

**Incidence** - is a measure of disease frequency. Incidence quantifies the number of new cases (incident cases; i.e. people newly acquiring a disease or an attribute) in a population at risk of developing the disease over a given period of time.

**Incidence rate** - is the number of new cases (i.e. people newly acquiring a disease or an attribute) developing during a specific period of time divided by the total disease free person-time of observation seen in the population at risk.

**Incubation period** - is the time interval between exposure to a sufficient cause of the disease and the onset of symptoms. The incubation period = Induction period + Latency period.

**Infectious disease (Communicable disease)** - is an illness caused by transmission of a specific infectious biological agent or its toxic products (= necessary cause) from an infected person, animal, or reservoir to a susceptible host. The transmission can occur directly or indirectly from a plant or animal host, vector, or the inanimate environment.

**Infectious disease epidemiology** – is the part of epidemiology that focuses on infectious diseases. Infectious disease epidemiology raises very specific questions about agents, transmission routes, and immunization. Infectious disease epidemiology provides models explaining occurrence and development of infectious disease outbreaks.

**Inferential statistics** – is used to generalize from the sample to the target population. Inferential statistics are applied in two stages of an epidemiological study: during the study design to estimate the appropriate sample size according to the specified research hypothesis and during the analysis to confirm or reject the research hypothesis. There are two different techniques to confirm or reject a research hypothesis; the use of confidence intervals or the conduct of a statistical hypothesis test.

**Information bias** – is one of the three main types of bias. Information bias refers to a distortion in the effect measure, due to measurement error or misclassification of participants for one or more variables. Information bias occurs when the measurement of either the study factor or the outcome is systematically inaccurate.

**Informed consent** – is considered the standard expression of autonomy of a research participant. Current ethical guidelines insist that researchers who deal with human participants must seek informed consent from participants prior to any research being conducted. The process of obtaining informed consent means that the researcher is required to explain the research project verbally and/or in writing and, most importantly, outline exactly what participation in the study entails. In many research projects participants are asked to sign a statement of informed consent.

**Instructions for authors** – provide details of the exact format required for the submission of a manuscript to a particular scientific journal. Numerous journals in the health sciences have agreed to adopt a common format for manuscripts called the “Uniform Requirements for Manuscripts Submitted to Biomedical Journals.

**Interquartile Range (IQR)** – is a measure of dispersion used to describe a sample of a numerical variable. The IQR is the range between the 0.25 and the 0.75 quantile (see also p-quantile). It encompasses the “middle” 50% of all observations. As opposed to the standard deviation, it can also be used in a meaningful way when the underlying distribution is not symmetrical.

**Interview** – is a form of data collection. An interview is a conversation between a researcher and a participant specifically designed to obtain certain information. Interviews can have many formats (e.g. standardised or not standardized, face-to-face or distant).

**Literature review** – is the process of collecting, classifying and evaluating what other researchers have previously written about a topic. A literature review is not only a summary or classification of published material but crucially also a critical appraisal of the research conducted in a specific area. A literature review should be the start of any intended research project. It should be systematic and accompanied by a sufficiently detailed search protocol.

**Matching** – is a design feature in analytical observational studies. Matching is a technique that tries to create comparable groups (i.e. cases and controls in case-control studies or the exposure groups in cohort studies) with respect to extraneous factors (e.g. matching for gender in a case control study means that for each case a gender-matched control is included in the study).

**Mean** – is a measure of central tendency used to describe a sample of a numerical variable. The mean is the “average” of all observation. It is calculated by adding up the values of all observations and dividing the result by the number of observations. The mean can only be used in a meaningful way if the distribution of the numerical variable is symmetrical.

**Median** - is a measure of central tendency used to describe a sample of a numerical variable. The median is the “middle” value when all observations are sorted in ascending order. In contrast to the mean, the median can also used in a meaningful way when the underlying distribution is not symmetrical. The median is the 0.5-quantile (see p-quantile).

**Measure of central tendency** – is a summary word for descriptive statistics for numerical data that describe the centre of the distribution. The most frequently used measures of central tendency are the arithmetic mean and the median.

**Measure of dispersion** – is a summary word for descriptive statistics for numerical data that describe the spread of the distribution. The most frequently used measures of dispersion are the standard deviation, the inter-quartile range and the range.

**Multivariable** – an adjective used in statistics to describe situations or procedures where more than two variables only are considered or involved.

**Multivariable analysis** - is used to statistically investigate the relationship between study factor(s) and outcome by simultaneously allowing adjustment for confounding. Multivariable models assess the effects of several variables together. The choice of an appropriate multivariable model (e.g. logistic regression, multiple linear regression, Cox proportional hazard analysis) is foremost dependent on the type of the outcome variable.

**Natural history of disease** - is a model or a framework for thinking about the natural development and course of diseases. The natural history of disease expresses the progress of a disease in an individual over time – in an idealised and standardised format. Many diseases have specific stages such as the stage of susceptibility, the sub-clinical stage, the stage of clinical disease, and the stage or recovery, disability or death. Together the stages form the natural history of a disease.

**Necessary cause** – must always precede an effect. In the epidemiological context a necessary cause might be an infectious agent causing an infectious disease. If a necessary cause exists then it forms part of the sufficient cause in every person with the disease.

**Negative predictive value** - of a diagnostic or screening test is the probability that a person with a negative test result will actually be disease free.

**Non-Parametric statistics** – make no assumptions on the distribution(s) of the underlying numerical variable(s) other then it is supposed to be convex. This implies that also no assumptions referring to the Normal distribution of the numerical variable(s) under considerations are made. Non-parametric statistics can still be used even if the numerical distribution(s) are non-Normal (i.e. skewed) and usually deal with medians and non¬parametric measures of dispersion (such as the interquartile range).

**Normality assumption** – refers to the assumption that a numerical variable follows a Normal distribution i.e. the distribution function closely follows a Gauss-bell shape. As a rule of thumb, the examination of the histogram (should loosely display a Gauss bell shape), the calculation of the ratio between the mean and the median (should be in the boundaries of 0.9 and 1.1), and the relative size of the standard deviation to the mean (the ratio of SD over the mean should be less than 0.33) can be used to practically decide whether a Normal distribution can be assumed or not. Note that all three checks should be fulfilled before a Normal distribution can be assumed.

**Number needed to treat (NNT)** – is the estimated number of patients who need to be treated with the new intervention rather than the standard treatment (control) in order to achieve one additional successful “treatment”. NNT is an additional useful way of reporting the results of a randomised controlled trial.

**Numerical variable** - is a characteristic that takes on numerical values within a certain range. Examples of numerical variables are blood pressure, weight, age, and number of children. Numerical data should be recorded as such i.e. not be categorised; if required categorisation can be done during the analytical process.

**Observational studies** - constitute one of the two major branches of analytical research (experimental studies form the second branch). In observational studies the study factor is only observed but not altered or managed in any way by the investigator.

**Observational uniformity** – is a pre-requisite for the internal validity of an epidemiological study. Observational uniformity means that the groups to be compared are observed in the

same way, with the same intensity, making the same measurements and observations in the same way, and with equal documentation. The information collected must be valid. Observational uniformity implies that there is no information bias.

**Observations** – is one form of data collection. Observation involves being close to the person or unit of analysis who/that is being studied so that behaviours and characteristics can be directly observed and recorded. Observation has the advantage that the researchers directly see and hear how people act rather than having to rely on the participants’ or third persons’ interpretations.

**Odds-ratio** - is a measure of association used in case-control (exposure odds-ratio) and cross-sectional studies (prevalence odds-ratio).

**Operational research hypothesis** – is a clear and precise quantified statement of the question that the research is designed to answer. The operational research hypothesis must be plausible and falsifiable. In order for the hypothesis to be falsifiable, the expected result has to be quantified in measurable terms. An operational research hypothesis is the centre of every well-planned quantitative research.

**Outbreak investigation** – is an epidemiological investigation of the cause and control of an epidemic in a specific location (e.g. suburb, closed institution) or defined population (e.g. intravenous drug users, dialysis patients).

**Paired statistical test** – is a version of a statistical test suitable for paired data i.e. situations where the same individuals were measured twice (or more often) resulting in mulitple measurements of the same characteristic in the identical individuals.

**Parametric statistics** – make assumptions on the distribution(s) of the underlying numerical variable(s). These assumptions mostly refer to a Normal distribution of the numerical variable(s) under considerations. In univariate and bivariate statistics this means that parametric statistics usually deal with means and standard deviations.

**Peer-review** – for scientific manuscripts submitted to a peer-reviewed journal involves the (sometimes repeated) review of the manuscript by fellow researchers (peers). Manuscripts may be rejected by a journal as a result of the peer-review or accepted after a successful review process.

**Period prevalence** - is the total number of people with a disease or an attribute during a particular period of time divided by size of the total population at the midpoint of the observation period.

**PICOT** – stands for Population, Intervention, Comparison, Outcome, and Timeframe. PICOT is a conceptual framework helpful on the way from an initial research idea to the final operational research hypothesis.

**Pie Chart** – is a graphical display of one categorical variable. This circular chart (pie) depicts the absolute or relative frequencies of the different categories of the underlying categorical variable as proportionally sized sectors (slices of the pie). All sectors (slices) together constitute a full disk (pie). Pie charts are widely used in the mass media, however, are often of very limited value – especially when the relative sizes of slices of one pie or of slices across different pies are to be compared. A tabled list of percentages is often more informative (and easier to obtain).

**Placebo** – is an inert substance or procedure with no known physiological or otherwise effect. Placebos are used as mock treatments for the control group in experimental studies. Even if an intervention is theoretically totally irrelevant to the participant’s condition, the participant’s attitude towards the condition and indeed the condition itself may improve by the perception that something is being done. This effect is known as the placebo effect. Placebos are used to counter this form of information bias. In patient trials, placebos can only be ethically justified when no known effective treatment is available for the control group.

**Point prevalence** – is the total number of people with a disease or an attribute divided by the total number of people in the population at a given point in time.

**Population at risk** - are all people under observation who initially do not have the disease or the attribute but are “at risk” of acquiring the disease or the attribute.

**Positive predictive value** - of a diagnostic or screening test is the probability that a person with a positive test result will actually have the disease.

**Prevalence** - is a measure of disease frequency. Prevalence quantifies the number of existing cases (prevalent cases; i.e. people with a disease or an attribute) in a population at a point in time or during a period of time.

**Prevalence odds-ratio** – is a measure of association used in cross-sectional studies. The prevalence odds-ratio compares the odds of the prevalence of the outcome in the exposed group with the odds of the prevalence of the outcome in the unexposed group.

**Primary prevention** – are public health efforts that are directed towards the stage of susceptibility of a disease. Primary prevention aims to prevent or reduce “exposure” and thus the possibility of the disease occurring. An example of primary prevention is the “Slip, Slop, Slap” campaign to reduce sun exposure and hence prevent skin cancer.

**Probability sampling** – is a sampling strategy where each individual in the target population has a known chance of being sampled. Random sampling is the most important example of probability sampling.

**Prolective data** – are specifically collected for the purpose of a particular study. Prolective data collection can be adapted to the specific needs of a study; the researcher is able to decide on the type and format of the required data. Prolective data collection allows control over the quality of the data and is the preferred option for research.

**Public health** - is a multidisciplinary set of activities concerned with the protection and promotion of the health of people and communities, and the delivery of health services to people and communities.

**p-quantile** – of the sample (sorted in ascending order) of a numerical variable is defined by the value for which p% of observations are below and (1-p)% are above. For instance the 0.25-quantile is the value where 25% of observations are smaller and 75% or larger than the quantile. The median is the 0.5 or 50% quantile.

**p-value** – is the result of a statistical hypothesis test. The p-value gives the probability of obtaining in a sample a difference as large as the actually observed one (or an even larger one) if in reality (i.e. in the wider population) there is no such difference. Thus the p-value is the probability that an observed difference is attributable to chance alone. The smaller the p-value, the less likely that an observed difference occurred by chance alone. If the p-value is less than a set alpha level (usually less than 0.05), then the result of the statistical hypothesis test is called statistically significant.

**Qualitative epidemiology** – encompasses a broad range of research approaches that are descriptive in nature and do not aim to generalise to a wider population by using inferential statistics. Qualitative research does not rely on quantifying results. Qualitative research is concerned with the individual experience or process. Many researchers, quantitative or qualitative, would dispute qualitative research being part of epidemiology.

**Quality control research** – is a branch of comparative epidemiology. It includes methods and study designs that assess measurement quality and intra- and inter-observer agreement.

**Quantitative epidemiology** – aims to collect information on many people who are representative of a wider population. The intention of quantitative epidemiology is to draw inferences from the study sample and relate them to a wider population. In quantitative research information is quantitatively measured and results are assessed using statistical analysis.

**Questionnaire** – is a data collection instrument. A questionnaire is a document designed to seek specific information from respondents. The questions asked should be straightforward, unambiguous and in simple direct language which is in tune with the target population.

**Random error** – is a type of error that is governed by chance. The smaller the random error in a study the more reliable are the results of the study. Random error occurs in every epidemiological study due to natural or biological variation. Random error is assessed during the statistical analysis of the data collected in a study.

**Random sample** – is a form of probability sampling. In a random sample each individual or unit of analysis in the target population has an equal chance of being selected into the sample.

**Randomisation** – is a design feature of experimental studies. Randomisation means that participants are allocated to the intervention or the control group by chance alone; that is every participant has the same chance to be either in the control or in the intervention group.

**Randomised controlled trial (RCT)** – is a specific type of experimental study. In a RCT participants are randomised to an intervention or a control group. The control group of a randomised controlled trial is concurrent.

**Range** – is a measure of dispersion used in the description of a sample of a numerical variable. The range is the distance between the smallest and the largest observation.

**Referencing** – In a scientific publication each factual statement, other than the ones that result directly from the present study, have to be accompanied by a supporting reference. Two main referencing styles are used in scientific journals: The Vancouver style asks for a numbered list of references according to the order of appearance in the text and the Harvard style that requests an alphabetical list of references with the full name of the first author and the year of publication in the text.

**Relative risk** – is a ratio measure of association in cohort studies and experimental studies. The relative risk is the incidence of the outcome in the exposed (or intervention) group divided by the incidence of the outcome in the not-exposed (or control) group.

**Reliability (Consistency, repeatability, precision, or reproducibility)** – of ‘measurements’ means that if “measurements” were repeated with the same participants by the same or a different health professional, the results of the repeated “measurements” would be very similar or even identical to the first findings. The “measurements” might be responses to questions, results of diagnostic tests, or physical measurements such as height or weight. Also important is the reliability of the overall results of an epidemiological study (i.e. the amount of random error involved) which is assessed with statistical techniques such as confidence intervals and statistical hypothesis testing.

**Representative uniformity** – is a pre-requisite for the internal validity of an epidemiological study. Representative uniformity means that the sample(s) represent the target population. Representative uniformity implies that there is no selection bias.

**Research** – is about discovering and acquiring new knowledge in an organised way using methods that are reproducible and ethically acceptable.

**Research design** – includes the entirety of how a study is planned. Research design is the “architecture” of a study including all its details such as the population and the sampling frame, the methods used to assess and measure participants, planned analysis and time frame.

**Retrolective data** – are based on previously collected routine data such as medical records or registry data. Retrolective data were not specifically collected for a scientific purpose and their quality is often of concern.

**Routinely collected health data** – are data or statistics that are routinely collected by Governments, Health Departments, or other agencies. Examples of routinely available data are cancer registry data or birth and death registry data. In this context, “routinely” implies that the data is not collected for a specific research project but is collected anyway.

**Sample** - is a selected subset of a wider population. The sample is the group of individuals from whom data for a study are collected.

**Sample size** – In quantitative epidemiological research it is important for a study to have an adequate sample size to allow estimations and comparisons with some pre-defined statistical confidence. Sample size calculation ensures optimal sample size in the sense that a study has sufficient power to detect an existing difference with statistical confidence - without wasting resources to collect too large a sample.

**Sampling frame** – is the source entity from which the sample will be drawn. Examples of sampling frames include the telephone book or a list of hospital patients.

**Scatterplot (or scattergram)** - is a graphical display of the values of two numerical variables in a two-dimensional system of Cartesian coordinates. Each variable defines one axis. Each pair of observations is represented by a dot (the x-coordinate and y-coordinate referring to the observed values of the two variables plotted). The swarm of all dots gives a visual impression of the association between the two numerical variables. If it is possible to identify an independent variable and a dependent one (e.g. one variable can influence the other but not vice versa; e.g. blood pressure and age), then the independent variable is used to define the horizontal axis (x-axis) and the dependent variable defines the vertical axis (y-axis). (In the blood pressure / age example, age can influence blood pressure but not vice versa; thus age will define the x-axis).

**Scientific publication** - is the logic endpoint of any research project because the results of a research project should be made available to other researchers and to society at large. Scientific publications are usually published as journal articles. The structure and language of a scientific publication is rigid and controlled. A scientific publication usually includes sections called ‘introduction’, ‘methods’, ‘results’, and ‘discussion’.

**Screening** – is the testing of usually asymptomatic people to determine their likelihood of having a particular disease. Screening tests sort out asymptomatic persons who probably have a disease from those who probably do not. A screening test is not intended to be diagnostic. Persons with positive or suspicious findings are usually referred to additional diagnostic procedures.

**Secondary prevention** - are public health efforts that are directed towards the subclinical stage of a disease. Secondary prevention aims to prevent the clinical stage of a disease or to reduce the severity of the disease once it has emerged. An example of a secondary prevention effort is screening for a disease.

**Selection bias** – is one of the three main types of bias. Selection bias refers to a distortion in the effect measure resulting from the manner in which the people are selected for the sample. For example, selection bias may be introduced if sampling techniques are inappropriate. If selection bias occurs, the sample(s) do not represent the target population.

**Sensitivity** – is the ability of a diagnostic or screening test to correctly identify people who actually have the disease (‘correct positive’). Sensitivity is the number of truly diseased people who test positive divided by the total number of truly diseased people.

**Specificity** - is the ability of a diagnostic or screening test to exclude people who are free of disease (‘correct negative’). Specificity is the number of people who tested negative and who are truly disease-free divided by the total number of people who are truly disease free.

**Standard Deviation (SD)** – is a measure of dispersion used to describe a sample of a numerical variable. The SD is the average deviation of the observations from the mean (see also mean). Since it is based on the mean, it can only be used in a meaningful way when the underlying distribution is symmetrical.

**Standardization of rates** – is used to allow comparisons of rates between populations with structural differences. Standardization is a numerical technique that calculates expected rates based on the rates observed in the population according to a given standard population. In epidemiology standardization of rates occurs most often in order to adjust for age differences, resulting in age-standardized rates.

**Statistical hypothesis test** – is part of inferential statistics and is a decision making tool used to confirm or reject a research hypothesis. A statistical test judges how likely it is that an observed difference between groups, or an association between characteristics, is likely to be due to random error (chance) alone. A statistical test makes inferences from findings of the sample to the wider population. The appropriate statistical test for a given research hypothesis is dependent on the type of the variables involved.

**Statistically significant** – is a phrase that implies that a statistical hypothesis test was calculated and the resulting p-value was below the set alpha level (i.e. usually below 0.05). Statistical significance however exclusively refers to random error and does not automatically imply clinical relevance.

**Structural uniformity** – is a pre-requisite for the internal validity of an analytical epidemiological study. Structural uniformity means that the structural characteristics and potentially influencing factors of the groups to be compared are as alike as possible. Structural uniformity implies that there is no confounding bias.

**Study protocol** - is a formal document specifying in every detail how a study is planned and conducted. The study protocol is a “living” document which grows and changes as a study progresses.

**Sufficient cause** – for a disease always leads to the disease. The factors forming the sufficient cause may vary from person to person. If a necessary cause exists, however, then any combination of factors that form a sufficient cause (i.e. initiate disease in a person) will include this necessary cause.

**Surveillance** - is the ongoing systematic collection, analysis and interpretation of health data, essential to the planning, implementation, and evaluation of public health practice, closely integrated with the timely dissemination of these data to those who need to know. A surveillance system usually includes a functional capacity for data collection, analysis, and dissemination linked to public health programs.

**Systematic error** – is a type of error which acts on the results of a study in a systematic way. The smaller the systematic error in a study the more valid are the results of the study. Systematic error occurs in every epidemiological study. Study design features such as randomisation, blinding and matching are used to minimise systematic error.

**Systematic sampling** - is a form of probability sampling in case the sampling frame covers the entire target population and shows no periodicity. Systematic sampling implies to evenly sample over the entire target population. An example of systematic sampling is sampling the first phone number on the top left corner of every page of the phone book.

**Target population** - is the population about which one wants to draw conclusions. The actual population, and with that, the sample may or may not be representative of the target population. The target population is partly defined by the exclusion and inclusion criteria of a study. When conducting a study it is most important to define the target population first to ensure appropriate sampling.

**Tertiary prevention** - are public health efforts that are directed towards the clinical stage of a disease. Tertiary prevention aims preventing or minimising the progression of a disease or its consequences. A randomised controlled trial that aims to identify best treatment for a disease is an example of tertiary prevention.

**The scientific method** – is framework for conducting research that has been developed over the last centuries and is largely dominated by Western perspectives. The scientific method states how knowledge should be acquired: by formulating and testing falsifiable hypotheses based on empirical and measurable data collections. Thus, a falsifiable research hypothesis forms the core of the scientific method. The hypothesis is informed by the current theory and is either confirmed or rejected by empirical observations. Confirming or rejecting a hypothesis will strengthen or weaken the theory. This is a cyclic process. Epidemiology is the scientific method of public health.

**Timing** - refers to the chronological relationship between the onset of the study in real calendar time and the observation of the study factor and outcome. In completely “prospective” studies both the study factor and the outcome are observed after the onset of the study and refer exclusively to the actual study period. In a completely “retrospective” study both the study factor and the outcome occurred before the onset of the study. The investigator obtains information about the study factor and the outcome from records and/or recall of previous events. In an “ambi-spective” study, either the study factor or the outcome is assessed prospectively and the other retrospectively, or at least one of the two key variables is assessed in both ways

**Unit of analysis** – in epidemiology is in most cases the human individual. However, ecological studies regard entire populations such as countries, states, or cities as units of analysis. Some epidemiological studies may relate to units such as hospitals or contaminated sites.

**Univariate** – an adjective used in statistics to describe situations or procedures where one single variable only is considered or involved.

**Unpaired statistical test** – is a version of a statistical test suitable for unpaired data i.e. situations where different groups of individuals are measured once only resulting in a single measurements (per assessed characteristic) for the individuals.

**Validity (synonyms are conformity, accuracy, or correctness)** – of ‘measurements’ means that the “measurements” taken are correct. The “measurements” might be responses to questions, results of diagnostic tests, or physical measurements such as height or weight. A questionnaire is valid if the questions assess what they are supposed to assess. A quality of life questionnaire, for example, is supposed to measure quality of life. A diagnostic test is valid if the results of the diagnostic test are correct, that is, if the test is able to differentiate correctly between diseased people and people free of the disease. Validity often also refers to the overall result of an epidemiological study. The results of an epidemiological study are called valid if no bias (i.e. no systematic error that distorts the results) is present.