Missing values complicate the analysis of large-scale observational datasets such as electronic health records. Our work has developed several foundational new models for missing value imputation, including low rank models and Gaussian copula models. We have also demonstrated improved methods to handle missing-not-at-random or informative missing data through the missing indicator method.

, PhD student Mike Van Ness, KDD, 2023

, Cornell, 2021

, keynote at Women in Data Science, 2019

: imputation with the Gaussian copula

: low rank models for missing value imputation


M. Van Ness and M. Udell
Table representation learning workshop at NeurIPS, 2023


V. N. Mike, T. Bosschieter, R. Halpin-Gregorio, and M. Udell
29th SIGKDD Conference on Knowledge Discovery and Data Mining - Applied Data Science Track, 2023


Y. Zhao and M. Udell
Accepted at Journal of Statistical Software, 2023


Y. Zhao, A. Townsend, and M. Udell
NeurIPS, 2022


N. Sengupta, M. Udell, N. Srebro, and J. Evans
Sociological Methodology, 2022


C. Yang, L. Ding, Z. Wu, and M. Udell
International Conference on Artificial Intelligence and Statistics (AISTATS), 2021


Y. Zhao, E. Landgrebe, E. Shekhtman, and M. Udell
AAAI, 2021


E. Landgrebe, Y. Zhao, and M. Udell
ICML Workshop on the Art of Learning with Missing Values (Artemiss), 2020


C. Yang, L. Ding, Z. Wu, and M. Udell
OPT2020: 12th Annual Workshop on Optimization for Machine Learning, 2020


Y. Zhao and M. Udell
Advances in Neural Information Processing Systems (NeurIPS), 2020


J. Fan, Y. Zhang, and M. Udell
Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020


Y. Zhao and M. Udell
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2020


J. Fan and M. Udell
Computer Vision and Pattern Recognition (CVPR), 2019
Oral Presentation


N. Kallus, X. Mao, and M. Udell
Advances in Neural Information Processing Systems, 2018


M. Paradkar and M. Udell
CVPR Workshop on Tensor Methods in Computer Vision, 2017


M. Udell, C. Horn, R. Zadeh, and S. Boyd
Foundations and Trends in Machine Learning, 2016


M. Udell
Stanford University Thesis, 2015


M. Udell and S. Boyd
2015


M. Udell and S. Boyd
Biomedical Computation Review, 2014


M. Udell, C. Horn, R. Zadeh, and S. Boyd
NeurIPS Workshop on Distributed Machine Learning and Matrix Computations, 2014

.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Missing Data | Types, Explanation, & Imputation

Missing Data | Types, Explanation, & Imputation

Published on December 8, 2021 by Pritha Bhandari . Revised on June 21, 2023.

Missing data , or missing values, occur when you don’t have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.

Table of contents

Types of missing data, are missing data problematic, how to prevent missing data, how to deal with missing values, other interesting articles, frequently asked questions about missing data.

Missing data are errors because your data don’t represent the true values of what you set out to measure.

The reason for the missing data is important to consider, because it helps you determine the type of missing data and what you need to do about it.

There are three main types of missing data.

Type Definition
Missing completely at random (MCAR) Missing data are randomly distributed across the variable and unrelated to other .
Missing at random (MAR) Missing data are not randomly distributed but they are accounted for by other observed variables.
Missing not at random (MNAR) Missing data systematically differ from the observed values.

Missing completely at random

When data are missing completely at random (MCAR), the probability of any particular value being missing from your dataset is unrelated to anything else.

The missing values are randomly distributed, so they can come from anywhere in the whole distribution of your values. These MCAR data are also unrelated to other unobserved variables.

However, you note that you have data points from a wide distribution, ranging from low to high values.

Data are often considered MCAR if they seem unrelated to specific values or other variables. In practice, it’s hard to meet this assumption because “true randomness” is rare.

When data are missing due to equipment malfunctions or lost samples, they are considered MCAR.

Missing at random

Data missing at random (MAR) are not actually missing at random; this term is a bit of a misnomer .

This type of missing data systematically differs from the data you’ve collected, but it can be fully accounted for by other observed variables.

The likelihood of a data point being missing is related to another observed variable but not to the specific value of that data point itself.

But looking at the observed data for adults aged 18–25, you notice that the values are widely spread . It’s unlikely that the missing data are missing because of the specific values themselves.

Missing not at random

Data missing not at random (MNAR) are missing for reasons related to the values themselves.

This type of missing data is important to look for because you may lack data from key subgroups within your sample. Your sample may not end up being representative of your population .

Attrition bias

In longitudinal studies , attrition bias can be a form of MNAR data. Attrition bias means that some participants are more likely to drop out than others.

For example, in long-term medical studies, some participants may drop out because they become more and more unwell as the study continues. Their data are MNAR because their health outcomes are worse, so your final dataset may only include healthy individuals, and you miss out on important data.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Missing data are problematic because, depending on the type, they can sometimes cause sampling bias . This means your results may not be generalizable outside of your study because your data come from an unrepresentative sample .

In practice, you can often consider two types of missing data ignorable  because the missing data don’t systematically differ from your observed values:

For these two data types, the likelihood of a data point being missing has nothing to do with the value itself. So it’s unlikely that your missing values are significantly different from your observed values.

On the flip side, you have a biased dataset if the missing data systematically differ from your observed data. Data that are MNAR are called non-ignorable  for this reason.

Missing data often come from attrition bias , nonresponse , or poorly designed research protocols. When designing your study , it’s good practice to make it easy for your participants to provide data.

Here are some tips to help you minimize missing data:

  • Limit the number of follow-ups
  • Minimize the amount of data collected
  • Make data collection forms user friendly
  • Use data validation techniques
  • Offer incentives

After you’ve collected data, it’s important to store them carefully, with multiple backups.

To tidy up your data, your options usually include accepting, removing, or recreating the missing data.

You should consider how to deal with each case of missing data based on your assessment of why the data are missing.

  • Are these data missing for random or non-random reasons?
  • Are the data missing because they represent zero or null values?
  • Was the question or measure poorly designed?

Your data can be accepted, or left as is, if it’s MCAR or MAR. However, MNAR data may need more complex treatment.

Prevent plagiarism. Run a free check.

The most conservative option involves accepting your missing data: you simply leave these cells blank.

It’s best to do this when you believe you’re dealing with MCAR or MAR values. When you have a small sample, you’ll want to conserve as much data as possible because any data removal can affect your statistical power .

You might also recode all missing values with labels of “N/A” (short for “not applicable”) to make them consistent throughout your dataset.

These actions help you retain data from as many research subjects as possible with few or no changes.

You can remove missing data from statistical analyses using listwise or pairwise deletion.

Listwise deletion

Listwise deletion means deleting data from all cases (participants) who have data missing for any variable in your dataset. You’ll have a dataset that’s complete for all participants included in it.

A downside of this technique is that you may end up with a much smaller and/or a biased sample to work with. If significant amounts of data are missing from some variables or measures in particular, the participants who provide those data might significantly differ from those who don’t.

Your sample could be biased because it doesn’t adequately represent the population .

Pairwise deletion

Pairwise deletion lets you keep more of your data by only removing the data points that are missing from any analyses. It conserves more of your data because all available data from cases are included.

It also means that you have an uneven sample size for each of your variables. But it’s helpful when you have a small sample or a large proportion of missing values for some variables.

When you perform analyses with multiple variables, such as a correlation , only cases (participants) with complete data for each variable are included.

  • 12 people didn’t answer a question about their gender, reducing the sample size from 114 to 102 participants for the variable “gender.”
  • 3 people didn’t answer a question about their age, reducing the sample size from 114 to 11 participants for the variable “age.”

Imputation means replacing a missing value with another value based on a reasonable estimate. You use other data to recreate the missing value for a more complete dataset.

You can choose from several imputation methods.

The easiest method of imputation involves replacing missing values with the mean or median value for that variable.

Hot-deck imputation

In hot-deck imputation , you replace each missing value with an existing value from a similar case or participant within your dataset. For each case with missing values, the missing value is replaced by a value from a so-called “donor” that’s similar to that case based on data for other variables.

You sort the data based on other variables and search for participants who responded similarly to other questions compared to your participants with missing values.

Cold-deck imputation

Alternatively, in cold-deck imputation , you replace missing values with existing values from similar cases from other datasets. The new values come from an unrelated sample.

You search for participants who responded similarly to other questions compared to your participants with missing values.

Use imputation carefully

Imputation is a complicated task because you have to weigh the pros and cons.

Although you retain all of your data, this method can create research bias and lead to inaccurate results. You can never know for sure whether the replaced value accurately reflects what would have been observed or answered. That’s why it’s best to apply imputation with caution.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Statistical power
  • Pearson correlation
  • Degrees of freedom
  • Statistical significance

Methodology

  • Cluster sampling
  • Stratified sampling
  • Focus group
  • Systematic review
  • Ethnography
  • Double-Barreled Question

Research bias

  • Implicit bias
  • Publication bias
  • Cognitive bias
  • Placebo effect
  • Pygmalion effect
  • Hindsight bias
  • Overconfidence bias

Missing data , or missing values, occur when you don’t have data stored for certain variables or participants.

In any dataset, there’s usually some missing data. In quantitative research , missing values appear as blank cells in your spreadsheet.

Missing data are important because, depending on the type, they can sometimes bias your results. This means your results may not be generalizable outside of your study because your data come from an unrepresentative sample .

To tidy up your missing data , your options usually include accepting, removing, or recreating the missing data.

  • Acceptance: You leave your data as is
  • Listwise or pairwise deletion: You delete all cases (participants) with missing data from analyses
  • Imputation: You use other data to fill in the missing data

There are three main types of missing data .

Missing completely at random (MCAR) data are randomly distributed across the variable and unrelated to other variables .

Missing at random (MAR) data are not randomly distributed but they are accounted for by other observed variables.

Missing not at random (MNAR) data systematically differ from the observed values.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 21). Missing Data | Types, Explanation, & Imputation. Scribbr. Retrieved August 26, 2024, from https://www.scribbr.com/statistics/missing-data/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, what is data cleansing | definition, guide & examples, how to find outliers | 4 ways with examples & explanation, random vs. systematic error | definition & examples, what is your plagiarism score.

X

Short courses

Menu

Introduction to Dealing with Missing Data (Online)

  • Study at own pace

Book a place

This course looks at the problem of missing data in research studies in detail. Reasons and different types of missing data are discussed as well as bad and good methods of dealing with them.

The course is delivered in a self-paced format by UCL's Centre for Applied Statistics Courses (CASC), part of the UCL Great Ormond Street Institute of Child Health (ICH).

Missing data are very common in research studies, but ignoring these cases can lead to invalid and misleading conclusions being drawn. This course provides guidance on how to deal with missing values and the best ways of analysing a dataset that is incomplete.

The course covers the following topics:

  • Reasons for missing data
  • Types of missing data
  • Simple methods for analysing incomplete data
  • More sophisticated methods of dealing with missing data (simple and multiple stochastic imputation, weighting methods)

Course structure and teaching

This is an online, self-paced course that includes:

  • Full electronic notes
  • Short lecture videos that follow closely with the notes.
  • Accessible materials with alternative text for images, and captions/transcripts for each video.
  • Interactive quizzes for each chapter.
  • Support will also be available through a forum, where you can ask questions related to the course materials.

Learning outcomes

At the end of the course, delegates should understand potential reasons for missing data in research and be able to deal with it if they encounter missing data in their own analysis. In particular, delegates will be able to:

  • Understand the reasons for missing data in research
  • Differentiate between the different types of missing data including ‘’missing completely at random”, “missing at random” and “missing not at random”.
  • With the additional help additional software packages, report the extent of missing data in their analysis.
  • Employ simple and advanced methods for filling in missing data such as multiple imputation.
  • Comprehend the advantages and disadvantages of each imputation method

Entry requirements

A basic level of statistical literacy is required as a prerequisite.

It is desirable for the course participants to have basic knowledge of statistics, i.e. notion of statistical inference, p-values and Confidence intervals.

Those who have completed the five-day Introduction to Statistics and Research Methods course run frequently by the Centre for Applied Statistics Courses (CASC) team will be equipped.

Cost and concessions

The standard price is £75.

A 50% discount is available for UCL staff, students, alumni. If you're eligible for a discount, email [email protected] before booking to be sent the discount code.

The course is available for free to those associated with the Institute of Child Health or Great Ormond Street Hospital, and SLMS doctoral students. Please also email [email protected] to receive a booking code.

Certificates

You can download a certificate of participation once you have completed all the session quizzes.

Find out about other statistics courses

CASC's stats courses are suitable for anyone requiring an understanding of research methodology and statistical analyses. The courses allow non-statisticians to interpret published research and/or undertake their own research studies.

Find out more about CASC's full range of statistics courses .

Course team

Dr Dean Langan

Dr Dean Langan

Dean works as a lecturer, jointly based within the School of Life and Medical Sciences (SLMS) and the Centre for Applied Statistics Courses (CASC) at UCL. He has a Bachelor’s degree in Mathematics from University of Liverpool, a Master's degree in Medical Statistics from University of Leicester, and a PhD from University of York for his research in statistical methods for random-effects meta-analysis. He's worked as a statistician on a number of clinical trials related to stroke and myeloma at the Clinical Trials Research Unit in Leeds. His specialist areas include statistical methods for meta-analysis, R programming, clinical trial methodology and research design.

Course information last modified: 23 Oct 2023, 16:02

Length and time commitment

  • Time commitment: 6 hours
  • Course length: Study at own pace

Contact information

Related Short Courses

  • Methodology
  • Open access
  • Published: 14 May 2013

Principled missing data methods for researchers

  • Yiran Dong 1 &
  • Chao-Ying Joanne Peng 1  

SpringerPlus volume  2 , Article number:  222 ( 2013 ) Cite this article

49k Accesses

891 Citations

22 Altmetric

Metrics details

The impact of missing data on quantitative research can be serious, leading to biased estimates of parameters, loss of information, decreased statistical power, increased standard errors, and weakened generalizability of findings. In this paper, we discussed and demonstrated three principled missing data methods: multiple imputation, full information maximum likelihood, and expectation-maximization algorithm, applied to a real-world data set. Results were contrasted with those obtained from the complete data set and from the listwise deletion method. The relative merits of each method are noted, along with common features they share. The paper concludes with an emphasis on the importance of statistical assumptions, and recommendations for researchers. Quality of research will be enhanced if (a) researchers explicitly acknowledge missing data problems and the conditions under which they occurred, (b) principled methods are employed to handle missing data, and (c) the appropriate treatment of missing data is incorporated into review standards of manuscripts submitted for publication.

Missing data are a rule rather than an exception in quantitative research. Enders ( 2003 ) stated that a missing rate of 15% to 20% was common in educational and psychological studies. Peng et al. ( 2006 ) surveyed quantitative studies published from 1998 to 2004 in 11 education and psychology journals. They found that 36% of studies had no missing data, 48% had missing data, and about 16% cannot be determined. Among studies that showed evidence of missing data, 97% used the listwise deletion (LD) or the pairwise deletion (PD) method to deal with missing data. These two methods are ad hoc and notorious for biased and/or inefficient estimates in most situations (Rubin 1987 ;Schafer 1997 ). The APA Task Force on Statistical Inference explicitly warned against their use (Wilkinson and the Task Force on Statistical Inference 1999 p. 598). Newer and principled methods, such as the multiple-imputation (MI) method, the full information maximum likelihood (FIML) method, and the expectation-maximization (EM) method, take into consideration the conditions under which missing data occurred and provide better estimates for parameters than either LD or PD. Principled missing data methods do not replace a missing value directly; they combine available information from the observed data with statistical assumptions in order to estimate the population parameters and/or the missing data mechanism statistically.

A review of the quantitative studies published in Journal of Educational Psychology (JEP) between 2009 and 2010 revealed that, out of 68 articles that met our criteria for quantitative research, 46 (or 67.6%) articles explicitly acknowledged missing data, or were suspected to have some due to discrepancies between sample sizes and degrees of freedom. Eleven (or 16.2%) did not have missing data and the remaining 11 did not provide sufficient information to help us determine if missing data occurred. Of the 46 articles with missing data, 17 (or 37%) did not apply any method to deal with the missing data, 13 (or 28.3%) used LD or PD, 12 (or 26.1%) used FIML, four (or 8.7%) used EM, three (or 6.5%) used MI, and one (or 2.2%) used both the EM and the LD methods. Of the 29 articles that dealt with missing data, only two explained their rationale for using FIML and LD, respectively. One article misinterpreted FIML as an imputation method. Another was suspected to have used either LD or an imputation method to deal with attrition in a PISA data set (OECD 2009 ;Williams and Williams 2010 ).

Compared with missing data treatments by articles published in JEP between 1998 and 2004 (Table 3.1 in Peng et al. 2006 ), there has been improvement in the decreased use of LD (from 80.7% down to 21.7%) and PD (from 17.3% down to 6.5%), and an increased use of FIML (from 0% up to 26.1%), EM (from 1.0% up to 8.7%), or MI (from 0% up to 6.5%). Yet several research practices still prevailed from a decade ago, namely, not explicitly acknowledging the presence of missing data, not describing the particular approach used in dealing with missing data, and not testing assumptions associated with missing data methods. These findings suggest that researchers in educational psychology have not fully embraced principled missing data methods in research.

Although treating missing data is usually not the focus of a substantive study, failing to do so properly causes serious problems. First, missing data can introduce potential bias in parameter estimation and weaken the generalizability of the results (Rubin 1987 ;Schafer 1997 ). Second, ignoring cases with missing data leads to the loss of information which in turn decreases statistical power and increases standard errors(Peng et al. 2006 ). Finally, most statistical procedures are designed for complete data (Schafer and Graham 2002 ). Before a data set with missing values can be analyzed by these statistical procedures, it needs to be edited in some way into a “complete” data set. Failing to edit the data properly can make the data unsuitable for a statistical procedure and the statistical analyses vulnerable to violations of assumptions.

Because of the prevalence of the missing data problem and the threats it poses to statistical inferences, this paper is interested in promoting three principled methods, namely, MI, FIML, and EM, by illustrating these methods with an empirical data set and discussing issues surrounding their applications. Each method is demonstrated using SAS 9.3. Results are contrasted with those obtained from the complete data set and the LD method. The relative merits of each method are noted, along with common features they share. The paper concludes with an emphasis on assumptions associated with these principled methods and recommendations for researchers. The remainder of this paper is divided into the following sections: (1) Terminology, (2) Multiple Imputation (MI), (3) Full Information Maximum-Likelihood (FIML), (4) Expectation-Maximization (EM) Algorithm, (5) Demonstration, (6) Results, and (6) Discussion.

Terminology

Missing data occur at two levels: at the unit level or at the item level. A unit-level non-response occurs when no information is collected from a respondent. For example, a respondent may refuse to take a survey, or does not show up for the survey. While the unit non-response is an important and common problem to tackle, it is not the focus of this paper. This paper focuses on the problem of item non-response . An item non-response refers to the incomplete information collected from a respondent. For example, a respondent may miss one or two questions on a survey, but answered the rest. The missing data problem at the item level needs to be tackled from three aspects: the proportion of missing data, the missing data mechanisms, and patterns of missing data. A researcher must address all three before choosing an appropriate procedure to deal with missing data. Each is discussed below.

Proportion of missing data

The proportion of missing data is directly related to the quality of statistical inferences. Yet, there is no established cutoff from the literature regarding an acceptable percentage of missing data in a data set for valid statistical inferences. For example, Schafer ( 1999 ) asserted that a missing rate of 5% or less is inconsequential. Bennett ( 2001 ) maintained that statistical analysis is likely to be biased when more than 10% of data are missing. Furthermore, the amount of missing data is not the sole criterion by which a researcher assesses the missing data problem. Tabachnick and Fidell ( 2012 ) posited that the missing data mechanisms and the missing data patterns have greater impact on research results than does the proportion of missing data.

Missing data mechanisms

According to Rubin ( 1976 ), there are three mechanisms under which missing data can occur: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR). To understand missing data mechanisms, we partition the data matrix Y into two parts: the observed part ( Y obs ) and the missing part ( Y mis ). Hence, Y  = ( Y obs ,  Y mis ). Rubin ( 1976 ) defined MAR to be a condition in which the probability that data are missing depends only on the observed Y obs , but not on the missing Y mis , after controlling for Y obs . For example, suppose a researcher measures college students’ understanding of calculus in the beginning (pre-test) and at the end (post-test) of a calculus course. Let’s suppose that students who scored low on the pre-test are more likely to drop out of the course, hence, their scores on the post-test are missing. If we assume that the probability of missing the post-test depends only on scores on the pre-test, then the missing mechanism on the post-test is MAR. In other words, for students who have the same pre-test score, the probability of their missing the post-test is random. To state the definition of MAR formally, let R be a matrix of missingness with the same dimension as Y . The element of R is either 1 or 0, corresponding to Y being observed (coded as 1) or missing (coded as 0). If the distribution of R , written as P ( R | Y ,  ξ ), where ξ = missingness parameter, can be modeled as Equation 1 , then the missing condition is said to be MAR (Schafer 1997 p. 11):

In other words, the probability of missingness depends on only the observed data and ξ. Furthermore, if (a) the missing data mechanism is MAR and (b) the parameter of the data model ( θ ) and the missingness parameter ξ are independent, the missing data mechanism is said to be ignorable (Little and Rubin 2002 ). Since condition (b) is almost always true in real world settings, ignorability and MAR (together with MCAR) are sometimes viewed as equivalent (Allison 2001 ).

Although many modern missing data methods (e.g., MI, FIML, EM) assume MAR, violation of this assumption should be expected in most cases (Schafer and Graham 2002 ). Fortunately, research has shown that violation of the MAR assumption does not seriously distort parameter estimates (Collins et al. 2001 ). Moreover, MAR is quite plausible when data are missing by design. Examples of missing by design include the use of multiple booklets in large scale assessment, longitudinal studies that measure a subsample at each time point, and latent variable analysis in which the latent variable is missing with a probability of 1, therefore, the missing probability is independent of all other variables.

MCAR is a special case of MAR. It is a missing data condition in which the likelihood of missingness depends neither on the observed data Y obs , nor on the missing data Y mis . Under this condition, the distribution of R is modeled as follows:

If missing data meet the MCAR assumption, they can be viewed as a random sample of the complete data. Consequently, ignoring missing data under MCAR will not introduce bias, but will increase the SE of the sample estimates due to the reduced sample size. Thus, MCAR poses less threat to statistical inferences than MAR or MNAR.

The third missing data mechanism is MNAR. It occurs when the probability of missing depends on the missing value itself. For example, missing data on the income variable is likely to be MNAR, if high income earners are more inclined to withhold this information than average- or low-income earners. In case of MNAR, the missing mechanism must be specified by the researcher, and incorporated into data analysis in order to produce unbiased parameter estimates. This is a formidable task not required by MAR or MCAR.

The three missing data methods discussed in this paper are applicable under either the MCAR or the MAR condition, but not under MNAR. It is worth noting that including variables in the statistical inferential process that could explain missingness makes the MAR condition more plausible. Return to the college students’ achievement in a calculus course for example. If the researcher did not collect students’ achievement data on the pre-test, the missingness on the post-test is not MAR, because the missingness depends on the unobserved score on the post-test alone. Thus, the literature on missing data methods often suggests including additional variables into a statistical model in order to make the missing data mechanism ignorable (Collins et al. 2001 ;Graham 2003 ;Rubin 1996 ).

The tenability of MCAR can be examined using Little’s multivariate test (Little and Schenker 1995 ). However, it is impossible to test whether the MAR condition holds, given only the observed data (Carpenter and Goldstein 2004 ;Horton and Kleinman 2007 ;White et al. 2011 ). One can instead examine the plausibility of MAR by a simple t -test of mean differences between the group with complete data and that with missing data (Diggle et al. 1995 ;Tabachnick and Fidell 2012 ). Both approaches are illustrated with a data set at http://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/20.0/en/client/Manuals/IBM_SPSS_Missing_Values.pdf . Yet, Schafer and Graham ( 2002 ) criticized the practice of dummy coding missing values, because such a practice redefines the parameters of the population. Readers should therefore be cautioned that the results of these tests should not be interpreted as providing definitive evidence of either MCAR or MAR.

Patterns of missing data

There are three patterns of missing data: univariate, monotone, and arbitrary; each is discussed below. Suppose there are p variables, denoted as, Y 2 ,  Y 2 , …,  Y p . A data set is said to have a univariate pattern of missing if the same participants have missing data on one or more of the p variables. A dataset is said to have a monotone missing data pattern if the variables can be arranged in such a way that, when Y j is missing, Y j  + 1 ,  Y j  + 2 , …,  Y p are missing as well. The monotone missing data pattern occurs frequently in longitudinal studies where, if a participant drops out at one point, his/her data are missing on subsequent measures. For the treatment of missing data, the monotone missing data pattern subsumes the univariate missing data pattern. If missing data occur in any variable for any participant in a random fashion, the data set is said to have an arbitrary missing data pattern. Computationally, the univariate or the monotone missing data pattern is easier to handle than an arbitrary pattern.

Multiple Imputation (MI)

MI is a principled missing data method that provides valid statistical inferences under the MAR condition (Little and Rubin 2002 ). MI was proposed to impute missing data while acknowledging the uncertainty associated with the imputed values (Little and Rubin 2002 ). Specifically, MI acknowledges the uncertainty by generating a set of m plausible values for each unobserved data point, resulting in m complete data sets, each with one unique estimate of the missing values. The m complete data sets are then analyzed individually using standard statistical procedures, resulting in m slightly different estimates for each parameter. At the final stage of MI, m estimates are pooled together to yield a single estimate of the parameter and its corresponding SE . The pooled SE of the parameter estimate incorporates the uncertainty due to the missing data treatment (the between imputation uncertainty) into the uncertainty inherent in any estimation method (the within imputation uncertainty). Consequently, the pooled SE is larger than the SE derived from a single imputation method (e.g., mean substitution) that does not consider the between imputation uncertainty. Thus, MI minimizes the bias in the SE of a parameter estimate derived from a single imputation method.

In sum, MI handles missing data in three steps: (1) imputes missing data m times to produce m complete data sets; (2) analyzes each data set using a standard statistical procedure; and (3) combines the m results into one using formulae from Rubin ( 1987 ) or Schafer ( 1997 ). Below we discuss each step in greater details and demonstrate MI with a real data set in the section Demonstration .

Step 1: imputation

The imputation step in MI is the most complicated step among the three steps. The aim of the imputation step is to fill in missing values multiple times using the information contained in the observed data. Many imputation methods are available to serve this purpose. The preferred method is the one that matches the missing data pattern. Given a univariate or monotone missing data pattern, one can impute missing values using the regression method (Rubin 1987 ), or the predictive mean matching method if the missing variable is continuous (Heitjan and Little 1991 ;Schenker and Taylor 1996 ). When data are missing arbitrarily, one can use the Markov Chain Monte Carlo (MCMC) method (Schafer 1997 ), or the fully conditional specification (also referred to as chained equations) if the missing variable is categorical or non-normal (Raghunathan et al. 2001 ;van Buuren 2007 ;van Buuren et al. 3.0.CO;2-R" href="/articles/10.1186/2193-1801-2-222#ref-CR62" id="ref-link-section-d11706100e1007">1999 ;van Buuren et al. 2006 ). The regression method and the MCMC method are described next.

The regression method for univariate or monotone missing data pattern

Suppose that there are p variables, Y 1 ,  Y 2 , …,  Y p in a data set and missing data are uniformly or monotonically present from Y j to Y p , where 1 < j ≤ p . To impute the missing values for the j th variable, one first constructs a regression model using observed data on Y 1 through Y j  - 1 to predict the missing values on Y j :

The regression model in Equation 3 yields the estimated regression coefficients β ^ and the corresponding covariance matrix. Based on these results, one can impute one set of regression coefficients β ^ * from the sampling distributions of β ^ . Next, the missing values in Y j can be imputed by plugging β ^ * into Equation 3 and adding a random error. After missing data in Y j are imputed, missing data in Y j  + 1 , …,  Y p are imputed subsequently in the same fashion, resulting in one complete data set. The above steps are repeated m times to derive m sets of missing values (Rubin 1987 ;pp. 166–167; SAS Institute Inc 2011 ).

The MCMC method for arbitrary missing pattern

When the missing data pattern is arbitrary, it is difficult to develop analytical formulae for the missing data. One has to turn to numerical simulation methods, such as MCMC (Schafer 1997 ) in this case. The MCMC technique used by the MI procedure of SAS is described below [interested readers should refer to SAS/STAT 9.3 User’s Guide (SAS Institute Inc 2011 ) for a detailed explanation].

Recall that the goal of the imputation step is to draw random samples of missing data based on information contained in the observed data. Since the parameter ( θ ) of the data is also unknown, the imputation step actually draws random samples of both missing data and θ based on the observed data. Formally, the imputation step is to draw random samples from the distribution P ( θ ,  Y mis | Y obs ). Because it is much easier to draw estimates of Y mis from P ( Y mis | Y obs ,  θ ) and estimates of θ from P ( θ | Y obs ,  Y mis ) separately, the MCMC method draws samples in two steps. At step one, given the current estimate of θ ( t ) at the t th iteration, a random sample Y mis t + 1 is drawn from the conditional predictive distribution of P ( Y mis | Y obs ,  θ ( t ) ). At step two, given Y mis t + 1 , a random sample of θ ( t  + 1) is drawn from the distribution of P θ | Y obs , Y mis t + 1 . According to Tanner and Wong ( 1987 ), the first step is called the I-step (not to be confused with the first imputation step in MI) and the second step is called the P-step (or the posterior step). Starting with an initial value θ (0) (usually an arbitrary guess), MCMC iterates between the I-step and the P-step, leading to a Markov Chain: Y mis 1 , θ 1 , Y mis 2 , θ 2 , … , Y mis t , θ t , and so on. It can be shown that this Markov Chain converges in distribution to P ( θ ,  Y mis | Y obs ). It follows that the sequence θ (1) ,  θ (2) , …,  θ ( t ) , … converges to P ( θ | Y obs ) and the sequence Y mis 1 , Y mis 2 , … , Y mis t , … converges to P ( Y mis | Y obs ). Thus, after the Markov Chain converges, m draws of Y mis can form m imputations for the missing data. In practice, the m draws are separated by several iterations to avoid correlations between successive draws. Computation formulae of P ( Y mis | Y obs ,  θ ) and P ( θ | Y obs ,  Y mis ) based on the multivariate normal distribution can be found in SAS/STAT 9.3 User’s Guide (SAS Institute Inc 2011 ). At the end of the first step in MI, m sets of complete data are generated.

Step 2: statistical analysis

The second step of MI analyzes the m sets of data separately using a statistical procedure of a researcher’s choice. At the end of the second step, m sets of parameter estimates are obtained from separate analyses of m data sets.

Step 3: combining results

The third step of MI combines the m estimates into one. Rubin ( 1987 ) provided formulae for combining m point estimates and SE s for a single parameter estimate and its SE . Suppose Q ^ i denotes the estimate of a parameter Q , (e.g., a regression coefficient) from the i th data set. Its corresponding estimated variance is denoted as U ^ l . Then the pooled point estimate of Q is given by:

The variance of Q ¯ is the weighted sum of two variances: the within imputation variance ( U ¯ ) and the between imputation variance ( B ). Specifically, these three variances are computed as follows:

In Equation 7 , the ( 1 m ) factor is an adjustment for the randomness associated with a finite number of imputations. Theoretically, estimates derived from MI with small m yield larger sampling variances than ML estimates (e.g., those derived from FIML), because the latter do not involve randomness caused by simulation.

The statistic Q - Q ¯ / T is approximately distributed as a t distribution. The degrees of freedom ( ν m or ν m * ) for this t distribution are calculated by Equations 8 –10 (Barnard and Rubin 1999 ):

In Equation 8 , r is the relative increase in variance due to missing data. The r is defined as the adjusted between-imputation variance standardized by the within-imputation variance. In Equation 10 , gamma = (1 + 1/ m ) B / T , and ν 0 is the degrees of freedom if the data are complete. ν m * is a correction of ν m , when ν 0 is small and the missing rate is moderate (SAS Institute Inc 2011 ).

According to Rubin ( 1987 ), the severity of missing data is measured by the fraction of missing information ( λ ^ ), defined as:

As the number of imputations increases to infinity, λ ^ is reduced to the ratio of the between-imputation variance over the total variance. In its limiting form, λ ^ can be interpreted as the proportion of total variance (or total uncertainty) that is attributable to the missing data (Schafer 1999 ).

For multivariate parameter estimation, Rubin ( 1987 ) provided a method to combine several estimates into a vector or matrix. The pooling procedure is a multivariate version of Equations ( 4 ) through (7), which incorporates the estimates of covariances among parameters. Rubin’s method assumes that the fraction of missing information (i.e., λ ^ ) is the same for all variables (SAS Institute Inc 2011 ). To our knowledge, no published studies have examined whether this assumption is realistic with real data sets, or Rubin’s method is robust to violation of this assumption.

MI related issues

When implementing MI, the researcher needs to be aware of several practical issues, such as, the multivariate normality assumption, the imputation model, the number of imputations, and the convergence of MCMC. Each is discussed below.

The multivariate normality assumption

The regression and MCMC methods implemented in statistical packages (e.g., SAS) assume multivariate normality for variables. It has been shown that MI based on the multivariate normal model can provide valid estimates even when this assumption is violated (Demirtas et al. 2008 ;Schafer 1997 1999 ). Furthermore, this assumption is robust when the sample size is large and when the missing rate is low, although the definition for a large sample size or for a low rate of missing is not specified in the literature (Schafer 1997 ).

When an imputation model contains categorical variables, one cannot use the regression method or MCMC directly. Techniques such as, logistic regression and discriminant function analysis, can substitute for the regression method, if the missing data pattern is monotonic or univariate. If the missing data pattern is arbitrary, MCMC based on other probability models (such as the joint distribution of normal and binary) can be used for imputation. The free MI software NORM developed by Schafer ( 1997 ) has two add-on modules—CAT and MIX—that deal with categorical data. Specifically, CAT imputes missing data for categorical variables, and MIX imputes missing data for a combination of categorical and continuous variables. Other software packages are also available for imputing missing values in categorical variables, such as the ICE module in Stata (Royston 2004 2005 2007 ;Royston and White 2011 ), the mice package in R and S-Plus (van Buuren and Groothuis-Oudshoorn 2011 ), and the IVEware (Raghunathan et al. 2001 (Yucel). Interested readers are referred to a special volume of the Journal of Statistical Software 2011 ) for recent developments in MI software.

When researchers use statistical packages that impose a multivariate normal distribution assumption on categorical variables, a common practice is to impute missing values based on the multivariate normal model, then round the imputed value to the nearest integer or to the nearest plausible value. However, studies have shown that this naïve way of rounding would not provide desirable results for binary missing values (Ake 2005 ;Allison 2005 ;Enders 2010 ). For example, Horton et al. ( 2003 ) showed analytically that rounding the imputed values led to biased estimates, whereas imputed values without rounding led to unbiased results. Bernaards et al. ( 2007 ) compared three approaches to rounding in binary missing values: (1) rounding the imputed value to the nearest plausible value, (2) randomly drawing from a Bernoulli trial using the imputed value, between 0 and 1, as the probability in the Bernoulli trial, and (3) using an adaptive rounding rule based on the normal approximation to the binomial distribution. Their results showed that the second method was the worst in estimating odds ratio, and the third method provided the best results. One merit of their study is that it is based on a real-world data set. However, other factors may influence the performance of the rounding strategies, such as the missing mechanism, the size of the model, distributions of the categorical variables. These factors are not within a researcher’s control. Additional research is needed to identify one or more good strategy in dealing with categorical variables in MI, when a multivariate normal-based software is used to perform MI.

Unfortunately, even less is known about the effect of rounding in MI, when imputing ordinal variables with three or more levels. It is possible that as the level of the categorical variable increases, the effect of rounding decreases. Again, studies are needed to further explore this issue.

The imputation model

MI requires two models: the imputation model used in step 1 and the analysis model used in step 2. Theoretically, MI assumes that the two models are the same. In practice, they can be different (Schafer 1997 ). An appropriate imputation model is the key to the effectiveness of MI; it should have the following two properties.

First, an imputation model should include useful variables. Rubin ( 1996 ) recommends a liberal approach when deciding if a variable should be included in the imputation model. Schafer ( 1997 ) and van Buuren et al. ( 3.0.CO;2-R" href="/articles/10.1186/2193-1801-2-222#ref-CR62" id="ref-link-section-d11706100e3230">1999 ) recommended three kinds of variables to be included in an imputation model: (1) variables that are of theoretical interest, (2) variables that are associated with the missing mechanism, and (3) variables that are correlated with the variables with missing data. The latter two kinds of variables are sometimes referred to as auxiliary variables (Collins et al. 2001 ). The first kind of variables is necessary, because omitting them will downward bias the relation between these variables and other variables in the imputation model. The second kind of variables makes the MAR assumption more plausible, because they account for the missing mechanism. The third kind of variables helps to estimate missing values more precisely. Thus, each kind of variables has a unique contribution to the MI procedure. However, including too many variables in an imputation model may inflate the variance of estimates, or lead to non-convergence. Thus, researchers should carefully select variables to be included into an imputation model. van Buuren et al. ( 3.0.CO;2-R" href="/articles/10.1186/2193-1801-2-222#ref-CR62" id="ref-link-section-d11706100e3236">1999 ) recommended not including auxiliary variables that have too many missing data. Enders ( 2010 ) suggested selecting auxiliary variables that have absolute correlations greater than .4 with variables with missing data.

Second, an imputation model should be general enough to capture the assumed structure of the data. If an imputation model is more restrictive, namely, making additional restrictions than an analysis model, one of two consequences may follow. One consequence is that the results are valid but the conclusions may be conservative (i.e., failing to reject the false null hypothesis), if the additional restrictions are true (Schafer 1999 ). Another consequence is that the results are invalid because one or more of the restrictions is false (Schafer 1999 ). For example, a restriction may restrict the relationship between a variable and other variables in the imputation model to be merely pairwise. Therefore, any interaction effect that involves at least three variables will be biased toward zero. To handle interactions properly in MI, Enders ( 2010 ) suggested that the imputation model include the product of the two variables if both are continuous. For categorical variables, Enders suggested performing MI separately for each subgroup defined by the combination of the levels of the categorical variables.

Number of imputations

The number of imputations needed in MI is a function of the rate of missing information in a data set. A data set with a large amount of missing information requires more imputations. Rubin ( 1987 ) provided a formula to compute the relative efficiency of imputing m times, instead of an infinite number of times: RE = [1+ λ ^ / m ] -1 , where λ ^ is the fraction of missing information, defined in Equation 11 .

However, methodologists have not agreed on the optimal number of imputations. Schafer and Olsen ( 1998 ) suggested that “in many applications, just 3–5 imputations are sufficient to obtain excellent results” (p. 548). Schafer and Graham ( 2002) were more conservative in asserting that 20 imputations are enough in many practical applications to remove noises from estimations. Graham et al. ( 2007 ) commented that RE should not be an important criterion when specifying m , because RE has little practical meaning. Other factors, such as, the SE , p -value, and statistical power, are more related to empirical research and should also be considered, in addition to RE. Graham et al. ( 2007 ) reported that statistical power decreased much faster than RE, as λ increases and/or m decreases. In an extreme case in which λ=.9 and  m  = 3, the power for MI was only .39, while the power of an equivalent FIML analysis was 0.78. Based on these results, Graham et al. ( 2007 ) provided a table for the number of imputations needed, given λ and an acceptable power falloff, such as 1%. They defined the power falloff as the percentage decrease in power, compared to an equivalent FIML analysis, or compared to m = 100. For example, to ensure a power falloff less than 1%, they recommended m = 20, 40, 100, or > 100 for a true λ =.1, .5, .7, or .9 respectively. Their recommended m is much larger than what is derived from the Rubin rule based on RE (Rubin 1987 ). Unfortunately, Graham et al.’s study is limited to testing a small standardized regression coefficient (β = 0.0969) in a simple regression analysis. The power falloff of MI may be less severe when the true β is larger than 0.0969. At the present, the literature does not shed light on the performance of MI when the regression model is more complex than a simple regression model.

Recently, White et al. ( 2011 ) argued that in addition to relative efficiency and power, researchers should also consider Monte Carlo errors when specifying the optimal number of imputations. Monte Carlo error is defined as the standard deviation of the estimates (e.g. regression coefficients, test statistic, p -value) “across repeated runs of the same imputation procedure with the same data” (White et al. 2011 p. 387). Monte Carlo error converges to zero as m increases. A small Monte Carlo error implies that results from a particular run of MI could be reproduced in the subsequent repetition of the MI analysis. White et al. also suggested that the number of imputations should be greater than or equal to the percentage of missing observations in order to ensure an adequate level of reproducibility. For studies that compare different statistical methods, the number of imputations should be even larger than the percentage of missing observations, usually between 100 and 1000, in order to control the Monte Carlo error (Royston and White 2011 ).

It is clear from the above discussions that a simple recommendation for the number of imputations (e.g., m = 5) is inadequate. For data sets with a large amount of missing information, more than five imputations are necessary in order to maintain the power level and control the Monte Carlo error. A larger imputation model may require more imputations, compared to a smaller or simpler model. This is so because a large imputation model results in increased SE s, compared to a smaller or simpler model. Therefore, for a large model, additional imputations are needed to offset the increased SE s. Specific guidelines for choosing m await empirical research. In general, it is a good practice to specify a sufficient m to ensure the convergence of MI within a reasonable computation time.

Convergence of MCMC

The convergence of the Markov Chain is one of the determinants of the validity of the results obtained from MI. If the Markov Chain does not converge, the imputed values are not considered random samples from the posterior distribution of the missing data, given the observed data, i.e., P ( Y mis | Y obs ). Consequently, statistical results based on these imputed values are invalid. Unfortunately, the importance of assessing the convergence was rarely mentioned in articles that reviewed the theory and application of MCMC (Schafer 1999 ;Schafer and Graham 2002 ;Schlomer et al. 2010 ;Sinharay et al. 2001 ). Because the convergence is defined in terms of both probability and procedures, it is complex and difficult to determine the convergence of MCMC (Enders 2010 ). One way to roughly assess convergence is to visually examine the trace plot and the autocorrelation function plot; both are provided by SAS PROC MI (SAS Institute Inc 2011 ). For a parameter θ , a trace plot is a plot of the number of iterations ( t ) against the value of θ ( t ) on the vertical axis. If the MCMC converges, there is no indication of a systematic trend in the trace plot. The autocorrelation plot displays the autocorrelations between θ ( t ) s at lag k on the vertical axis against k on the horizontal axis. Ideally, the autocorrelation at any lag should not be statistically significantly different from zero. Since the convergence of a Markov Chain may be at different rates for different parameters, one needs to examine these two plots for each parameter. When there are many parameters, one can choose to examine the worst linear function (or WLF, Schafer 1997 ). The WLF is a constructed statistic that converges more slowly than all other parameters in the MCMC method. Thus if the WLF converges, all parameters should have converged (see pp. 2–3 of the Appendix for an illustration of both plots for WLF, accessible from https://oncourse.iu.edu/access/content/user/peng/Appendix.Dong%2BPeng.Principled%20missing%20methods.current.pdf ). Another way to assess the convergence of MCMC is to start the chain multiple times, each with a different initial value. If all the chains yield similar results, one can be confident that the algorithm has converged.

Full information maximum-likelihood (FIML)

FIML is a model-based missing data method that is used frequently in structural equating modeling (SEM). In our review of the literature, 26.1% studies that had missing data used FIML to deal with missing data. Unlike MI, FIML does not impute any missing data. It estimates parameters directly using all the information that is already contained in the incomplete data set. The FIML approach was outlined by Hartley and Hocking ( 1971 ). As the name suggests, FIML obtains parameter estimates by maximizing the likelihood function of the incomplete data. Under the assumption of multivariate normality, the log likelihood function of each observation i is:

where x i is the vector of observed values for case i , K i is a constant that is determined by the number of observed variables for case i , and μ and Σ are, respectively, the mean vector and the covariance matrix that are to be estimated (Enders 2001 ). For example, if there are three variables ( X 1 ,  X 2 , and  X 3 )  in the model. Suppose for case i , X 1  = 10 and  X 2  = 5, while X 3 is missing. Then in the likelihood function for case i is:

The total sample log likelihood is the sum of the individual log likelihood across n cases. The standard ML algorithm is used to obtain the estimates of μ and Σ, and the corresponding SE s by maximizing the total sample log likelihood function.

As with MI, FIML also assumes MAR and multivariate normality for the joint distribution of all the variables. When the two assumptions are met, FIML is demonstrated to produce unbiased estimates (Enders and Bandalos 2001 ) and valid model fit information (Enders 2001 ). Furthermore, FIML is generally more efficient than other ad hoc missing data methods, such as LD (Enders 2001 ). When the normality assumption was violated, Enders ( 2001 ) reported that (1) FIML provided unbiased estimates across different missing rates, sample sizes, and distribution shapes, as long as the missing mechanism was MCAR or MAR, but (2) FIML resulted in negatively biased SE estimates and an inflated model rejection rate (namely, rejecting fitted models too frequently). Thus, Enders recommended using correction methods, such as rescaled statistics and bootstrap, to correct the bias associated with nonnormality.

Because FIML assumes MAR, adding auxiliary variables to a fitted model is beneficial to data analysis in terms of bias and efficiency (Graham 2003 ; Section titled The Imputation Model). Collins et al. ( 2001 ) showed that auxiliary variables are especially helpful when (1) missing rate is high (i.e., > 50%), and/or (2) the auxiliary variable is at least moderately correlated (i.e., Pearson’s r > .4) with either the variable containing missing data or the variable causing missingness. However, incorporating auxiliary variables into FIML is not as straightforward as it is with MI. Graham ( 2003 ) proposed the saturated correlates model to incorporate auxiliary variables into a substantive SEM model, without affecting the parameter estimates of the SEM model or its model fit index. Specifically, Graham suggested that, after the substantive SEM model is constructed, the auxiliary variables be added into the model according to the following rules: (a) all auxiliary variables are specified to be correlated with all exogenous manifest variables in the model; (b) all auxiliary variables are specified to be correlated with the residuals for all the manifest variables that are predicted; and (c) all auxiliary variables are specified to be correlated to each other. Afterwards, the saturated correlates model can be fitted to data by FIML to increase efficiency and decrease bias.

Expectation-maximization (EM) algorithm

The EM algorithm is another maximum-likelihood based missing data method. As with FIML, the EM algorithm does not “fill in” missing data, but rather estimates the parameters directly by maximizing the complete data log likelihood function. It does so by iterating between the E step and the M step (Dempster et al. 1977 ).

The E (expectation) step calculates the expectation of the log-likelihood function of the parameters, given data. Assuming a data set ( Y ) is partitioned into two parts: the observed part and the missing part, namely, Y  = ( Y obs ,  Y mis ). The distribution of Y depending on the unknown parameter θ can be therefore written as:

Equation 13 can be written as a likelihood function as Equation 14 :

where c is a constant relating to the missing data mechanism that can be ignored under the MAR assumption and the independence between model parameters and the missing mechanism parameters (Schafer 1997 p. 12). Taking the log of both sides of Equation 14 yields the following:

where l ( θ | Y ) = log  P ( Y | θ ) is the complete-data log likelihood, l ( θ | Y obs ) is the observed-data log likelihood, log  c is a constant, and P ( Y mis | Y obs ,  θ (Schafer) is the predictive distribution of the missing data, given θ 1997 ). Since log  c does not affect the estimation of θ , this term can be dropped in subsequent calculations.

Because Y mis is unknown, the complete-data log likelihood cannot be determined directly. However, if there is a temporary or initial guess of θ (denoted as θ ( t ) ), it is possible to compute the expectation of l ( θ | Y ) with respect to the assumed distribution of the missing data P ( Y mis | Y obs ,  θ ( t ) ) as Equation 16 :

It is at the E step of the EM algorithm that Q ( θ | θ ( t ) ) is calculated.

At the M (Maximization) step, the next guess of θ is obtained by maximizing the expectation of the complete data log likelihood from the previous E step:

The EM algorithm is initialized with an arbitrary guess of θ 0 , usually estimates based solely on the observed data. It proceeds by alternating between the E step and M step. It is terminated when successive estimates of θ are nearly identical. The θ ( t +1) that maximizes Q ( θ | θ ( t ) ) is guaranteed to yield an observed data log likelihood that is greater than or equal to that provided by θ ( t ) (Dempster et al. 1977 ).

The EM algorithm has many attractive properties. First, an EM estimator is unbiased and efficient when the missing mechanism is ignorable (ignorability is discussed under the section Missing Data Mechanisms , Graham 2003 ). Second, the EM algorithm is simple, easy to implement (Dempster et al. 1977 ) and stable (Couvreur 1996 ). Third, it is straightforward in EM to compare different models using the likelihood ratio test, because EM is based on the likelihood function. Assuming Model B is nested within Model A, these two models can be compared based on the difference in the log likelihoods corresponding to these two models, namely l θ ^ A | Y obs - l θ ^ B | Y obs . Such a difference in the log likelihoods follows a chi-square distribution under suitable regularity conditions (Schafer and Graham 2002 ;Wilks 1938 ). The degree of freedom of the chi-square statistic is the difference in the number of parameters estimated between the two models. Fourth, EM can be used in situations that are not missing data related. For example, EM algorithm can be used in mixture models, random effect models, mixed models, hierarchical linear models, and unbalanced designs including repeated measures (Peng et al. 2006 ). Finally, the EM algorithm and other missing data methods that are based on the observed data log likelihood, such as FIML, are more efficient than the MI method because these methods do not require simulations whereas MI does.

However, the EM algorithm also has several disadvantages. First, the EM algorithm does not compute the derivatives of the log likelihood function. Consequently, it does not provide estimates of SE s. Although extensions of EM have been proposed to allow for the estimation of SE s, these extensions are computationally complex. Thus, EM is not a choice of the missing data method when statistical tests or confidence intervals of estimated parameters are the primary goals of research. Second, the rate of convergence can be painfully slow, when the percent of missing information is large (Little and Rubin 2002 ). Third, many statistical programs assume the multivariate normal distribution when constructing l ( θ | Y ). Violation of this multivariate normality assumption may cause convergence problems for EM, and also for other ML-based methods, such as FIML. For example, if the likelihood function has more than one mode, the mode to which EM will converge depends on the starting value of the iteration. Schafer ( 1997 ) cautions that multiple modes do occur in real data sets, especially when “the data are sparse and/or the missingness pattern is unusually pernicious.” (p. 52). One way to check if the EM provides valid results is to initialize the EM algorithm with different starting values, and check if the results are similar. Finally, EM is model specific. Each proposed data model requires a unique likelihood function. In sum, if used flexibly and with df , EM is powerful and can provide smaller SE estimates than MI. Schafer and Graham ( 2002 ) compiled a list of packages that offered the EM algorithm. To the best of our knowledge, the list has not been updated in the literature.

Demonstration

In this section, we demonstrate the three principled missing data methods by applying them to a real-world data set. The data set is complete and described under Data Set . A research question posted to this data set and an appropriate analysis strategy are described next under Statistical Modeling . From the complete data set, two missing data conditions were created under the MAR assumption at three missing data rates. These missing data conditions are described under Generating Missing Data Conditions . For each missing data condition, LD, MI, FIML, and EM were applied to answer the research question. The application of these four methods is described under Data Analysis . Results obtained from these methods were contrasted with those obtained from the complete data set. The results are discussed in the next section titled Results .

Self-reported health data by 432 adolescents were collected in the fall of 1988 from two junior high schools (Grades 7 through 9) in the Chicago area. Of the 432 participants, 83.4% were Whites and the remaining Blacks or others, with a mean age of 13.9 years and nearly even numbers of girls ( n = 208) and boys ( n = 224). Parents were notified by mail that the survey was to be conducted. Both the parents and the students were assured of their rights to optional participation and confidentiality of students’ responses. Written parental consent was waived with the approval of the school administration and the university Institutional Review Board (Ingersoll et al. 1993 ). The adolescents reported their health behavior, using the Health Behavior Questionnaire (HBQ) (Ingersoll and Orr 1989 ;Peng et al. 2006 ;Resnick et al. 1993 ), self-esteem, using Rosenberg’s inventory (Rosenberg 1989 ), gender, race, intention to drop out of school, and family structure. The HBQ asked adolescents to indicate whether they engaged in specific risky health behaviors (Behavioral Risk Scale) or had experienced selected emotions (Emotional Risk Scale). The response scale ranged from 1 ( never ) to 4 ( about once a week ) for both scales. Examples of behavioral risk items were “I use alcohol (beer, wine, booze),” “I use pot,” and “I have had sexual intercourse/gone all the way.” These items measured frequency of adolescents’ alcohol and drug use, sexual activity, and delinquent behavior. Examples of emotional risk items were “I have attempted suicide,” and “I have felt depressed.” Emotional risk items measured adolescents’ quality of relationship with others, and management of emotions. Cronbach’s alpha reliability (Nunnally 1978 ) was .84 for the Behavioral Risk Scale and .81 for the Emotional Risk Scale (Peng and Nichols 2003 ). Adolescents’ self-esteem was assessed using Rosenberg’s self-esteem inventory (Rosenberg 1989 ). Self-esteem scores ranged from 9.79 to 73.87 with a mean of 50.29 and SD of 10.04. Furthermore, among the 432 adolescents, 12.27% ( n = 53) indicated an intention to drop out of school; 67.4% ( n = 291) were from families with two parents, including those with one step-parent, and 32.63% ( n = 141) were from families headed by a single parent. The data set is hereafter referred to as the Adolescent data and is available from https://oncourse.iu.edu/access/content/user/peng/logregdata_peng_.sav as an SPSS data file.

Statistical Modeling

For the Adolescent data, we were interested in predicting adolescents’ behavioral risk from their gender, intention to drop out from school, family structure, and self-esteem scores. Given this objective, a linear regression model was fit to the data using adolescents’ score on the Behavioral Risk Scale of the HBQ as the dependent variable (BEHRISK) and gender (GENDER), intention to drop out of school (DROPOUT), type of family structure (FAMSTR), and self-esteem score (ESTEEM) as predictors or covariates. The emotional risk (EMORISK) was used subsequently as an auxiliary variable to illustrate the missing data methods. Hence, it was not included in the regression model. For the linear regression model, gender was coded as 1 for girls and 0 for boys, DROPOUT was coded as 1 for yes and 0 for no, and FAMSTR was coded as 1 for single-parent families and 0 for intact or step families. BEHRISK and ESTEEM were coded using participant’s scores on these two scales. Because the distribution of BEHRISK was highly skewed, a natural log transformation was applied to BEHRISK to reduce its skewness from 2.248 to 1.563. The natural-log transformed BEHRISK (or LBEHRISK) and ESTEEM were standardized before being included in the regression model to facilitate the discussion of the impact of different missing data methods. Thus, the regression model fitted to the Adolescent data was:

The regression coefficients obtained from SAS 9.3 using the complete data were:

According to the results, when all other covariates were held as a constant, boys, adolescents with intention to drop out of school, those with low self-esteem scores, or adolescents from single-parent families, were more likely to engage in risky behaviors.

Generating missing data conditions

The missing data on LBEHRISK and ESTEEM were created under the MAR mechanism. Specifically, the probability of missing data on LBEHRISK was made to depend on EMORISK. And the probability of missing data on ESTEEM depended on FAMSTR. Peugh and Enders ( 2004 ) reviewed missing data reported in 23 applied research journals, and found that “the proportion of missing cases per analysis ranged from less than 1% to approximately 67%” (p. 539). Peng, et al. ( 2006 ) reported missing rates ranging from 26% to 72% based on 1,666 studies published in 11 education and psychology journals. We thus designed our study to correspond to the wide spread of missing rates encountered by applied researchers. Specifically, we manipulated the overall missing rate at three levels: 20%, 40%, or 60% (see Table  1 ).We did not include lower missing rates such as, 10% or 5%, because we expected missing data methods to perform similarly and better at low missing rates than at high missing rates. Altogether we generated three missing data conditions using SPSS 20 (see the Appendix for SPSS syntax for generating missing data). Due to the difficulty in manipulating missing data in the outcome variable and the covariates, the actual overall missing rates could not be controlled exactly at 20% or 60%. They did closely approximate these pre-specified rates (see the description below).

According to Table  1 , at the 20% overall missing rate, participants from a single-parent family had a probability of .20 of missing ESTEEM, while participants from a two-parent family (including the intact families and families with one step- and one biological parent) had a probability of .02 of missing scores on ESTEEM. As the overall missing rate increased from 20% to 40% or 60%, the probability of missing on ESTEEM likewise increased. Furthermore, the probability of missing in LBEHRISK was conditioned on the value of EMORISK. Specifically, at the 20% overall missing rate, if EMORISK was at or below the first quartile, the probability of LBEHRISK missing was .00 (Table  1 ). If EMORISK was between the first and the third quartiles, the probability of LBEHRISK missing was .10 and an EMORISK at or above the third quartile resulted in LBEHRISK missing with a probability of .30. When the overall missing rate increased to 40% or 60%, the probabilities of missing LBEHRISK increased accordingly.

After generating three data sets with different overall missing rates, the regression model in Equation 18 was fitted to each data set using four methods (i.e., LD, MI, FIML, and EM) to deal with missing data. Since missing on LBEHRISK depended on EMORISK, EMORISK was used as an auxiliary variable in MI, EM, and FIML methods. All analyses were performed using SAS 9.3. For simplicity, we describe the data analysis for one of the three data sets, namely, the condition with an overall missing rate of 20%. Other data sets were analyzed similarly. Results are presented in Tables  2 and 3 .

Data analysis

The ld method.

The LD method was implemented as a default in PROC REG. To implement LD, we ran PROC REG without specifying any options regarding missing data method. The SAS system, by default, used cases with complete data to estimate the regression coefficients.

The MI method

The MI method was implemented using a combination of PROC MI (for imputation), PROC REG (for OLS regression analysis), and PROC MIANALYZE (for pooling in MI). According to White et al. ( 2011 ), the number of imputations should be at least equal to the percentage of missing observations. The largest missing rate in the present study was 60%. Thus, we decided to impute missing data 60 times before pooling estimates. The imputation model included all four covariates specified in Equation 18 , the dependent variable (LBEHRISK), and EMORISK as an auxiliary. For PROC MI, MCMC was chosen as the imputation method because the missing data pattern was arbitrary. By default, PROC MI uses the EM estimates as starting values for the MCMC method. The iteration history of EM indicated that the algorithm converged rather quickly; it took four iterations to converge for the 20% overall missing rate. The convergence in MCMC was further inspected using the trace plot and the autocorrelation function plot for the worst linear function (SAS Institute Inc 2011 ). The inspection of the trace plot did not identify any systematic trend, or any significant autocorrelation for lags greater than two in the autocorrelation function plot. We therefore concluded that the MCMC converged and the choice of 1000 as the number of burn-in and 200 as the number of iterations between imputations was adequate. The number of burn-in is the number of iterations before the first draw. It needs to be sufficiently large to ensure the convergence of MCMC. The fraction of missing information ( λ ) for each variable with missing data was estimated by PROC MI to be .11 for LBEHRISK and .10 for ESTEEM. These λ ^ s would have resulted in 3% power falloff, compared to FIML, if only five imputations were used (Graham et al. 2007 ). Instead, we specified 60 imputations based on White et al. ( 2011 )’s recommendation. The resulting 60 imputed data sets were used in steps 2 and 3 of MI.

The second step in MI was to fit the regression model in Equation 18 to each imputed data set using PROC REG (see the Appendix for the SAS syntax). At the end of PROC REG, 60 sets of estimates of regression coefficients and their variance-covariance matrices were output to the third and final step in MI, namely, to pool these 60 estimates into one set. PROC MIANALYZE was invoked to combine these estimates and their variances/covariances into one set using the pooling formula in Equations 4 to 7 (Rubin 1987 ). By default, PROC MIANALYZE uses ν m , defined in Equation 9 , for hypothesis testing. In order to specify the corrected degrees of freedom ν m * (as defined in Equation 10 ) for testing, we specified the “EDF=427” option, because 427 was the degrees of freedom based on the complete data.

The FIML method

The FIML method was implemented using PROC CALIS which is designed for structural equation modeling. Beginning with SAS 9.22, the CALIS procedure has offered an option to analyze data using FIML in the presence of missing data. The FIML method in the CALIS procedure has a variety of applications in path analyses, regression models, factor analyses, and others, as these modeling techniques are considered special cases of structural equation modeling (Yung and Zhang 2011 ). For the current study, two models were specified using PROC CALIS: an ordinary least squares regression model without the auxiliary variable EMORISK, and a saturated correlates model that included EMORISK. For the saturated correlates model, EMORISK was specified to be correlated with the four covariates (GENDER, DROPOUT, ESTEEM, and FAMSTR) and the residual for LBEHRISK. Graham ( 2003 ) has shown that by constructing the saturated correlates model this way, one can include an auxiliary variable in the SEM model without affecting parameter estimate(s), or the model fit index for the model of substantive interest, which is Equation 18 in the current study.

The EM method

The EM method was implemented using both PROC MI and PROC REG. As stated previously, the versatile PROC MI can be used for EM if the EM statement was specified. To include auxiliary variables in EM, one lists the auxiliary variables on the VAR statement of PROC MI (see the Appendix for the SAS syntax). The output data set of PROC MI with the EM specification is a data set containing the estimated variance-covariance matrix and the vector of means of all the variables listed on the VAR statement. The variance-covariance matrix and the means vector were subsequently input into PROC REG to be fitted by the regression model in Equation 18 . In order to compute the SE for the estimated regression coefficients, we specified a nominal sample size that was the average of available cases among all the variables. We decided on this strategy based on findings by Truxillo ( 2005 ). Truxillo ( 2005 ) compared three strategies for specifying sample sizes for hypothesis testing in discriminant function analysis using EM results. The three strategies were: (a) the minimum column-wise n (i.e., the smallest number of available cases among all variables), (b) the average column-wise n (i.e., the mean number of available cases among all the variables), and (c) the minimum pairwise n (the smallest number of available cases for any pair of variables in a data set). He found that the average column-wise n approach produced results closest to the complete-data results. It is worth noting that Truxillo ( 2005 )’s study was limited to discriminant function analysis and three sample size specifications. Additional research is needed in order to determine the best strategy to specify a nominal sample size for other statistical procedures.

Results derived from the 40% missing rate exhibited patterns between those obtained at 20% and 60% missing rates. Hence, they are presented in the Appendix. Table  2 presents estimates of regression coefficients and SE s derived from LD, MI, FIML and EM for the 20% and 60% missing data conditions. Table  3 presents the percent of bias in parameter estimates by the four missing data methods. The percentage of bias was defined and calculated as the ratio of the difference between the incomplete data estimate and the complete data estimate, divided by the complete data estimate. Any percentage of bias larger than 10% is considered substantial in subsequent discussions. The complete data results are included in Table  2 as a benchmark to which the missing data results are contrasted. The regression model based on the complete data explained 28.4% of variance (i.e., R adj 2 ) in LBEHRISK, RMSE = 0.846, and all four predictors were statistically significant at p < .001.

According to Table  2 , at 20% overall missing rate, estimates derived from the four missing data methods were statistically significant at p < .001, the same significance level as the complete data results. LD consistently resulted in larger SE , compared to the three principled methods, or the complete data set. The bias in estimates was mostly under 10%, except for estimates of ESTEEM by all four missing data methods (Table  3 ). The three principled methods exhibited similar biases and estimated FAMSTR accurately.

When the overall missing rate was 60% (Table  2 ), estimates derived from the four missing data methods showed that all four covariates were statistically significant at least at p < .05. LD consistently resulted in larger SE , compared to the three principled methods, or the complete data set. All four methods resulted in substantial bias for three of the four covariates (Table  3 ). The three principled methods once again yielded similar biases, whereas bias from LD was similar to these three only for DROPOUT. Indeed, DROPOUT was least accurately estimated by all four methods. LD estimated ESTEEM most accurately and better than the three principled methods. The three principled methods estimated GENDER most accurately and their estimates for FAMSTR were better than LD’s. Differences in absolute bias due to these four methods for ESTEEM or GENDER were actually quite small.

Compared to the complete data result, the three principled methods slightly overestimated SE s (Table  2 ), but not as badly as LD. Among the three methods, SE s obtained from EM were closer to those based on the complete data, than MI or FIML. This finding is to be expected because MI incorporates into SE the uncertainty associated with plausible missing data estimates. And the literature consistently documented the superior power of EM, compared to MI (Collins et al. 2001 ;Graham et al. 2007 ;Schafer and Graham 2002 ).

In general, the SE and the bias increased as the overall missing rate increased from 20% to 60%. One exception to this trend was the bias in ESTEEM estimated by LD; they decreased instead, although the two estimates differed by a mere .02.

During the last decade, the missing data treatments reported in JEP have shown much improvement in terms of decreased use of ad hoc methods (e.g., LD and PD) and increased use of principled methods (e.g., FIML, EM, and MI). Yet several research practices still persisted including, not explicitly acknowledging the presence of missing data, not describing the approach used in dealing with missing data, not testing assumptions assumed. In this paper, we promote three principled missing data methods (i.e., MI, FIML, and EM) by discussing their theoretical framework, implementation, assumptions, and computing issues. All three methods were illustrated with an empirical Adolescent data set using SAS 9.3. Their performances were evaluated under three conditions. These three conditions were created from three missing rates (20%, 40%, and 60%). Each incomplete data set was subsequently analyzed by a regression model to predict adolescents’ behavioral risk score using one of the three principled methods or LD. The performance of the four missing data methods was contrasted with that of the complete data set in terms of bias and SE .

Results showed that the three principled methods yielded similar estimates at both missing data rates. In comparison, LD consistently resulted in larger SE s for regression coefficients estimates. These findings are consistent with those reported in the literature and thus confirm the recommendations of the three principled methods (Allison 2003 ;Horton and Lipsitz 2001 ;Kenward and Carpenter 2007 ;Peng et al. 2006 ;Peugh and Enders 2004 ;Schafer and Graham 2002 ). Under the three missing data conditions, MI, FIML, and EM yielded similar estimates and SE s. These results are consistent with missing data theory that argues that MI and ML-based methods (e.g., FIML and EM) are equivalent (Collins et al. 2001 ;Graham et al. 2007 ;Schafer and Graham 2002 ). In terms of SE , ML-based methods outperformed MI by providing slightly smaller SE s. This finding is to be expected because ML-based methods do not involve any randomness whereas MI does. Below we elaborate on features shared by MI and ML-based methods, choice between these two types of methods, and extension of these methods to multilevel research contexts.

Features shared by MI and ML-based methods

First of all, these methods are based on the likelihood function of P ( Y obs ,  θ ) = ∫  P ( Y complete ,  θ ) dY mis . Because this equation is valid under MAR (Rubin 1976 ), all three principled methods are valid under the MAR assumption. The two ML-based methods work directly with the likelihood function, whereas MI takes the Bayesian approach by imposing a prior distribution on the likelihood function. As the sample size increases, the impact of the specific prior distribution diminishes. It has been shown that,

If the user of the ML procedure and the imputer use the same set of input data (same set of variables and observational units), if their models apply equivalent distributional assumptions to the variables and the relationships among them, if the sample size is large, and if the number of imputations, M , is sufficiently large, then the results from the ML and MI procedures will be essentially identical. (Collins et al. 2001 p. 336)

In fact, the computational details of EM and MCMC (i.e., data augmentation) are very similar (Schafer 1997 ).

Second, both the MI and the ML-based methods allow the estimation/imputation model to be different from the analysis model—the model of substantive interest. Although it is widely known that the imputation model can be different from the analysis model for MI, the fact that ML-based methods can incorporate auxiliary variables (such as, EMORISK) is rarely mentioned in the literature, except by Graham ( 2003 ). As previously discussed, Graham ( 2003 ) suggested using the saturated correlates model to incorporate auxiliary variables into SEM. However, this approach results in a rapidly expanding model with each additional auxiliary variable; consequently, the ML-based methods may not converge. In this case, MI is the preferred method, especially when one needs to incorporate a large number of auxiliary variables into the model of substantive interest.

Finally, most statistical packages that offer the EM, FIML and/or MI methods assume multivariate normality. Theory and experiments suggest that MI is more robust to violation of this distributional assumption than ML-based methods (Schafer 1997 ). As discussed previously, violation of the multivariate normality assumption may cause convergence problems for ML-based methods. Yet MI can still provide satisfactory results in the presence of non-normality (refer to the section titled MI Related Issues ). This is so because the posterior distribution in MI is approximated by a finite mixture of the normal distributions. MI therefore is able to capture non-normal features, such as, skewness or multiple modes (Schafer 1999 ). At the present, the literature does not offer systematic comparisons of these two methods in terms of their sensitivity to the violation of the multivariate normality assumption.

Choice between MI and ML-based methods

The choice between MI and ML-based methods is not easy. On the one hand, ML-based methods offer the advantage of likelihood ratio tests so that nested models can be compared. Even though Schafer ( 1997 ) provided a way to combine likelihood ratio test statistics in MI, no empirical studies have evaluated the performance of this pooled likelihood ratio test under various data conditions (e.g., missing mechanism, missing rate, number of imputations, model complexity). And this test has not been incorporated into popular statistical packages, such as, SAS, SPSS. ML-based methods, in general, produce slightly smaller SE s than MI (Collins et al. 2001 ;Schafer and Graham 2002 ). Finally, ML-based methods have greater power than MI (Graham et al. 2007 ), unless imputations were sufficiently large, such as 100 or more.

On the other hand, MI has a clear advantage over ML-based methods when dealing with categorical variables (Peng and Zhu 2008 ). Another advantage of MI over ML-based methods is its computational simplicity (Sinharay et al. 2001 ). Once missing data have been imputed, fitting multiple models to a single data set does not require the repeated application of MI. Yet it requires multiple applications of ML-based methods to fit different models to the same data. As stated earlier, it is easier to include auxiliary variable in MI than in ML-based methods. In this sense, MI is the preferred method, if one wants to employ an inclusive strategy to selecting auxiliary variables.

The choice also depends on the goal of the study. If the aim is exploratory, or if the data are prepared for a number of users who may analyze the data differently, MI is certainly better than a ML-based method. For these purposes, a data analyst needs to make sure that the imputation model is general enough to capture meaningful relationships in the data set. If, however, a researcher is clear about the parameters to be estimated, FIML or EM is a better choice because they do not introduce randomness due to imputation into the data, and are more efficient than MI.

An even better way to deal with missing data is to apply MI and EM jointly. In fact, the application of MI can be facilitated by utilizing EM estimates as starting values for the data augmentation algorithm (Enders 2010 ). Furthermore, the number of EM iterations needed for convergence is a conservative estimate for the number of burn-ins needed in data augmentation of MI, because EM converges slower than MI.

Extension of MI and ML-based methods to multilevel research contexts

Many problems in education and psychology are multilevel in nature, such as students nested within classroom, teachers nested within school districts, etc. To adequately address these problems, multilevel model have been recommended by methodologists. For an imputation method to yield valid results, the imputation model must contain the same structure as the data. In other words, the imputation model should be multilevel in order to impute for missing data in a multilevel context (Carpenter and Goldstein 2004 ). There are several ways to extend MI to deal with missing data when there are two levels. If missing data occur only at level 1 and the number of level 2 units is low, standard MI can be used with minor adjustments. For example, for a random-intercept model, one can dummy-code the cluster membership variable and include the dummy variables into the imputation model. In the case of a random slope and random intercepts model, one needs to perform multiple imputation separately within each cluster (Graham 2009 ). When the number of level 2 units is high, the procedure just described is cumbersome. In this instance, one may turn to specialized MI programs, such as, the PAN library in the S-Plus program (Schafer 2001 ), the REALCON-IMPUTE software (Carpenter et al. 2011 ), and the R package mlmmm (Yucel 2007 ). Unfortunately, ML-based methods have been extended to multilevel models only when there are missing data on the dependent variable, but not on the covariates at any level, such as student’s age at level 1 or school’s SES at level 2 (Enders 2010 ).

In this paper, we discuss and demonstrate three principled missing data methods that are applicable for a variety of research contexts in educational psychology. Before applying any of the principled methods, one should make every effort to prevent missing data from occurring. Toward this end, the missing data rate should be kept at minimum by designing and implementing data collection carefully. When missing data are inevitable, one needs to closely examine the missing data mechanism, missing rate, missing pattern, and the data distribution before deciding on a suitable missing data method. When implementing a missing data method, a researcher should be mindful of issues related to its proper implementation, such as, statistical assumptions, the specification of the imputation/estimation model, a suitable number of imputations, and criteria of convergence.

Quality of research will be enhanced if (a) researchers explicitly acknowledge missing data problems and the conditions under which they occurred, (b) principled methods are employed to handle missing data, and (c) the appropriate treatment of missing data is incorporated into review standards of manuscripts submitted for publication.

Ake CF: Rounding after multiple imputation with non-binary categorical covariates. In Proceedings of the Thirtieth Annual SAS® Users Group International Conference . Cary, NC: SAS Institute Inc; 2005:1-11.

Google Scholar  

Allison PD: Missing data . Thousand Oaks, CA: Sage Publications, Inc.; 2001.

Allison PD: Missing data techniques for structural equation modeling. J Abnorm Psychol 2003, 112(4):545-557. 10.1037/0021-843x.112.4.545

Allison PD: Imputation of categorical variables with PROC MI. In Proceedings of the Thirtieth Annual SAS® Users Group International Conference . Cary, NC: SAS Institute Inc; 2005:1-14.

Barnard J, Rubin DB: Small-sample degrees of freedom with multiple imputation. Biometrika 1999, 86(4):948-955. 10.1093/biomet/86.4.948

Bennett DA: How can I deal with missing data in my study? Aust N Z J Public Health 2001, 25(5):464-469. 10.1111/j.1467-842X.2001.tb00294.x

Bernaards CA, Belin TR, Schafer JL: Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat Med 2007, 26(6):1368-1382. 10.1002/sim.2619

Carpenter J, Goldstein H: Multiple imputation in MLwiN. Multilevel modelling newsletter 2004, 16: 9-18.

Carpenter JR, Goldstein H, Kenward MG: REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Softw 2011, 45(5):1-14.

Collins LM, Schafer JL, Kam C-M: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Meth 2001, 6(4):330-351. 10.1037/1082-989X.6.4.330

Couvreur C: The EM Algorithm: A Guided Tour . Pragues, Czech Republik: In Proc. 2d IEEE European Workshop on Computationaly Intensive Methods in Control and Signal Processing; 1996:115-120. 10.1.1.52.5949

Demirtas H, Freels SA, Yucel RM: Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment. JSCS 2008, 78(1):69-84. 10.1080/10629360600903866

Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Series 1977, 39(1):1-38. 10.2307/2984875

Diggle PJ, Liang KY, Zeger SL: Analysis of longitudinal data . New York: Oxford University Press; 1995.

Enders CK: A Primer on Maximum Likelihood Algorithms Available for Use With Missing Data. Struct Equ Modeling 2001, 8(1):128-141. 10.1207/S15328007SEM0801_7

Enders CK: Using the Expectation Maximization Algorithm to Estimate Coefficient Alpha for Scales With Item-Level Missing Data. Psychol Meth 2003, 8(3):322-337. 10.1037/1082-989X.8.3.322

Enders CK: Applied Missing Data Analysis . New York, NY: The Guilford Press; 2010.

Enders CK, Bandalos DL: The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models. Struct Equ Modeling l 2001, 8(3):430-457. 10.1207/S15328007SEM0803_5

Graham JW: Adding Missing-Data-Relevant Variables to FIML-Based Structural Equation Models. Struct Equ Modeling 2003, 10(1):80-100. 10.1207/S15328007SEM1001_4

Graham JW: Missing data analysis: Making it work in the real world. Annu Rev Psychol 2009, 60: 549-576. 10.1146/annurev.psych.58.110405.085530

Graham JW, Olchowski A, Gilreath T: How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory. Prev Sci 2007, 8(3):206-213. 10.1007/s11121-007-0070-9

Hartley HO, Hocking RR: The Analysis of Incomplete Data. Biometrics 1971, 27(4):783-823. 10.2307/2528820

Heitjan DF, Little RJ: Multiple imputation for the fatal accident reporting system. Appl Stat 1991, 40: 13-29. 10.2307/2347902

Horton NJ, Kleinman KP: Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. Am Stat 2007, 61(1):79-90. 10.1198/000313007X172556

Horton NJ, Lipsitz SR: Multiple Imputation in Practice. Am Stat 2001, 55(3):244-254. 10.1198/000313001317098266

Horton NJ, Lipsitz SR, Parzen M: A Potential for Bias When Rounding in Multiple Imputation. Am Stat 2003, 57(4):229-232. 10.1198/0003130032314

Ingersoll GM, Orr DP: Behavioral and emotional risk in early adolescents. J Early Adolesc 1989, 9(4):396-408. 10.1177/0272431689094002

Ingersoll GM, Grizzle K, Beiter M, Orr DP: Frequent somatic complaints and psychosocial risk in adolescents. J Early Adolesc 1993, 13(1):67-78. 10.1177/0272431693013001004

Kenward MG, Carpenter J: Multiple imputation: current perspectives. Stat Methods in Med Res 2007, 16(3):199-218. 10.1177/0962280206075304

Little RJA, Rubin DB: Statistical analysis with missing data . 2nd edition. New York: Wiley; 2002.

Little RJA, Schenker N: Missing Data. In Handbook of Statistical Modeling for the Social and Behavioral Sciences . Edited by: Arminger G, Clogg CC, Sobel ME. New York: Plenum Press; 1995:39-75.

Nunnally J: Psychometric theory . 2nd edition. New York: McGraw-Hill; 1978.

OECD: PISA Data Analysis Manual: SPSS, Second Edition. OECD Publishing, Paris. 2009. 10.1787/9789264056275-en

Peng CYJ, Nichols RN: Using multinomial logistic models to predict adolescent behavioral risk. J Mod App Stat 2003, 2(1):177-188.

Peng CYJ, Zhu J: Comparison of two approaches for handling missing covariates in logistic regression. Educ Psychol Meas 2008, 68(1):58-77. 10.1177/0013164407305582

Peng CYJ, Harwell M, Liou SM, Ehman LH: Advances in missing data methods and implications for educational research. In Real data analysis . Edited by: Sawilowsky SS. Charlotte, North Carolina: Information Age Pub; 2006:31-78.

Peugh JL, Enders CK: Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of educational research 2004, 74(4):525-556. 10.3102/00346543074004525

Raghunathan TE, Lepkowski JM, van Hoewyk J, Solenberger P: A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 2001, 27(1):85-96.

Resnick MD, Harris LJ, Blum RW: The impact of caring and connectedness on adolescent health and well-being. J Paediatr Child Health 1993, 29(Suppl 1):3-9. 10.1111/j.1440-1754.1993.tb02257.x

Rosenberg M: Society and the adolescent self-image . rev edition. Middletown, CT, England: Wesleyan University Press; 1989.

Royston P: Multiple imputation of missing values. SJ 2004, 4(3):227-241.

Royston P: Multiple imputation of missing values: Update of ice. SJ 2005, 5(4):527-536.

Royston P: Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring. SJ 2007, 7(4):445-464.

Royston P, White IR: Multiple Imputation by Chained Equations (MICE): Implementation in Stata. J Stat Softw 2011, 45(4):1-20.

Rubin DB: Inference and missing data. Biometrika 1976, 63(3):581-592. 10.1093/biomet/63.3.581

Rubin DB: Multiple imputation for nonresponse in surveys . New York: John Wiley & Sons, Inc.; 1987.

Rubin DB: Multiple Imputation after 18+ Years. JASA 1996, 91: 473-489. 10.1080/01621459.1996.10476908

SAS Institute Inc: SAS/STAT 9.3 User's Guide . Cary, NC: SAS Institute Inc; 2011.

Schafer JL: Analysis of incomplete multivariate data . London: Chapman & Hall/CRC; 1997.

Schafer JL: Multiple imputation: a primer. Stat Methods in Med 1999, 8(1):3-15. 10.1177/096228029900800102

Schafer JL: Multiple imputation with PAN. In New methods for the analysis of change . Edited by: Collins LM, Sayer AG. Washington, DC: American Psychological Association; 2001:353-377.

Schafer JL, Graham JW: Missing data: Our view of the state of the art. Psychol Meth 2002, 7(2):147-177. 10.1037/1082-989X.7.2.147

Schafer JL, Olsen MK: Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. Multivar Behav Res 1998, 33(4):545-571. 10.1207/s15327906mbr3304_5

Schenker N, Taylor JMG: Partially parametric techniques for multiple imputation. Comput Stat Data Anal 1996, 22(4):425-446. doi: -7 http://dx.doi.org/10.1016/0167-9473(95)00057 doi:-7 10.1016/0167-9473(95)00057-7

Schlomer GL, Bauman S, Card NA: Best practices for missing data management in counseling psychology. J Couns Psychol 2010, 57(1):1-10. 10.1037/a0018082

Sinharay S, Stern HS, Russell D: The use of multiple imputation for the analysis of missing data. Psychol Meth 2001, 6(4):317-329. 10.1037/1082-989X.6.4.317

Tabachnick BG, Fidell LS: Using multivariate statistics . 6th edition. Needham Heights, MA: Allyn & Bacon; 2012.

Tanner MA, Wong WH: The Calculation of Posterior Distributions by Data Augmentation. JASA 1987, 82(398):528-540. 10.1080/01621459.1987.10478458

Truxillo C: Maximum Likelihood Parameter Estimation with Incomplete Data. In Proceedings of the Thirtieth Annual SAS® Users Group International Conference . Cary, NC: SAS Institute Inc; 2005:1-19.

van Buuren S: Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods in Med Res 2007, 16(3):219-242. 10.1177/0962280206074463

van Buuren S, Groothuis-Oudshoorn K: mice: Multivariate Imputation by Chained Equations in R. J Stat Softw 2011, 45(3):1-67.

van Buuren S, Boshuizen HC, Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999, 18(6):681-694. 10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R

van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB: Fully conditional specification in multivariate imputation. JSCS 2006, 76(12):1049-1064. 10.1080/10629360600810434

White IR, Royston P, Wood AM: Multiple imputation using chained equations: Issues and guidance for practice. Stat Med 2011, 30(4):377-399. 10.1002/sim.4067

Wilkinson L, the Task Force on Statistical Inference: Statistical methods in psychology journals: Guidelines and explanations. Am Psychol 1999, 54(8):594-604. 10.1037/0003-066X.54.8.594

Wilks SS: The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Statist 1938, 9(1):60-62. 10.1214/aoms/1177732360

Williams T, Williams K: Self-efficacy and performance in mathematics: Reciprocal determinism in 33 nations. J Educ Psychol 2010, 102(2):453-466. 10.1037/a0017271

Yucel R: R mlmmm package: fitting multivariate linear mixed-effects models with missing values . 2007. . Accessed 28 Feb 2013 http://cran.r-project.org/web/packages/mlmmm/

Yucel R: Multiple imputation. J Stat Softw 2011, 45: 1. (ed)

Yung Y, Zhang W: Making use of incomplete observations in the analysis of structural equation models: The CALIS procedure's full information maximum likelihood method in SAS/STAT® 9.3. In Proceedings of the SAS® Global Forum 2011 Conference . Cary, NC: SAS Institute Inc; 2011.

Download references

Acknowledgment

This research was supported by the Maris M. Proffitt and Mary Higgins Proffitt Endowment Grant awarded to the second author. The opinions contained in this paper are those of the authors and do not necessarily reflect those of the grant administer—Indiana University, School of Education.

Author information

Authors and affiliations.

Indiana University-Bloomington, Bloomington, Indiana

Yiran Dong & Chao-Ying Joanne Peng

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Chao-Ying Joanne Peng .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors’ contributions

YD did literature review on missing data methods, carried out software demonstration, and drafted the manuscript. CYJP conceived the software demonstration design, provided the empirical data, worked with YD collaboratively to finalize the manuscript. Both authors read and approved of the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Dong, Y., Peng, CY.J. Principled missing data methods for researchers. SpringerPlus 2 , 222 (2013). https://doi.org/10.1186/2193-1801-2-222

Download citation

Received : 14 March 2013

Accepted : 30 April 2013

Published : 14 May 2013

DOI : https://doi.org/10.1186/2193-1801-2-222

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Missing data
  • Listwise deletion

thesis missing data

  • Search Menu
  • Sign in through your institution
  • Browse content in Acquired Cardiac
  • Arrhythmias
  • Congestive Heart Failure
  • Coronary Disease (Acquired Cardiac)
  • Education (Acquired Cardiac)
  • Experimental (Acquired Cardiac)
  • Great Vessels
  • Heart Failure Surgery (Acquired Cardiac)
  • Mechanical Circulatory Support
  • Minimally Invasive Procedures (Acquired Cardiac)
  • Myocardial Infarction
  • Organ Protection (Acquired Cardiac)
  • Pericardium
  • Transcatheter Procedures
  • Translational Research (Acquired Cardiac)
  • Tumours (Acquired Cardiac)
  • Valve Disorders (Acquired Cardiac)
  • Browse content in Congenital
  • Anomalies of LVOT and RVOT
  • Double Inlet or Double Outlet Ventricle
  • Education (Congenital)
  • Experimental (Congenital)
  • Fontan Operation
  • Grown-up Congenital Heart Disease
  • Heart Failure Surgery (Congenital)
  • Interventional Procedures (Congenital)
  • Neurodevelopment (Congenital)
  • Organ Protection (Congenital)
  • Perioperative Care (Congenital)
  • Septal Defects (Congenital)
  • Tetralogy of Fallot
  • Translational Research (Congenital)
  • Transposition Surgery
  • Tumours (Congenital)
  • Univentricular Palliation
  • Valvular Anomalies (Congenital)
  • Vessel Anomalies (Congenital)
  • Browse content in General Interest
  • Basic Science
  • Clinical Epidemiology
  • Education (General Interest)
  • Essential Surgical Skills
  • Experimental (General Interest)
  • Perioperative Care (General Interest)
  • Research Methods
  • Browse content in Pathology
  • Browse content in Valvular Heart Disease
  • Translational Research (Valvular Heart Disease)
  • Browse content in Thoracic
  • Education (Thoracic)
  • Experimental (Thoracic)
  • Mediastinum
  • Minimally Invasive Procedures (Thoracic)
  • Organ Protection (Thoracic)
  • Trachea and Bronchi
  • Translational Research (Thoracic)
  • Trauma (Thoracic)
  • Browse content in Vascular
  • Aortic Disorders
  • Aorto-iliac Disease
  • Cerebrovascular Disease
  • Education (Vascular)
  • Endovascular Procedures
  • Experimental (Vascular)
  • Organ Protection (Vascular)
  • Peripheral Arteries and Veins
  • Renal and Visceral Arteries
  • Translational Research (Vascular)
  • Vascular Malformations
  • Venous Disease
  • Advance Articles
  • Editor's Choice
  • Supplements
  • Residents' Corner
  • Author Guidelines
  • Statistical Primers
  • Publishing Tips
  • Why Publish with ICVTS
  • Self-Archiving Policy
  • About Interdisciplinary CardioVascular and Thoracic Surgery
  • About the European Association for Cardio-Thoracic Surgery
  • About the European Board of Cardiovascular Perfusion
  • Editorial Board
  • Journals Career Network
  • Advertising and Corporate Services
  • Contact EACTS
  • Journals on Oxford Academic
  • Books on Oxford Academic

European Association for Cardio-Thoracic Surgery

Article Contents

Introduction, methodology, supplementary material, acknowledgements, statistical primer: how to deal with missing data in scientific research †.

  • Article contents
  • Figures & tables
  • Supplementary Data

Grigorios Papageorgiou, Stuart W Grant, Johanna J M Takkenberg, Mostafa M Mokhles, Statistical primer: how to deal with missing data in scientific research?, Interactive CardioVascular and Thoracic Surgery , Volume 27, Issue 2, August 2018, Pages 153–158, https://doi.org/10.1093/icvts/ivy102

  • Permissions Icon Permissions

Missing data are a common challenge encountered in research which can compromise the results of statistical inference when not handled appropriately. This paper aims to introduce basic concepts of missing data to a non-statistical audience, list and compare some of the most popular approaches for handling missing data in practice and provide guidelines and recommendations for dealing with and reporting missing data in scientific research. Complete case analysis and single imputation are simple approaches for handling missing data and are popular in practice, however, in most cases they are not guaranteed to provide valid inferences. Multiple imputation is a robust and general alternative which is appropriate for data missing at random, surpassing the disadvantages of the simpler approaches, but should always be conducted with care. The aforementioned approaches are illustrated and compared in an example application using Cox regression.

Missing data are a common challenge encountered by researchers while undertaking clinical research. It can occur across all types of studies including randomized controlled trials, cohort studies, case–control studies and clinical registries. The optimum approach to missing data is to ensure that strategies are devised to ensure that the amount of missing data in a study is as small as possible. Such strategies are commonly utilized in prospectively designed clinical trials as if statistical assumptions due to missing data are required, then the protection of randomization will be broken down and unbiased estimates of treatment effect will be lost. Strategies to minimize missing data in large multicentre cohort or registry studies may be employed however, data desired for research purposes may often be missing due to the retrospective nature of the study or because the data fall outside the primary purpose of the registry [ 1 , 2 ].

Dealing with missing data may be low on the list of priorities for a researcher when undertaking a study but it is a vital step in data analysis as inappropriate handing of missing data can lead to a variety of problems. These included a loss of statistical power, loss of representation of key subgroups of the cohort, biased or inaccurate estimates of treatment effects and increased complexity of the statistical analysis.

To ensure that missing data are handled appropriately, there are a number of steps to follow: first, taking any necessary steps to complete or reduce the amount of missing data wherever possible; second, understanding the mechanism behind the remaining missing data; third, handling the missing data using appropriate methodology and finally, performing sensitivity analyses where appropriate. Focusing primarily on the framework of missing covariate data in non-randomized studies, this article introduces the concept behind different types of missing data and compares some of the most popular approaches for handling missing data in practice. Guidelines and recommendations for dealing with and reporting missing data in scientific research are also presented along with a simulated exercise on handling missing data.

Missing data mechanisms

Before discussing methods for handling missing data, it is important to review the types of missingness. Commonly, these are classified as missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [ 3 ]. An analysis of missing data patterns across contributing participants or centres, over time, or between key treatment groups should be performed to establish the mechanisms behind the missing data [ 1 ].

Missing completely at random

Observations of all subjects are equally likely to be missing. That is, there are no systematic differences between subjects with observed and unobserved values meaning that the observed values can be treated as a random sample of the population. For example, echocardiographic measurements might be missing due to sporadic ultrasound malfunction.

Missing at random

The likelihood of a value to be missing depends on other, observed variables. Hence, any systematic difference between missing and observed values can be attributed to observed data. That is, the relationships observed in the data at hand can be utilized to ‘recover’ the missing data. For example, missing echocardiographic measurements might be more normal than the observed ones because younger patients are more likely to miss an appointment.

Missing not at random

The likelihood to be missing depends on the (unobserved) value itself, and thus, systematic differences between the missing and the observed values remain, even after accounting for all other available information. In other words, there is extra information associated with the missing data that cannot be recovered by utilizing the relationships observed in the data. For example, missing echocardiographic measurements might be worse than the observed ones because patients with severe valve disease are more likely to miss a clinic visit because they are unable to visit the hospital.

Summary of missing data mechanisms

Missing data mechanismRelated toNot related toProbability to be missingValid analysis
MCARObserved or missing dataEqual for every data pointComplete case analysis, single and multiple imputation
MARObserved dataMissing dataEqual for data points within groupsMultiple imputation
MNARMissing dataUnequal and unknownSensitivity analysis
Missing data mechanismRelated toNot related toProbability to be missingValid analysis
MCARObserved or missing dataEqual for every data pointComplete case analysis, single and multiple imputation
MARObserved dataMissing dataEqual for data points within groupsMultiple imputation
MNARMissing dataUnequal and unknownSensitivity analysis

MAR: missing at random; MCAR: missing completely at random; MNAR: missing not at random.

Methods for handling missing data

There are various approaches for an incomplete data analysis. Two common approaches encountered in practice are complete case analysis and single imputation. Although these approaches are easily implemented, they may not be statistically valid and can result in bias when the data are not missing completely at random [ 5 , 6 ]. On the other hand, multiple imputation is a more general approach that overcomes the main disadvantages of the aforementioned approaches when data are missing (completely) at random [ 7–9 ].

Complete case analysis

The easiest way to deal with missing data is to drop all cases that have one or more values missing in any of the variables required for analysis. Although under MCAR this does not lead to bias of the results, it may result in significant loss of data and associated loss of power (e.g. wider confidence intervals) because the sample size is reduced. The extent of this loss of power is associated with the amount of missing data. If the data are MAR, this approach will lead to biased results. Complete case analysis may be appropriate for missing data related to the primary outcome of the study.

Single imputation

Alternatively, missing values in any variable could be replaced with a single value that is thought to best represent the mechanism of the missing data. This could be the mean of a normally distributed continuous variable, the median/mode of a categorical variable, the predicted value from a regression equation (that is, utilizing the complete observations to predict the values of the missing observations) or the best/worst observation carried forward. There may be cases where the missing risk factor data are believed to be highly likely due to the absence of a risk factor, and in this situation, it may be reasonable to impute the absence of the risk factor.

Although this approach allows the researcher to include all subjects in the analysis, it may lead to biased results. Moreover, the uncertainty of parameter estimates of the imputed variables will not necessarily improve when compared with the complete case analysis because the imputation is not conditional on the values of the outcome variable. How large the induced bias is depends on the variability of the imputed variable and on the proportion of missing values. Single imputation is also invalid under MAR since it does not account for the inter-relationships between the variables of interest. Single imputation may, however, be used to perform sensitivity analyses for missing covariate information or primary outcome data to ensure that the reported results are valid under the worst or best-case scenario.

Multiple imputation

Multiple imputation offers an alternative to overcome the disadvantages of the complete case analysis or single imputation approach. It allows the uncertainty, which is due to missing data, to be appropriately considered and can be thought of in three distinct steps: imputation, analysis and pooling of the results.

At the first step of imputation, multiple copies of the original incomplete dataset are generated. In each dataset, the missing values are replaced by values which are randomly sampled from the predictive distribution of the observed data, conditional on all other variables. The process of sampling induces variation in the imputed values which reflects the uncertainty of those imputed values.

In the analysis step, the model of interest is fitted to each imputed dataset. The results derived from each analysis will differ slightly due to the variability of the imputed values. In the third step, the results are pooled by taking the average of the estimates from the separate analyses to derive the pooled estimate and by applying Rubin’s rules, which incorporate the within and between imputation uncertainty, to derive the associated standard errors. More details on Rubin’s rules and the formulas that are used to obtain the pooled estimates can be found in Supplementary Material A.

Comparison of incomplete data analysis methods

MethodsProsCons
Complete case analysisSimple to implementLoss of power and efficiency and invalid under MAR
Single mean imputationSimple to implement and avoids loss of powerDoes not appropriately account for uncertainty in results and invalid under MAR
Multiple imputationAvoids loss of power, retains efficiency and valid under MARTime consuming and requires more statistical knowledge
MethodsProsCons
Complete case analysisSimple to implementLoss of power and efficiency and invalid under MAR
Single mean imputationSimple to implement and avoids loss of powerDoes not appropriately account for uncertainty in results and invalid under MAR
Multiple imputationAvoids loss of power, retains efficiency and valid under MARTime consuming and requires more statistical knowledge

MAR: missing at random.

It is important to note that generally none of the above methods will provide valid results under MNAR. Methods for dealing with MNAR data are very limited and usually complex. They are typically based on the idea of sensitivity analysis under various MNAR scenarios, for example, assuming the worst possible or best possible value for the missing data. Commonly, the goal of such sensitivity analyses is to help in assessing the robustness of the results under plausible MNAR scenarios. Multiple imputation can also potentially be used to perform sensitivity analyses if data are MNAR [ 10 ].

Multiple imputation: considerations and limitations

Multiple imputation is a general approach with numerous applications, and it is easily accessible through standard statistical software packages such as R [ 11 ], SPSS ® , SAS ® and STATA ® . However, it should be highlighted that it is not a panacea for every incomplete data setting [ 12 , 13 ]. Although multiple imputation is often considered as an out of the box method that can be easily applied in any missing data problem, this is not true. Its application requires the user to carefully consider the plausibility of each of the possible causes of missingness, thoroughly select an appropriate imputation model and choose appropriate variables to include with respect to both clinical relevance and the missing at random assumption.

Missing outcome information: It should be noted that up to this point, this article has focused primarily on missing covariate information. That is because when there are missing outcome data, it has been argued that the complete case analysis is more appropriate as imputed outcome data can lead to misleading results [ 14 , 15 ]. Single imputation of the worst or best-case scenario for missing outcome data may be used as sensitivity analysis to ensure the validity of trial results. Multiple imputation of missing outcome data may also be performed if there are auxiliary variables that are highly correlated with the outcome and the probability that the outcome is missing. However, this can only help in reducing the loss in accuracy of the estimates due to missing data and only if the data are at most MAR. Nevertheless, the complete case analysis should be regarded as the principle analysis in the case of missing outcome data.

The number of imputed datasets: Although 5 imputed datasets are considered adequate, it is always advised to increase the number to improve the efficiency and the reproducibility of the results [ 13 ].

The number of iterations: Since multiple imputation is based on an iterative algorithm, the convergence criteria should always be assessed and if necessary, the number of iterations increased [ 7 , 10 ].

Inclusion of the outcome in the imputation model: The outcome should be included in the first step of the multiple imputation procedure to take into account the association between outcome and incomplete covariates [ 16 ].

Longitudinal studies: Common software packages usually require the transformation of long datasets (a row per measurement) to their wide (a row per subject) counterparts to perform multiple imputation. This implies that current implementation of multiple imputation in longitudinal settings works best in balanced studies (e.g. subjects are measured at the same time points).

Survival analysis: Because of the complex nature of the outcome variable in such cases (pairing of a binary event indicator variable with a time-to-event variable), several approaches have been proposed on how to include it in the imputation model [ 17–19 ]. The most recent research findings, however, propose to use the Nelson–Aalen estimator along with the event indicator in the imputation model rather than the event indicator along the time-to-event variable [ 20 ].

Acceptable amount of missingness: There is no standard rule of how much missing data is too much. Theoretically, multiple imputation can handle large amounts of missingness. Nevertheless, the quality of the results is related to the complexity of the imputation model used, whether there are few or many variables with a large amount of missingness, the total sample size and the variability of the variables which are subject to missingness. For example, 50% missingness may be acceptable if the remaining 50% of the data allow accurate estimation of the predictive distribution used to draw imputed values. In settings with a small sample size, large variability and/or a heterogeneous study population, this may not be the case.

Given the potential complexities, it is clear that multiple imputation should be conducted carefully with respect to the challenges of each analysis. Advice from statistical experts is, therefore, highly recommended when considering multiple imputation to address missing data.

Guidelines for reporting incomplete data analysis in scientific manuscripts

Data example

To illustrate the above points with a data example, we consider a simple scenario for survival analysis. The data come from a follow-up study of patients with congenital heart disease who received a human tissue allograft in the aortic position. The aim is to investigate the association between postoperative aortic gradient (mmHg) and risk of death while accounting for baseline factors such as age at operation (years), gender, donor age (years) and allograft diameter (mm). An overview of the data for the ‘all cases’ scenario (before excluding any case to artificially generate missing data scenarios) is provided in the Supplementary Material B, Table S4 .

To briefly illustrate a few of the points presented throughout this article for dealing with missing data, we artificially generated 40% missingness on the postoperative aortic gradient under the 3 missingness scenarios: MCAR, MAR and MNAR. Under MCAR, randomly chosen values were deleted. Under MAR, the aortic gradient measurements of younger patients (age less than the mean age in the dataset) were deleted. Finally, for MNAR, the missing values were selected to be patients with a high postoperative aortic gradient (higher than the 65th percentile of the postoperative gradient in the dataset) assuming that they are more likely to be unable to go to the hospital. We then applied Cox regression using complete cases, single mean imputation and multiple imputation under each scenario for the mechanism that generated the missing data and compared the corresponding results with those obtained when all the cases were used (no missing data). The analysis was conducted in R [ 11 ] using the packages mice [ 10 ] and mitools [ 21 ]. A sample R code for conducting multiple imputation in R is given in Supplementary Material C.

The results are summarized in Fig.  1 , where the red dot and black lines represent the estimated hazard ratios and their corresponding 95% confidence intervals, respectively. As shown in this figure, under MCAR and MAR, multiple imputation provided results that were slightly closer to those of the complete data (before values were removed; ‘all cases’) than the results from the simpler approaches for this specific example. Nevertheless, the differences are small, and both the complete case analysis and single mean imputation are theoretically valid under MCAR. The loss in efficiency due to the reduced sample size when using only the complete cases is evident from the wider confidence intervals. Under MNAR, all approaches provided biased estimates. In this situation, further sensitivity analyses or explicit accounting of the missing data mechanism would be required [ 8 ].

Hazard ratios and 95% confidence intervals of ‘all cases’, complete cases, single mean imputation and multiple imputation analyses under 3 missing data mechanisms. MAR: missing at random; MCAR: missing completely at random; MNAR: missing not at random.

Hazard ratios and 95% confidence intervals of ‘all cases’, complete cases, single mean imputation and multiple imputation analyses under 3 missing data mechanisms. MAR: missing at random; MCAR: missing completely at random; MNAR: missing not at random.

Missing data are common in clinical research and should be minimized wherever possible through good study design and data collection protocols. However, in most cases, it is not possible to reduce the amount of missing data to zero. As demonstrated in the example presented in this article, inappropriate handling of missing data can potentially lead to biased results or significant loss of power. Although simpler approaches in handling missing data such as the complete case analysis or single imputation may be appropriate if the amount of missing data is small and the mechanisms behind the missing data are clearly understood, in most cases multiple imputation is accepted as the preferred strategy for handling missing data. Although multiple imputation deals with a number of pitfalls related to complete case analysis or single imputation, it does significantly increase the complexity of the analysis and can potentially lead to bias if the data are not missing at random.

It is important to approach the handling of missing data in a systematic manner and clearly report the steps that have been undertaken in the handling of missing data as outlined in the guidelines in Table 3 . Although this article is intended to give an overview for clinicians on how to handle missing data, it is strongly recommended that complex approaches to handle missing data should be performed under the guidance of a statistician.

Supplementary material is available at ICVTS online.

We would like to thank Nicole S. Erler for the helpful discussions and valuable comments.

M.M.M. is funded by a NWO Veni grant of the Netherlands Organisation for Scientific Research (NWO 916.160.87).

Conflict of interest: none declared.

† Presented at the 31st Annual Meeting of the European Association for Cardio-Thoracic Surgery, Vienna, Austria, 7–11 October 2017.

Bell ML , Kenward MG , Fairclough DL , Horton NJ. Differential dropout and bias in randomised controlled trials: when it matters and when it may not . BMJ 2013 ; 346 : e8668.

  • Google Scholar

Little RJ , D’Agostino R , Cohen ML , Dickersin K , Emerson SS , Farrar JT et al.  The prevention and treatment of missing data in clinical trials . N Engl J Med 2012 ; 367 : 1355 – 60 .

Little RJA , Rubin DB. Statistical Analysis with Missing Data, 2nd edn, Chapter 1. Hoboken, NJ : Wiley , 2002 .

Google Preview

Enders CK. Applied Missing Data Analysis, Chapter 1 . New York : Guilford Press , 2010.

Carpenter J , Kenward MG. A critique of common approaches to missing data. In: Missing data in randomised controlled trials—a practical guide . Birmingham, AL : National Institute for Health Research , 2007 .

Vach W , Blettner M. Biased estimation of the odds ratio in case control studies due to the use of ad hoc methods of correcting for missing values for confounding variables . Am J Epidemiol 1991 ; 134 : 895 – 907 .

van Buuren S. Flexible Imputation of Missing Data, Chapters: 2, 5 . Boca Raton, FL : Taylor & Francis , 2012 .

Carpenter JR , Kenward MG. Multiple Imputation and Its Application, Chapters: 1, 2, 10 . Chichester, West Sussex, United Kingdom : John Wiley & Sons, Ltd , 2013 .

Schafer JL. Analysis of Incomplete Multivariate Data, Chapter 4 . London : Chapman and Hall , 1997 .

van Buuren S , Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R . J Stat Soft 2011 ; 45 : 1 – 67 .

R Core Team ( 2017 ). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/ .

Erler NS , Rizopoulos D , van Rosmalen J , Jaddoe V , Franco O , Lesaffre E. Dealing with missing covariates in epidemiologic studies: a comparison between multiple imputation and a full Bayesian approach . Stat Med 2016 ; 35 : 2955 – 74 .

White IR , Royston P , Wood AM. Multiple Imputation using chained equations: issues and guidance in practice . Stat Med 2011 ; 30 : 377 – 99 .

Von Hippel PT. Regression with missing Ys: an improved strategy for analyzing multiply imputed data . Sociol Methodol 2007 ; 37 : 83 – 117 .

Little RJA. Regression with missing X’s: a review . J Am Stat Assoc 1992 ; 87 : 1227 – 37 .

Moons KG , Donders RA , Stijnen T , Harrell FE Jr. Using the outcome for imputation of missing predictor values was preferred . J Clin Epidemiol 2006 ; 59 : 1092 – 101 .

Barzi F , Woodward M. Imputations of missing values in practice: results from imputations of serum cholesterol in 28 cohort studies . Am J Epidemiol 2004 ; 160 : 34 – 45 .

van Buuren S , Boshuizen HC , Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis . Stat Med 1999 ; 18 : 681 – 94 .

Clark TG , Altman DG. Developing a prognostic model in the presence of missing data: an ovarian cancer case study . J Clin Epidemiol 2003 ; 56 : 28 – 37 .

White IR , Royston P. Imputing missing covariate values for the Cox model . Stat Med 2009 ; 28 : 1982 – 98 .

Lumley T. mitools: Tools for Multiple Imputation of Missing Data. R Package Version 2.3. 2014 . https://CRAN.R-project.org/package=mitools (20 February 2018, date last accessed).

  • multiple imputation
  • missing data
  • missing at random

Supplementary data

Month: Total Views:
May 2018 115
June 2018 86
July 2018 70
August 2018 69
September 2018 40
October 2018 68
November 2018 108
December 2018 113
January 2019 111
February 2019 165
March 2019 186
April 2019 214
May 2019 306
June 2019 231
July 2019 230
August 2019 246
September 2019 49
October 2019 46
November 2019 51
December 2019 25
January 2020 46
February 2020 45
March 2020 116
April 2020 183
May 2020 109
June 2020 157
July 2020 130
August 2020 107
September 2020 161
October 2020 215
November 2020 278
December 2020 360
January 2021 406
February 2021 379
March 2021 591
April 2021 539
May 2021 401
June 2021 273
July 2021 258
August 2021 308
September 2021 246
October 2021 241
November 2021 253
December 2021 247
January 2022 241
February 2022 235
March 2022 305
April 2022 294
May 2022 270
June 2022 205
July 2022 210
August 2022 273
September 2022 260
October 2022 244
November 2022 221
December 2022 200
January 2023 207
February 2023 182
March 2023 228
April 2023 214
May 2023 244
June 2023 207
July 2023 259
August 2023 286
September 2023 294
October 2023 333
November 2023 339
December 2023 345
January 2024 447
February 2024 339
March 2024 459
April 2024 352
May 2024 389
June 2024 319
July 2024 273
August 2024 218

Email alerts

Related articles in.

  • Web of Science

Citing articles via

  • Recommend to your Library

Affiliations

  • Online ISSN 2753-670X
  • Copyright © 2024 European Association for Cardio-Thoracic Surgery
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Handling Missing Data

  • First Online: 23 March 2016

Cite this chapter

thesis missing data

  • Thom Baguley 5 &
  • Mark Andrews 5  

Part of the book series: Human–Computer Interaction Series ((HCIS))

4982 Accesses

7 Citations

This chapter provides an overview of the topic of missing data. We introduce the main types of missing data that can occur in practice and discuss the practical consequences of each of these types for general data analysis. We then describe general and practical solutions to the problem of missing data, discussing common but flawed approaches as well as more powerful approaches such as multiple imputation , which is an approach to dealing with missing data that is suitable for many—although not all—situations. Finally, we consider the topic of missing data as part of statistical inference more generally, and how it can be handled in both maximum likelihood and Bayesian approaches to inference.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

thesis missing data

Missing Data

thesis missing data

Missing Data Imputation: A Practical Guide

Editor note: In this section the authors discuss Bayesian methods. These have not yet been covered in the previous chapters. The basic ideas behind Bayesian inference are introduced in Chap.  8 . We recommend those readers who are totally unaware of Bayesian methods to read Chap.  8 before proceeding.

Continuing with the notation introducted in Sect.  4.2.1 , here we will denote the fully observed variables in our data by x , the partially observed variables by \(y = y^{\text {obs}}, y^{\text {obs}}\) , and we will index the missing variables in y by the I . We can also assume that any or all of x , y and I may be multivariate arrays.

Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res 20(1):40–49

Article   Google Scholar  

Buuren S, Groothuis-Oudshoorn K (2011) mice: Multivariate imputation by chained equations in r. J Stat Softw 45(3)

Google Scholar  

Collins LM, Schafer JL, Kam C-M (2001) A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods 6(4):330

Dempster MM, Laird NM, Jain DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc 1–38

Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans Pattern Anal Machine Intel 6:721–741

Article   MATH   Google Scholar  

Graham JW (2009) Missing data analysis: making it work in the real world. Annu Rev Psychol 60:549–576

Graham JW, Olchowski AE, Gilreath TD (2007) How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci 8(3):206–213

Little RJ, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York

Little RJ, Smith PJ (1987) Editing and imputation for quantitative survey data. J Amer Stat Assoc 82(397):58–68

Article   MathSciNet   MATH   Google Scholar  

Marlin BM, Zemel RS, Roweis ST, Slaney M (2011) Recommender systems, missing data and statistical model estimation. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, pp 2686–2691

Mohan K, Pearl J, Tian J (2013) Graphical models for inference with missing data. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems, vol 26. Curran Associates, Inc., pp 1277–1285

Peugh JL, Enders CK (2004) Missing data in educational research: a review of reporting practices and suggestions for improvement. Rev Educ Res 74(4):525–556

Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G (2009) Bayesian t tests for accepting and rejecting the null hypothesis. Psychon Bull Rev 16(2):225–237

Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147

Su Y-S, Yajima M, Gelman AE, Hill J (2011) Multiple imputation with diagnostics (mi) in r: opening windows into the black box. J Stat Softw 45(2):1–31

Download references

Author information

Authors and affiliations.

Division of Psychology, Nottingham Trent University, Nottingham, UK

Thom Baguley & Mark Andrews

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Thom Baguley .

Editor information

Editors and affiliations.

Moray School of Education, Edinburgh University, Edinburgh, United Kingdom

Judy Robertson

Donders Centre for Cognition, Radboud University Nijmegen, Tilburg, The Netherlands

Maurits Kaptein

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Baguley, T., Andrews, M. (2016). Handling Missing Data. In: Robertson, J., Kaptein, M. (eds) Modern Statistical Methods for HCI. Human–Computer Interaction Series. Springer, Cham. https://doi.org/10.1007/978-3-319-26633-6_4

Download citation

DOI : https://doi.org/10.1007/978-3-319-26633-6_4

Published : 23 March 2016

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-26631-2

Online ISBN : 978-3-319-26633-6

eBook Packages : Computer Science Computer Science (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Corpus ID: 15504536

Missing Data Problems in Machine Learning

  • Benjamin M Marlin
  • Published 2008
  • Computer Science, Mathematics

Figures and Tables from this paper

table 3.1

126 Citations

Bayesian binomial mixture model for collaborative prediction with non-random missing data.

  • Highly Influenced
  • 18 Excerpts

Variational Gibbs inference for statistical model estimation from incomplete data

Missing value imputation using stratified supervised learning for cardiovascular data, sharing pattern submodels for prediction with missing values, a structured prediction approach for missing value imputation, merits of bayesian networks in overcoming small data challenges: a meta-model for handling missing data, a subspace ensemble framework for classification with high dimensional missing data, flexible imputation of missing data, machine learning-based missing value imputation method for clinical datasets, learning sparse graphical models for data restoration and multi-label classification, 83 references, statistical analysis with missing data.

  • Highly Influential

Learning from Incomplete Data

Unsupervised learning with non-ignorable missing data, incomplete-data classification using logistic regression, max-margin classification of incomplete data, inference and missing data, supervised learning from incomplete data via an em approach, collaborative filtering and the missing at random assumption, bayesian data analysis., max-margin classification of data with absent features, related papers.

Showing 1 through 3 of 0 Related Papers

  • Skip to main content
  • Accessibility information

thesis missing data

  • Enlighten Enlighten

Enlighten Theses

  • Latest Additions
  • Browse by Year
  • Browse by Subject
  • Browse by College/School
  • Browse by Author
  • Browse by Funder
  • Login (Library staff only)

In this section

Investigating statistical approaches to handling missing data in the context of the Gateshead Millennium Study

Gordon, Claire Ann (2010) Investigating statistical approaches to handling missing data in the context of the Gateshead Millennium Study. MSc(R) thesis, University of Glasgow.


A commonly occurring problem in all kinds of studies is that of missing data. These missing values can occur for a number of reasons, including equipment malfunctions and, more typically, subjects recruited to a study not participating fully. In particular, in a longitudinal study, one or more of the repeated measurements on a subject might be missing. The way in which missing values are dealt with depends on the data analyst's experience with statistical techniques. The most common way in which data analysts proceed is to use the complete case analysis method, i.e. removing cases with missing values for any of the variables and running the analysis on the remaining cases. Although this method is very straightforward to implement and is used by the vast majority of data analysts, it can lead to biased results unless data are missing completely at random. Complete Case analysis can dramatically reduce the sample size of the study, as only those cases for which all variables are measured are included in the analysis. Therefore the complete case analysis method is "not generally recommended" (Diggle et al., 2002). Alternative approaches to the complete case analysis method involve filling in (or imputing) values for the incomplete cases, making "more efficient use of the available data" (Schafer, 1997). The purpose of this thesis is to compare and contrast the results obtained from analysing the relationship between growth and feeding behaviour in the first year of life using the complete case analysis and three imputation methods: single hot-decking, multiple hot-decking and the EM algorithm. The data used in this research come from the Gateshead Millennium Study, a prospective study of a cohort of just over 1,000 babies. In practical terms, the purpose of the work is to confirm the conclusions from the published complete-case analysis. It is of more theoretical interest to determine which imputation method is the most appropriate for dealing with missing data in this study. Chapter 1 provides an introduction to the problem of missing data and how they may arise and a description of the Gateshead Millennium Study data, to which all the missing data methods will be applied. It concludes by giving the aims of this thesis. Chapter 2 provides an in depth review of various missing data approaches and indicates which characteristics of the missing data have to be considered in order to determine which of these approaches can be employed to deal with the missing values. Also in Chapter 2, various aspects of the Gateshead Millennium Study data are reviewed. Measures of growth and feeding behaviour in the first year of life are described as these are important variables in the published analysis. Chapter 3 assesses how complete the Gateshead Millennium Study data is by producing a detailed description of each of the questions in each of the questionnaires. This is achieved by examining the Wave Non-response, Section Non-response and Item Non-response for each of the six questionnaires. Chapter 4 recreates the results from the complete case analyses for the relationship between development of growth and feeding in the first year of life which have already been performed and published in the paper - How Does Maternal and Child Feeding Behaviour Relate to Weight Gain and Failure to Thrive? Data From a Prospective Birth Cohort (Wright et al., 2006a). This chapter also gives insight as to whether or not it is appropriate to assume that the missing data mechanism is MCAR and therefore whether or not it is reasonable to believe the results obtained from the complete case analysis. Chapter 5 focusses on the various methods used to impute the missing values in the Gateshead Millennium Study data. This chapter begins by considering the EM Algorithm. It gives details of how the EM Algorithm was performed and the results obtained. In addition to the EM Algorithm, this chapter also considers the procedures and results for Single Imputation and Multiple Imputation by hot-decking. This chapter concludes by comparing the results of these methods to one another and also to the complete case analysis results from Chapter 4. Finally, Chapter 6 provides a summary of the results from the various missing data methods applied and discusses various alternative methods which could also have been performed.

Actions (login required)

Item Type: Thesis (MSc(R))
Qualification Level: Masters
Additional Information: The questionnaires in Appendix A of this thesis are the intellectual property of the Gateshead Millennium Study Team.
Keywords: Missing Data, Missing Data Mechanisms, Complete Case Analysis, EM Algorithm, Hot-deck Imputation, Multiple Imputation, Gateshead Millennium Study
Subjects: >
Colleges/Schools: > >
Supervisor's Name: McColl, Prof. John
Date of Award: 2010
Depositing User:
Unique ID: glathesis:2010-2312
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 05 Jan 2011
Last Modified: 10 Dec 2012 13:53
URI:
View Item

Downloads per month over past year

View more statistics

-

The University of Glasgow is a registered Scottish charity: Registration Number SC004401

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springerplus

Logo of splus

Principled missing data methods for researchers

Indiana University-Bloomington, Bloomington, Indiana USA

Chao-Ying Joanne Peng

The impact of missing data on quantitative research can be serious, leading to biased estimates of parameters, loss of information, decreased statistical power, increased standard errors, and weakened generalizability of findings. In this paper, we discussed and demonstrated three principled missing data methods: multiple imputation, full information maximum likelihood, and expectation-maximization algorithm, applied to a real-world data set. Results were contrasted with those obtained from the complete data set and from the listwise deletion method. The relative merits of each method are noted, along with common features they share. The paper concludes with an emphasis on the importance of statistical assumptions, and recommendations for researchers. Quality of research will be enhanced if (a) researchers explicitly acknowledge missing data problems and the conditions under which they occurred, (b) principled methods are employed to handle missing data, and (c) the appropriate treatment of missing data is incorporated into review standards of manuscripts submitted for publication.

Missing data are a rule rather than an exception in quantitative research. Enders ( 2003 ) stated that a missing rate of 15% to 20% was common in educational and psychological studies. Peng et al. ( 2006 ) surveyed quantitative studies published from 1998 to 2004 in 11 education and psychology journals. They found that 36% of studies had no missing data, 48% had missing data, and about 16% cannot be determined. Among studies that showed evidence of missing data, 97% used the listwise deletion (LD) or the pairwise deletion (PD) method to deal with missing data. These two methods are ad hoc and notorious for biased and/or inefficient estimates in most situations ( Rubin 1987 ; Schafer 1997 ). The APA Task Force on Statistical Inference explicitly warned against their use ( Wilkinson and the Task Force on Statistical Inference 1999 p. 598). Newer and principled methods, such as the multiple-imputation (MI) method, the full information maximum likelihood (FIML) method, and the expectation-maximization (EM) method, take into consideration the conditions under which missing data occurred and provide better estimates for parameters than either LD or PD. Principled missing data methods do not replace a missing value directly; they combine available information from the observed data with statistical assumptions in order to estimate the population parameters and/or the missing data mechanism statistically.

A review of the quantitative studies published in Journal of Educational Psychology (JEP) between 2009 and 2010 revealed that, out of 68 articles that met our criteria for quantitative research, 46 (or 67.6%) articles explicitly acknowledged missing data, or were suspected to have some due to discrepancies between sample sizes and degrees of freedom. Eleven (or 16.2%) did not have missing data and the remaining 11 did not provide sufficient information to help us determine if missing data occurred. Of the 46 articles with missing data, 17 (or 37%) did not apply any method to deal with the missing data, 13 (or 28.3%) used LD or PD, 12 (or 26.1%) used FIML, four (or 8.7%) used EM, three (or 6.5%) used MI, and one (or 2.2%) used both the EM and the LD methods. Of the 29 articles that dealt with missing data, only two explained their rationale for using FIML and LD, respectively. One article misinterpreted FIML as an imputation method. Another was suspected to have used either LD or an imputation method to deal with attrition in a PISA data set ( OECD 2009 ; Williams and Williams 2010 ).

Compared with missing data treatments by articles published in JEP between 1998 and 2004 ( Table 3.1 in Peng et al. 2006 ), there has been improvement in the decreased use of LD (from 80.7% down to 21.7%) and PD (from 17.3% down to 6.5%), and an increased use of FIML (from 0% up to 26.1%), EM (from 1.0% up to 8.7%), or MI (from 0% up to 6.5%). Yet several research practices still prevailed from a decade ago, namely, not explicitly acknowledging the presence of missing data, not describing the particular approach used in dealing with missing data, and not testing assumptions associated with missing data methods. These findings suggest that researchers in educational psychology have not fully embraced principled missing data methods in research.

Although treating missing data is usually not the focus of a substantive study, failing to do so properly causes serious problems. First, missing data can introduce potential bias in parameter estimation and weaken the generalizability of the results ( Rubin 1987 ; Schafer 1997 ). Second, ignoring cases with missing data leads to the loss of information which in turn decreases statistical power and increases standard errors( Peng et al. 2006 ). Finally, most statistical procedures are designed for complete data ( Schafer and Graham 2002 ). Before a data set with missing values can be analyzed by these statistical procedures, it needs to be edited in some way into a “complete” data set. Failing to edit the data properly can make the data unsuitable for a statistical procedure and the statistical analyses vulnerable to violations of assumptions.

Because of the prevalence of the missing data problem and the threats it poses to statistical inferences, this paper is interested in promoting three principled methods, namely, MI, FIML, and EM, by illustrating these methods with an empirical data set and discussing issues surrounding their applications. Each method is demonstrated using SAS 9.3. Results are contrasted with those obtained from the complete data set and the LD method. The relative merits of each method are noted, along with common features they share. The paper concludes with an emphasis on assumptions associated with these principled methods and recommendations for researchers. The remainder of this paper is divided into the following sections: (1) Terminology, (2) Multiple Imputation (MI), (3) Full Information Maximum-Likelihood (FIML), (4) Expectation-Maximization (EM) Algorithm, (5) Demonstration, (6) Results, and (6) Discussion.

Terminology

Missing data occur at two levels: at the unit level or at the item level. A unit-level non-response occurs when no information is collected from a respondent. For example, a respondent may refuse to take a survey, or does not show up for the survey. While the unit non-response is an important and common problem to tackle, it is not the focus of this paper. This paper focuses on the problem of item non-response . An item non-response refers to the incomplete information collected from a respondent. For example, a respondent may miss one or two questions on a survey, but answered the rest. The missing data problem at the item level needs to be tackled from three aspects: the proportion of missing data, the missing data mechanisms, and patterns of missing data. A researcher must address all three before choosing an appropriate procedure to deal with missing data. Each is discussed below.

Proportion of missing data

The proportion of missing data is directly related to the quality of statistical inferences. Yet, there is no established cutoff from the literature regarding an acceptable percentage of missing data in a data set for valid statistical inferences. For example, Schafer ( 1999 ) asserted that a missing rate of 5% or less is inconsequential. Bennett ( 2001 ) maintained that statistical analysis is likely to be biased when more than 10% of data are missing. Furthermore, the amount of missing data is not the sole criterion by which a researcher assesses the missing data problem. Tabachnick and Fidell ( 2012 ) posited that the missing data mechanisms and the missing data patterns have greater impact on research results than does the proportion of missing data.

Missing data mechanisms

According to Rubin ( 1976 ), there are three mechanisms under which missing data can occur: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR). To understand missing data mechanisms, we partition the data matrix Y into two parts: the observed part ( Y obs ) and the missing part ( Y mis ). Hence, Y  = ( Y obs ,  Y mis ). Rubin ( 1976 ) defined MAR to be a condition in which the probability that data are missing depends only on the observed Y obs , but not on the missing Y mis , after controlling for Y obs . For example, suppose a researcher measures college students’ understanding of calculus in the beginning (pre-test) and at the end (post-test) of a calculus course. Let’s suppose that students who scored low on the pre-test are more likely to drop out of the course, hence, their scores on the post-test are missing. If we assume that the probability of missing the post-test depends only on scores on the pre-test, then the missing mechanism on the post-test is MAR. In other words, for students who have the same pre-test score, the probability of their missing the post-test is random. To state the definition of MAR formally, let R be a matrix of missingness with the same dimension as Y . The element of R is either 1 or 0, corresponding to Y being observed (coded as 1) or missing (coded as 0). If the distribution of R , written as P ( R | Y ,  ξ ), where ξ = missingness parameter, can be modeled as Equation 1 , then the missing condition is said to be MAR ( Schafer 1997 p. 11):

In other words, the probability of missingness depends on only the observed data and ξ. Furthermore, if (a) the missing data mechanism is MAR and (b) the parameter of the data model ( θ ) and the missingness parameter ξ are independent, the missing data mechanism is said to be ignorable ( Little and Rubin 2002 ). Since condition (b) is almost always true in real world settings, ignorability and MAR (together with MCAR) are sometimes viewed as equivalent ( Allison 2001 ).

Although many modern missing data methods (e.g., MI, FIML, EM) assume MAR, violation of this assumption should be expected in most cases ( Schafer and Graham 2002 ). Fortunately, research has shown that violation of the MAR assumption does not seriously distort parameter estimates ( Collins et al. 2001 ). Moreover, MAR is quite plausible when data are missing by design. Examples of missing by design include the use of multiple booklets in large scale assessment, longitudinal studies that measure a subsample at each time point, and latent variable analysis in which the latent variable is missing with a probability of 1, therefore, the missing probability is independent of all other variables.

MCAR is a special case of MAR. It is a missing data condition in which the likelihood of missingness depends neither on the observed data Y obs , nor on the missing data Y mis . Under this condition, the distribution of R is modeled as follows:

If missing data meet the MCAR assumption, they can be viewed as a random sample of the complete data. Consequently, ignoring missing data under MCAR will not introduce bias, but will increase the SE of the sample estimates due to the reduced sample size. Thus, MCAR poses less threat to statistical inferences than MAR or MNAR.

The third missing data mechanism is MNAR. It occurs when the probability of missing depends on the missing value itself. For example, missing data on the income variable is likely to be MNAR, if high income earners are more inclined to withhold this information than average- or low-income earners. In case of MNAR, the missing mechanism must be specified by the researcher, and incorporated into data analysis in order to produce unbiased parameter estimates. This is a formidable task not required by MAR or MCAR.

The three missing data methods discussed in this paper are applicable under either the MCAR or the MAR condition, but not under MNAR. It is worth noting that including variables in the statistical inferential process that could explain missingness makes the MAR condition more plausible. Return to the college students’ achievement in a calculus course for example. If the researcher did not collect students’ achievement data on the pre-test, the missingness on the post-test is not MAR, because the missingness depends on the unobserved score on the post-test alone. Thus, the literature on missing data methods often suggests including additional variables into a statistical model in order to make the missing data mechanism ignorable ( Collins et al. 2001 ; Graham 2003 ; Rubin 1996 ).

The tenability of MCAR can be examined using Little’s multivariate test ( Little and Schenker 1995 ). However, it is impossible to test whether the MAR condition holds, given only the observed data ( Carpenter and Goldstein 2004 ; Horton and Kleinman 2007 ; White et al. 2011 ). One can instead examine the plausibility of MAR by a simple t -test of mean differences between the group with complete data and that with missing data ( Diggle et al. 1995 ; Tabachnick and Fidell 2012 ). Both approaches are illustrated with a data set at http://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/20.0/en/client/Manuals/IBM_SPSS_Missing_Values.pdf . Yet, Schafer and Graham ( 2002 ) criticized the practice of dummy coding missing values, because such a practice redefines the parameters of the population. Readers should therefore be cautioned that the results of these tests should not be interpreted as providing definitive evidence of either MCAR or MAR.

Patterns of missing data

There are three patterns of missing data: univariate, monotone, and arbitrary; each is discussed below. Suppose there are p variables, denoted as, Y 2 ,  Y 2 , …,  Y p . A data set is said to have a univariate pattern of missing if the same participants have missing data on one or more of the p variables. A dataset is said to have a monotone missing data pattern if the variables can be arranged in such a way that, when Y j is missing, Y j  + 1 ,  Y j  + 2 , …,  Y p are missing as well. The monotone missing data pattern occurs frequently in longitudinal studies where, if a participant drops out at one point, his/her data are missing on subsequent measures. For the treatment of missing data, the monotone missing data pattern subsumes the univariate missing data pattern. If missing data occur in any variable for any participant in a random fashion, the data set is said to have an arbitrary missing data pattern. Computationally, the univariate or the monotone missing data pattern is easier to handle than an arbitrary pattern.

Multiple Imputation (MI)

MI is a principled missing data method that provides valid statistical inferences under the MAR condition ( Little and Rubin 2002 ). MI was proposed to impute missing data while acknowledging the uncertainty associated with the imputed values ( Little and Rubin 2002 ). Specifically, MI acknowledges the uncertainty by generating a set of m plausible values for each unobserved data point, resulting in m complete data sets, each with one unique estimate of the missing values. The m complete data sets are then analyzed individually using standard statistical procedures, resulting in m slightly different estimates for each parameter. At the final stage of MI, m estimates are pooled together to yield a single estimate of the parameter and its corresponding SE . The pooled SE of the parameter estimate incorporates the uncertainty due to the missing data treatment (the between imputation uncertainty) into the uncertainty inherent in any estimation method (the within imputation uncertainty). Consequently, the pooled SE is larger than the SE derived from a single imputation method (e.g., mean substitution) that does not consider the between imputation uncertainty. Thus, MI minimizes the bias in the SE of a parameter estimate derived from a single imputation method.

In sum, MI handles missing data in three steps: (1) imputes missing data m times to produce m complete data sets; (2) analyzes each data set using a standard statistical procedure; and (3) combines the m results into one using formulae from Rubin ( 1987 ) or Schafer ( 1997 ). Below we discuss each step in greater details and demonstrate MI with a real data set in the section Demonstration .

Step 1: imputation

The imputation step in MI is the most complicated step among the three steps. The aim of the imputation step is to fill in missing values multiple times using the information contained in the observed data. Many imputation methods are available to serve this purpose. The preferred method is the one that matches the missing data pattern. Given a univariate or monotone missing data pattern, one can impute missing values using the regression method ( Rubin 1987 ), or the predictive mean matching method if the missing variable is continuous ( Heitjan and Little 1991 ; Schenker and Taylor 1996 ). When data are missing arbitrarily, one can use the Markov Chain Monte Carlo (MCMC) method ( Schafer 1997 ), or the fully conditional specification (also referred to as chained equations) if the missing variable is categorical or non-normal ( Raghunathan et al. 2001 ; van Buuren 2007 ; van Buuren et al. 1999 ; van Buuren et al. 2006 ). The regression method and the MCMC method are described next.

The regression method for univariate or monotone missing data pattern

Suppose that there are p variables, Y 1 ,  Y 2 , …,  Y p in a data set and missing data are uniformly or monotonically present from Y j to Y p , where 1 < j ≤ p . To impute the missing values for the j th variable, one first constructs a regression model using observed data on Y 1 through Y j  - 1 to predict the missing values on Y j :

The MCMC method for arbitrary missing pattern

When the missing data pattern is arbitrary, it is difficult to develop analytical formulae for the missing data. One has to turn to numerical simulation methods, such as MCMC ( Schafer 1997 ) in this case. The MCMC technique used by the MI procedure of SAS is described below [interested readers should refer to SAS/STAT 9.3 User’s Guide ( SAS Institute Inc 2011 ) for a detailed explanation].

Step 2: statistical analysis

The second step of MI analyzes the m sets of data separately using a statistical procedure of a researcher’s choice. At the end of the second step, m sets of parameter estimates are obtained from separate analyses of m data sets.

Step 3: combining results

Mi related issues.

When implementing MI, the researcher needs to be aware of several practical issues, such as, the multivariate normality assumption, the imputation model, the number of imputations, and the convergence of MCMC. Each is discussed below.

The multivariate normality assumption

The regression and MCMC methods implemented in statistical packages (e.g., SAS) assume multivariate normality for variables. It has been shown that MI based on the multivariate normal model can provide valid estimates even when this assumption is violated ( Demirtas et al. 2008 ; Schafer 1997 , 1999 ). Furthermore, this assumption is robust when the sample size is large and when the missing rate is low, although the definition for a large sample size or for a low rate of missing is not specified in the literature ( Schafer 1997 ).

When an imputation model contains categorical variables, one cannot use the regression method or MCMC directly. Techniques such as, logistic regression and discriminant function analysis, can substitute for the regression method, if the missing data pattern is monotonic or univariate. If the missing data pattern is arbitrary, MCMC based on other probability models (such as the joint distribution of normal and binary) can be used for imputation. The free MI software NORM developed by Schafer ( 1997 ) has two add-on modules—CAT and MIX—that deal with categorical data. Specifically, CAT imputes missing data for categorical variables, and MIX imputes missing data for a combination of categorical and continuous variables. Other software packages are also available for imputing missing values in categorical variables, such as the ICE module in Stata ( Royston 2004 , 2005 , 2007 ; Royston and White 2011 ), the mice package in R and S-Plus ( van Buuren and Groothuis-Oudshoorn 2011 ), and the IVEware ( Raghunathan et al. 2001 (Yucel ). Interested readers are referred to a special volume of the Journal of Statistical Software 2011 ) for recent developments in MI software.

When researchers use statistical packages that impose a multivariate normal distribution assumption on categorical variables, a common practice is to impute missing values based on the multivariate normal model, then round the imputed value to the nearest integer or to the nearest plausible value. However, studies have shown that this naïve way of rounding would not provide desirable results for binary missing values ( Ake 2005 ; Allison 2005 ; Enders 2010 ). For example, Horton et al. ( 2003 ) showed analytically that rounding the imputed values led to biased estimates, whereas imputed values without rounding led to unbiased results. Bernaards et al. ( 2007 ) compared three approaches to rounding in binary missing values: (1) rounding the imputed value to the nearest plausible value, (2) randomly drawing from a Bernoulli trial using the imputed value, between 0 and 1, as the probability in the Bernoulli trial, and (3) using an adaptive rounding rule based on the normal approximation to the binomial distribution. Their results showed that the second method was the worst in estimating odds ratio, and the third method provided the best results. One merit of their study is that it is based on a real-world data set. However, other factors may influence the performance of the rounding strategies, such as the missing mechanism, the size of the model, distributions of the categorical variables. These factors are not within a researcher’s control. Additional research is needed to identify one or more good strategy in dealing with categorical variables in MI, when a multivariate normal-based software is used to perform MI.

Unfortunately, even less is known about the effect of rounding in MI, when imputing ordinal variables with three or more levels. It is possible that as the level of the categorical variable increases, the effect of rounding decreases. Again, studies are needed to further explore this issue.

The imputation model

MI requires two models: the imputation model used in step 1 and the analysis model used in step 2. Theoretically, MI assumes that the two models are the same. In practice, they can be different ( Schafer 1997 ). An appropriate imputation model is the key to the effectiveness of MI; it should have the following two properties.

First, an imputation model should include useful variables. Rubin ( 1996 ) recommends a liberal approach when deciding if a variable should be included in the imputation model. Schafer ( 1997 ) and van Buuren et al. ( 1999 ) recommended three kinds of variables to be included in an imputation model: (1) variables that are of theoretical interest, (2) variables that are associated with the missing mechanism, and (3) variables that are correlated with the variables with missing data. The latter two kinds of variables are sometimes referred to as auxiliary variables ( Collins et al. 2001 ). The first kind of variables is necessary, because omitting them will downward bias the relation between these variables and other variables in the imputation model. The second kind of variables makes the MAR assumption more plausible, because they account for the missing mechanism. The third kind of variables helps to estimate missing values more precisely. Thus, each kind of variables has a unique contribution to the MI procedure. However, including too many variables in an imputation model may inflate the variance of estimates, or lead to non-convergence. Thus, researchers should carefully select variables to be included into an imputation model. van Buuren et al. ( 1999 ) recommended not including auxiliary variables that have too many missing data. Enders ( 2010 ) suggested selecting auxiliary variables that have absolute correlations greater than .4 with variables with missing data.

Second, an imputation model should be general enough to capture the assumed structure of the data. If an imputation model is more restrictive, namely, making additional restrictions than an analysis model, one of two consequences may follow. One consequence is that the results are valid but the conclusions may be conservative (i.e., failing to reject the false null hypothesis), if the additional restrictions are true ( Schafer 1999 ). Another consequence is that the results are invalid because one or more of the restrictions is false ( Schafer 1999 ). For example, a restriction may restrict the relationship between a variable and other variables in the imputation model to be merely pairwise. Therefore, any interaction effect that involves at least three variables will be biased toward zero. To handle interactions properly in MI, Enders ( 2010 ) suggested that the imputation model include the product of the two variables if both are continuous. For categorical variables, Enders suggested performing MI separately for each subgroup defined by the combination of the levels of the categorical variables.

Number of imputations

However, methodologists have not agreed on the optimal number of imputations. Schafer and Olsen ( 1998 ) suggested that “in many applications, just 3–5 imputations are sufficient to obtain excellent results” (p. 548). Schafer and Graham ( 2002) were more conservative in asserting that 20 imputations are enough in many practical applications to remove noises from estimations. Graham et al. ( 2007 ) commented that RE should not be an important criterion when specifying m , because RE has little practical meaning. Other factors, such as, the SE , p -value, and statistical power, are more related to empirical research and should also be considered, in addition to RE. Graham et al. ( 2007 ) reported that statistical power decreased much faster than RE, as λ increases and/or m decreases. In an extreme case in which λ=.9 and  m  = 3, the power for MI was only .39, while the power of an equivalent FIML analysis was 0.78. Based on these results, Graham et al. ( 2007 ) provided a table for the number of imputations needed, given λ and an acceptable power falloff, such as 1%. They defined the power falloff as the percentage decrease in power, compared to an equivalent FIML analysis, or compared to m = 100. For example, to ensure a power falloff less than 1%, they recommended m = 20, 40, 100, or > 100 for a true λ =.1, .5, .7, or .9 respectively. Their recommended m is much larger than what is derived from the Rubin rule based on RE ( Rubin 1987 ). Unfortunately, Graham et al.’s study is limited to testing a small standardized regression coefficient (β = 0.0969) in a simple regression analysis. The power falloff of MI may be less severe when the true β is larger than 0.0969. At the present, the literature does not shed light on the performance of MI when the regression model is more complex than a simple regression model.

Recently, White et al. ( 2011 ) argued that in addition to relative efficiency and power, researchers should also consider Monte Carlo errors when specifying the optimal number of imputations. Monte Carlo error is defined as the standard deviation of the estimates (e.g. regression coefficients, test statistic, p -value) “across repeated runs of the same imputation procedure with the same data” ( White et al. 2011 p. 387). Monte Carlo error converges to zero as m increases. A small Monte Carlo error implies that results from a particular run of MI could be reproduced in the subsequent repetition of the MI analysis. White et al. also suggested that the number of imputations should be greater than or equal to the percentage of missing observations in order to ensure an adequate level of reproducibility. For studies that compare different statistical methods, the number of imputations should be even larger than the percentage of missing observations, usually between 100 and 1000, in order to control the Monte Carlo error ( Royston and White 2011 ).

It is clear from the above discussions that a simple recommendation for the number of imputations (e.g., m = 5) is inadequate. For data sets with a large amount of missing information, more than five imputations are necessary in order to maintain the power level and control the Monte Carlo error. A larger imputation model may require more imputations, compared to a smaller or simpler model. This is so because a large imputation model results in increased SE s, compared to a smaller or simpler model. Therefore, for a large model, additional imputations are needed to offset the increased SE s. Specific guidelines for choosing m await empirical research. In general, it is a good practice to specify a sufficient m to ensure the convergence of MI within a reasonable computation time.

Convergence of MCMC

The convergence of the Markov Chain is one of the determinants of the validity of the results obtained from MI. If the Markov Chain does not converge, the imputed values are not considered random samples from the posterior distribution of the missing data, given the observed data, i.e., P ( Y mis | Y obs ). Consequently, statistical results based on these imputed values are invalid. Unfortunately, the importance of assessing the convergence was rarely mentioned in articles that reviewed the theory and application of MCMC ( Schafer 1999 ; Schafer and Graham 2002 ; Schlomer et al. 2010 ; Sinharay et al. 2001 ). Because the convergence is defined in terms of both probability and procedures, it is complex and difficult to determine the convergence of MCMC ( Enders 2010 ). One way to roughly assess convergence is to visually examine the trace plot and the autocorrelation function plot; both are provided by SAS PROC MI ( SAS Institute Inc 2011 ). For a parameter θ , a trace plot is a plot of the number of iterations ( t ) against the value of θ ( t ) on the vertical axis. If the MCMC converges, there is no indication of a systematic trend in the trace plot. The autocorrelation plot displays the autocorrelations between θ ( t ) s at lag k on the vertical axis against k on the horizontal axis. Ideally, the autocorrelation at any lag should not be statistically significantly different from zero. Since the convergence of a Markov Chain may be at different rates for different parameters, one needs to examine these two plots for each parameter. When there are many parameters, one can choose to examine the worst linear function ( or WLF, Schafer 1997 ). The WLF is a constructed statistic that converges more slowly than all other parameters in the MCMC method. Thus if the WLF converges, all parameters should have converged (see pp. 2–3 of the Appendix for an illustration of both plots for WLF, accessible from https://oncourse.iu.edu/access/content/user/peng/Appendix.Dong%2BPeng.Principled%20missing%20methods.current.pdf ). Another way to assess the convergence of MCMC is to start the chain multiple times, each with a different initial value. If all the chains yield similar results, one can be confident that the algorithm has converged.

Full information maximum-likelihood (FIML)

FIML is a model-based missing data method that is used frequently in structural equating modeling (SEM). In our review of the literature, 26.1% studies that had missing data used FIML to deal with missing data. Unlike MI, FIML does not impute any missing data. It estimates parameters directly using all the information that is already contained in the incomplete data set. The FIML approach was outlined by Hartley and Hocking ( 1971 ). As the name suggests, FIML obtains parameter estimates by maximizing the likelihood function of the incomplete data. Under the assumption of multivariate normality, the log likelihood function of each observation i is:

where x i is the vector of observed values for case i , K i is a constant that is determined by the number of observed variables for case i , and μ and Σ are, respectively, the mean vector and the covariance matrix that are to be estimated ( Enders 2001 ). For example, if there are three variables ( X 1 ,  X 2 , and  X 3 )  in the model. Suppose for case i , X 1  = 10 and  X 2  = 5, while X 3 is missing. Then in the likelihood function for case i is:

The total sample log likelihood is the sum of the individual log likelihood across n cases. The standard ML algorithm is used to obtain the estimates of μ and Σ, and the corresponding SE s by maximizing the total sample log likelihood function.

As with MI, FIML also assumes MAR and multivariate normality for the joint distribution of all the variables. When the two assumptions are met, FIML is demonstrated to produce unbiased estimates ( Enders and Bandalos 2001 ) and valid model fit information ( Enders 2001 ). Furthermore, FIML is generally more efficient than other ad hoc missing data methods, such as LD ( Enders 2001 ). When the normality assumption was violated, Enders ( 2001 ) reported that (1) FIML provided unbiased estimates across different missing rates, sample sizes, and distribution shapes, as long as the missing mechanism was MCAR or MAR, but (2) FIML resulted in negatively biased SE estimates and an inflated model rejection rate (namely, rejecting fitted models too frequently). Thus, Enders recommended using correction methods, such as rescaled statistics and bootstrap, to correct the bias associated with nonnormality.

Because FIML assumes MAR, adding auxiliary variables to a fitted model is beneficial to data analysis in terms of bias and efficiency ( Graham 2003 ; Section titled The Imputation Model). Collins et al. ( 2001 ) showed that auxiliary variables are especially helpful when (1) missing rate is high (i.e., > 50%), and/or (2) the auxiliary variable is at least moderately correlated (i.e., Pearson’s r > .4) with either the variable containing missing data or the variable causing missingness. However, incorporating auxiliary variables into FIML is not as straightforward as it is with MI. Graham ( 2003 ) proposed the saturated correlates model to incorporate auxiliary variables into a substantive SEM model, without affecting the parameter estimates of the SEM model or its model fit index. Specifically, Graham suggested that, after the substantive SEM model is constructed, the auxiliary variables be added into the model according to the following rules: (a) all auxiliary variables are specified to be correlated with all exogenous manifest variables in the model; (b) all auxiliary variables are specified to be correlated with the residuals for all the manifest variables that are predicted; and (c) all auxiliary variables are specified to be correlated to each other. Afterwards, the saturated correlates model can be fitted to data by FIML to increase efficiency and decrease bias.

Expectation-maximization (EM) algorithm

The EM algorithm is another maximum-likelihood based missing data method. As with FIML, the EM algorithm does not “fill in” missing data, but rather estimates the parameters directly by maximizing the complete data log likelihood function. It does so by iterating between the E step and the M step ( Dempster et al. 1977 ).

The E (expectation) step calculates the expectation of the log-likelihood function of the parameters, given data. Assuming a data set ( Y ) is partitioned into two parts: the observed part and the missing part, namely, Y  = ( Y obs ,  Y mis ). The distribution of Y depending on the unknown parameter θ can be therefore written as:

Equation 13 can be written as a likelihood function as Equation 14 :

where c is a constant relating to the missing data mechanism that can be ignored under the MAR assumption and the independence between model parameters and the missing mechanism parameters ( Schafer 1997 p. 12). Taking the log of both sides of Equation 14 yields the following:

where l ( θ | Y ) = log  P ( Y | θ ) is the complete-data log likelihood, l ( θ | Y obs ) is the observed-data log likelihood, log  c is a constant, and P ( Y mis | Y obs ,  θ (Schafer ) is the predictive distribution of the missing data, given θ 1997 ). Since log  c does not affect the estimation of θ , this term can be dropped in subsequent calculations.

Because Y mis is unknown, the complete-data log likelihood cannot be determined directly. However, if there is a temporary or initial guess of θ (denoted as θ ( t ) ), it is possible to compute the expectation of l ( θ | Y ) with respect to the assumed distribution of the missing data P ( Y mis | Y obs ,  θ ( t ) ) as Equation 16 :

It is at the E step of the EM algorithm that Q ( θ | θ ( t ) ) is calculated.

At the M (Maximization) step, the next guess of θ is obtained by maximizing the expectation of the complete data log likelihood from the previous E step:

The EM algorithm is initialized with an arbitrary guess of θ 0 , usually estimates based solely on the observed data. It proceeds by alternating between the E step and M step. It is terminated when successive estimates of θ are nearly identical. The θ ( t +1) that maximizes Q ( θ | θ ( t ) ) is guaranteed to yield an observed data log likelihood that is greater than or equal to that provided by θ ( t ) ( Dempster et al. 1977 ).

However, the EM algorithm also has several disadvantages. First, the EM algorithm does not compute the derivatives of the log likelihood function. Consequently, it does not provide estimates of SE s. Although extensions of EM have been proposed to allow for the estimation of SE s, these extensions are computationally complex. Thus, EM is not a choice of the missing data method when statistical tests or confidence intervals of estimated parameters are the primary goals of research. Second, the rate of convergence can be painfully slow, when the percent of missing information is large ( Little and Rubin 2002 ). Third, many statistical programs assume the multivariate normal distribution when constructing l ( θ | Y ). Violation of this multivariate normality assumption may cause convergence problems for EM, and also for other ML-based methods, such as FIML. For example, if the likelihood function has more than one mode, the mode to which EM will converge depends on the starting value of the iteration. Schafer ( 1997 ) cautions that multiple modes do occur in real data sets, especially when “the data are sparse and/or the missingness pattern is unusually pernicious.” (p. 52). One way to check if the EM provides valid results is to initialize the EM algorithm with different starting values, and check if the results are similar. Finally, EM is model specific. Each proposed data model requires a unique likelihood function. In sum, if used flexibly and with df , EM is powerful and can provide smaller SE estimates than MI. Schafer and Graham ( 2002 ) compiled a list of packages that offered the EM algorithm. To the best of our knowledge, the list has not been updated in the literature.

Demonstration

In this section, we demonstrate the three principled missing data methods by applying them to a real-world data set. The data set is complete and described under Data Set . A research question posted to this data set and an appropriate analysis strategy are described next under Statistical Modeling . From the complete data set, two missing data conditions were created under the MAR assumption at three missing data rates. These missing data conditions are described under Generating Missing Data Conditions . For each missing data condition, LD, MI, FIML, and EM were applied to answer the research question. The application of these four methods is described under Data Analysis . Results obtained from these methods were contrasted with those obtained from the complete data set. The results are discussed in the next section titled Results .

Self-reported health data by 432 adolescents were collected in the fall of 1988 from two junior high schools (Grades 7 through 9) in the Chicago area. Of the 432 participants, 83.4% were Whites and the remaining Blacks or others, with a mean age of 13.9 years and nearly even numbers of girls ( n = 208) and boys ( n = 224). Parents were notified by mail that the survey was to be conducted. Both the parents and the students were assured of their rights to optional participation and confidentiality of students’ responses. Written parental consent was waived with the approval of the school administration and the university Institutional Review Board ( Ingersoll et al. 1993 ). The adolescents reported their health behavior, using the Health Behavior Questionnaire (HBQ) ( Ingersoll and Orr 1989 ; Peng et al. 2006 ; Resnick et al. 1993 ), self-esteem, using Rosenberg’s inventory ( Rosenberg 1989 ), gender, race, intention to drop out of school, and family structure. The HBQ asked adolescents to indicate whether they engaged in specific risky health behaviors (Behavioral Risk Scale) or had experienced selected emotions (Emotional Risk Scale). The response scale ranged from 1 ( never ) to 4 ( about once a week ) for both scales. Examples of behavioral risk items were “I use alcohol (beer, wine, booze),” “I use pot,” and “I have had sexual intercourse/gone all the way.” These items measured frequency of adolescents’ alcohol and drug use, sexual activity, and delinquent behavior. Examples of emotional risk items were “I have attempted suicide,” and “I have felt depressed.” Emotional risk items measured adolescents’ quality of relationship with others, and management of emotions. Cronbach’s alpha reliability ( Nunnally 1978 ) was .84 for the Behavioral Risk Scale and .81 for the Emotional Risk Scale ( Peng and Nichols 2003 ). Adolescents’ self-esteem was assessed using Rosenberg’s self-esteem inventory ( Rosenberg 1989 ). Self-esteem scores ranged from 9.79 to 73.87 with a mean of 50.29 and SD of 10.04. Furthermore, among the 432 adolescents, 12.27% ( n = 53) indicated an intention to drop out of school; 67.4% ( n = 291) were from families with two parents, including those with one step-parent, and 32.63% ( n = 141) were from families headed by a single parent. The data set is hereafter referred to as the Adolescent data and is available from https://oncourse.iu.edu/access/content/user/peng/logregdata_peng_.sav as an SPSS data file.

Statistical Modeling

For the Adolescent data, we were interested in predicting adolescents’ behavioral risk from their gender, intention to drop out from school, family structure, and self-esteem scores. Given this objective, a linear regression model was fit to the data using adolescents’ score on the Behavioral Risk Scale of the HBQ as the dependent variable (BEHRISK) and gender (GENDER), intention to drop out of school (DROPOUT), type of family structure (FAMSTR), and self-esteem score (ESTEEM) as predictors or covariates. The emotional risk (EMORISK) was used subsequently as an auxiliary variable to illustrate the missing data methods. Hence, it was not included in the regression model. For the linear regression model, gender was coded as 1 for girls and 0 for boys, DROPOUT was coded as 1 for yes and 0 for no, and FAMSTR was coded as 1 for single-parent families and 0 for intact or step families. BEHRISK and ESTEEM were coded using participant’s scores on these two scales. Because the distribution of BEHRISK was highly skewed, a natural log transformation was applied to BEHRISK to reduce its skewness from 2.248 to 1.563. The natural-log transformed BEHRISK (or LBEHRISK) and ESTEEM were standardized before being included in the regression model to facilitate the discussion of the impact of different missing data methods. Thus, the regression model fitted to the Adolescent data was:

The regression coefficients obtained from SAS 9.3 using the complete data were:

According to the results, when all other covariates were held as a constant, boys, adolescents with intention to drop out of school, those with low self-esteem scores, or adolescents from single-parent families, were more likely to engage in risky behaviors.

Generating missing data conditions

The missing data on LBEHRISK and ESTEEM were created under the MAR mechanism. Specifically, the probability of missing data on LBEHRISK was made to depend on EMORISK. And the probability of missing data on ESTEEM depended on FAMSTR. Peugh and Enders ( 2004 ) reviewed missing data reported in 23 applied research journals, and found that “the proportion of missing cases per analysis ranged from less than 1% to approximately 67%” (p. 539). Peng, et al. ( 2006 ) reported missing rates ranging from 26% to 72% based on 1,666 studies published in 11 education and psychology journals. We thus designed our study to correspond to the wide spread of missing rates encountered by applied researchers. Specifically, we manipulated the overall missing rate at three levels: 20%, 40%, or 60% (see Table  1 ).We did not include lower missing rates such as, 10% or 5%, because we expected missing data methods to perform similarly and better at low missing rates than at high missing rates. Altogether we generated three missing data conditions using SPSS 20 (see the Appendix for SPSS syntax for generating missing data). Due to the difficulty in manipulating missing data in the outcome variable and the covariates, the actual overall missing rates could not be controlled exactly at 20% or 60%. They did closely approximate these pre-specified rates (see the description below).

Probability of missing for LBEHRISK and ESTEEM at three missing rates

Overall missing rateMissing variableFAMSTRMissing variableEMORISK
Single familyIntact/step familyBetween &
20%ESTEEM.20.02LBEHRISK.00.10.30
40%ESTEEM.40.05LBEHRISK.10.20.60
60%ESTEEM.80.10LBEHRISK.20.40.80

Note . Q1 = first quartile, Q3 = third quartile.

According to Table  1 , at the 20% overall missing rate, participants from a single-parent family had a probability of .20 of missing ESTEEM, while participants from a two-parent family (including the intact families and families with one step- and one biological parent) had a probability of .02 of missing scores on ESTEEM. As the overall missing rate increased from 20% to 40% or 60%, the probability of missing on ESTEEM likewise increased. Furthermore, the probability of missing in LBEHRISK was conditioned on the value of EMORISK. Specifically, at the 20% overall missing rate, if EMORISK was at or below the first quartile, the probability of LBEHRISK missing was .00 (Table  1 ). If EMORISK was between the first and the third quartiles, the probability of LBEHRISK missing was .10 and an EMORISK at or above the third quartile resulted in LBEHRISK missing with a probability of .30. When the overall missing rate increased to 40% or 60%, the probabilities of missing LBEHRISK increased accordingly.

After generating three data sets with different overall missing rates, the regression model in Equation 18 was fitted to each data set using four methods (i.e., LD, MI, FIML, and EM) to deal with missing data. Since missing on LBEHRISK depended on EMORISK, EMORISK was used as an auxiliary variable in MI, EM, and FIML methods. All analyses were performed using SAS 9.3. For simplicity, we describe the data analysis for one of the three data sets, namely, the condition with an overall missing rate of 20%. Other data sets were analyzed similarly. Results are presented in Tables  2 and ​ and3 3 .

Regression Coefficients from Four Missing Data Methods

Complete dataLDMIFIMLEM
GENDER-0.434***-0.412***-0.414***-0.421***-0.421***
(0.082)(0.091)(0.086)(0.087)(0.083)
DROPOUT1.172***1.237***1.266***1.263***1.263***
(0.125)(0.142)(0.132)(0.132)(0.126)
ESTEEM-0.191***-0.213***-0.215***-0.212***-0.212***
(0.041)(0.046)(0.044)(0.044)(0.041)
FAMSTR0.367***0.377***0.365***0.366***0.366***
(0.087)(0.101)(0.096)(0.092)(0.088)
Actual 432349432N/A414
GENDER-0.434***-0.39**-0.414***-0.413***-0.413***
(0.082)(0.131)(0.1)(0.104)(0.086)
DROPOUT1.172***1.557***1.559***1.532***1.562***
(0.125)(0.209)(0.17)(0.158)(0.131)
ESTEEM-0.191***-0.193**-0.217***-0.214**-0.215***
(0.041)(0.065)(0.063)(0.06)(0.043)
FAMSTR0.367***0.479*0.302*0.3**0.3**
(0.087)(0.192)(0.116)(0.111)(0.091)
Actual 432171432N/A367

Note . Standard error estimates in parentheses. MI results were based on 60 imputations. FIML results were obtained with EMORISK as an auxiliary variable in the model.

a The actual overall missing rate was 19.21%. b The actual overall missing rate was 60.42%.

* p < .05. ** p < .01. *** p < .001.

Percentage of Bias in Estimates

LDMIFIMLEM
GENDER5.074.613.003.00
DROPOUT5.558.027.767.76
ESTEEM-11.52-12.57-10.99-10.99
FAMSTR2.72-0.54-0.27-0.27
GENDER10.144.614.844.84
DROPOUT32.8533.0230.7233.28
ESTEEM-1.05-13.61-12.04-12.57
FAMSTR30.52-17.71-18.26-18.26

Note . Percentage of bias was calculated as the ratio of the difference between the incomplete data estimate and the complete data estimate divided by the complete data estimate.

Data analysis

The ld method.

The LD method was implemented as a default in PROC REG. To implement LD, we ran PROC REG without specifying any options regarding missing data method. The SAS system, by default, used cases with complete data to estimate the regression coefficients.

The MI method

The second step in MI was to fit the regression model in Equation 18 to each imputed data set using PROC REG (see the Appendix for the SAS syntax). At the end of PROC REG, 60 sets of estimates of regression coefficients and their variance-covariance matrices were output to the third and final step in MI, namely, to pool these 60 estimates into one set. PROC MIANALYZE was invoked to combine these estimates and their variances/covariances into one set using the pooling formula in Equations 4 to 7 ( Rubin 1987 ). By default, PROC MIANALYZE uses ν m , defined in Equation 9 , for hypothesis testing. In order to specify the corrected degrees of freedom ν m * (as defined in Equation 10 ) for testing, we specified the “EDF=427” option, because 427 was the degrees of freedom based on the complete data.

The FIML method

The FIML method was implemented using PROC CALIS which is designed for structural equation modeling. Beginning with SAS 9.22, the CALIS procedure has offered an option to analyze data using FIML in the presence of missing data. The FIML method in the CALIS procedure has a variety of applications in path analyses, regression models, factor analyses, and others, as these modeling techniques are considered special cases of structural equation modeling ( Yung and Zhang 2011 ). For the current study, two models were specified using PROC CALIS: an ordinary least squares regression model without the auxiliary variable EMORISK, and a saturated correlates model that included EMORISK. For the saturated correlates model, EMORISK was specified to be correlated with the four covariates (GENDER, DROPOUT, ESTEEM, and FAMSTR) and the residual for LBEHRISK. Graham ( 2003 ) has shown that by constructing the saturated correlates model this way, one can include an auxiliary variable in the SEM model without affecting parameter estimate(s), or the model fit index for the model of substantive interest, which is Equation 18 in the current study.

The EM method

The EM method was implemented using both PROC MI and PROC REG. As stated previously, the versatile PROC MI can be used for EM if the EM statement was specified. To include auxiliary variables in EM, one lists the auxiliary variables on the VAR statement of PROC MI (see the Appendix for the SAS syntax). The output data set of PROC MI with the EM specification is a data set containing the estimated variance-covariance matrix and the vector of means of all the variables listed on the VAR statement. The variance-covariance matrix and the means vector were subsequently input into PROC REG to be fitted by the regression model in Equation 18 . In order to compute the SE for the estimated regression coefficients, we specified a nominal sample size that was the average of available cases among all the variables. We decided on this strategy based on findings by Truxillo ( 2005 ). Truxillo ( 2005 ) compared three strategies for specifying sample sizes for hypothesis testing in discriminant function analysis using EM results. The three strategies were: (a) the minimum column-wise n (i.e., the smallest number of available cases among all variables), (b) the average column-wise n (i.e., the mean number of available cases among all the variables), and (c) the minimum pairwise n (the smallest number of available cases for any pair of variables in a data set). He found that the average column-wise n approach produced results closest to the complete-data results. It is worth noting that Truxillo ( 2005 )’s study was limited to discriminant function analysis and three sample size specifications. Additional research is needed in order to determine the best strategy to specify a nominal sample size for other statistical procedures.

Results derived from the 40% missing rate exhibited patterns between those obtained at 20% and 60% missing rates. Hence, they are presented in the Appendix. Table  2 presents estimates of regression coefficients and SE s derived from LD, MI, FIML and EM for the 20% and 60% missing data conditions. Table  3 presents the percent of bias in parameter estimates by the four missing data methods. The percentage of bias was defined and calculated as the ratio of the difference between the incomplete data estimate and the complete data estimate, divided by the complete data estimate. Any percentage of bias larger than 10% is considered substantial in subsequent discussions. The complete data results are included in Table  2 as a benchmark to which the missing data results are contrasted. The regression model based on the complete data explained 28.4% of variance (i.e., R adj 2 ) in LBEHRISK, RMSE = 0.846, and all four predictors were statistically significant at p < .001.

According to Table  2 , at 20% overall missing rate, estimates derived from the four missing data methods were statistically significant at p < .001, the same significance level as the complete data results. LD consistently resulted in larger SE , compared to the three principled methods, or the complete data set. The bias in estimates was mostly under 10%, except for estimates of ESTEEM by all four missing data methods (Table  3 ). The three principled methods exhibited similar biases and estimated FAMSTR accurately.

When the overall missing rate was 60% (Table  2 ), estimates derived from the four missing data methods showed that all four covariates were statistically significant at least at p < .05. LD consistently resulted in larger SE , compared to the three principled methods, or the complete data set. All four methods resulted in substantial bias for three of the four covariates (Table  3 ). The three principled methods once again yielded similar biases, whereas bias from LD was similar to these three only for DROPOUT. Indeed, DROPOUT was least accurately estimated by all four methods. LD estimated ESTEEM most accurately and better than the three principled methods. The three principled methods estimated GENDER most accurately and their estimates for FAMSTR were better than LD’s. Differences in absolute bias due to these four methods for ESTEEM or GENDER were actually quite small.

Compared to the complete data result, the three principled methods slightly overestimated SE s (Table  2 ), but not as badly as LD. Among the three methods, SE s obtained from EM were closer to those based on the complete data, than MI or FIML. This finding is to be expected because MI incorporates into SE the uncertainty associated with plausible missing data estimates. And the literature consistently documented the superior power of EM, compared to MI ( Collins et al. 2001 ; Graham et al. 2007 ; Schafer and Graham 2002 ).

In general, the SE and the bias increased as the overall missing rate increased from 20% to 60%. One exception to this trend was the bias in ESTEEM estimated by LD; they decreased instead, although the two estimates differed by a mere .02.

During the last decade, the missing data treatments reported in JEP have shown much improvement in terms of decreased use of ad hoc methods (e.g., LD and PD) and increased use of principled methods (e.g., FIML, EM, and MI). Yet several research practices still persisted including, not explicitly acknowledging the presence of missing data, not describing the approach used in dealing with missing data, not testing assumptions assumed. In this paper, we promote three principled missing data methods (i.e., MI, FIML, and EM) by discussing their theoretical framework, implementation, assumptions, and computing issues. All three methods were illustrated with an empirical Adolescent data set using SAS 9.3. Their performances were evaluated under three conditions. These three conditions were created from three missing rates (20%, 40%, and 60%). Each incomplete data set was subsequently analyzed by a regression model to predict adolescents’ behavioral risk score using one of the three principled methods or LD. The performance of the four missing data methods was contrasted with that of the complete data set in terms of bias and SE .

Results showed that the three principled methods yielded similar estimates at both missing data rates. In comparison, LD consistently resulted in larger SE s for regression coefficients estimates. These findings are consistent with those reported in the literature and thus confirm the recommendations of the three principled methods ( Allison 2003 ; Horton and Lipsitz 2001 ; Kenward and Carpenter 2007 ; Peng et al. 2006 ; Peugh and Enders 2004 ; Schafer and Graham 2002 ). Under the three missing data conditions, MI, FIML, and EM yielded similar estimates and SE s. These results are consistent with missing data theory that argues that MI and ML-based methods (e.g., FIML and EM) are equivalent ( Collins et al. 2001 ; Graham et al. 2007 ; Schafer and Graham 2002 ). In terms of SE , ML-based methods outperformed MI by providing slightly smaller SE s. This finding is to be expected because ML-based methods do not involve any randomness whereas MI does. Below we elaborate on features shared by MI and ML-based methods, choice between these two types of methods, and extension of these methods to multilevel research contexts.

Features shared by MI and ML-based methods

First of all, these methods are based on the likelihood function of P ( Y obs ,  θ ) = ∫  P ( Y complete ,  θ ) dY mis . Because this equation is valid under MAR ( Rubin 1976 ), all three principled methods are valid under the MAR assumption. The two ML-based methods work directly with the likelihood function, whereas MI takes the Bayesian approach by imposing a prior distribution on the likelihood function. As the sample size increases, the impact of the specific prior distribution diminishes. It has been shown that,

If the user of the ML procedure and the imputer use the same set of input data (same set of variables and observational units), if their models apply equivalent distributional assumptions to the variables and the relationships among them, if the sample size is large, and if the number of imputations, M , is sufficiently large, then the results from the ML and MI procedures will be essentially identical. ( Collins et al. 2001 p. 336)

In fact, the computational details of EM and MCMC (i.e., data augmentation) are very similar ( Schafer 1997 ).

Second, both the MI and the ML-based methods allow the estimation/imputation model to be different from the analysis model—the model of substantive interest. Although it is widely known that the imputation model can be different from the analysis model for MI, the fact that ML-based methods can incorporate auxiliary variables (such as, EMORISK) is rarely mentioned in the literature, except by Graham ( 2003 ). As previously discussed, Graham ( 2003 ) suggested using the saturated correlates model to incorporate auxiliary variables into SEM. However, this approach results in a rapidly expanding model with each additional auxiliary variable; consequently, the ML-based methods may not converge. In this case, MI is the preferred method, especially when one needs to incorporate a large number of auxiliary variables into the model of substantive interest.

Finally, most statistical packages that offer the EM, FIML and/or MI methods assume multivariate normality. Theory and experiments suggest that MI is more robust to violation of this distributional assumption than ML-based methods ( Schafer 1997 ). As discussed previously, violation of the multivariate normality assumption may cause convergence problems for ML-based methods. Yet MI can still provide satisfactory results in the presence of non-normality (refer to the section titled MI Related Issues ). This is so because the posterior distribution in MI is approximated by a finite mixture of the normal distributions. MI therefore is able to capture non-normal features, such as, skewness or multiple modes ( Schafer 1999 ). At the present, the literature does not offer systematic comparisons of these two methods in terms of their sensitivity to the violation of the multivariate normality assumption.

Choice between MI and ML-based methods

The choice between MI and ML-based methods is not easy. On the one hand, ML-based methods offer the advantage of likelihood ratio tests so that nested models can be compared. Even though Schafer ( 1997 ) provided a way to combine likelihood ratio test statistics in MI, no empirical studies have evaluated the performance of this pooled likelihood ratio test under various data conditions (e.g., missing mechanism, missing rate, number of imputations, model complexity). And this test has not been incorporated into popular statistical packages, such as, SAS, SPSS. ML-based methods, in general, produce slightly smaller SE s than MI ( Collins et al. 2001 ; Schafer and Graham 2002 ). Finally, ML-based methods have greater power than MI ( Graham et al. 2007 ), unless imputations were sufficiently large, such as 100 or more.

On the other hand, MI has a clear advantage over ML-based methods when dealing with categorical variables ( Peng and Zhu 2008 ). Another advantage of MI over ML-based methods is its computational simplicity ( Sinharay et al. 2001 ). Once missing data have been imputed, fitting multiple models to a single data set does not require the repeated application of MI. Yet it requires multiple applications of ML-based methods to fit different models to the same data. As stated earlier, it is easier to include auxiliary variable in MI than in ML-based methods. In this sense, MI is the preferred method, if one wants to employ an inclusive strategy to selecting auxiliary variables.

The choice also depends on the goal of the study. If the aim is exploratory, or if the data are prepared for a number of users who may analyze the data differently, MI is certainly better than a ML-based method. For these purposes, a data analyst needs to make sure that the imputation model is general enough to capture meaningful relationships in the data set. If, however, a researcher is clear about the parameters to be estimated, FIML or EM is a better choice because they do not introduce randomness due to imputation into the data, and are more efficient than MI.

An even better way to deal with missing data is to apply MI and EM jointly. In fact, the application of MI can be facilitated by utilizing EM estimates as starting values for the data augmentation algorithm ( Enders 2010 ). Furthermore, the number of EM iterations needed for convergence is a conservative estimate for the number of burn-ins needed in data augmentation of MI, because EM converges slower than MI.

Extension of MI and ML-based methods to multilevel research contexts

Many problems in education and psychology are multilevel in nature, such as students nested within classroom, teachers nested within school districts, etc. To adequately address these problems, multilevel model have been recommended by methodologists. For an imputation method to yield valid results, the imputation model must contain the same structure as the data. In other words, the imputation model should be multilevel in order to impute for missing data in a multilevel context ( Carpenter and Goldstein 2004 ). There are several ways to extend MI to deal with missing data when there are two levels. If missing data occur only at level 1 and the number of level 2 units is low, standard MI can be used with minor adjustments. For example, for a random-intercept model, one can dummy-code the cluster membership variable and include the dummy variables into the imputation model. In the case of a random slope and random intercepts model, one needs to perform multiple imputation separately within each cluster ( Graham 2009 ). When the number of level 2 units is high, the procedure just described is cumbersome. In this instance, one may turn to specialized MI programs, such as, the PAN library in the S-Plus program ( Schafer 2001 ), the REALCON-IMPUTE software ( Carpenter et al. 2011 ), and the R package mlmmm ( Yucel 2007 ). Unfortunately, ML-based methods have been extended to multilevel models only when there are missing data on the dependent variable, but not on the covariates at any level, such as student’s age at level 1 or school’s SES at level 2 ( Enders 2010 ).

In this paper, we discuss and demonstrate three principled missing data methods that are applicable for a variety of research contexts in educational psychology. Before applying any of the principled methods, one should make every effort to prevent missing data from occurring. Toward this end, the missing data rate should be kept at minimum by designing and implementing data collection carefully. When missing data are inevitable, one needs to closely examine the missing data mechanism, missing rate, missing pattern, and the data distribution before deciding on a suitable missing data method. When implementing a missing data method, a researcher should be mindful of issues related to its proper implementation, such as, statistical assumptions, the specification of the imputation/estimation model, a suitable number of imputations, and criteria of convergence.

Quality of research will be enhanced if (a) researchers explicitly acknowledge missing data problems and the conditions under which they occurred, (b) principled methods are employed to handle missing data, and (c) the appropriate treatment of missing data is incorporated into review standards of manuscripts submitted for publication.

Acknowledgment

This research was supported by the Maris M. Proffitt and Mary Higgins Proffitt Endowment Grant awarded to the second author. The opinions contained in this paper are those of the authors and do not necessarily reflect those of the grant administer—Indiana University, School of Education.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

YD did literature review on missing data methods, carried out software demonstration, and drafted the manuscript. CYJP conceived the software demonstration design, provided the empirical data, worked with YD collaboratively to finalize the manuscript. Both authors read and approved of the final manuscript.

Contributor Information

Yiran Dong, Email: ude.anaidni@gnodiy .

Chao-Ying Joanne Peng, Email: ude.anaidni@gnep .

  • Ake CF. Proceedings of the Thirtieth Annual SAS® Users Group International Conference. Cary, NC: SAS Institute Inc; 2005. Rounding after multiple imputation with non-binary categorical covariates; pp. 1–11. [ Google Scholar ]
  • Allison PD. Missing data. Thousand Oaks, CA: Sage Publications, Inc.; 2001. [ Google Scholar ]
  • Allison PD. Missing data techniques for structural equation modeling. J Abnorm Psychol. 2003; 112 (4):545–557. doi: 10.1037/0021-843X.112.4.545. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Allison PD. Proceedings of the Thirtieth Annual SAS® Users Group International Conference. Cary, NC: SAS Institute Inc; 2005. Imputation of categorical variables with PROC MI; pp. 1–14. [ Google Scholar ]
  • Barnard J, Rubin DB. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999; 86 (4):948–955. doi: 10.1093/biomet/86.4.948. [ CrossRef ] [ Google Scholar ]
  • Bennett DA. How can I deal with missing data in my study? Aust N Z J Public Health. 2001; 25 (5):464–469. [ PubMed ] [ Google Scholar ]
  • Bernaards CA, Belin TR, Schafer JL. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat Med. 2007; 26 (6):1368–1382. doi: 10.1002/sim.2619. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Carpenter J, Goldstein H. Multiple imputation in MLwiN. Multilevel modelling newsletter. 2004; 16 :9–18. [ Google Scholar ]
  • Carpenter JR, Goldstein H, Kenward MG. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Softw. 2011; 45 (5):1–14. [ Google Scholar ]
  • Collins LM, Schafer JL, Kam C-M. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Meth. 2001; 6 (4):330–351. doi: 10.1037/1082-989X.6.4.330. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Couvreur C. The EM Algorithm: A Guided Tour. Pragues, Czech Republik: In Proc. 2d IEEE European Workshop on Computationaly Intensive Methods in Control and Signal Processing; 1996. pp. 115–120. [ Google Scholar ]
  • Demirtas H, Freels SA, Yucel RM. Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment. JSCS. 2008; 78 (1):69–84. [ Google Scholar ]
  • Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Series. 1977; 39 (1):1–38. [ Google Scholar ]
  • Diggle PJ, Liang KY, Zeger SL. Analysis of longitudinal data. New York: Oxford University Press; 1995. [ Google Scholar ]
  • Enders CK. A Primer on Maximum Likelihood Algorithms Available for Use With Missing Data. Struct Equ Modeling. 2001; 8 (1):128–141. doi: 10.1207/S15328007SEM0801_7. [ CrossRef ] [ Google Scholar ]
  • Enders CK. Using the Expectation Maximization Algorithm to Estimate Coefficient Alpha for Scales With Item-Level Missing Data. Psychol Meth. 2003; 8 (3):322–337. doi: 10.1037/1082-989X.8.3.322. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Enders CK. Applied Missing Data Analysis. New York, NY: The Guilford Press; 2010. [ Google Scholar ]
  • Enders CK, Bandalos DL. The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models. Struct Equ Modeling l. 2001; 8 (3):430–457. doi: 10.1207/S15328007SEM0803_5. [ CrossRef ] [ Google Scholar ]
  • Graham JW. Adding Missing-Data-Relevant Variables to FIML-Based Structural Equation Models. Struct Equ Modeling. 2003; 10 (1):80–100. doi: 10.1207/S15328007SEM1001_4. [ CrossRef ] [ Google Scholar ]
  • Graham JW. Missing data analysis: Making it work in the real world. Annu Rev Psychol. 2009; 60 :549–576. doi: 10.1146/annurev.psych.58.110405.085530. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Graham JW, Olchowski A, Gilreath T. How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory. Prev Sci. 2007; 8 (3):206–213. doi: 10.1007/s11121-007-0070-9. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hartley HO, Hocking RR. The Analysis of Incomplete Data. Biometrics. 1971; 27 (4):783–823. doi: 10.2307/2528820. [ CrossRef ] [ Google Scholar ]
  • Heitjan DF, Little RJ. Multiple imputation for the fatal accident reporting system. Appl Stat. 1991; 40 :13–29. doi: 10.2307/2347902. [ CrossRef ] [ Google Scholar ]
  • Horton NJ, Kleinman KP. Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. Am Stat. 2007; 61 (1):79–90. doi: 10.1198/000313007X172556. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Horton NJ, Lipsitz SR. Multiple Imputation in Practice. Am Stat. 2001; 55 (3):244–254. doi: 10.1198/000313001317098266. [ CrossRef ] [ Google Scholar ]
  • Horton NJ, Lipsitz SR, Parzen M. A Potential for Bias When Rounding in Multiple Imputation. Am Stat. 2003; 57 (4):229–232. doi: 10.1198/0003130032314. [ CrossRef ] [ Google Scholar ]
  • Ingersoll GM, Orr DP. Behavioral and emotional risk in early adolescents. J Early Adolesc. 1989; 9 (4):396–408. doi: 10.1177/0272431689094002. [ CrossRef ] [ Google Scholar ]
  • Ingersoll GM, Grizzle K, Beiter M, Orr DP. Frequent somatic complaints and psychosocial risk in adolescents. J Early Adolesc. 1993; 13 (1):67–78. doi: 10.1177/0272431693013001004. [ CrossRef ] [ Google Scholar ]
  • Kenward MG, Carpenter J. Multiple imputation: current perspectives. Stat Methods in Med Res. 2007; 16 (3):199–218. doi: 10.1177/0962280206075304. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Little RJA, Rubin DB. Statistical analysis with missing data. 2. New York: Wiley; 2002. [ Google Scholar ]
  • Little RJA, Schenker N. Missing Data. In: Arminger G, Clogg CC, Sobel ME, editors. Handbook of Statistical Modeling for the Social and Behavioral Sciences. New York: Plenum Press; 1995. pp. 39–75. [ Google Scholar ]
  • Nunnally J. Psychometric theory. 2. New York: McGraw-Hill; 1978. [ Google Scholar ]
  • OECD Publishing, Paris. 2009. PISA Data Analysis Manual: SPSS, Second Edition. [ Google Scholar ]
  • Peng CYJ, Nichols RN. Using multinomial logistic models to predict adolescent behavioral risk. J Mod App Stat. 2003; 2 (1):177–188. [ Google Scholar ]
  • Peng CYJ, Zhu J. Comparison of two approaches for handling missing covariates in logistic regression. Educ Psychol Meas. 2008; 68 (1):58–77. [ Google Scholar ]
  • Peng CYJ, Harwell M, Liou SM, Ehman LH. Advances in missing data methods and implications for educational research. In: Sawilowsky SS, editor. Real data analysis. Charlotte, North Carolina: Information Age Pub; 2006. pp. 31–78. [ Google Scholar ]
  • Peugh JL, Enders CK. Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of educational research. 2004; 74 (4):525–556. doi: 10.3102/00346543074004525. [ CrossRef ] [ Google Scholar ]
  • Raghunathan TE, Lepkowski JM, van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology. 2001; 27 (1):85–96. [ Google Scholar ]
  • Resnick MD, Harris LJ, Blum RW. The impact of caring and connectedness on adolescent health and well-being. J Paediatr Child Health. 1993; 29 (Suppl 1):3–9. doi: 10.1111/j.1440-1754.1993.tb02257.x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rosenberg M. Society and the adolescent self-image. rev. Middletown, CT, England: Wesleyan University Press; 1989. [ Google Scholar ]
  • Royston P. Multiple imputation of missing values. SJ. 2004; 4 (3):227–241. [ Google Scholar ]
  • Royston P. Multiple imputation of missing values: Update of ice. SJ. 2005; 5 (4):527–536. [ Google Scholar ]
  • Royston P. Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring. SJ. 2007; 7 (4):445–464. [ Google Scholar ]
  • Royston P, White IR. Multiple Imputation by Chained Equations (MICE): Implementation in Stata. J Stat Softw. 2011; 45 (4):1–20. [ Google Scholar ]
  • Rubin DB. Inference and missing data. Biometrika. 1976; 63 (3):581–592. doi: 10.1093/biomet/63.3.581. [ CrossRef ] [ Google Scholar ]
  • Rubin DB. Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, Inc.; 1987. [ Google Scholar ]
  • Rubin DB. Multiple Imputation after 18+ Years. JASA. 1996; 91 :473–489. doi: 10.1080/01621459.1996.10476908. [ CrossRef ] [ Google Scholar ]
  • SAS/STAT 9.3 User's Guide. Cary, NC: SAS Institute Inc; 2011. [ Google Scholar ]
  • Schafer JL. Analysis of incomplete multivariate data. London: Chapman & Hall/CRC; 1997. [ Google Scholar ]
  • Schafer JL. Multiple imputation: a primer. Stat Methods in Med. 1999; 8 (1):3–15. doi: 10.1191/096228099671525676. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schafer JL. Multiple imputation with PAN. In: Collins LM, Sayer AG, editors. New methods for the analysis of change. Washington, DC: American Psychological Association; 2001. pp. 353–377. [ Google Scholar ]
  • Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychol Meth. 2002; 7 (2):147–177. doi: 10.1037/1082-989X.7.2.147. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schafer JL, Olsen MK. Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. Multivar Behav Res. 1998; 33 (4):545–571. doi: 10.1207/s15327906mbr3304_5. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schenker N, Taylor JMG. Partially parametric techniques for multiple imputation. Comput Stat Data Anal. 1996; 22 (4):425–446. doi: 10.1016/0167-9473(95)00057-7. [ CrossRef ] [ Google Scholar ]
  • Schlomer GL, Bauman S, Card NA. Best practices for missing data management in counseling psychology. J Couns Psychol. 2010; 57 (1):1–10. doi: 10.1037/a0018082. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sinharay S, Stern HS, Russell D. The use of multiple imputation for the analysis of missing data. Psychol Meth. 2001; 6 (4):317–329. doi: 10.1037/1082-989X.6.4.317. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tabachnick BG, Fidell LS. Using multivariate statistics. 6. Needham Heights, MA: Allyn & Bacon; 2012. [ Google Scholar ]
  • Tanner MA, Wong WH. The Calculation of Posterior Distributions by Data Augmentation. JASA. 1987; 82 (398):528–540. doi: 10.1080/01621459.1987.10478458. [ CrossRef ] [ Google Scholar ]
  • Truxillo C. Proceedings of the Thirtieth Annual SAS® Users Group International Conference. Cary, NC: SAS Institute Inc; 2005. Maximum Likelihood Parameter Estimation with Incomplete Data; pp. 1–19. [ Google Scholar ]
  • van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods in Med Res. 2007; 16 (3):219–242. doi: 10.1177/0962280206074463. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011; 45 (3):1–67. [ Google Scholar ]
  • van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999; 18 (6):681–694. doi: 10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fully conditional specification in multivariate imputation. JSCS. 2006; 76 (12):1049–1064. [ Google Scholar ]
  • White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011; 30 (4):377–399. doi: 10.1002/sim.4067. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wilkinson L, the Task Force on Statistical Inference Statistical methods in psychology journals: Guidelines and explanations. Am Psychol. 1999; 54 (8):594–604. doi: 10.1037/0003-066X.54.8.594. [ CrossRef ] [ Google Scholar ]
  • Wilks SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Statist. 1938; 9 (1):60–62. doi: 10.1214/aoms/1177732360. [ CrossRef ] [ Google Scholar ]
  • Williams T, Williams K. Self-efficacy and performance in mathematics: Reciprocal determinism in 33 nations. J Educ Psychol. 2010; 102 (2):453–466. doi: 10.1037/a0017271. [ CrossRef ] [ Google Scholar ]
  • Yucel R. R mlmmm package: fitting multivariate linear mixed-effects models with missing values. 2007. [ Google Scholar ]
  • Yucel R. Multiple imputation. J Stat Softw. 2011; 45 :1. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yung Y, Zhang W. Proceedings of the SAS® Global Forum 2011 Conference. Cary, NC: SAS Institute Inc; 2011. Making use of incomplete observations in the analysis of structural equation models: The CALIS procedure's full information maximum likelihood method in SAS/STAT® 9.3. [ Google Scholar ]
  • Research article
  • Open access
  • Published: 11 July 2012

A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures

  • Amalia Karahalios 1 , 2 ,
  • Laura Baglietto 1 , 2 ,
  • John B Carlin 2 , 3 ,
  • Dallas R English 1 , 2 &
  • Julie A Simpson 1 , 2  

BMC Medical Research Methodology volume  12 , Article number:  96 ( 2012 ) Cite this article

39k Accesses

119 Citations

Metrics details

Retaining participants in cohort studies with multiple follow-up waves is difficult. Commonly, researchers are faced with the problem of missing data, which may introduce biased results as well as a loss of statistical power and precision. The STROBE guidelines von Elm et al. (Lancet, 370:1453-1457, 2007); Vandenbroucke et al. (PLoS Med, 4:e297, 2007) and the guidelines proposed by Sterne et al. (BMJ, 338:b2393, 2009) recommend that cohort studies report on the amount of missing data, the reasons for non-participation and non-response, and the method used to handle missing data in the analyses. We have conducted a review of publications from cohort studies in order to document the reporting of missing data for exposure measures and to describe the statistical methods used to account for the missing data.

A systematic search of English language papers published from January 2000 to December 2009 was carried out in PubMed. Prospective cohort studies with a sample size greater than 1,000 that analysed data using repeated measures of exposure were included.

Among the 82 papers meeting the inclusion criteria, only 35 (43%) reported the amount of missing data according to the suggested guidelines. Sixty-eight papers (83%) described how they dealt with missing data in the analysis. Most of the papers excluded participants with missing data and performed a complete-case analysis (n = 54, 66%). Other papers used more sophisticated methods including multiple imputation (n = 5) or fully Bayesian modeling (n = 1). Methods known to produce biased results were also used, for example, Last Observation Carried Forward (n = 7), the missing indicator method (n = 1), and mean value substitution (n = 3). For the remaining 14 papers, the method used to handle missing data in the analysis was not stated.

Conclusions

This review highlights the inconsistent reporting of missing data in cohort studies and the continuing use of inappropriate methods to handle missing data in the analysis. Epidemiological journals should invoke the STROBE guidelines as a framework for authors so that the amount of missing data and how this was accounted for in the analysis is transparent in the reporting of cohort studies.

Peer Review reports

A growing number of cohort studies are establishing protocols to re-contact participants at various times during follow-up. These waves of data collection provide researchers with the opportunity to obtain information regarding changes in the participants’ exposure and outcome measures. Incorporating the repeated measures of the exposure in the epidemiological analysis is especially important if the current exposure (or change in exposure) is thought to be more predictive of the outcome than the participants’ baseline measurement [ 1 ] or the researcher is interested in assessing the effect of a cumulative exposure [ 2 ]. The time frames for these follow-up waves of data collection can vary from one to two years up to 20 to 30 years or even longer post-baseline. Repeated ascertainment of exposure and outcome measures over time can lead to missing data for reasons such as participants not being traceable, too sick to participate, withdrawing from the study, refusing to respond to certain questions or death [ 3 , 4 ]. In this paper we focus on missing data in exposure measures that are made repeatedly in a cohort study because studies of this type (in which the outcome is often a single episode of disease or death obtained from a registry and therefore, known for all participants) are common and increasingly important in chronic disease epidemiology. Further research is needed on the consequences of and best methods for handling missing data in such study designs, but simulation and case studies have shown that missing covariate data can lead to biased results and there may be gains in precision of estimation of effects if multiple imputation is used to handle missing covariate data [ 5 – 7 ].

If participants with missing data and complete data differ with respect to exposure and outcome, estimates of association based on fully observed cases (known as a complete-case analysis) might be biased. Further, the estimates from these analyses will have less precision than an analysis of all participants in the absence of missing data. As well as complete-case analysis, there are other methods available for dealing with missing data in the statistical analysis [ 8 , 9 ]. These include ad hoc methods such as Last Observation Carried Forward and the missing indicator method, and more advanced approaches such as multiple imputation and likelihood-based formulations.

The STROBE guidelines for reporting of observational studies, published in 2007, state that the method for handling missing data should be addressed and furthermore, that the number of individuals used for analysis at each stage of the study should be reported accompanied by reasons for non-participation or non-response [ 10 , 11 ]. The guidelines published by Sterne et al. [ 12 ], an extension to the STROBE guidelines, provide general recommendations for the reporting of missing data in any study affected by missing data and specific recommendations for reporting the details of multiple imputation.

In this paper we: 1) give a brief review of the statistical methods that have been proposed for handling missing data and when they may be appropriate; 2) review how missing exposure data has been reported in large cohort studies with one or more waves of follow-up, where the repeated waves of exposures were incorporated in the statistical analyses; and 3) report how the same studies dealt with missing data in the statistical analyses.

Statistical methods for handling missing data

Complete-case analysis only includes in the analysis participants with complete data on all waves of data collection, thereby potentially reducing the precision of the estimates of the exposure-outcome associations [ 2 ]. The advantage of using complete-case analysis is that it is easily implemented, with most software packages using this method as the default. The estimates of the associations of interest may be biased if the participants with missing data are not similar to those with complete data. To be valid, complete-case analyses must assume that participants with missing data can be thought of as a random sample of those that were intended to be observed (commonly referred to in the missing data nomenclature as missing completely at random (MCAR) [ 13 ]), or at least that the likelihood of exposure being missing is independent of the outcome given the exposures [ 5 ].

There are three commonly used ad hoc approaches for handling missing data, all of which can lead to bias [ 3 , 12 , 14 ]. The Last Observation Carried Forward (LOCF) method replaces the missing value in a wave of data collection with the non-missing value from the previous completed wave for the same individual. The assumption behind this approach is that the exposure status of the individual has not changed over time. The mean value substitution method replaces the missing value with the average value calculated over all the values available from the other waves of data collection for the same individual. Both LOCF and mean value substitution falsely increase the stated precision of the estimates by failing to account for the uncertainty due to the missing data and generally give biased results, even when the data are MCAR [ 7 , 15 ]. The Missing Indicator Method is applied to categorical exposures and includes an extra category of the exposure variable for those individuals with missing data. Indicator variables are created for the analysis, including an indicator for the missing data category [ 16 ]. This method is simple to implement, but also produces biased results in many settings, even when the data are MCAR [ 6 , 12 ].

Multiple Imputation (MI) begins by imputing values for the missing data multiple times by sampling from an imputation model (using either chained equations [ 17 , 18 ] or a multivariate normal model [ 19 ]). The imputation model should contain the variables that are to be included in the statistical model used for the epidemiological analysis, as well as auxiliary variables that may contain information about the missing data, and a “proper” imputation procedure incorporates appropriate variability in the imputed values. The imputation process creates multiple ‘completed’ versions of the datasets. These ‘completed datasets’ are analysed using the appropriate statistical model for the epidemiological analysis and the estimates obtained from each dataset are averaged to produce one overall MI estimate. The standard error for this overall MI estimate is derived using Rubin’s rules, which account for variability between-and within- the estimates obtained from the separate analyses of the ‘completed datasets’ [ 3 , 13 ]. By accounting for the variability between the completed (imputed) datasets, MI produces a valid estimate of the precision of the final MI estimate. When the imputation is performed using standard methods that are now available in many packages, with appropriate model specifications to reflect the structure of the data, the resulting MI estimate will be valid (unbiased parameter estimates with nominal confidence interval coverage) if the missing data are ‘Missing At Random’ (MAR) [ 5 ]. MAR describes a situation where the probability of being missing for a particular variable (e.g. waist circumference) can be explained by other observed variables in the dataset, but is (conditionally) independent of the variable itself (that is, waist circumference) [ 13 ]. On the other hand, MI may produce biased estimates if the data are ‘Missing Not At Random’ (MNAR), which occurs when the study participants with missing data differ from the study participants with complete data in a manner that cannot be explained by the observed data in the study [ 13 ].

MI is now implemented in many major statistical packages (including Stata [ 20 ] and SAS [ 21 ]) making it an easily accessible method. However, it can be a time-intensive process to impute multiple datasets, analyse the ‘completed datasets’ and combine the results; and the imputation model can be complex since it must contain the exposure and outcome variables included in the analysis model, auxiliary variables and any interactions that will be included in the final analysis model [ 22 , 23 ]. Sterne et al. [ 12 ] have described a number of pitfalls that can be encountered in the imputation procedure that might lead to biased results for the epidemiological analysis of interest.

Missing data can also be handled with the following more sophisticated methods: maximum likelihood-based formulations, fully Bayesian models and weighting methods. Likelihood-based methods use all of the available information (i.e. information from participants with both complete and incomplete data) to simultaneously estimate both the missing data model and the data analysis model, eliminating the need to handle the missing data directly [ 3 , 8 , 24 , 25 ], although in many cases the MAR assumption is also invoked to enable the missing data model to be ignored. Bayesian models also rely on a fully specified model that incorporates both the missingness process and the associations of interest [ 12 , 15 , 26 ]. Weighting methods apply weights that correspond to the inverse probability of a data observation being observed, to the observed data to account for the missing data [ 22 , 25 ]. These methods may improve the precision of the estimates compared with complete-case analysis. However, they are also dependent on assumptions about the missingness mechanism and in some cases on specifying the correct missingness model. In general, these methods require tailored programming which can be time consuming and requires specialist expertise [ 15 ].

Criteria for considering studies in this review

For this review we selected prospective cohort studies that analysed exposure data collected after initial recruitment during the follow-up period (i.e. studies looking at a change in exposure or at a time varying covariate). We restricted our review to cohort studies with more than 1,000 participants, as we thought it was more likely for there to be more missing data in follow-up measurements of exposures in large cohort studies (typically population based studies) compared to small cohorts (often based on a specific clinical population). For cohort studies reported in multiple papers, we included only the most recent original research article. Studies that only used data collected at baseline or at one of the follow-up waves in the analysis, and studies that newly recruited participants at one of the waves after baseline were excluded. We did not place any restrictions on the types of exposures or outcomes studied or the type of statistical analysis performed.

Search strategy

PubMed was searched for English language papers published between January 2000 and December 2009. We chose January 2000 as a starting date because the first widely available statistical software package for implementing MI, the NORM package [ 27 ], was developed in 1997 and updated in 1999. Search terms included: “Cohort Studies”[MeSh] AND (“attrition” OR “drop out” OR “longitudinal change” OR “missing data” OR “multiple exposure” OR “multiple follow-up” OR “multiple waves” OR “repeated exposure” OR “repeated follow-up” OR “repeated waves” OR “repeated measures” OR “time dependent covariates” OR “time dependent” OR “time varying covariate” OR “cumulative average”).

We carried out a further search of cohort studies listed in the web appendix of the paper by Lewington et al. [ 28 ], to ensure that any known large cohort studies were not missed in the original PubMed search. These cohort studies were established in the 1970s and 1980s, allowing them time to measure repeated waves of exposure on their participants and to publish these results during our study period (i.e. between 2000 and 2009).

Methods of the review

AK reviewed all articles; any uncertainties regarding the statistical method used to handle the missing data were resolved by discussion with JAS, and AK extracted the data. Additional tables and methods sections from journal websites were checked if referred to in the article.

Our aim was to assess the reporting of missing data and the methods used to handle the missing data according to the recommendations given by the STROBE guidelines [ 10 , 11 ] and Sterne et al. [ 12 ]. The information extracted is summarised in Tables  1 and 2 and Additional file 1 : Table S1.

Study selection

We identified 4,277 articles via the keyword search. A total of 3,684 articles were excluded based on their title and abstract, leaving 543 articles for further evaluation. Of these, 471 articles were excluded and 72 articles were found to be appropriate for the review. A further ten studies were identified from the reference list of Lewington et al. [ 28 ] (Figure 1 ), giving 82 studies included in this review. The reasons for excluding studies are outlined in Figure 1 , the most common reasons were sample size of less than 1,000 participants (54%), study design was not a prospective cohort (19%), and did not report original research findings (13%).

figure 1

Search results.

Characteristics of included studies

The characteristics of the 82 studies included are summarised in Table 1 and further details can be found in the additional table (see Additional file 1: Table S 1 ). The studies included ranged from smaller studies that recruited 1,000 to 2,000 participants at baseline to larger studies with more than 20,000 participants, and the number published annually increased steadily from two papers in 2000 to 16 papers in 2009. The majority of studies recruited their participants in the decades 1980 to 1989 (n = 25), and 1990 to 1999 (n = 30). Cox proportional hazards regression was the most common statistical method used for the epidemiological analysis (n = 37) to analyse the repeated measures of exposure, with 35 of these papers incorporating the repeated exposure(s) as a time varying covariate and the remaining two papers including a single measure of the covariate derived from repeated assessments. Generalised Estimating Equations with a logistic (n = 10) or linear regression (n = 3) and generalised linear mixed-effects models (logistic regression (n = 3) and linear regression (n = 13)) were the next most common epidemiological analyses used.

Missing covariate data at follow-up

The methods used by the selected papers for handling missing data are summarised in Table 2 . Sixty-six papers (80%) commented on the amount of missing data at follow-up. Of these, only 35 papers provided information about the proportion of participants lost to follow-up at each wave. The remaining 31 papers provided incomplete details about the amount of missing data at each wave: 22 papers made a general comment about the amount of missing data; six papers reported the amount of missing data for the final wave but gave no detail regarding the number of participants available at previous waves of data collection (including baseline); and three papers only reported the amount of missing data for a few of the variables.

Of the 29 papers published after 2007, nine papers did not state the proportion of missing data at each follow up wave, three papers provided a comment as to why the data were missing and eight papers compared the baseline covariates for those with and without missing covariate data at the repeated waves of follow up.

Among those papers that provided information on missing data, the proportion of covariate data missing at any follow-up wave ranged from 2% to 65%. Twenty-six papers (32%) compared the key variables of interest for those who did and did not have data from post-baseline waves, but only six of these presented the results in detail while the rest commented briefly in the text on whether or not there was a difference.

Methods used to deal with missing data at follow-up

The most common methods used to deal with missing data were complete-case analysis (n = 54), LOCF (n = 7) and MI (n = 5). Of the 54 papers that used complete-case analysis: 38 excluded participants who were missing exposure data at any of the waves of data collection from the analysis; one paper also excluded participants with any missing exposure data but used a weighted analysis to deal with the missing data; and the remaining 15 papers, where both the exposure and outcome measures were assessed repeatedly at each wave of data collection, excluded participant data records for waves where the exposure data were missing. Fourteen papers did not state the method used to deal with the missing data, although nine of these papers performed a Cox regression model using SAS [ 21 ] or Stata [ 20 ] and we therefore assumed that they used a complete-case analysis (Table 2 ). Both papers published in 2000 used complete-case analysis. From 2001 to 2009, the proportion of papers using complete-case analysis ranged from 25% to 65%. Methods known to produce biased results (i.e. LOCF, the missing indicator method and mean value substitution) continue to be used, with four papers using these methods in 2009.

Of the five papers that used MI [ 29 – 33 ], two papers [ 29 , 30 ] compared the characteristics of the participants with and without missing data. For the MI, three of the five papers [ 30 , 31 , 33 ] provided details of the imputation process including the number of imputations performed and the variables included in the imputation model, and compared the results from the MI analysis to results from complete-case analysis. The other two papers [ 29 , 32 ] provided details about the number of imputations performed but did not describe the variables included in their imputation model and did not compare the MI results to the complete-case analysis.

We identified 82 cohort studies of 1,000 or more participants that were published from 2000 to 2009 and which analysed exposure data collected from repeated follow-up waves. The reporting of missing data in these studies was found to be inconsistent and generally did not follow the recommendations set out by the STROBE guidelines [ 10 , 11 ] or the guidelines set out by Sterne et al. [ 12 ]. The STROBE guidelines recommend that authors report the number of participants who take part in each wave of the study and give reasons why participants did not attend a wave. Only three papers [ 30 , 34 , 35 ] followed the STROBE guidelines fully. The majority of papers did not provide a reason or comment for why study participants did not attend each wave of follow-up. Sterne et al. [ 12 ] recommend that the reasons for missing data be described with respect to other variables and that authors investigate potentially important differences between participants and non-participants.

The STROBE guidelines were published in 2007. Of the nine papers published after 2007, only one followed the STROBE guidelines fully. This suggests that either journal editors are not using these guidelines or authors are not considering the impact of missing covariate data in their research.

A review of missing data in cancer prognostic studies published in 2004 by Burton et al. [ 36 ] and a review of developmental psychology studies published in 2009 by Jelicic et al. [ 3 ] reported similar findings to ours. Burton et al. [ 36 ] found a deficiency in the reporting of missing covariate data in cancer prognostic studies. After reviewing 100 articles, they found that only 40% of articles provided information about the method used to handle missing covariate data and only 12 articles would have satisfied their proposed guidelines for the reporting of missing data. We observed in our review, of articles published from 2000 to 2009, that a larger proportion of articles reported the method used to handle the missing data in the analysis but that many articles were still not reporting the amount of missing data and the reasons for missingness.

The cohort studies we identified used numerous methods to handle missing data in the exposure-outcome analyses. Although some studies used advanced statistical modelling procedures (e.g. MI and Bayesian), the majority removed individuals with missing data and performed a complete-case analysis; a method that may produce biased results if the missing data are not MCAR. Jelicic et al. also found in their review that a large proportion of studies used complete-case analysis to handle their missing data [ 3 ]. For studies with a large proportion of missing data, excluding participants with missing data may also reduce the precision of the analysis substantially. Ad hoc methods (e.g. LOCF, the missing indicator method and mean value substitution), which are generally not recommended [ 16 , 25 ] because they fail to account for the uncertainty in the data and may produce biased estimates [ 12 ], continue to be used. Although MI is becoming more accessible, only five studies used this method. The reporting of the imputation procedure was inconsistent and often incomplete. This was also observed by two independent reviews of the reporting of MI in the medical journals: BMJ, JAMA Lancet and the New England Journal of Medicine [ 12 , 37 ]. Future studies should follow the recommendations outlined by Sterne et al. [ 12 ] to ensure that enough details are provided about the MI procedure, especially the implementation and details of the imputation modelling process.

Strengths and limitations of the literature review

We aimed to complete a comprehensive review of all papers published that analysed exposure variables measured at multiple follow-up waves. Several keywords were used in order to obtain as many articles as possible. The keyword search was then supplemented with cohort studies identified from a pooled analysis of 61 cohort studies. Although a large number of abstracts and studies were identified, some cohort studies might have been missed. If multiple papers were identified from one study, the most recent article was included in the review, which might have led us to omit papers from the same study that used a more appropriate missing data method. Our search criteria only included papers written in English and only PubMed was searched. Our search strategy was limited to articles published between 2000 and 2009. On average three papers of the type we focussed on were published each year from 2000 to 2002 and the number has increased since then, so it seems unlikely that many papers were published before this time. Also, MI was not as accessible prior to 1997, so papers published before 2000 were more likely to have used complete case analysis or other ad hoc methods.

With the increase in the number of cohort studies analysing data with multiple follow-up waves it is essential that authors follow the STROBE guidelines [ 10 , 11 ] in conjunction with the guidelines proposed by Sterne et al. [ 12 ] to report on the amount of missing data in the study and the methods used to handle the missing data in the analyses. This will ensure that missing data are reported with enough detail to allow readers to assess the validity of the results. Incomplete data and the statistical methods used to deal with the missing data can lead to bias, or be inefficient, and so authors should be encouraged to use online supplements (if necessary) as a way of publishing both the details of the missing data in their study and the details of the methods used to deal with the missing data.

Cupples LA, D’Agostino RB, Anderson K, Kannel WB: Comparison of baseline and repeated measure covariate techniques in the Framingham Heart Study. Stat Med. 1988, 7: 205-222. 10.1002/sim.4780070122.

CAS   PubMed   Google Scholar  

Shortreed SM, Forbes AB: Missing data in the exposure of interest and marginal structural models: a simulation study based on the Framingham Heart Study. Stat Med. 2010, 29: 431-443.

PubMed   Google Scholar  

Jelicic H, Phelps E, Lerner RM: Use of missing data methods in longitudinal studies: the persistence of bad practices in developmental psychology. Dev Psychol. 2009, 45: 1195-1199.

Kurland BF, Johnson LL, Egleston BL, Diehr PH: Longitudinal data with follow-up truncated by death: match the analysis method to research aims. Stat Sci. 2009, 24: 211-10.1214/09-STS293.

PubMed   PubMed Central   Google Scholar  

White IR, Carlin JB: Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med. 2010

Google Scholar  

Knol MJ, Janssen KJ, Donders AR, Egberts AC, Heerdink ER, Grobbee DE, Moons KG, Geerlings MI: Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J Clin Epidemiol. 2010, 63: 728-736. 10.1016/j.jclinepi.2009.08.028.

Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, Carroll RJ: Analyzing incomplete longitudinal clinical trial data. Biostatistics. 2004, 5: 445-464. 10.1093/biostatistics/kxh001.

Schafer JL, Graham JW: Missing data: our view of the state of the art. Psychol Methods. 2002, 7: 147-177.

Carpenter J, Kenward MG: A critique of common approaches to missing data. 2007, National Institute for Health Research, Birmingham, AL

von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP: The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet. 2007, 370: 1453-1457. 10.1016/S0140-6736(07)61602-X.

Vandenbroucke JP, von Elm E, Altman DG, Gotzsche PC, Mulrow CD, Pocock SJ, Poole C, Schlesselman JJ, Egger M: Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration. PLoS Med. 2007, 4: e297-10.1371/journal.pmed.0040297.

Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009, 338: b2393-10.1136/bmj.b2393.

Little RJA, Rubin DB: Statistical analysis with missing data. 2002, John Wiley & Sons, Inc, Hoboken, New Jersey

Rubin DB: Multiple imputation for nonresponse in surveys. 1987, John Wiley & Sons, New York

Buhi ER, Goodson P, Neilands TB: Out of sight, not out of mind: strategies for handling missing data. Am J Heal Behav. 2008, 32: 83-92.

Greenland S, Finkle WD: A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995, 142: 1255-1264.

Azur MJ, Stuart EA, Frangakis C, Leaf PJ: Multiple imputation by chained equations: what is it and how does it work?. Int J Methods Psychiatr Res. 2011, 20: 40-49. 10.1002/mpr.329.

White IR, Royston P, Wood AM: Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011, 30: 377-399. 10.1002/sim.4067.

Lee KJ, Carlin JB: Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol. 2010, 171: 624-632. 10.1093/aje/kwp425.

StataCorp: Stata Statistical Software: Release 11. 2009, StataCorp LP, College Station, TX

SAS Insitute Inc: SAS OnlineDoc, Version 8. 2000, SAS Institute, Inc., Cary, NC

Carpenter JR, Kenward MG, Vansteelandt S: A comparison of multiple imputation and doubly robust estimation for analyses with missing data. J R Stat Soc: Series A (Stat Soc). 2006, 169: 571-584.

Graham JW: Missing data analysis: making it work in the real world. Annu Rev Psychol. 2009, 60: 549-576. 10.1146/annurev.psych.58.110405.085530.

Enders CK: A primer on maximum likelihood algorithms available for use with missing data. Struct Equ Model. 2001, 8: 128-141. 10.1207/S15328007SEM0801_7.

Horton NJ, Kleinman KP: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat. 2007, 61: 79-90. 10.1198/000313007X172556.

Baraldi AN, Enders CK: An introduction to modern missing data analyses. J Sch Psychol. 2010, 48: 5-37. 10.1016/j.jsp.2009.10.001.

Schafer J, Yucel R: PAN: Multiple imputation for multivariate panel data (Software). 1999, 86: 949-955.

Lewington S, Clarke R, Qizilbash N, Peto R, Collins R: Age-specific relevance of usual blood pressure to vascular mortality: a meta-analysis of individual data for one million adults in 61 prospective studies. Lancet. 2002, 360: 1903-1913.

Bond GE, Burr RL, McCurry SM, Rice MM, Borenstein AR, Larson EB: Alcohol and cognitive performance: a longitudinal study of older Japanese Americans. The Kame Project. Int Psychogeriatr. 2005, 17: 653-668. 10.1017/S1041610205001651.

Kivimaki M, Lawlor DA, Singh-Manoux A, Batty GD, Ferrie JE, Shipley MJ, Nabi H, Sabia S, Marmot MG, Jokela M: Common mental disorder and obesity: insight from four repeat measures over 19 years: prospective Whitehall II cohort study. BMJ. 2009, 339: b3765-10.1136/bmj.b3765.

McCormack VA, Dos Santos Silva I, De Stavola BL, Perry N, Vinnicombe S, Swerdlow AJ, Hardy R, Kuh D: Life-course body size and perimenopausal mammographic parenchymal patterns in the MRC 1946 British birth cohort. Br J Cancer. 2003, 89: 852-859. 10.1038/sj.bjc.6601207.

CAS   PubMed   PubMed Central   Google Scholar  

Sugihara Y, Sugisawa H, Shibata H, Harada K: Productive roles, gender, and depressive symptoms: evidence from a national longitudinal study of late-middle-aged Japanese. J Gerontol B Psychol Sci Soc Sci. 2008, 63: P227-P234. 10.1093/geronb/63.4.P227.

Wiles NJ, Haase AM, Gallacher J, Lawlor DA, Lewis G: Physical activity and common mental disorder: results from the Caerphilly study. Am J Epidemiol. 2007, 165: 946-954. 10.1093/aje/kwk070.

Fuhrer R, Dufouil C, Dartigues JF: Exploring sex differences in the relationship between depressive symptoms and dementia incidence: prospective results from the PAQUID Study. J Am Geriatr Soc. 2003, 51: 1055-1063. 10.1046/j.1532-5415.2003.51352.x.

Sogaard AJ, Meyer HE, Tonstad S, Haheim LL, Holme I: Weight cycling and risk of forearm fractures: a 28-year follow-up of men in the Oslo Study. Am J Epidemiol. 2008, 167: 1005-1013. 10.1093/aje/kwm384.

Burton A, Altman DG: Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. Br J Cancer. 2004, 91: 4-8. 10.1038/sj.bjc.6601907.

Mackinnon A: The use and reporting of multiple imputation in medical research - a review. J Intern Med. 2010, 268: 586-593. 10.1111/j.1365-2796.2010.02274.x.

Agrawal A, Grant JD, Waldron M, Duncan AE, Scherrer JF, Lynskey MT, Madden PA, Bucholz KK, Heath AC: Risk for initiation of substance use as a function of age of onset of cigarette, alcohol and cannabis use: findings in a Midwestern female twin cohort. Prev Med. 2006, 43: 125-128. 10.1016/j.ypmed.2006.03.022.

Anstey KJ, Hofer SM, Luszcz MA: Cross-sectional and longitudinal patterns of dedifferentiation in late-life cognitive and sensory function: the effects of age, ability, attrition, and occasion of measurement. J Exp Psychol Gen. 2003, 132: 470-487.

Arifeen S, Black RE, Antelman G, Baqui A, Caulfield L, Becker S: Exclusive breastfeeding reduces acute respiratory infection and diarrhea deaths among infants in Dhaka slums. Pediatrics. 2001, 108: E67-10.1542/peds.108.4.e67.

Bada HS, Das A, Bauer CR, Shankaran S, Lester B, LaGasse L, Hammond J, Wright LL, Higgins R: Impact of prenatal cocaine exposure on child behavior problems through school age. Pediatrics. 2007, 119: e348-359. 10.1542/peds.2006-1404.

Beesdo K, Bittner A, Pine DS, Stein MB, Hofler M, Lieb R, Wittchen HU: Incidence of social anxiety disorder and the consistent risk for secondary depression in the first three decades of life. Arch Gen Psychiatry. 2007, 64: 903-912. 10.1001/archpsyc.64.8.903.

Berecki-Gisolf J, Begum N, Dobson AJ: Symptoms reported by women in midlife: menopausal transition or aging?. Menopause. 2009, 16: 1021-1029. 10.1097/gme.0b013e3181a8c49f.

Blazer DG, Sachs-Ericsson N, Hybels CF: Perception of unmet basic needs as a predictor of depressive symptoms among community-dwelling older adults. J Gerontol A Biol Sci Med Sci. 2007, 62: 191-195. 10.1093/gerona/62.2.191.

Bray JW, Zarkin GA, Ringwalt C, Qi J: The relationship between marijuana initiation and dropping out of high school. Health Econ. 2000, 9: 9-18. 10.1002/(SICI)1099-1050(200001)9:1<9::AID-HEC471>3.0.CO;2-Z.

Breslau N, Schultz LR, Johnson EO, Peterson EL, Davis GC: Smoking and the risk of suicidal behavior: a prospective study of a community sample. Arch Gen Psychiatry. 2005, 62: 328-334. 10.1001/archpsyc.62.3.328.

Brown JW, Liang J, Krause N, Akiyama H, Sugisawa H, Fukaya T: Transitions in living arrangements among elders in Japan: does health make a difference?. J Gerontol B Psychol Sci Soc Sci. 2002, 57: S209-220. 10.1093/geronb/57.4.S209.

Bruckl TM, Wittchen HU, Hofler M, Pfister H, Schneider S, Lieb R: Childhood separation anxiety and the risk of subsequent psychopathology: Results from a community study. Psychother Psychosom. 2007, 76: 47-56. 10.1159/000096364.

Cauley JA, Lui LY, Barnes D, Ensrud KE, Zmuda JM, Hillier TA, Hochberg MC, Schwartz AV, Yaffe K, Cummings SR, Newman AB: Successful skeletal aging: a marker of low fracture risk and longevity. The Study of Osteoporotic Fractures (SOF). J Bone Miner Res. 2009, 24: 134-143. 10.1359/jbmr.080813.

Celentano DD, Munoz A, Cohn S, Vlahov D: Dynamics of behavioral risk factors for HIV/AIDS: a 6-year prospective study of injection drug users. Drug Alcohol Depend. 2001, 61: 315-322. 10.1016/S0376-8716(00)00154-X.

Chao C, Jacobson LP, Tashkin D, Martinez-Maza O, Roth MD, Margolick JB, Chmiel JS, Holloway MN, Zhang ZF, Detels R: Recreational amphetamine use and risk of HIV-related non-Hodgkin lymphoma. Cancer Causes Control. 2009, 20: 509-516. 10.1007/s10552-008-9258-y.

Cheung YB, Khoo KS, Karlberg J, Machin D: Association between psychological symptoms in adults and growth in early life: longitudinal follow up study. BMJ. 2002, 325: 749-10.1136/bmj.325.7367.749.

Chien KL, Hsu HC, Sung FC, Su TC, Chen MF, Lee YT: Hyperuricemia as a risk factor on cardiovascular events in Taiwan: the Chin-Shan Community Cardiovascular Cohort Study. Atherosclerosis. 2005, 183: 147-155. 10.1016/j.atherosclerosis.2005.01.018.

Clays E, De Bacquer D, Leynen F, Kornitzer M, Kittel F, De Backer G: Job stress and depression symptoms in middle-aged workers–prospective results from the Belstress study. Scand J Work Environ Health. 2007, 33: 252-259. 10.5271/sjweh.1140.

Conron KJ, Beardslee W, Koenen KC, Buka SL, Gortmaker SL: A longitudinal study of maternal depression and child maltreatment in a national sample of families investigated by child protective services. Arch Pediatr Adolesc Med. 2009, 163: 922-930. 10.1001/archpediatrics.2009.176.

Cuddy TE, Tate RB: Sudden unexpected cardiac death as a function of time since the detection of electrocardiographic and clinical risk factors in apparently healthy men: the Manitoba Follow-Up Study, 1948 to 2004. Can J Cardiol. 2006, 22: 205-211. 10.1016/S0828-282X(06)70897-2.

Daniels MC, Adair LS: Growth in young Filipino children predicts schooling trajectories through high school. J Nutr. 2004, 134: 1439-1446.

de Mutsert R, Grootendorst DC, Boeschoten EW, Brandts H, van Manen JG, Krediet RT, Dekker FW: Subjective global assessment of nutritional status is strongly associated with mortality in chronic dialysis patients. Am J Clin Nutr. 2009, 89: 787-793. 10.3945/ajcn.2008.26970.

De Stavola BL, Meade TW: Long-term effects of hemostatic variables on fatal coronary heart disease: 30-year results from the first prospective Northwick Park Heart Study (NPHS-I). J Thromb Haemost. 2007, 5: 461-471. 10.1111/j.1538-7836.2007.02330.x.

Di Nisio M, Barbui T, Di Gennaro L, Borrelli G, Finazzi G, Landolfi R, Leone G, Marfisi R, Porreca E, Ruggeri M, et al: The haematocrit and platelet target in polycythemia vera. Br J Haematol. 2007, 136: 249-259. 10.1111/j.1365-2141.2006.06430.x.

Engberg J, Morral AR: Reducing substance use improves adolescents’ school attendance. Addiction. 2006, 101: 1741-1751. 10.1111/j.1360-0443.2006.01544.x.

Fergusson DM, Boden JM, Horwood LJ: The developmental antecedents of illicit drug use: evidence from a 25-year longitudinal study. Drug Alcohol Depend. 2008, 96: 165-177. 10.1016/j.drugalcdep.2008.03.003.

Fung TT, Malik V, Rexrode KM, Manson JE, Willett WC, Hu FB: Sweetened beverage consumption and risk of coronary heart disease in women. Am J Clin Nutr. 2009, 89: 1037-1042. 10.3945/ajcn.2008.27140.

Gallo WT, Bradley EH, Dubin JA, Jones RN, Falba TA, Teng HM, Kasl SV: The persistence of depressive symptoms in older workers who experience involuntary job loss: results from the health and retirement survey. J Gerontol B Psychol Sci Soc Sci. 2006, 61: S221-228. 10.1093/geronb/61.4.S221.

Gauderman WJ, Avol E, Gilliland F, Vora H, Thomas D, Berhane K, McConnell R, Kuenzli N, Lurmann F, Rappaport E, et al: The effect of air pollution on lung development from 10 to 18 years of age. N Engl J Med. 2004, 351: 1057-1067. 10.1056/NEJMoa040610.

Glotzer TV, Daoud EG, Wyse DG, Singer DE, Ezekowitz MD, Hilker C, Miller C, Qi D, Ziegler PD: The relationship between daily atrial tachyarrhythmia burden from implantable device diagnostics and stroke risk: the TRENDS study. Circ Arrhythm Electrophysiol. 2009, 2: 474-480. 10.1161/CIRCEP.109.849638.

Gunderson EP, Jacobs DR, Chiang V, Lewis CE, Tsai A, Quesenberry CP, Sidney S: Childbearing is associated with higher incidence of the metabolic syndrome among women of reproductive age controlling for measurements before pregnancy: the CARDIA study. Am J Obstet Gynecol. 2009, 201 (177): e171-179.

Haag MD, Bos MJ, Hofman A, Koudstaal PJ, Breteler MM, Stricker BH: Cyclooxygenase selectivity of nonsteroidal anti-inflammatory drugs and risk of stroke. Arch Intern Med. 2008, 168: 1219-1224. 10.1001/archinte.168.11.1219.

Hart CL, Hole DJ, Davey Smith G: Are two really better than one? Empirical examination of repeat blood pressure measurements and stroke risk in the Renfrew/Paisley and collaborative studies. Stroke. 2001, 32: 2697-2699. 10.1161/hs1101.098637.

Hogg RS, Bangsberg DR, Lima VD, Alexander C, Bonner S, Yip B, Wood E, Dong WW, Montaner JS, Harrigan PR: Emergence of drug resistance is associated with an increased risk of death among patients first starting HAART. PLoS Med. 2006, 3: e356-10.1371/journal.pmed.0030356.

Jacobs EJ, Thun MJ, Connell CJ, Rodriguez C, Henley SJ, Feigelson HS, Patel AV, Flanders WD, Calle EE: Aspirin and other nonsteroidal anti-inflammatory drugs and breast cancer incidence in a large U.S. cohort. Cancer Epidemiol Biomarkers Prev. 2005, 14: 261-264.

Jamrozik E, Knuiman MW, James A, Divitini M, Musk AW: Risk factors for adult-onset asthma: a 14-year longitudinal study. Respirology. 2009, 14: 814-821. 10.1111/j.1440-1843.2009.01562.x.

Jimenez M, Krall EA, Garcia RI, Vokonas PS, Dietrich T: Periodontitis and incidence of cerebrovascular disease in men. Ann Neurol. 2009, 66: 505-512. 10.1002/ana.21742.

Juhaeri , Stevens J, Chambless LE, Nieto FJ, Jones D, Schreiner P, Arnett D, Cai J: Associations of weight loss and changes in fat distribution with the remission of hypertension in a bi-ethnic cohort: the Atherosclerosis Risk in Communities Study. Prev Med. 2003, 36: 330-339. 10.1016/S0091-7435(02)00063-4.

Karlamangla A, Zhou K, Reuben D, Greendale G, Moore A: Longitudinal trajectories of heavy drinking in adults in the United States of America. Addiction. 2006, 101: 91-99. 10.1111/j.1360-0443.2005.01299.x.

Keller MC, Neale MC, Kendler KS: Association of different adverse life events with distinct patterns of depressive symptoms. Am J Psychiatry. 2007, 164: 1521-1529. 10.1176/appi.ajp.2007.06091564. quiz 1622

Kersting RC: Impact of social support, diversity, and poverty on nursing home utilization in a nationally representative sample of older Americans. Soc Work Health Care. 2001, 33: 67-87. 10.1300/J010v33n02_05.

Lacson E, Wang W, Lazarus JM, Hakim RM: Change in vascular access and mortality in maintenance hemodialysis patients. Am J Kidney Dis. 2009, 54: 912-921. 10.1053/j.ajkd.2009.07.008.

Lamarca R, Ferrer M, Andersen PK, Liestol K, Keiding N, Alonso J: A changing relationship between disability and survival in the elderly population: differences by age. J Clin Epidemiol. 2003, 56: 1192-1201. 10.1016/S0895-4356(03)00201-4.

Lawson DW, Mace R: Sibling configuration and childhood growth in contemporary British families. Int J Epidemiol. 2008, 37: 1408-1421. 10.1093/ije/dyn116.

Lee DH, Ha MH, Kam S, Chun B, Lee J, Song K, Boo Y, Steffen L, Jacobs DR: A strong secular trend in serum gamma-glutamyltransferase from 1996 to 2003 among South Korean men. Am J Epidemiol. 2006, 163: 57-65.

Lee DS, Evans JC, Robins SJ, Wilson PW, Albano I, Fox CS, Wang TJ, Benjamin EJ, D’Agostino RB, Vasan RS: Gamma glutamyl transferase and metabolic syndrome, cardiovascular disease, and mortality risk: the Framingham Heart Study. Arterioscler Thromb Vasc Biol. 2007, 27: 127-133. 10.1161/01.ATV.0000251993.20372.40.

Li G, Higdon R, Kukull WA, Peskind E, Van Valen Moore K, Tsuang D, van Belle G, McCormick W, Bowen JD, Teri L, et al: Statin therapy and risk of dementia in the elderly: a community-based prospective cohort study. Neurology. 2004, 63: 1624-1628. 10.1212/01.WNL.0000142963.90204.58.

Li LW, Conwell Y: Effects of changes in depressive symptoms and cognitive functioning on physical disability in home care elders. J Gerontol A Biol Sci Med Sci. 2009, 64: 230-236.

Limburg PJ, Anderson KE, Johnson TW, Jacobs DR, Lazovich D, Hong CP, Nicodemus KK, Folsom AR: Diabetes mellitus and subsite-specific colorectal cancer risks in the Iowa Women’s Health Study. Cancer Epidemiol Biomarkers Prev. 2005, 14: 133-137.

Luchenski S, Quesnel-Vallee A, Lynch J: Differences between women’s and men’s socioeconomic inequalities in health: longitudinal analysis of the Canadian population, 1994–2003. J Epidemiol Community Health. 2008, 62: 1036-1044. 10.1136/jech.2007.068908.

Melamed ML, Eustace JA, Plantinga L, Jaar BG, Fink NE, Coresh J, Klag MJ, Powe NR: Changes in serum calcium, phosphate, and PTH and the risk of death in incident dialysis patients: a longitudinal study. Kidney Int. 2006, 70: 351-357. 10.1038/sj.ki.5001542.

Menotti A, Lanti M, Kromhout D, Blackburn H, Jacobs D, Nissinen A, Dontas A, Kafatos A, Nedeljkovic S, Adachi H: Homogeneity in the relationship of serum cholesterol to coronary deaths across different cultures: 40-year follow-up of the Seven Countries Study. Eur J Cardiovasc Prev Rehabil. 2008, 15: 719-725. 10.1097/HJR.0b013e328315789c.

Michaelsson K, Olofsson H, Jensevik K, Larsson S, Mallmin H, Berglund L, Vessby B, Melhus H: Leisure physical activity and the risk of fracture in men. PLoS Med. 2007, 4: e199-10.1371/journal.pmed.0040199.

Michaud DS, Liu Y, Meyer M, Giovannucci E, Joshipura K: Periodontal disease, tooth loss, and cancer risk in male health professionals: a prospective cohort study. Lancet Oncol. 2008, 9: 550-558. 10.1016/S1470-2045(08)70106-2.

Mirzaei M, Taylor R, Morrell S, Leeder SR: Predictors of blood pressure in a cohort of school-aged children. Eur J Cardiovasc Prev Rehabil. 2007, 14: 624-629. 10.1097/HJR.0b013e32828621c6.

Mishra GD, McNaughton SA, Bramwell GD, Wadsworth ME: Longitudinal changes in dietary patterns during adult life. Br J Nutr. 2006, 96: 735-744.

Monda KL, Adair LS, Zhai F, Popkin BM: Longitudinal relationships between occupational and domestic physical activity patterns and body weight in China. Eur J Clin Nutr. 2008, 62: 1318-1325. 10.1038/sj.ejcn.1602849.

Moss SE, Klein R, Klein BE: Long-term incidence of dry eye in an older population. Optom Vis Sci. 2008, 85: 668-674. 10.1097/OPX.0b013e318181a947.

Nabi H, Consoli SM, Chastang JF, Chiron M, Lafont S, Lagarde E: Type A behavior pattern, risky driving behaviors, and serious road traffic accidents: a prospective study of the GAZEL cohort. Am J Epidemiol. 2005, 161: 864-870. 10.1093/aje/kwi110.

Nakano T, Tatemichi M, Miura Y, Sugita M, Kitahara K: Long-term physiologic changes of intraocular pressure: a 10-year longitudinal analysis in young and middle-aged Japanese men. Ophthalmology. 2005, 112: 609-616. 10.1016/j.ophtha.2004.10.046.

Nowicki MJ, Vigen C, Mack WJ, Seaberg E, Landay A, Anastos K, Young M, Minkoff H, Greenblatt R, Levine AM: Association of cells with natural killer (NK) and NKT immunophenotype with incident cancers in HIV-infected women. AIDS Res Hum Retrovir. 2008, 24: 163-168.

Ormel J, Oldehinkel AJ, Vollebergh W: Vulnerability before, during, and after a major depressive episode: a 3-wave population-based study. Arch Gen Psychiatry. 2004, 61: 990-996. 10.1001/archpsyc.61.10.990.

Rabbitt P, Lunn M, Wong D, Cobain M: Sudden declines in intelligence in old age predict death and dropout from longitudinal studies. J Gerontol B Psychol Sci Soc Sci. 2008, 63: P205-P211. 10.1093/geronb/63.4.P205.

Randolph JF, Sowers M, Bondarenko I, Gold EB, Greendale GA, Bromberger JT, Brockwell SE, Matthews KA: The relationship of longitudinal change in reproductive hormones and vasomotor symptoms during the menopausal transition. J Clin Endocrinol Metab. 2005, 90: 6106-6112. 10.1210/jc.2005-1374.

Rousseau MC, Abrahamowicz M, Villa LL, Costa MC, Rohan TE, Franco EL: Predictors of cervical coinfection with multiple human papillomavirus types. Cancer Epidemiol Biomarkers Prev. 2003, 12: 1029-1037.

Ryu S, Chang Y, Woo HY, Lee KB, Kim SG, Kim DI, Kim WS, Suh BS, Jeong C, Yoon K: Time-dependent association between metabolic syndrome and risk of CKD in Korean men without hypertension or diabetes. Am J Kidney Dis. 2009, 53: 59-69. 10.1053/j.ajkd.2008.07.027.

Seid M, Varni JW, Cummings L, Schonlau M: The impact of realized access to care on health-related quality of life: a two-year prospective cohort study of children in the California State Children’s Health Insurance Program. J Pediatr. 2006, 149: 354-361. 10.1016/j.jpeds.2006.04.024.

Silfverdal SA, Ehlin A, Montgomery SM: Protection against clinical pertussis induced by whole-cell pertussis vaccination is related to primo-immunisation intervals. Vaccine. 2007, 25: 7510-7515. 10.1016/j.vaccine.2007.08.046.

Spence SH, Najman JM, Bor W, O’Callaghan MJ, Williams GM: Maternal anxiety and depression, poverty and marital relationship factors during early childhood as predictors of anxiety and depressive symptoms in adolescence. J Child Psychol Psychiatry. 2002, 43: 457-469. 10.1111/1469-7610.00037.

Stewart R, Xue QL, Masaki K, Petrovitch H, Ross GW, White LR, Launer LJ: Change in blood pressure and incident dementia: a 32-year prospective study. Hypertension. 2009, 54: 233-240. 10.1161/HYPERTENSIONAHA.109.128744.

Strasak AM, Kelleher CC, Klenk J, Brant LJ, Ruttmann E, Rapp K, Concin H, Diem G, Pfeiffer KP, Ulmer H: Longitudinal change in serum gamma-glutamyltransferase and cardiovascular disease mortality: a prospective population-based study in 76,113 Austrian adults. Arterioscler Thromb Vasc Biol. 2008, 28: 1857-1865. 10.1161/ATVBAHA.108.170597.

Strawbridge WJ, Cohen RD, Shema SJ: Comparative strength of association between religious attendance and survival. Int J Psychiatry Med. 2000, 30: 299-308.

Sung M, Erkanli A, Angold A, Costello EJ: Effects of age at first substance use and psychiatric comorbidity on the development of substance use disorders. Drug Alcohol Depend. 2004, 75: 287-299. 10.1016/j.drugalcdep.2004.03.013.

Tehard B, Lahmann PH, Riboli E, Clavel-Chapelon F: Anthropometry, breast cancer and menopausal status: use of repeated measurements over 10 years of follow-up-results of the French E3N women’s cohort study. Int J Cancer. 2004, 111: 264-269. 10.1002/ijc.20213.

Vikan T, Johnsen SH, Schirmer H, Njolstad I, Svartberg J: Endogenous testosterone and the prospective association with carotid atherosclerosis in men: the Tromso study. Eur J Epidemiol. 2009, 24: 289-295. 10.1007/s10654-009-9322-2.

Wang NY, Young JH, Meoni LA, Ford DE, Erlinger TP, Klag MJ: Blood pressure change and risk of hypertension associated with parental hypertension: the Johns Hopkins Precursors Study. Arch Intern Med. 2008, 168: 643-648. 10.1001/archinte.168.6.643.

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/12/96/prepub

Download references

Acknowledgements

This work was supported by the National Health & Medical Research Council Grants Number 60740 and Number 251533.

Author information

Authors and affiliations.

Cancer Epidemiology Centre, Cancer Council Victoria, Carlton, VIC, Australia

Amalia Karahalios, Laura Baglietto, Dallas R English & Julie A Simpson

Centre for Molecular, Environmental, Genetic, and Analytic Epidemiology, School of Population Health, The University of Melbourne, Parkville, VIC, Australia

Amalia Karahalios, Laura Baglietto, John B Carlin, Dallas R English & Julie A Simpson

Clinical Epidemiology and Biostatistics Unit, Murdoch Children’s Research Institute, Parkville, VIC, Australia

John B Carlin

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Julie A Simpson .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors’ contributions

AK drafted the protocol for the review, reviewed the articles and drafted the manuscript. JAS conceived of the review, resolved any discrepancies encountered by AK when reviewing the articles and helped with drafting the manuscript. LB, JBC and DRE provided feedback on the design of the protocol and drafts of the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

12874_2011_746_moesm1_esm.doc.

Additional file 1: Table S1. Detailed characteristics of the studies included in the systematic review. Details of studies included in the systematic review and the corresponding reference list [ 29 – 35 , 38 – 112 ]. (DOC 212 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions.

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Karahalios, A., Baglietto, L., Carlin, J.B. et al. A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures. BMC Med Res Methodol 12 , 96 (2012). https://doi.org/10.1186/1471-2288-12-96

Download citation

Received : 12 December 2011

Accepted : 11 July 2012

Published : 11 July 2012

DOI : https://doi.org/10.1186/1471-2288-12-96

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Longitudinal cohort studies
  • Missing exposure data
  • Repeated exposure measurement
  • Missing data methods

BMC Medical Research Methodology

ISSN: 1471-2288

thesis missing data

IMAGES

  1. Analysis of Missing Data

    thesis missing data

  2. Missing data techniques taxonomy.

    thesis missing data

  3. How to Handle Missing Data in Practice: Guide for Beginners

    thesis missing data

  4. PPT

    thesis missing data

  5. More Discussion and Reporting guideline of Missing Data Analysis

    thesis missing data

  6. (PDF) Missing Data

    thesis missing data

COMMENTS

  1. PDF METHODS OF HANDLING MISSING DATA IN A Thesis

    Missing Data in One Shot Response Based Power System Control. Major Professor: Steven Rovnyak. The thesis extends the work done in [1] [2] by Rovnyak, et al. where the authors ... in this thesis assuming di erent missing data scenarios. In addition to CC1, the chapter also describes another set of control combination (CC2) whose performance ...

  2. How handling missing data may impact conclusions: A comparison of six

    Missing data is a recurrent issue in many fields of medical research, particularly in questionnaires. The aim of this article is to describe and compare six conceptually different multiple imputation methods, alongside the commonly used complete case analysis, and to explore whether the choice of methodology for handling missing data might impact clinical conclusions drawn from a regression ...

  3. Missing data imputation

    Missing data imputation. Missing values complicate the analysis of large-scale observational datasets such as electronic health records. Our work has developed several foundational new models for missing value imputation, including low rank models and Gaussian copula models. We have also demonstrated improved methods to handle missing-not-at ...

  4. PDF Missing Data Problems in Machine Learning

    Missing Data Problems in Machine Learning Benjamin M. Marlin Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2008 Learning, inference, and prediction in the presence of missing data are pervasive problems in machine learning and statistical data analysis. This thesis focuses on the problems of collab-

  5. A review on missing values for main challenges and methods

    The missing rate is one of the crucial metrics to measure the number of missing values in a dataset. The pattern of missing data and the proportion of missing data, particularly when the percentage of missing data surpasses 40%, have a considerable negative impact on the accuracy of prediction (or imputation), according to Song et al. [26].

  6. The prevention and handling of the missing data

    Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data [].Accordingly, some studies have focused on handling the missing data, problems caused by missing data, and ...

  7. Missing Data

    Missing data, or missing values, occur when you don't have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons. In any dataset, there are usually some missing data. In quantitative research, missing values appear as blank cells in your ...

  8. Analysing and Interpreting Data in Your Dissertation: Making Sense of

    Missing data and outliers can significantly impact the results of your analysis. For missing data, several strategies can be employed, such as deletion (removing incomplete cases), mean imputation (replacing missing values with the mean), or more advanced techniques like multiple imputation. The choice of method depends on the proportion and ...

  9. Introduction to Dealing with Missing Data (Online)

    Missing data are very common in research studies, but ignoring these cases can lead to invalid and misleading conclusions being drawn. This course provides guidance on how to deal with missing values and the best ways of analysing a dataset that is incomplete. The course covers the following topics: Reasons for missing data. Types of missing data.

  10. Principled missing data methods for researchers

    The impact of missing data on quantitative research can be serious, leading to biased estimates of parameters, loss of information, decreased statistical power, increased standard errors, and weakened generalizability of findings. In this paper, we discussed and demonstrated three principled missing data methods: multiple imputation, full information maximum likelihood, and expectation ...

  11. Statistical primer: how to deal with missing data in scientific

    Missing data are a common challenge encountered in research which can compromise the results of statistical inference when not handled appropriately. This paper aims to introduce basic concepts of missing data to a non-statistical audience, list and compare some of the most popular approaches for handling missing data in practice and provide ...

  12. Dealing with Missing Data: A Comparative Exploration of Approaches

    Missing survey data occur for three reasons: (1) noncoverage—the observation fell outside of the sample, (2) total nonresponse—the would-be respondent failed to respond to the survey, and (3) item nonresponse—the respondent skipped a particular survey item (Brick and Kalton 1996).Although data missing as a result of these different causes present distinct challenges for the researcher ...

  13. Handling Missing Data

    The starting point for dealing with the problem of missing data is to understand why it is a problem. One of the fundamental ideas in statistics is that the accuracy of an estimate is a function of two properties: precision and bias. Missing data is potentially disastrous for both of these properties.

  14. Full article: What is Missing in Missing Data Handling? An Evaluation

    1 Introduction. Missing data can significantly influence results and conclusions that can be drawn from data as incomplete data and inappropriate statistical methods for handling missing data can produce bias (Karahalios et al. Citation 2012).For each study, researchers should indicate the amount of missing data for each key variable and address how missing data was handled (Vandenbroucke et ...

  15. Missing Data Problems in Machine Learning

    This thesis develops models and methods for collaborative prediction with non-random missing data by combining standard models for complete data with models of the missing data process and describes several strategies for classification with missing features including the use of generative classifiers. Learning, inference, and prediction in the presence of missing data are pervasive problems ...

  16. Full article: Strategies for handling missing data in longitudinal

    Missing data methods, maximum likelihood estimation (MLE) and multiple imputation (MI), for longitudinal questionnaire data were investigated via simulation. Predictive mean matching (PMM) was applied at both item and scale levels, logistic regression at item level and multivariate normal imputation at scale level.

  17. PDF Multiple Imputation of Missing Data in Multilevel Models Nidhi ...

    research. Missing values can occur at any level in multilevel data, but guidance on multiple imputation in data with more than two levels is currently an open research question. This thesis implements and extends the Gelman and Hill approach for imputation of missing data at higher levels by including aggregate forms of individual-level mea-

  18. Investigating statistical approaches to handling missing data in the

    Chapter 1 provides an introduction to the problem of missing data and how they may arise and a description of the Gateshead Millennium Study data, to which all the missing data methods will be applied. It concludes by giving the aims of this thesis. Chapter 2 provides an in depth review of various missing data approaches and indicates which ...

  19. PDF Missing Data and the EM algorithm

    Ignoring the missing data mechanism The likelihood function ignoring the missing data mechanism is L ign(θ|y obs) ∝ f(y obs |θ) = Z f(y obs,y mis |θ)dy mis. (2) When is L∝ L ign so the missing data mechanism can be ignored for further analysis? This is true if: 1. The data are MAR; 2. The parameters ηgoverning the missingness are

  20. Thesis 1: Quantum Algorithm for Handling Missing Data, Thesis 2

    In this thesis I present such an application applied to handling missing data. The motivation for creating a quantum algorithm for missing data has two parts: (1) The problem of missing data is large and extends many disciplines. If not handled correctly, it can lead to insufficient analysis of the data. It is an important problem to tackle.

  21. Principled missing data methods for researchers

    Principled missing data methods for researchers. Missing data are a rule rather than an exception in quantitative research. Enders ( 2003) stated that a missing rate of 15% to 20% was common in educational and psychological studies.Peng et al. ( 2006) surveyed quantitative studies published from 1998 to 2004 in 11 education and psychology journals.. They found that 36% of studies had no ...

  22. A review of the reporting and handling of missing data in cohort

    Retaining participants in cohort studies with multiple follow-up waves is difficult. Commonly, researchers are faced with the problem of missing data, which may introduce biased results as well as a loss of statistical power and precision. The STROBE guidelines von Elm et al. (Lancet, 370:1453-1457, 2007); Vandenbroucke et al. (PLoS Med, 4:e297, 2007) and the guidelines proposed by Sterne et ...

  23. Error 4401: missing upstream/downstream connection data for ...

    When running a simulation, the Message Board displays "Error 4401: missing upstream/downstream connection data for pump/valve elements" and "Error: could not create ...