Missing Data and Imputation

Natalie Belford, Benjamin Brustad, Jasmine Hyler

2023-07-31

Introduction

  • Missing Data

    • Common occurrence among datasets
      • Not widely discussed how to resolve
    • Alone, not a problem
      • Becomes problem when analysis shows bias or lacks power

Issues with Missing Data

  • Within experiments and analyses, missing data can lead to

    • Inaccurately distributed data and calculations
    • Skewed visuals
    • Inaccurate conclusions

Types of Missing Data

  • Missing data categorized into one of three types:
    • Missing Completely at Random (MCAR)
    • Missing at Random (MAR)
    • Missing Not at Random (MNAR)

Missing Completely at Random (MCAR)

  • Probability that missing data not related to either
    • Specific value supposed to be obtained
    • Set of obtained values
  • MCAR data is unbiased

Missing Completely at Random (MCAR), continued

  • Examples:
    • Equipment failure
    • Samples lost in transit
    • Unsatisfactory samples

Missing at Random (MAR)

  • Probability that missing responses are
    • Dependent upon set of observed responses
    • Not related to specific expected values
  • Most realistic option for missing data
  • Missing data is knowable, missingness is predictable
    • Missingness not at random; is random condition on observed values from entire dataset
    • Estimates determined
    • Bias recovered

Missing at Random (MAR), continued

  • Example: Men less likely to fill out depression survey
    • Reason: because of society
      • Not because of lack of depression symptoms

Missing Not at Random (MNAR)

  • Missing data not classified as either MCAR or MAR
    • No bias is observed
    • Power is affected
      • Larger Standard Error (SE) due to reduced sample size
    • Least desirable missing data scenario
  • When MNAR occurs, research subjects affect variables
  • Example: Subjects don’t disclose accurate information for fear or shame; forego providing data altogether

Types of Missing Data Solutions

  • Missing Imputation (MI)
  • Multivariate Imputation by Chained Equation (MICE)
  • Single Center Imputation from Multiple Chained Equation (SICE)

Multiple Imputation (MI)

  • Most common method
  • Results similar to using complete datasets
    • Resolves issue of too small or too large standard errors
    • Large standard error (SE) - results acquired lack precision
    • Small standard error (SE) - results acquired with overestimation of precision

Multivariate Imputation by Chained Equation (MICE)

  • Assumes data is Missing at Random (MAR)

  • Purpose: Consider uncertainty of missing data

  • Uses multiple imputation methods to provide better results

Single Center Imputation from Multiple Chained Equation (SICE)

  • Alternative to MICE
  • Improves upon MICE by creating hybrid of single and multiple imputation techniques
  • Uses the respective SICE variant (categorical or numeric)
    • Missing data values corrected using thorough approach to find more accurate value to use
    • Uses predicted values imputed from MI approach

Single Center Imputation from Multiple Chained Equation(SICE), continued

  • Computes mean or mode for imputed values (depending on data type)
  • Replaces original imputed value with respective central measure
  • Has lowest computation time all all missing data solutions
    • Purpose: Replace predicted imputed values computed using MI with central measure computed using SICE (Khan and Hoque 2020)

Methods

  • Dataset: 21 specific chronic illnesses of Medicare beneficiaries, from U.S. Department of Health & Human Services (Medicare & Medicaid Services 2018).

  • Goal: Fill in missing data (2244 missing entries, about 5%)

  • Approach: Imputation using 3 approaches: Predictive Means Matching, Classification and Regression Trees, Lasso Regression, and Random Forest

    • These are all forms of Multiple Imputation: perform single imputation several times to create multiple data sets, analyze and compute the error of each set, then combine the data into a single, final data set (Little et al. 2014).

Figure 1: Multiple Imputation Process using 5 sets

Figure 1: Multiple Imputation Process using 5 sets

Figure 1: Multiple Imputation Process using 5 sets

Rubin’s rules for combining

1) Select independent variables that may help impute variables with missing data

2) Noting the chosen statistical method, estimate in each of the imputed datasets the association of interest

3) Combine using Rubin’s rules the association measures from each imputed dataset. To combine, we use the following equations

\[ W=\frac{\sum({SE_t}^2)}{m} \tag{1} \]

\[ B=\frac{\sum(\hat{\theta_t}-\bar{\theta})^2}{m-1} \tag{2} \]

\[ SE=\sqrt{W+B+\frac{B}{m}} \tag{3} \]

Predictive Means Matching (PMM)

  • hot deck method

  • easy to use

  • handles any data type

  • Realistic

  • Hard to use for small data sets or ones with large FMI

  • Chooses values to minimize this distance

\[ \delta_{hj}=\alpha^{mis}z_j-\alpha^{obs}z_h \tag{4} \]

Classification and Regression Trees(CART)

  • Machine Learning

  • Robust

  • Flexible

  • Straightforward(in R)

    • Automates variable selection, missing values, outliers, variable interaction, and nonlinear relationships
  • In practice works similarly to PMM with tree instead of regression

Random Forest(Miss Forest)

  • Subset of CART
  • Uses CART and Rubin’s rules automatically to yield a single dataset

  • Does not account for uncertainty and increases P values

  • Usually similar to average of all predictions from the CART models

  • Better predictions and accuracy than a single CART model

Lasso Regression

  • Minimizes regression coefficient (good for high multicollinearity)

  • Best for high dimension datasets

  • Preserves relationships between variables best

  • Removes some predictor variables(easier to use)

  • May add bias and nonsense results

  • Equation (Musoro et al. 2014)

\[(\hat{\beta_0}, \hat{\beta}^{lasso}) =argmin[\sum(Y_i-(\beta_0+\beta X^T_i))^2+\lambda \sum |\beta_j|] \tag5\]

Analysis and Results

  • Prior to implementing MICE method

    • Utilized a Missing Map to create a visualization

    • Verified three variables missing entries: Provisional Income(408), Total Medicare Standardized Payment(918), and Total Medicare Payment(908)

    • Total of 2244 missing data points

Code

# loading packages

library(ggplot2)

library(dplyr)

library(mice)

library(missForest)

library(VIM)

library(ggmice)

library(xlsx)

library(readxl)

library(knitr)

Figure 2: Missing Map

## reading data file from github


Chronic_Conditions <- read_excel("Chronic_Conditions.xlsx")

#Calculate pattern of missing data

Chronic_Conditions <- Chronic_Conditions %>%

select( PrvInc , Stdzd_Pymt_PC, Pymt_PC)

## Display table of missing entries per variable(complete columns were omitted)

plot_pattern(Chronic_Conditions)

Figure 2: Missing Map

Figure 3: Histogram of Missing Data

#Histogram of that same missingness data

aggr_plot <- aggr(Chronic_Conditions, col=c('navyblue','red'),

numbers=TRUE, sortVars=FALSE, labels=names(data), cex.axis=.7, gap=3,

ylab=c("Histogram of missing data","Pattern"))

Figure 3: Histogram of Missing Data

MICE Package

  • Following the missing map, we ran the MICE package in R.
# Imputes missing data using the three selected methods 
mice_imputed <- data.frame(
  original = Chronic_Conditions$PrvInc,
 imputed_pmm = complete(mice(Chronic_Conditions, method = "pmm", printFlag = FALSE))$PrvInc,
  imputed_cart = complete(mice(Chronic_Conditions, method = "cart", printFlag = FALSE))$PrvInc,
  imputed_lasso = complete(mice(Chronic_Conditions, method = "lasso.norm", printFlag = FALSE))$PrvInc)



#Imputation using miss Forest 
Chronic_Conditions.mis <- prodNA(Chronic_Conditions, noNA = 0.1)

Figure 4: Missing data percentages per variable

# Plots missing data percentages per variable

aggr(Chronic_Conditions.mis, col=c('navyblue', 'yellow' ),

numbers=TRUE, sortVars=FALSE,

labels=names(Chronic_Conditions.mis), cex.axis=.7,

gap=3, ylab=c("Missing Data", "Pattern"))
     PrvInc Stdzd_Pymt_PC Pymt_PC     
2707      1             1       1    0
320       1             1       0    1
296       1             0       1    1
762       1             0       0    2
513       0             1       1    1
64        0             1       0    2
52        0             0       1    2
200       0             0       0    3
        829          1310    1346 3485

Figure 4: Missing data percentages per variable

Miss Forest

#Missingness data for miss Forest

plot_pattern(Chronic_Conditions.mis)

Figure 5: Table of Missing Values - Miss Forest

#Summarizes the POOLED data after miss Forest imputation

summary(Chronic_Conditions.mis)
     PrvInc       Stdzd_Pymt_PC      Pymt_PC      
 Min.   :0.0000   Min.   :    0   Min.   :     0  
 1st Qu.:0.0394   1st Qu.:18725   1st Qu.: 18500  
 Median :0.1114   Median :23292   Median : 23141  
 Mean   :0.1733   Mean   :24099   Mean   : 24142  
 3rd Qu.:0.2767   3rd Qu.:28251   3rd Qu.: 28356  
 Max.   :0.7588   Max.   :89235   Max.   :100028  
 NA's   :829      NA's   :1310    NA's   :1346    

Figure 6: Density Plot

# Density plot to compare to original data

densityplot(imputed_Data)

Conclusion

  • Missing data is common

  • Multiple Imputation(MI) is the act of completing data sets where there are missing data points

  • Less biased and more accurate outcome

  • Multivariate Imputation by Chained Equations(MICE) is an effective way of imputing data

Conclusion, continued

  • Benefits

    • Reduced uncertanty

    • Reduced Bias

    • Multiple methods of imputation in MICE package (PMM, lasso, and CART)

References

Khan, S. I., and A. S. M. L. Sice Hoque. 2020. “An Improved Missing Data Imputation Technique.” J Big Data 7: 37. https://doi.org/10.1186/s40537-020-00313-w.
Little, T., T. Jorgensen, K. Lang, E. Moore, and E. Whitney. 2014. “On the Joys of Missing Data.” Journal of Pediatric Psychology 39 (2): 151–62. https://doi.org/https://doi.org/10.1093/jpepsy/jst048.
Medicare & Medicaid Services, Centers for. 2018. “Specific Chronic Conditions.” https://catalog.data.gov/dataset/specific-chronic-conditions-0f448.
Musoro, Jammbe Z, Aeilko H Zwinderman, Milo A Puhan, Gerben ter Riet, and Ronald B Geskus. 2014. “Validation of Prediction Models Based on Lasso Regression with Multiply Imputed Data.” BMC Medical Research Methodology 1. https://doi.org/https://doi.org/10.1186/1471-2288-14-116.