Missing Data and Imputation

Natalie Belford, Benjamin Brustad, Jasmine Hyler

2023-07-31

Introduction

Missing Data
- Common occurrence among datasets
  - Not widely discussed how to resolve
- Alone, not a problem
  - Becomes problem when analysis shows bias or lacks power

Issues with Missing Data

Within experiments and analyses, missing data can lead to
- Inaccurately distributed data and calculations
- Skewed visuals
- Inaccurate conclusions

Types of Missing Data

Missing data categorized into one of three types:
- Missing Completely at Random (MCAR)
- Missing at Random (MAR)
- Missing Not at Random (MNAR)

Missing Completely at Random (MCAR)

Probability that missing data not related to either
- Specific value supposed to be obtained
- Set of obtained values
MCAR data is unbiased

Missing Completely at Random (MCAR), continued

Examples:
- Equipment failure
- Samples lost in transit
- Unsatisfactory samples

Missing at Random (MAR)

Probability that missing responses are
- Dependent upon set of observed responses
- Not related to specific expected values
Most realistic option for missing data
Missing data is knowable, missingness is predictable
- Missingness not at random; is random condition on observed values from entire dataset
- Estimates determined
- Bias recovered

Missing at Random (MAR), continued

Example: Men less likely to fill out depression survey
- Reason: because of society
  - Not because of lack of depression symptoms

Missing Not at Random (MNAR)

Missing data not classified as either MCAR or MAR
- No bias is observed
- Power is affected
  - Larger Standard Error (SE) due to reduced sample size
- Least desirable missing data scenario
When MNAR occurs, research subjects affect variables
Example: Subjects don’t disclose accurate information for fear or shame; forego providing data altogether

Types of Missing Data Solutions

Missing Imputation (MI)
Multivariate Imputation by Chained Equation (MICE)
Single Center Imputation from Multiple Chained Equation (SICE)

Multiple Imputation (MI)

Most common method
Results similar to using complete datasets
- Resolves issue of too small or too large standard errors
- Large standard error (SE) - results acquired lack precision
- Small standard error (SE) - results acquired with overestimation of precision

Multivariate Imputation by Chained Equation (MICE)

Assumes data is Missing at Random (MAR)
Purpose: Consider uncertainty of missing data
Uses multiple imputation methods to provide better results

Single Center Imputation from Multiple Chained Equation (SICE)

Alternative to MICE
Improves upon MICE by creating hybrid of single and multiple imputation techniques
Uses the respective SICE variant (categorical or numeric)
- Missing data values corrected using thorough approach to find more accurate value to use
- Uses predicted values imputed from MI approach

Single Center Imputation from Multiple Chained Equation(SICE), continued

Computes mean or mode for imputed values (depending on data type)
Replaces original imputed value with respective central measure
Has lowest computation time all all missing data solutions
- Purpose: Replace predicted imputed values computed using MI with central measure computed using SICE (Khan and Hoque 2020)

Methods

Dataset: 21 specific chronic illnesses of Medicare beneficiaries, from U.S. Department of Health & Human Services (Medicare & Medicaid Services 2018).
Goal: Fill in missing data (2244 missing entries, about 5%)
Approach: Imputation using 3 approaches: Predictive Means Matching, Classification and Regression Trees, Lasso Regression, and Random Forest
- These are all forms of Multiple Imputation: perform single imputation several times to create multiple data sets, analyze and compute the error of each set, then combine the data into a single, final data set (Little et al. 2014).

Figure 1: Multiple Imputation Process using 5 sets

Rubin’s rules for combining

1) Select independent variables that may help impute variables with missing data

2) Noting the chosen statistical method, estimate in each of the imputed datasets the association of interest

3) Combine using Rubin’s rules the association measures from each imputed dataset. To combine, we use the following equations

\[ W=\frac{\sum({SE_t}^2)}{m} \tag{1} \]

\[ B=\frac{\sum(\hat{\theta_t}-\bar{\theta})^2}{m-1} \tag{2} \]

\[ SE=\sqrt{W+B+\frac{B}{m}} \tag{3} \]

Predictive Means Matching (PMM)

hot deck method
easy to use
handles any data type
Realistic
Hard to use for small data sets or ones with large FMI
Chooses values to minimize this distance

\[ \delta_{hj}=\alpha^{mis}z_j-\alpha^{obs}z_h \tag{4} \]

Classification and Regression Trees(CART)

Machine Learning
Robust
Flexible
Straightforward(in R)
- Automates variable selection, missing values, outliers, variable interaction, and nonlinear relationships
In practice works similarly to PMM with tree instead of regression

Random Forest(Miss Forest)

Subset of CART

Uses CART and Rubin’s rules automatically to yield a single dataset
Does not account for uncertainty and increases P values
Usually similar to average of all predictions from the CART models
Better predictions and accuracy than a single CART model

Lasso Regression

Minimizes regression coefficient (good for high multicollinearity)
Best for high dimension datasets
Preserves relationships between variables best
Removes some predictor variables(easier to use)
May add bias and nonsense results
Equation (Musoro et al. 2014)

\[(\hat{\beta_0}, \hat{\beta}^{lasso}) =argmin[\sum(Y_i-(\beta_0+\beta X^T_i))^2+\lambda \sum |\beta_j|] \tag5\]

Analysis and Results

Prior to implementing MICE method
- Utilized a Missing Map to create a visualization
- Verified three variables missing entries: Provisional Income(408), Total Medicare Standardized Payment(918), and Total Medicare Payment(908)
- Total of 2244 missing data points

Code

# loading packages

library(ggplot2)

library(dplyr)

library(mice)

library(missForest)

library(VIM)

library(ggmice)

library(xlsx)

library(readxl)

library(knitr)

Figure 2: Missing Map

## reading data file from github


Chronic_Conditions <- read_excel("Chronic_Conditions.xlsx")

#Calculate pattern of missing data

Chronic_Conditions <- Chronic_Conditions %>%

select( PrvInc , Stdzd_Pymt_PC, Pymt_PC)

## Display table of missing entries per variable(complete columns were omitted)

plot_pattern(Chronic_Conditions)

Figure 2: Missing Map

Figure 3: Histogram of Missing Data

#Histogram of that same missingness data

aggr_plot <- aggr(Chronic_Conditions, col=c('navyblue','red'),

numbers=TRUE, sortVars=FALSE, labels=names(data), cex.axis=.7, gap=3,

ylab=c("Histogram of missing data","Pattern"))

Figure 3: Histogram of Missing Data

MICE Package

Following the missing map, we ran the MICE package in R.

# Imputes missing data using the three selected methods 
mice_imputed <- data.frame(
  original = Chronic_Conditions$PrvInc,
 imputed_pmm = complete(mice(Chronic_Conditions, method = "pmm", printFlag = FALSE))$PrvInc,
  imputed_cart = complete(mice(Chronic_Conditions, method = "cart", printFlag = FALSE))$PrvInc,
  imputed_lasso = complete(mice(Chronic_Conditions, method = "lasso.norm", printFlag = FALSE))$PrvInc)



#Imputation using miss Forest 
Chronic_Conditions.mis <- prodNA(Chronic_Conditions, noNA = 0.1)

Figure 4: Missing data percentages per variable

# Plots missing data percentages per variable

aggr(Chronic_Conditions.mis, col=c('navyblue', 'yellow' ),

numbers=TRUE, sortVars=FALSE,

labels=names(Chronic_Conditions.mis), cex.axis=.7,

gap=3, ylab=c("Missing Data", "Pattern"))

     PrvInc Stdzd_Pymt_PC Pymt_PC     
2707      1             1       1    0
320       1             1       0    1
296       1             0       1    1
762       1             0       0    2
513       0             1       1    1
64        0             1       0    2
52        0             0       1    2
200       0             0       0    3
        829          1310    1346 3485

Figure 4: Missing data percentages per variable

Miss Forest

#Missingness data for miss Forest

plot_pattern(Chronic_Conditions.mis)

Figure 5: Table of Missing Values - Miss Forest

#Summarizes the POOLED data after miss Forest imputation

summary(Chronic_Conditions.mis)

     PrvInc       Stdzd_Pymt_PC      Pymt_PC      
 Min.   :0.0000   Min.   :    0   Min.   :     0  
 1st Qu.:0.0394   1st Qu.:18725   1st Qu.: 18500  
 Median :0.1114   Median :23292   Median : 23141  
 Mean   :0.1733   Mean   :24099   Mean   : 24142  
 3rd Qu.:0.2767   3rd Qu.:28251   3rd Qu.: 28356  
 Max.   :0.7588   Max.   :89235   Max.   :100028  
 NA's   :829      NA's   :1310    NA's   :1346

Figure 6: Density Plot

# Density plot to compare to original data

densityplot(imputed_Data)

Conclusion

Missing data is common
Multiple Imputation(MI) is the act of completing data sets where there are missing data points
Less biased and more accurate outcome
Multivariate Imputation by Chained Equations(MICE) is an effective way of imputing data

Conclusion, continued

Benefits
- Reduced uncertanty
- Reduced Bias
- Multiple methods of imputation in MICE package (PMM, lasso, and CART)

References

Khan, S. I., and A. S. M. L. Sice Hoque. 2020. “An Improved Missing Data Imputation Technique.” J Big Data 7: 37. https://doi.org/10.1186/s40537-020-00313-w.

Little, T., T. Jorgensen, K. Lang, E. Moore, and E. Whitney. 2014. “On the Joys of Missing Data.” Journal of Pediatric Psychology 39 (2): 151–62. https://doi.org/https://doi.org/10.1093/jpepsy/jst048.

Medicare & Medicaid Services, Centers for. 2018. “Specific Chronic Conditions.” https://catalog.data.gov/dataset/specific-chronic-conditions-0f448.

Musoro, Jammbe Z, Aeilko H Zwinderman, Milo A Puhan, Gerben ter Riet, and Ronald B Geskus. 2014. “Validation of Prediction Models Based on Lasso Regression with Multiply Imputed Data.” BMC Medical Research Methodology 1. https://doi.org/https://doi.org/10.1186/1471-2288-14-116.