Natalie Belford, Benjamin Brustad, Jasmine Hyler
2023-07-31
Missing Data
Within experiments and analyses, missing data can lead to
Assumes data is Missing at Random (MAR)
Purpose: Consider uncertainty of missing data
Uses multiple imputation methods to provide better results
Dataset: 21 specific chronic illnesses of Medicare beneficiaries, from U.S. Department of Health & Human Services (Medicare & Medicaid Services 2018).
Goal: Fill in missing data (2244 missing entries, about 5%)
Approach: Imputation using 3 approaches: Predictive Means Matching, Classification and Regression Trees, Lasso Regression, and Random Forest
1) Select independent variables that may help impute variables with missing data
2) Noting the chosen statistical method, estimate in each of the imputed datasets the association of interest
3) Combine using Rubin’s rules the association measures from each imputed dataset. To combine, we use the following equations
\[ W=\frac{\sum({SE_t}^2)}{m} \tag{1} \]
\[ B=\frac{\sum(\hat{\theta_t}-\bar{\theta})^2}{m-1} \tag{2} \]
\[ SE=\sqrt{W+B+\frac{B}{m}} \tag{3} \]
hot deck method
easy to use
handles any data type
Realistic
Hard to use for small data sets or ones with large FMI
Chooses values to minimize this distance
\[ \delta_{hj}=\alpha^{mis}z_j-\alpha^{obs}z_h \tag{4} \]
Machine Learning
Robust
Flexible
Straightforward(in R)
In practice works similarly to PMM with tree instead of regression
Uses CART and Rubin’s rules automatically to yield a single dataset
Does not account for uncertainty and increases P values
Usually similar to average of all predictions from the CART models
Better predictions and accuracy than a single CART model
Minimizes regression coefficient (good for high multicollinearity)
Best for high dimension datasets
Preserves relationships between variables best
Removes some predictor variables(easier to use)
May add bias and nonsense results
Equation (Musoro et al. 2014)
\[(\hat{\beta_0}, \hat{\beta}^{lasso}) =argmin[\sum(Y_i-(\beta_0+\beta X^T_i))^2+\lambda \sum |\beta_j|] \tag5\]
Prior to implementing MICE method
Utilized a Missing Map to create a visualization
Verified three variables missing entries: Provisional Income(408), Total Medicare Standardized Payment(918), and Total Medicare Payment(908)
Total of 2244 missing data points
## reading data file from github
Chronic_Conditions <- read_excel("Chronic_Conditions.xlsx")
#Calculate pattern of missing data
Chronic_Conditions <- Chronic_Conditions %>%
select( PrvInc , Stdzd_Pymt_PC, Pymt_PC)
## Display table of missing entries per variable(complete columns were omitted)
plot_pattern(Chronic_Conditions)
# Imputes missing data using the three selected methods
mice_imputed <- data.frame(
original = Chronic_Conditions$PrvInc,
imputed_pmm = complete(mice(Chronic_Conditions, method = "pmm", printFlag = FALSE))$PrvInc,
imputed_cart = complete(mice(Chronic_Conditions, method = "cart", printFlag = FALSE))$PrvInc,
imputed_lasso = complete(mice(Chronic_Conditions, method = "lasso.norm", printFlag = FALSE))$PrvInc)
#Imputation using miss Forest
Chronic_Conditions.mis <- prodNA(Chronic_Conditions, noNA = 0.1)
PrvInc Stdzd_Pymt_PC Pymt_PC
2707 1 1 1 0
320 1 1 0 1
296 1 0 1 1
762 1 0 0 2
513 0 1 1 1
64 0 1 0 2
52 0 0 1 2
200 0 0 0 3
829 1310 1346 3485
PrvInc Stdzd_Pymt_PC Pymt_PC
Min. :0.0000 Min. : 0 Min. : 0
1st Qu.:0.0394 1st Qu.:18725 1st Qu.: 18500
Median :0.1114 Median :23292 Median : 23141
Mean :0.1733 Mean :24099 Mean : 24142
3rd Qu.:0.2767 3rd Qu.:28251 3rd Qu.: 28356
Max. :0.7588 Max. :89235 Max. :100028
NA's :829 NA's :1310 NA's :1346
Missing data is common
Multiple Imputation(MI) is the act of completing data sets where there are missing data points
Less biased and more accurate outcome
Multivariate Imputation by Chained Equations(MICE) is an effective way of imputing data
Benefits
Reduced uncertanty
Reduced Bias
Multiple methods of imputation in MICE package (PMM, lasso, and CART)