PURPOSE
In distributed research network (DRN) settings, multiple imputation cannot be directly implemented because pooling individual-level data is often not feasible. The performance of multiple imputation in combination with meta-analysis is not well understood within DRNs.
METHODS
To evaluate the performance of imputation for missing baseline covariate data in combination with meta-analysis for time-to-event analysis within DRNs, we compared two parametric algorithms including one approximated linear imputation model (Approx), and one nonlinear substantive model compatible imputation model (SMC), as well as two non-parametric machine learning algorithms including random forest (RF), and classification and regression trees (CART), through simulation studies motivated by a real-world data set.
RESULTS
Under the setting with small effect sizes (i.e., log-Hazard Ratios (logHR)) and homogeneous missingness mechanisms across sites, all imputation methods produced unbiased and more efficient estimates while the complete-case analysis could be biased and inefficient; and under heterogeneous missingness mechanisms, estimates with RF method could have higher efficiency. Estimates from the distributed imputation combined by meta-analysis were similar to those from the imputation using pooled data. When logHRs were large, the SMC imputation algorithm generally performed better than others.
CONCLUSIONS
These findings suggest the validity and feasibility of imputation within DRNs in the presence of missing covariate data. The performance of the four imputation algorithms varies with the effect sizes and level of missingness.