Multiple Imputation in Stata: Examples

This article is part of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction.

This article contains examples that illustrate some of the issues involved in using multiple imputation. Articles in the Multiple Imputation in Stata series refer to these examples, and more discussion of the principles involved can be found in those articles. However, it is possible to read this article independently, or to just read about the particular example that interests you (see the list of examples below).

These examples are not intended to test the validity of the techniques used or rigorously compare their effectiveness. However, they should give you some intuition about how they work and their strengths and weaknesses. Some of the simulation parameters (and occasionally the seed for the random number generator) were chosen in order to highlight the issue at hand, but none of them are atypical of real-world situations. The data are generated randomly from standard distributions in such a way that the "right" answers are known and you can see how closely different techniques approach those answers.

A Stata do file is provided for each example, along with commentary and selected output (in this web page). The do file also generates the data set used, with a set seed for reproducibility. Our suggestion is that you open the do file in Stata's do file editor or your favorite text editor and read it in parallel with the discussion in the article. Please note that the example do files are not intended to demonstrate the entire process of multiple imputation—they don't always check the fit or convergence of the imputation models, for example, which are very important things to do in real world use of multiple imputation.

Each example concludes with a "Lessons Learned" section, but we'd like to highlight one overall lesson: Multiple imputation can be a useful tool, but there are many ways to get it wrong and invalidate your results. Be very careful, and don't expect it to be quick and easy.

The examples are:

Power
MCAR vs. MAR vs. MNAR
Imputing the Dependent Variable
Non-Normal Data
Transformations
Non-Linearity
Interactions

Power

The most common motivation for using multiple imputation is to try to increase the power of statistical tests by increasing the number of observations used (i.e. by not losing observations due to missing values). In our experience it rarely makes a large difference in practice. This example uses ideal circumstances to illustrate what extremely successful multiple imputation would look like.

Code for this example

Data

Observations: 1,000

Variables:

x1-x10 drawn from standard normal distribution (independently)
y is the sum of all x's, plus a normal error term

Missingness: Each value of the x variables has a 10% chance of being missing (MCAR).

Right Answers: Regressing y on all the x's, each x should have a coefficient of 1.

Procedure

Begin with complete cases analysis:

      Source |       SS       df       MS              Number of obs =     369
-------------+------------------------------           F( 10,   358) =    6.71
       Model |  3882.86207    10  388.286207           Prob > F      =  0.0000
    Residual |  20722.7734   358   57.884842           R-squared     =  0.1578
-------------+------------------------------           Adj R-squared =  0.1343
       Total |  24605.6355   368    66.86314           Root MSE      =  7.6082

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |    .733993   .4216582     1.74   0.083    -.0952453    1.563231
          x2 |   1.664231   .3867292     4.30   0.000      .903684    2.424777
          x3 |    1.51406   .3907875     3.87   0.000     .7455327    2.282588
          x4 |   .2801067    .395164     0.71   0.479    -.4970278    1.057241
          x5 |   .8524305   .4076557     2.09   0.037     .0507297    1.654131
          x6 |   .7704437    .413519     1.86   0.063     -.042788    1.583675
          x7 |   .6512155   .3938107     1.65   0.099    -.1232575    1.425689
          x8 |   .9173208   .3969585     2.31   0.021     .1366572    1.697984
          x9 |   .8736406   .4115488     2.12   0.034     .0642835    1.682998
         x10 |   .9222064   .4123417     2.24   0.026       .11129    1.733123
       _cons |  -.1999121   .3975523    -0.50   0.615    -.9817434    .5819192
------------------------------------------------------------------------------

Note that although each x value has just a 10% chance of being missing, because we have ten x variables per observation almost 2/3 of the observations are missing at least one value and must be dropped. As a result, the standard errors are quite large, and the coefficients on four of the ten x's are not significantly different from zero.

After multiple imputation, the same regression gives the following:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       1000
                                                  Average RVI     =     0.1483
                                                  Largest FMI     =     0.2356
                                                  Complete DF     =        989
DF adjustment:   Small sample                     DF:     min     =     142.63
                                                          avg     =     425.17
                                                          max     =     928.47
Model F test:       Equal FMI                     F(  10,  802.0) =      18.13
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |    .793598   .2679515     2.96   0.003     .2657743    1.321422
          x2 |   1.133405   .2482047     4.57   0.000     .6460259    1.620785
          x3 |    1.33182   .2521916     5.28   0.000     .8364684    1.827172
          x4 |   1.159887   .2714759     4.27   0.000      .624309    1.695465
          x5 |   1.181207   .2901044     4.07   0.000      .607747    1.754667
          x6 |   1.305636   .2835268     4.60   0.000     .7466279    1.864645
          x7 |   .6258813   .2568191     2.44   0.015      .121268    1.130494
          x8 |   1.143631    .253376     4.51   0.000     .6461328     1.64113
          x9 |   1.112347   .2838261     3.92   0.000     .5520503    1.672644
         x10 |   1.053309   .2612807     4.03   0.000     .5397648    1.566854
       _cons |   .0305628   .2499378     0.12   0.903    -.4599457    .5210712
------------------------------------------------------------------------------

The standard errors are much smaller than with complete cases analysis, and all of the coefficients are significantly different from zero. This illustrates the primary motivation for using multiple imputation.

Lessons Learned

This data set is ideal for multiple imputation because it has large numbers of observations with partial data. Complete cases analysis must discard all these observations. Multiple imputation can use the information they contain to improve the results.

Note that the imputation model could not do a very good job of predicting the missing values of the x's based on the observed data. The x's are completely independent of each other and thus have no predictive power. y has some but not much: if you regress each x on y and all the other x's (which is what the imputation model used does), the R-squared values are all less than 0.1 and most are less than 0.03. The success of multiple imputation does not depend on imputing the "right" values of the missing variables for individual observations, but rather on correctly modeling their distribution conditional on the observed data.

Return to the source article

MCAR vs. MAR vs. MNAR

Whether your data are Missing Completely at Random (probability of being missing does not depend on either observed or unobserved data), Missing at Random (probability of being missing depends only on observed data), or Missing Not at Random (probability of being missing depends on unobserved data) is very important in deciding how to analyze it. This example shows how complete cases analysis and multiple imputation respond to different mechanisms for missingness.

Code for this example

Data

Observations: 1,000

Variables:

x is drawn from standard normal distribution
y is x plus a normal error term

Missingness:

y is always observed
First run: probability of x being missing is 10% for all observations (MCAR)
Second run: probability of x being missing is proportional to y (MAR)
Third run: probability of x being missing is proportional to x (MNAR)

Right Answers: Regressing y on x, x should have a coefficient of 1.

Procedure

We'll analyze this data set three times, once when it is MCAR, once when it is MAR, and once when it is MNAR.

MCAR

In the first run, the data set is MCAR and both complete cases analysis and multiple imputation give unbiased results.

Complete cases analysis:

      Source |       SS       df       MS              Number of obs =     882
-------------+------------------------------           F(  1,   880) =  932.47
       Model |   918.60551     1   918.60551           Prob > F      =  0.0000
    Residual |  866.913167   880  .985128598           R-squared     =  0.5145
-------------+------------------------------           Adj R-squared =  0.5139
       Total |  1785.51868   881  2.02669543           Root MSE      =  .99254

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9842461   .0322319    30.54   0.000     .9209858    1.047506
       _cons |  -.0481664   .0334249    -1.44   0.150    -.1137683    .0174354
------------------------------------------------------------------------------

Multiple imputation:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       1000
                                                  Average RVI     =     0.1112
                                                  Largest FMI     =     0.1583
                                                  Complete DF     =        998
DF adjustment:   Small sample                     DF:     min     =     262.51
                                                          avg     =     541.98
                                                          max     =     821.46
Model F test:       Equal FMI                     F(   1,  821.5) =    1014.02
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9837383   .0308928    31.84   0.000     .9231003    1.044376
       _cons |  -.0377668   .0342445    -1.10   0.271    -.1051958    .0296621
------------------------------------------------------------------------------

MAR

Now, consider the case where the probability of x being missing is proportional to y (which is always observed), making the data MAR. With MAR data, complete cases analysis is biased:

      Source |       SS       df       MS              Number of obs =     652
-------------+------------------------------           F(  1,   650) =  375.85
       Model |  252.090968     1  252.090968           Prob > F      =  0.0000
    Residual |  435.966034   650  .670716976           R-squared     =  0.3664
-------------+------------------------------           Adj R-squared =  0.3654
       Total |  688.057002   651   1.0569232           Root MSE      =  .81897

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .6857927    .035374    19.39   0.000     .6163317    .7552538
       _cons |   -.558313   .0352209   -15.85   0.000    -.6274736   -.4891525
------------------------------------------------------------------------------

However, multiple imputation gives unbiased estimates:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       1000
                                                  Average RVI     =     0.4234
                                                  Largest FMI     =     0.3335
                                                  Complete DF     =        998
DF adjustment:   Small sample                     DF:     min     =      78.63
                                                          avg     =      90.20
                                                          max     =     101.78
Model F test:       Equal FMI                     F(   1,   78.6) =     750.78
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9909115   .0361642    27.40   0.000     .9189232      1.0629
       _cons |  -.0544014   .0366092    -1.49   0.140    -.1270175    .0182146
------------------------------------------------------------------------------

MNAR

Finally consider the case where the probability of x being missing proportional to x. This makes the data missing not at random (MNAR), and with MNAR data both complete cases analysis and multiple imputation are biased.

Complete cases analysis:

      Source |       SS       df       MS              Number of obs =     679
-------------+------------------------------           F(  1,   677) =  463.29
       Model |   464.94386     1   464.94386           Prob > F      =  0.0000
    Residual |  679.422756   677  1.00357866           R-squared     =  0.4063
-------------+------------------------------           Adj R-squared =  0.4054
       Total |  1144.36662   678  1.68785636           Root MSE      =  1.0018

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.093581   .0508074    21.52   0.000     .9938225     1.19334
       _cons |   .0420364   .0469578     0.90   0.371     -.050164    .1342368
------------------------------------------------------------------------------

Multiple imputation:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       1000
                                                  Average RVI     =     0.3930
                                                  Largest FMI     =     0.3425
                                                  Complete DF     =        998
DF adjustment:   Small sample                     DF:     min     =      74.96
                                                          avg     =     144.00
                                                          max     =     213.04
Model F test:       Equal FMI                     F(   1,   75.0) =     561.03
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.223772   .0516665    23.69   0.000     1.120846    1.326698
       _cons |   .3816984   .0402614     9.48   0.000     .3023366    .4610602
------------------------------------------------------------------------------

Complete cases analysis actually does better with this particular data set, but that's not true in general.

Lessons Learned

Always investigate whether your data set is plausibly MCAR or MAR. (For example, run logits on indicators of missingness and see if anything predicts it—if it does the data set is MAR rather than MCAR.) If your data set is MAR, consider using multiple imputation rather than complete cases analysis.

MNAR, by definition, cannot be detected by looking at the observed data. You'll have to think carefully about how the data was collected and consider whether some values of the variables might make the data more or less likely to be observed. Unfortunately, the options for working with MNAR data are limited.

Return to the source article

Imputing the Dependent Variable

Many researchers believe it is inappropriate to use imputed values of the dependent variable in the analysis model, especially if the variables used in the imputation model are the same as the variables used in the analysis model. The thinking is that the imputed values add no information since they were generated using the model that is being analyzed.

Unfortunately, this sometimes is misunderstood as meaning that the dependent variable should not be included in the imputation model. That would be a major mistake.

Complete code for this example

Data

Observations: 1,000

Variables:

x1-x3 drawn from standard normal distribution (independently)
y is the sum of all x's, plus a normal error term

Missingness: y and x1-x3 have a 20% probability of being missing (MCAR).

Right Answers: Regressing y on x1-x3, the coefficient on each should be 1.

Procedure

The following are the results of complete cases analysis:

      Source |       SS       df       MS              Number of obs =    4079
-------------+------------------------------           F(  3,  4075) = 3944.41
       Model |  12264.8582     3  4088.28608           Prob > F      =  0.0000
    Residual |  4223.63457  4075  1.03647474           R-squared     =  0.7438
-------------+------------------------------           Adj R-squared =  0.7437
       Total |  16488.4928  4078  4.04327926           Root MSE      =  1.0181

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .9834974   .0160921    61.12   0.000     .9519481    1.015047
          x2 |   1.000172   .0160265    62.41   0.000     .9687511    1.031593
          x3 |   1.003089   .0159724    62.80   0.000     .9717744    1.034404
       _cons |  -.0003362   .0159445    -0.02   0.983    -.0315962    .0309238
------------------------------------------------------------------------------

If you leave y out of the imputation model, imputing only x1-x3, the coefficients on the x's in the final model are biased towards zero:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       7993
                                                  Average RVI     =     0.4014
                                                  Largest FMI     =     0.3666
                                                  Complete DF     =       7989
DF adjustment:   Small sample                     DF:     min     =      72.62
                                                          avg     =     119.53
                                                          max     =     161.84
Model F test:       Equal FMI                     F(   3,  223.0) =    1761.75
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .7946113   .0185175    42.91   0.000     .7580219    .8312007
          x2 |   .7983325   .0199962    39.92   0.000     .7584766    .8381885
          x3 |   .8001109   .0193738    41.30   0.000     .7616436    .8385782
       _cons |   .0041643   .0183891     0.23   0.821    -.0321492    .0404778
------------------------------------------------------------------------------

Why is that? An easy way to see the problem is to look at correlations. First, the correlation between y and x1 in the unimputed data:

            |        y       x1
-------------+------------------
           y |   1.0000
          x1 |   0.5032   1.0000

Now, the correlation between y and x1 for those observations where x1 is imputed (this is calculated for the first imputation, but the others are similar):

             |        y       x1
-------------+------------------
           y |   1.0000
          x1 |  -0.0073   1.0000

Because y is not included in the imputation model, the imputation model creates values of x1 (and x2 and x3) which are not correlated with y. This does not match the observed data. It also biases the results of the final model by adding observations in which y really is unrelated to x1, x2, and x3.

This problem goes away if y is included in the imputation model. Given the nature of chained equations, this means that values of y must be imputed. However, you're under no obligation to use those values in your analysis model. Simply create an indicator variable for "y is missing in the observed data," which can be done automatically with misstable sum, gen(miss_), and then add if !miss_y to the regression command. This restricts the regression to those observations where y is actually observed, so any imputed values of y are not used. Here are the results:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       7993
                                                  Average RVI     =     0.6402
                                                  Largest FMI     =     0.5076
                                                  Complete DF     =       7989
DF adjustment:   Small sample                     DF:     min     =      38.27
                                                          avg     =      82.39
                                                          max     =     182.83
Model F test:       Equal FMI                     F(   3,  137.9) =    4691.72
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .9852388   .0150348    65.53   0.000     .9550372     1.01544
          x2 |   .9923264   .0158077    62.77   0.000     .9603327     1.02432
          x3 |   .9936324   .0129186    76.91   0.000     .9681437    1.019121
       _cons |  -.0015309   .0145755    -0.11   0.917    -.0306997    .0276378
------------------------------------------------------------------------------

On the other hand, using the imputed values of y turns out to make almost no difference in this case (see the complete code).

Lessons Learned

Always include the dependent variable in your imputation model. Whether you should use imputed values of the dependent variable in your analysis model is unclear, but always impute them.

Return to the source article

Non-Normal Data

The obvious way to impute a continuous variable is regression (i.e. the regress command in mi impute chained). However, it applies a normal error term. What happens if a variable is not at all normally distributed?

Complete code for this example

Data

Observations: 10,000

Variables:

g is a binary (1/0) variable with a 50% probability of being 1
x is drawn from the standard normal distribution, then 5 is added if g is 1. Thus x is bimodal.
y is x plus a normal error term

Missingness: Both y and x have a 10% probability of being missing (MCAR).

Right Answers: Regressing y on x, x should have a coefficient of 1.

Procedure

Given the way x was constructed, it has a bimodal distribution and is definitely not normal:

Original distribution of x

The obvious imputation model (regress x on y) captures some of this bimodality because of the influence of y. However, the error term is normal and does not take into account the non-normal distribution of the data. Thus this model is too likely to create imputed values that are near 2.5 (the "valley" between the two "peaks") and the distribution of the imputed data is "too normal":

x imputed with regression--too normal

However, the regression results are quite good:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =     0.2471
                                                  Largest FMI     =     0.1977
                                                  Complete DF     =       9998
DF adjustment:   Small sample                     DF:     min     =     238.92
                                                          avg     =     414.68
                                                          max     =     590.43
Model F test:       Equal FMI                     F(   1,  590.4) =   63916.86
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.001225   .0039603   252.82   0.000     .9934473    1.009003
       _cons |  -.0068119   .0152489    -0.45   0.655    -.0368514    .0232276
------------------------------------------------------------------------------

Replacing regression with Predictive Mean Matching gives a much better fit. PMM begins with regression, but it then finds the observed value of x that is the nearest match. Because there are fewer observed values in the "valley" PMM is less likely to find a match there, resulting in a distribution that is closer to the original:

x imputed with pmm--much better fit

Regression results after imputing with PMM are also good:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =     0.2885
                                                  Largest FMI     =     0.2698
                                                  Complete DF     =       9998
DF adjustment:   Small sample                     DF:     min     =     131.79
                                                          avg     =     148.15
                                                          max     =     164.50
Model F test:       Equal FMI                     F(   1,  131.8) =   53733.02
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9976108   .0043037   231.80   0.000     .9890976    1.006124
       _cons |   .0105094   .0156034     0.67   0.502    -.0202993    .0413181
------------------------------------------------------------------------------

On the other hand, x is distributed normally within each g group. If you impute the two groups separately then ordinary regression fits the data quite well:

x imputed with regression, but separately for the two groups

The regression results are also good:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =     0.6503
                                                  Largest FMI     =     0.4656
                                                  Complete DF     =       9998
DF adjustment:   Small sample                     DF:     min     =      45.53
                                                          avg     =      63.49
                                                          max     =      81.45
Model F test:       Equal FMI                     F(   1,   81.4) =   48664.75
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9996345   .0045314   220.60   0.000     .9906191     1.00865
       _cons |   .0032657   .0183149     0.18   0.859    -.0336105    .0401419
------------------------------------------------------------------------------

Lessons Learned

PMM can be a very effective tool for imputing non-normal data. On the other hand, if you can identify groups whose data may vary in systematically different ways, consider imputing them separately.

However, the regression results were uniformly good, even when the data were imputed using the original regression model where the distribution of the imputed values didn't match the distribution of the observed values very well. In part this is because we had a large number of observations and a simple model, but in general the relationship between face validity and getting valid estimates using the imputed data is unclear.

Return to the source article

Transformations

Given non-normal data, it's appealing to try to transform it in a way that makes it more normal. But what if the other variables in the model are related to the original values of the variable rather than the transformed values?

Complete code for this example

Data

Observations: 10,000

Variables:

x1 is drawn from the standard normal distribution, then exponentiated. Thus it is log-normal
x2 is drawn from the standard normal distribution
y is the sum of x1, x2 and a normal error term

Missingness: x1, x2 and y have a 10% probability of being missing (MCAR)

Right Answers: Regressing y on x1 and x2, both should have a coefficient of 1.

Procedure

First, complete cases analysis for comparison. It does quite well, as we'd expect with MCAR data (and a simple model with lots of observations):

      Source |       SS       df       MS              Number of obs =    7299
-------------+------------------------------           F(  2,  7296) =19024.33
       Model |  38421.4045     2  19210.7023           Prob > F      =  0.0000
    Residual |  7367.47472  7296  1.00979643           R-squared     =  0.8391
-------------+------------------------------           Adj R-squared =  0.8391
       Total |  45788.8793  7298  6.27416816           Root MSE      =  1.0049

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   1.001594   .0057231   175.01   0.000     .9903751    1.012813
          x2 |   .9973623   .0117407    84.95   0.000     .9743471    1.020377
       _cons |   .0035444   .0149663     0.24   0.813     -.025794    .0328827
------------------------------------------------------------------------------

This data set presents a dilemma for imputation: x1 can be made normal simply by taking its log, but y is related to x1, not the log of x1. Regressing ln(x1) on x2 and y (as the obvious imputation model will do) results in the following plot of residuals vs. fitted values (rvfplot):

rvfplot--the pattern indicates that the model is mispecified

If the model were specified correctly, we'd expect the points to be randomly distributed around the y axis regardless of their x location. Clearly that's not the case.

The obvious way to impute this data would be to:

Log transform x1 by creating lnx1 = ln(x1)
Impute lnx1 using regression
Create the passive variable ix1 = exp(lnx1) to reverse the transformation

Here are the regression results after doing so:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =   196.4421
                                                  Largest FMI     =     0.9987
                                                  Complete DF     =       9997
DF adjustment:   Small sample                     DF:     min     =       5.89
                                                          avg     =     242.41
                                                          max     =     701.09
Model F test:       Equal FMI                     F(   2,   12.8) =       3.84
Within VCE type:          OLS                     Prob > F        =     0.0493

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ix1 |    .014417   .0213705     0.67   0.525    -.0381112    .0669452
          x2 |   .9951566   .0235333    42.29   0.000     .9489525    1.041361
       _cons |   1.583044   .0384844    41.13   0.000     1.502835    1.663253
------------------------------------------------------------------------------

As you see, the coefficient on ix1 (imputed untransformed x1) is very strongly biased towards zero. Now try imputing lnx1 using Predictive Mean Matching rather than regression:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =     0.7058
                                                  Largest FMI     =     0.6292
                                                  Complete DF     =       9997
DF adjustment:   Small sample                     DF:     min     =      24.80
                                                          avg     =      53.56
                                                          max     =      74.84
Model F test:       Equal FMI                     F(   2,   52.8) =    8286.63
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ix1 |   .9554301    .009311   102.61   0.000     .9362457    .9746144
          x2 |   .9537128   .0146582    65.06   0.000     .9245112    .9829144
       _cons |   .0843407   .0192948     4.37   0.000     .0457588    .1229226
------------------------------------------------------------------------------

Now both coefficients are biased towards zero, though not nearly as strongly.

What if we simply ignore the fact that x1 is not normally distributed and impute it directly? The results are actually better:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =     0.3873
                                                  Largest FMI     =     0.3401
                                                  Complete DF     =       9997
DF adjustment:   Small sample                     DF:     min     =      84.25
                                                          avg     =     179.94
                                                          max     =     337.16
Model F test:       Equal FMI                     F(   2,  133.4) =   17538.58
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .9980159   .0060605   164.68   0.000     .9859646    1.010067
          x2 |   .9912113   .0117879    84.09   0.000      .967869    1.014554
       _cons |   .0108679   .0140238     0.77   0.439    -.0167172     .038453
------------------------------------------------------------------------------

This clearly works well, but raises the question of face validity. Here is a kernel density plot of the observed values of x1:

kdensity of observed, log-normal data

Here is a kernel density plot of the imputed values of x1:

kdensity of imputed data--many values <0, too spread out

The obvious problem is that the imputed values of x1 are frequently less than zero while the observed values of x1 are always positive. This would be especially problematic if you thought you might want to use a log transform later in the process.

Given the constraint that x1 cannot be less than zero, truncreg seems like a plausible alternative. Unfortunately, truncreg fails to converge for this data set (for reasons not yet clear to us). But PMM will also honor that constraint:

kdensity plot of imputed x1 using PMM

Here are the regression results with PMM:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =     0.3702
                                                  Largest FMI     =     0.3664
                                                  Complete DF     =       9997
DF adjustment:   Small sample                     DF:     min     =      72.89
                                                          avg     =     147.11
                                                          max     =     193.33
Model F test:       Equal FMI                     F(   2,  145.1) =   17890.15
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   1.000568   .0056325   177.64   0.000     .9894588    1.011677
          x2 |   .9977033    .012447    80.16   0.000     .9728958    1.022511
       _cons |    .009171    .014612     0.63   0.531    -.0196674    .0380094
------------------------------------------------------------------------------

Lessons Learned

The lesson of this example is not that you should never transform covariates. Keep in mind that what makes transforming x1 problematic in this example is that the relationship between y and x1 is known to be linear.

The real lesson is that misspecification in your imputation model can cause problems in your analysis model. Be sure to run the regressions implied by your imputation models separately and check them for misspecification. A secondary lesson is that PMM can work well for non-normal data.

Return to the source article

Non-Linearity

Non-linear terms in your analysis model present a major challenge in creating an imputation model, because the non-linear relationship between variables can't be easily inverted.

Note: in this example we'll frequently use Stata's syntax for interactions to create squared terms. A varlist of c.x##c.x is equivalent to the varlist x x2 where x2 is defined by gen x2=x^2. The squared term shows up in the output as c.x#c.x. Using c.x##c.x is convenient because you don't have to create a separate variable, and because in that form post-estimation commands can take into account the fact that x and x^2 are not independent variables.

Complete code for this example

Data

Observations: 1,000

Variables:

x is drawn from the standard normal distribution
y is x plus x^2 plus a normal error term

Missingness: y and x have a 10% probability of being missing (MCAR).

Right Answers: Regressing y on x and x^2, both should have a coefficient of 1.

Procedure

First, complete cases analysis for comparison:

      Source |       SS       df       MS              Number of obs =     815
-------------+------------------------------           F(  2,   812) = 1184.66
       Model |   2350.4449     2  1175.22245           Prob > F      =  0.0000
    Residual |  805.533089   812  .992035824           R-squared     =  0.7448
-------------+------------------------------           Adj R-squared =  0.7441
       Total |  3155.97799   814  3.87712284           Root MSE      =  .99601

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9798841   .0344861    28.41   0.000     .9121917    1.047577
             |
     c.x#c.x |   .9948393   .0244134    40.75   0.000     .9469185     1.04276
             |
       _cons |  -.0047983   .0429475    -0.11   0.911    -.0890994    .0795029
------------------------------------------------------------------------------

The obvious thing to do is to impute x, then allow x^2 to be a passive function of x (it makes no difference whether you create a variable for x^2 using mi passive or allow Stata to do it for you by putting c.x##c.x in your regression). However, this means that the imputation model for x simply regresses x on y. This is misspecified, as you can see from an rvfplot:

rvfplot shows that the model is misspecified

Here are the regression results. Because the imputation model is misspecified , the coefficient on x^2 (c.x#c.x) is biased:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       1000
                                                  Average RVI     =     0.9729
                                                  Largest FMI     =     0.5328
                                                  Complete DF     =        997
DF adjustment:   Small sample                     DF:     min     =      32.79
                                                          avg     =      37.89
                                                          max     =      46.68
Model F test:       Equal FMI                     F(   2,   50.2) =     351.77
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9586506   .0570451    16.81   0.000     .8427484    1.074553
             |
     c.x#c.x |    .844537   .0409527    20.62   0.000     .7611974    .9278765
             |
       _cons |   .1986436   .0662195     3.00   0.004     .0654027    .3318845
------------------------------------------------------------------------------

Next impute x using PMM. It's an improvement but still biased:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       1000
                                                  Average RVI     =     0.3974
                                                  Largest FMI     =     0.3984
                                                  Complete DF     =        997
DF adjustment:   Small sample                     DF:     min     =      56.92
                                                          avg     =     101.37
                                                          max     =     134.94
Model F test:       Equal FMI                     F(   2,  167.9) =     763.23
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9511347    .041403    22.97   0.000     .8691019    1.033167
             |
     c.x#c.x |   .9170003    .027871    32.90   0.000     .8618798    .9721207
             |
       _cons |   .1160666   .0565846     2.05   0.045     .0027546    .2293787
------------------------------------------------------------------------------

The include() option of mi impute chained is usually used to add variables to the imputation model of an individual variable, but it can also accept expressions. Does adding x^2 to the imputation for y with (regress, include((x^2))) y fix the problem? Unfortunately not, though it helps:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       1000
                                                  Average RVI     =     1.1210
                                                  Largest FMI     =     0.7256
                                                  Complete DF     =        997
DF adjustment:   Small sample                     DF:     min     =      17.48
                                                          avg     =      70.29
                                                          max     =     118.18
Model F test:       Equal FMI                     F(   2,   39.1) =     377.13
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9629371   .0452655    21.27   0.000     .8727678    1.053106
             |
     c.x#c.x |   .8874316   .0475562    18.66   0.000     .7873051    .9875581
             |
       _cons |    .125851    .053619     2.35   0.021     .0196724    .2320296
------------------------------------------------------------------------------

The imputation model for y has been fixed, but the imputation model for x is still misspecified and this still biases the results of the analysis model. Unfortunately, the true dependence of x on y does not match any standard regression model.

Next we'll use what White, Royston and Wood (Multiple imputation using chained equations: Issues and guidance for practice, Statistics in Medicine, November 2010) call the "Just Another Variable" approach. This creates a variable to contain x^2 (x2) and then imputes both x and x2 as if x2 were "just another variable" rather than being determined by x. The obvious disadvantage is that in the imputed data x2 is not equal to x^2, so it lacks face validity. But the regression results are a huge improvement:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       1000
                                                  Average RVI     =     0.2375
                                                  Largest FMI     =     0.3033
                                                  Complete DF     =        997
DF adjustment:   Small sample                     DF:     min     =      92.98
                                                          avg     =     331.43
                                                          max     =     610.11
Model F test:       Equal FMI                     F(   2,  496.2) =    1330.20
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9905117   .0324469    30.53   0.000     .9267905    1.054233
          x2 |   1.005361   .0240329    41.83   0.000     .9580613    1.052662
       _cons |  -.0018354    .046395    -0.04   0.969    -.0939669    .0902961
------------------------------------------------------------------------------

So would imputing x and x2 using PMM be even better on the theory that PMM is better in sticky situations of all sorts? Not this time:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =       1000
                                                  Average RVI     =     0.3068
                                                  Largest FMI     =     0.3151
                                                  Complete DF     =        997
DF adjustment:   Small sample                     DF:     min     =      86.95
                                                          avg     =     201.95
                                                          max     =     312.22
Model F test:       Equal FMI                     F(   2,  163.9) =    1138.59
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9761798   .0346402    28.18   0.000     .9078863    1.044473
          x2 |   .9987715   .0258422    38.65   0.000      .947407    1.050136
       _cons |  -.0032279   .0416431    -0.08   0.938    -.0851645    .0787087
------------------------------------------------------------------------------

Lessons Learned

Again, the principal lesson is that misspecification in your imputation model can lead to bias in your analysis model. Be very careful, and always check the fit of your imputation models.

This is a continuing research area, but there appears to be some agreement that imputing non-linear terms as passive variables should be avoided. "Just Another Variable" seems to be a good alternative.

Return to the source article

Interactions

Interaction terms in your analysis model also lead to challenges in creating an imputation model because it should take into account the interaction in the model for the covariates being interacted.

Note: we'll again use Stata's syntax for interactions. Putting g##c.x in the covariate list of a regression regresses the dependent variable on g, continuous variable x, and the interaction between g and x. The interaction term shows up in the output as g#c.x.

Complete code for this example

Data

Observations: 10,000

Variables:

g is a binary (1/0) variable with a 50% probability of being 1
x is drawn from the standard normal distribution
y is x plus x*g plus a normal error term. Put differently, y is x plus a normal error term for group 0 and 2*x plus a normal error term for group 1.

Missingness: y and x have a 20% probability of being missing (MCAR).

Right Answers: Regressing y on x, g, and the interaction between x and g, the coefficients for x and the interaction term should be 1, and the coefficient on g should be 0.

Procedure

The following are the results of complete cases analysis:

      Source |       SS       df       MS              Number of obs =    6337
-------------+------------------------------           F(  3,  6333) = 5435.30
       Model |  16167.6705     3   5389.2235           Prob > F      =  0.0000
    Residual |  6279.30945  6333  .991522099           R-squared     =  0.7203
-------------+------------------------------           Adj R-squared =  0.7201
       Total |    22446.98  6336   3.5427683           Root MSE      =  .99575

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         1.g |   -.012915   .0250189    -0.52   0.606    -.0619605    .0361305
           x |   .9842869    .017752    55.45   0.000     .9494869    1.019087
             |
       g#c.x |
          1  |   1.026204   .0249124    41.19   0.000     .9773671    1.075041
             |
       _cons |  -.0195648   .0177789    -1.10   0.271    -.0544175    .0152879
------------------------------------------------------------------------------

The obvious way to impute this data is to impute x and create the interaction term passively. However that means x is regressed on y and g without any interactions. The result is biased estimates (in opposite directions) of the coefficients for both x and the interaction term:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =     0.5760
                                                  Largest FMI     =     0.3048
                                                  Complete DF     =       9996
DF adjustment:   Small sample                     DF:     min     =     104.17
                                                          avg     =     130.43
                                                          max     =     182.89
Model F test:       Equal FMI                     F(   3,  205.4) =    4870.03
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         1.g |  -.0077291   .0251103    -0.31   0.759    -.0575089    .0420506
           x |   1.119861   .0182704    61.29   0.000     1.083631    1.156091
             |
       g#c.x |
          1  |   .7300249   .0239816    30.44   0.000     .6827087     .777341
             |
       _cons |  -.0148091   .0174857    -0.85   0.399    -.0494078    .0197896
------------------------------------------------------------------------------

An alternative approach is to create a variable for the interaction term, gx, and impute it separately from x (White, Royston and Wood's "Just Another Variable" approach). As in the previous example with a squared term, the obvious disadvantage is that the imputed value of gx will not in fact be g*x for any given observation. However, the results of the analysis model are much better:

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =     0.4359
                                                  Largest FMI     =     0.3753
                                                  Complete DF     =       9996
DF adjustment:   Small sample                     DF:     min     =      69.58
                                                          avg     =     111.30
                                                          max     =     210.61
Model F test:       Equal FMI                     F(   3,  223.1) =    5908.62
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9832953   .0159157    61.78   0.000     .9519209     1.01467
          xg |   1.018901   .0239421    42.56   0.000     .9713602    1.066443
           g |  -.0141319   .0247797    -0.57   0.570    -.0635346    .0352709
       _cons |  -.0087754   .0176281    -0.50   0.620    -.0439374    .0263866
------------------------------------------------------------------------------

Another alternative is to impute the two groups separately. This allows x to have the proper relationship with y in both groups. However, it also allows the relationships between other variables to vary between groups. If you know that some relationships do not vary between groups you'll lose some precision by not taking advantage of this knowledge, but in the real world it's rare know such things.

Multiple-imputation estimates                     Imputations     =         10
Linear regression                                 Number of obs   =      10000
                                                  Average RVI     =     0.3627
                                                  Largest FMI     =     0.3482
                                                  Complete DF     =       9996
DF adjustment:   Small sample                     DF:     min     =      80.48
                                                          avg     =     218.72
                                                          max     =     369.37
Model F test:       Equal FMI                     F(   3,  456.6) =    6783.26
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         1.g |  -.0218404   .0243087    -0.90   0.372    -.0702117    .0265309
           x |   .9869436   .0156451    63.08   0.000     .9561207    1.017766
             |
       g#c.x |
          1  |   1.018534   .0214947    47.39   0.000      .976267    1.060802
             |
       _cons |  -.0076746   .0158825    -0.48   0.629    -.0390027    .0236534
------------------------------------------------------------------------------

Lessons Learned

Once again, the principal lesson is that misspecification in your imputation model can lead to bias in your analysis model. Be very careful, and always check the fit of your imputation models.

This is also an area of ongoing research, but there appears to be some agreement that imputing interaction effects passively is problematic. If your interactions are all a matter of variables having different effects for different groups, imputing each group separately is probably the obvious solution, though "Just Another Variable" also works well. If you have interactions between continuous variables then use "Just Another Variable."

Return to the source article

Acknowledgements

Some of these examples follow the discussion in White, Royston, and Wood. ("Multiple imputation using chained equations: Issues and guidance for practice." Statistics in Medicine. 2011.) We highly recommend this article to anyone who is learning to use multiple imputation.

Next: Recommended Readings

Previous: Estimating

Last Revised: 6/21/2012