< Back to main website


 

 

This vignette was prepared as part of the Fellowship Freies Wissen sponsored by Wikimedia Deutschland, the Stifterverband, and the VolkswagenStiftung. The content is under a CC BY 4.0 license. The vignette was writen in R markdown and the original script is available on my GitHub page. Comments and pull requests are welcome.

R markdown documents are a form of literate programming. By weaving natural language and code together in a single text file, literate programming makes research more transparent and reproducible. It is also a nice tool for the development of Open educational resources (OER).


Missing data in repeated cross-national surveys

Missing data is a ubiquitous problem when analyzing cross-national social and political surveys. Data is said to be “sporadically missing” when respondents cannot or decline to answer certain questions (Resche-Rigon and White 2016). With repeated cross-national surveys, another concern is that questionnaires might change over time. In new survey waves, some questions might be introduced; other might be rephrased or discontinued. Data is then “systematically missing” in some country-waves (Resche-Rigon and White 2016). This makes it particularly challenging for scholars to track the evolution of beliefs, attitudes, and behavior within certain populations over time.

Since the 1980s, there has been a rapid increase in the number of international social and political surveys. Large projects such as the European Social Survey (ESS), the International Social Survey Programme (ISSP), the European Values Study (EVS), and the World Values Survey (WVS) have together collected responses from hundreds of thousands of respondents in the last thirty years. Some research teams have now taken on the task of harmonizing this data (see Neundorf and collaborators 2017; Słomczyński and collaborators 2017).

In order to perform longitudinal analyses with repeated cross-national surveys, scholars have typically coped with the problem of systematically missing data by relying on listwise deletion, therefore ignoring all observations with missing data, or by using informed guesses to fill in the missing observations. Both solutions have their disadvantages. Listwise deletion is inefficient as it forces the investigator to discard valuable information. It can also introduce biases if the observations are not “missing completely at random” (MCAR) (Rubin 1987). The “guessing” strategy, whether it is based on theoretical assumptions or empirical estimations (e.g. mean imputation), underestimates the uncertainty of the guesses. Analyzing this data will produce artificially low standard errors (for a good review of the topic, see King et al. 2001).

How can we improve the harmonization of repeated cross-national surveys?

In this vignette, I would like to demonstrate a technique I have been using in my dissertation to handle this type of problem: multiple imputation (MI). I build on a long tradition in survey research showing that MI is usually a more efficient and less biased way to handle missing data than listwise deletion and single imputation methods. MI has been considered an appropritate solution to account for mismatches across survey questionnaires. While MI uses the entire sample (and therefore does not discard valuable data), it also preserves uncertainties of the imputation process (Caren, Ghoshal, and Ribas 2011; King et al. 2001). I present a typical case of systematically missing data in repeated cross-national surveys and illustrate how I would go around it by performing the analysis with MI data in R. The mice package performs multiple imputation using fully conditional specification (FCS) (Buuren and Groothuis-Oudshoorn 2011).

A demonstration with the R mice package and data from the European Values Study

The European Values Study (EVS) is a large multinational survey project examining European citizens’ ideas and behavior with regards to “life, family, work, religion, politics and society” (EVS 2017). So far, four EVS waves have been published (the fifth wave is ongoing): 1) in 1981-1984, 2) in 1990-1993, 3) in 1999-2001, and 4) in 2008-2010.

One problem when analyzing EVS longitudinal data is that the way the education of respondents is measured changed over the course of the study. In the first two waves, the EVS only asked respondents at what age they completed their formal education. In the subsequent waves, it also asked respondents what was their highest level of education. Most social scientists recognize that the level of education is a better indicator of socio-economic status. If we want to use this variable in our analyses, we either have to discard the first two waves or impute the missing values.

Let’s first have a look at a subset of the EVS data. You can download the EVS Longitudinal datafile, 1981-2008 (ZA4804 Data file, Version 3.0.0) from the GESIS website and save it in your working directory. In this example, I use the .dta (Stata) file.

Data preparation

In R, we start by loading the packages necessary for this demonstration (if needed, install them with the function install.packages()).

library(dplyr) # used for data wrangling
library(ggplot2) # used for data visualization
library(haven) # used to import Stata and SPSS files
library(lme4) # performs linear mixed-effects models
library(magrittr) # allows pipe operator
library(mice) # performs MI

After, we load a subset the EVS dataset in the R working environment. The read_dta() function from the haven package allows to import .dta files in R.

#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Set path to EVS data
datapath <- "ZA4804_v3-0-0.dta"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Load a subset of variables
evs <- datapath %>%
  
  # Convert Stata datafile to R dataframe
  read_dta %>%
  
  # Keep only a subset of variables
  select(S002EVS, S003A, S024, A170, X001, X003, X023, X025R) %>%
  
  # Rename variables
  rename(wave = S002EVS, country = S003A, countrywave = S024, 
         stflife = A170, women = X001, age = X003, 
         eduage = X023, edulvl = X025R)

Suppose we would like to estimate the effect of the level of education on life satisfaction. We could select a few variables to work with:

  • wave: the EVS wave (1 to 4)
  • country: a country id
  • countrywave: a country-wave id
  • stflife: the question asks “how satisfied are you with your life as a whole these days?” and answers go from (1) “Dissatisfied” to (10) “Satisfied”
  • women: gender
  • age: age
  • eduage: age at which respondents completed their formal education
  • edulvl: highest level of education attained by respondents; three levels: 1) low (primary), 2) middle (secondary), and 3) high (tertiary).

Note that you could use other variables if you would like to. For a list of available variables in the EVS Longitudinal Data Files, you can have a look here.

I reduce the sample to respondents of countries which were part of all four EVS waves.

evs <- evs %>% 
  
  # Group by country and wave
  group_by(country, wave) %>%
  
  # Aggregate and tag each wave with value "1" in new variable 'tag_wave'
  summarize(tag_wave = 1) %>% 
  
  # Group again; now, only by country
  group_by(country) %>%
  
  # Aggregate, 
  # sum up the tags to count the number of waves per country, and
  # create var 'keep' taking value T if sum of waves per country is 4
  summarize(keep = sum(tag_wave) == 4) %>%
  
  # Merge the aggregate dataframe with the original individual-level data
  merge(evs) %>%
  
  # Filter countries with 4 waves
  filter(keep == T) %>%
  
  # Drop variable 'keep'
  select(-one_of("keep"))

Finally, We can also change the class of a few variables, from labelled to factor. Variables were automatically classified as labelled when we imported the .dta file, however this format is not supported well by some R functions.

evs <- evs %>% 
  mutate(country = droplevels(as_factor(country)),
         wave = droplevels(as_factor(wave)),
         countrywave = droplevels(as_factor(countrywave)),
         stflife = droplevels(as_factor(stflife)),
         women = droplevels(as_factor(women)),
         edulvl = droplevels(as_factor(edulvl)))

We now have 64792 respondents from 13 countries (including West Germany and Northern Ireland as “countries”).

table(evs$country, evs$wave)
##                   
##                    1981-1984 1990-1993 1999-2001 2008-2010
##   Belgium               1145      2792      1912      1509
##   Denmark               1182      1030      1023      1507
##   France                1200      1002      1615      1501
##   Iceland                927       702       968       808
##   Ireland               1217      1000      1012      1013
##   Italy                 1348      2018      2000      1519
##   Malta                  467       393      1002      1500
##   Netherlands           1221      1017      1003      1554
##   Spain                 2303      2637      1200      1500
##   Sweden                 954      1047      1015      1187
##   Great Britain         1167      1484      1000      1561
##   West Germany          1305      2101      1037      1071
##   Northern Ireland       312       304      1000       500

Missing patterns

We can start investigating patterns of missingness in the data using the function md.pattern() from the mice package.

misstable <- md.pattern(evs)
# exclude columns for country, wave, countrywave: zero missing.
misstable[, 4:ncol(misstable)]
##       women age stflife eduage edulvl      
## 30643     1   1       1      1      1     0
##   148     1   1       0      1      1     1
##     5     0   1       1      1      1     1
##   122     1   0       1      1      1     1
##  1058     1   1       1      0      1     1
## 31638     1   1       1      1      0     1
##     5     0   0       1      1      1     2
##    19     1   1       0      0      1     2
##    30     1   0       1      0      1     2
##   267     1   1       0      1      0     2
##    21     0   1       1      1      0     2
##   118     1   0       1      1      0     2
##   659     1   1       1      0      0     2
##     1     1   0       0      0      1     3
##     1     0   0       1      0      1     3
##     3     1   0       0      1      0     3
##     2     0   0       1      1      0     3
##    20     1   1       0      0      0     3
##    23     1   0       1      0      0     3
##     6     1   0       0      0      0     4
##     2     0   0       1      0      0     4
##     1     0   0       0      0      0     5
##          37 314     465   1820  32760 35396

The table above combines both sporadically and systematically missing data. We can see that 30643 observations have no missing values. Out of all 35396 missing values, 32760 values are missing for the variable edulvl. 31638 observations have a missing value only for this variable.

The code below produces a new aggregate dataframe identifying which questions are entirely missing in certain country-waves. I define a small function, allmiss(), which counts the number of non-missing values within a country-wave. If the sum of non-missing values is 0 (meaning, all values are missing), the function returns TRUE.

#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Function: determine if all observations are missing in cluster

allmiss <- function(var) {
  r <- (sum(!is.na(var)) == 0)
  return(r)
}

#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Create summary dataset showing country-waves where variables are 
# entirely missing

mdsummary <- evs %>%
  group_by(countrywave) %>%
  summarize(
    miss_stflife = allmiss(stflife), 
    miss_women = allmiss(women), 
    miss_age = allmiss(age), 
    miss_edulvl = allmiss(edulvl), 
    miss_eduage = allmiss(eduage)
    )

summary(mdsummary[, 2:ncol(mdsummary)]) 
##  miss_stflife    miss_women       miss_age       miss_edulvl    
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:52        FALSE:52        FALSE:52        FALSE:26       
##                                                  TRUE :26       
##  miss_eduage    
##  Mode :logical  
##  FALSE:52       
## 

The variable edulvl is the only one with systematically missing values. In half of the 52 country-waves, respondents were not asked about their highest level of education. If we were to use listwise deletion, thousands of obervations would be lost.

Fortunately, in all the country-waves, respondents were asked about the age at which they completed their formal education. eduage probably gives us a good grasp of the educational attainment of respondents. Let’s examine this graphically.

g <- ggplot(data=subset(evs, !is.na(edulvl) & !is.na(eduage)), 
            aes(x = as.numeric(eduage), fill = edulvl))
g <- g + geom_histogram(position="identity", alpha = 0.5, binwidth = 2)
g <- g + labs(x = "Age, completed formal education", y = "Number of observations")
g <- g + scale_fill_discrete(name = "Level of education")
g

What do we see? Respondents who completed their formal education at an older age have a higher level of education in general. However, there is substantial overlap between the three categories. If we were to fill in the missing values of edulvl on the basis of eduage in a single imputation, we would underestimate the uncertainty of our estimates. With multiple imputation, we can fill in the missing data while preserving the uncertainty of the imputations.

Generating the MI datasets

Results from analyses with MI data are said to be unbiased if data is “Missing at random” (MAR), that is, if unobserved variables are uncorrelated with the missing mechanism. The example introduced here is a plausible MAR case.

Analyzing multiply imputed data involves three steps:

  1. Generate m new datasets with imputed data. Each dataset contains the same initial, non-missing values, but different imputed values that vary across the mi datasets to reflect the uncertainty of our predictions.
  2. Fit separate statistical models on each mi dataset.
  3. Pool the results following Rubin’s rules (Rubin 1987; King et al. 2001)

mice is a powerful and well-documented R package to perform these operations. It uses chained equations to impute separately each variable with missing values. It works well with continuous, binary, unordered categorical, and ordered categorical data. An excellent resource to learn how to use the package are the vignettes prepared by Gerko Vink and Stef van Buuren in mice’s official documentation.

Let’s explore how mice would impute the EVS data. To see the default settings without actually imputating anything, we simply have to run the mice() function with the number of iterations (maxit) set to zero. The meth argument indicates which univariate imputation method will be used for each variable with missing values.

ini <- mice(evs, maxit = 0)
meth <- ini$meth
meth 
##     country        wave countrywave     stflife       women         age 
##          ""          ""          ""   "polyreg"    "logreg"       "pmm" 
##      eduage      edulvl 
##       "pmm"   "polyreg"

By default, mice() performs predictive mean matching (pmm) for numeric variables, logistic regression imputation (logreg) for boolean variables and factors with two levels, and polytomous regression imputation (polyreg) for unordered factors with more than two levels. We can keep the default settings except for stflife and edulvl since these variables are ordered. We can estimate proportional odds models (polr) instead.

meth[c("stflife", "edulvl")] <- "polr"
meth
##     country        wave countrywave     stflife       women         age 
##          ""          ""          ""      "polr"    "logreg"       "pmm" 
##      eduage      edulvl 
##       "pmm"      "polr"

We can now have a look at the predictors for each equation.

pred <- ini$pred
pred
##             country wave countrywave stflife women age eduage edulvl
## country           0    0           0       0     0   0      0      0
## wave              0    0           0       0     0   0      0      0
## countrywave       0    0           0       0     0   0      0      0
## stflife           1    1           1       0     1   1      1      1
## women             1    1           1       1     0   1      1      1
## age               1    1           1       1     1   0      1      1
## eduage            1    1           1       1     1   1      0      1
## edulvl            1    1           1       1     1   1      1      0

To fill in the missing values in each incomplete variable, mice() will use all remaining variables in the model. This is usually a good strategy since it improves our leverage to predict the missing data. We can however exclude wave since we already control for country and country-wave dummies; it does not add substantive information to the predictions. Note that, in the case of systematically-missing data, country-wave dummies for which data is entirely missing will not contribute to the model.

pred[, c("wave")] <- 0
pred
##             country wave countrywave stflife women age eduage edulvl
## country           0    0           0       0     0   0      0      0
## wave              0    0           0       0     0   0      0      0
## countrywave       0    0           0       0     0   0      0      0
## stflife           1    0           1       0     1   1      1      1
## women             1    0           1       1     0   1      1      1
## age               1    0           1       1     1   0      1      1
## eduage            1    0           1       1     1   1      0      1
## edulvl            1    0           1       1     1   1      1      0

We are now ready to generate m MI datasets. A higher number of imputed datasets would make statistical estimates more stable. However, since our number of observations is high, let’s limit m to a reasonable number, 5 (the default). We can also set a seed to make the results reproducible.

Warning: this will take several minutes to compute given the number of observations.

imp <- mice(evs, m = 5, meth = meth, pred = pred, seed = 123, print = TRUE)
## 
##  iter imp variable
##   1   1  stflife  women  age  eduage  edulvl
##   1   2  stflife  women  age  eduage  edulvl
##   1   3  stflife  women  age  eduage  edulvl
##   1   4  stflife  women  age  eduage  edulvl
##   1   5  stflife  women  age  eduage  edulvl
##   2   1  stflife  women  age  eduage  edulvl
##   2   2  stflife  women  age  eduage  edulvl
##   2   3  stflife  women  age  eduage  edulvl
##   2   4  stflife  women  age  eduage  edulvl
##   2   5  stflife  women  age  eduage  edulvl
##   3   1  stflife  women  age  eduage  edulvl
##   3   2  stflife  women  age  eduage  edulvl
##   3   3  stflife  women  age  eduage  edulvl
##   3   4  stflife  women  age  eduage  edulvl
##   3   5  stflife  women  age  eduage  edulvl
##   4   1  stflife  women  age  eduage  edulvl
##   4   2  stflife  women  age  eduage  edulvl
##   4   3  stflife  women  age  eduage  edulvl
##   4   4  stflife  women  age  eduage  edulvl
##   4   5  stflife  women  age  eduage  edulvl
##   5   1  stflife  women  age  eduage  edulvl
##   5   2  stflife  women  age  eduage  edulvl
##   5   3  stflife  women  age  eduage  edulvl
##   5   4  stflife  women  age  eduage  edulvl
##   5   5  stflife  women  age  eduage  edulvl

Fitting separate models and pooling the results

I complete this vignette by performing a multilevel linear model with random effects to predict life satisfication as a function of gender, age, and the level of education. Our data has a hierchical structure: individual-level observations are nested within country-waves, which are themselves clustered within countries. We will therefore fit a three-level model. For simplicity, we assume that stflife is a continuous variable with a normal distribution (the 10 point scale is not too far from that). Note also that 13 countries is a rather low number of clusters to perform a multilevel model; keep in mind that this might introduce some bias.

To perform the multilevel analysis, we will use the lmer() function in the lme4 package (for more information, see Bates et al. 2015). We fit a three-level linear model with random intercepts for countries and country-waves within countries using the MI data stored in the object imp.

fit <- with(imp, lmer(as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)))
fit
## call :
## with.mids(data = imp, expr = lmer(as.numeric(stflife) ~ women + 
##     age + edulvl + (1 | country/countrywave)))
## 
## call1 :
## mice(data = evs, m = 5, method = meth, predictorMatrix = pred, 
##     printFlag = TRUE, seed = 123)
## 
## nmis :
##     country        wave countrywave     stflife       women         age 
##           0           0           0         465          37         314 
##      eduage      edulvl 
##        1820       32760 
## 
## analyses :
## [[1]]
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266489.1
## Random effects:
##  Groups              Name        Std.Dev.
##  countrywave:country (Intercept) 0.1744  
##  country             (Intercept) 0.4242  
##  Residual                        1.8890  
## Number of obs: 64792, groups:  countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept)       women2          age      edulvl2      edulvl3  
##    7.453507     0.005093     0.000610     0.176788     0.311105  
## 
## [[2]]
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266500.1
## Random effects:
##  Groups              Name        Std.Dev.
##  countrywave:country (Intercept) 0.1859  
##  country             (Intercept) 0.4396  
##  Residual                        1.8891  
## Number of obs: 64792, groups:  countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept)       women2          age      edulvl2      edulvl3  
##   7.4660265    0.0040199    0.0004246    0.1759926    0.3198862  
## 
## [[3]]
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266503.2
## Random effects:
##  Groups              Name        Std.Dev.
##  countrywave:country (Intercept) 0.1817  
##  country             (Intercept) 0.4362  
##  Residual                        1.8892  
## Number of obs: 64792, groups:  countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept)       women2          age      edulvl2      edulvl3  
##   7.4565711    0.0053359    0.0004732    0.1787183    0.3371287  
## 
## [[4]]
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266533.8
## Random effects:
##  Groups              Name        Std.Dev.
##  countrywave:country (Intercept) 0.1784  
##  country             (Intercept) 0.4130  
##  Residual                        1.8897  
## Number of obs: 64792, groups:  countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept)       women2          age      edulvl2      edulvl3  
##   7.4592063    0.0020684    0.0004676    0.1702738    0.3139257  
## 
## [[5]]
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266572.1
## Random effects:
##  Groups              Name        Std.Dev.
##  countrywave:country (Intercept) 0.1742  
##  country             (Intercept) 0.4205  
##  Residual                        1.8902  
## Number of obs: 64792, groups:  countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept)       women2          age      edulvl2      edulvl3  
##   7.4692942    0.0052037    0.0003558    0.1578850    0.3107344

Finally, we pool the results. Note that, for the moment, mice can only pool the fixed part of the results.

pool.fit <- pool(fit)
summary(pool.fit)[, 1:7] # print result and keep relevant columns
##                      est           se          t          df     Pr(>|t|)
## (Intercept) 7.4609209660 0.1243543208 59.9972797 54630.87222 0.000000e+00
## women2      0.0043441463 0.0149909446  0.2897847 24329.15687 7.719834e-01
## age         0.0004662373 0.0004617754  1.0096625  1638.55013 3.128061e-01
## edulvl2     0.1719316541 0.0212411212  8.0942834   110.22726 8.357759e-13
## edulvl3     0.3185559109 0.0238001066 13.3846422    60.58451 0.000000e+00
##                     lo 95      hi 95
## (Intercept)  7.2171855760 7.70465636
## women2      -0.0250390269 0.03372732
## age         -0.0004394949 0.00137197
## edulvl2      0.1298377028 0.21402561
## edulvl3      0.2709580546 0.36615377

This procedure has allowed us to analyze the entire dataset rather than just the last two EVS waves. The analysis now covers the period from 1981 to 2010 instead of 1999 to 2010. The results show that, all other things being equal, people with a high level of education are more satisfied with their lives by a factor of 0.32 on the 10 point scale in comparison with people with a low level of education.

Final remarks

Multiple imputation constitutes a great tool to harmonize repeated cross-national surveys. In comparison with listwise deletion, it preserves valuable information and does not introduce bias if data is missing at random. There are however a few points to keep in mind:

References

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.” Journal of Statistical Software 67 (1): 1–48. doi:10.18637/jss.v067.i01.

Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in R” 45 (3): 1–67. doi:10.18637/jss.v045.i03.

Caren, Neal, Raj Andrew Ghoshal, and Vanesa Ribas. 2011. “A Social Movement Generation: Cohort and Period Trends in Protest Attendance and Petition Signing.” American Sociological Review 76 (1): 125–51.

Grund, Simon, Oliver Lüdtke, and Alexander Robitzsch. 2018. “Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations.” Organizational Research Methods 21 (1): 111–49. doi:10.1177/1094428117703686.

King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political Science Review 95 (01): 49–69. http://journals.cambridge.org/article_S0003055401000235.

Neundorf, Anja, and collaborators. 2017. “Global Citizen Politics.” https://globalcitizenpolitics.net.

Resche-Rigon, Matthieu, and Ian R White. 2016. “Multiple Imputation by Chained Equations for Systematically and Sporadically Missing Multilevel Data.” Statistical Methods in Medical Research, no. Online first: 1–16. doi:10.1177/0962280216666564.

Rubin, Donald B. 1987. Multiple Imputation for Survey Nonresponse. New York: Wiley.

Słomczyński, Kazimierz, and collaborators. 2017. “Survey Data Recycling.” https://dataharmonization.org/.