This vignette was prepared as part of the Fellowship Freies Wissen sponsored by Wikimedia Deutschland, the Stifterverband, and the VolkswagenStiftung. The content is under a CC BY 4.0 license. The vignette was writen in R markdown and the original script is available on my GitHub page. Comments and pull requests are welcome.
R markdown documents are a form of literate programming. By weaving natural language and code together in a single text file, literate programming makes research more transparent and reproducible. It is also a nice tool for the development of Open educational resources (OER).
Missing data is a ubiquitous problem when analyzing cross-national social and political surveys. Data is said to be “sporadically missing” when respondents cannot or decline to answer certain questions (Resche-Rigon and White 2016). With repeated cross-national surveys, another concern is that questionnaires might change over time. In new survey waves, some questions might be introduced; other might be rephrased or discontinued. Data is then “systematically missing” in some country-waves (Resche-Rigon and White 2016). This makes it particularly challenging for scholars to track the evolution of beliefs, attitudes, and behavior within certain populations over time.
Since the 1980s, there has been a rapid increase in the number of international social and political surveys. Large projects such as the European Social Survey (ESS), the International Social Survey Programme (ISSP), the European Values Study (EVS), and the World Values Survey (WVS) have together collected responses from hundreds of thousands of respondents in the last thirty years. Some research teams have now taken on the task of harmonizing this data (see Neundorf and collaborators 2017; Słomczyński and collaborators 2017).
In order to perform longitudinal analyses with repeated cross-national surveys, scholars have typically coped with the problem of systematically missing data by relying on listwise deletion, therefore ignoring all observations with missing data, or by using informed guesses to fill in the missing observations. Both solutions have their disadvantages. Listwise deletion is inefficient as it forces the investigator to discard valuable information. It can also introduce biases if the observations are not “missing completely at random” (MCAR) (Rubin 1987). The “guessing” strategy, whether it is based on theoretical assumptions or empirical estimations (e.g. mean imputation), underestimates the uncertainty of the guesses. Analyzing this data will produce artificially low standard errors (for a good review of the topic, see King et al. 2001).
How can we improve the harmonization of repeated cross-national surveys?
In this vignette, I would like to demonstrate a technique I have been using in my dissertation to handle this type of problem: multiple imputation (MI). I build on a long tradition in survey research showing that MI is usually a more efficient and less biased way to handle missing data than listwise deletion and single imputation methods. MI has been considered an appropritate solution to account for mismatches across survey questionnaires. While MI uses the entire sample (and therefore does not discard valuable data), it also preserves uncertainties of the imputation process (Caren, Ghoshal, and Ribas 2011; King et al. 2001). I present a typical case of systematically missing data in repeated cross-national surveys and illustrate how I would go around it by performing the analysis with MI data in R. The mice
package performs multiple imputation using fully conditional specification (FCS) (Buuren and Groothuis-Oudshoorn 2011).
The European Values Study (EVS) is a large multinational survey project examining European citizens’ ideas and behavior with regards to “life, family, work, religion, politics and society” (EVS 2017). So far, four EVS waves have been published (the fifth wave is ongoing): 1) in 1981-1984, 2) in 1990-1993, 3) in 1999-2001, and 4) in 2008-2010.
One problem when analyzing EVS longitudinal data is that the way the education of respondents is measured changed over the course of the study. In the first two waves, the EVS only asked respondents at what age they completed their formal education. In the subsequent waves, it also asked respondents what was their highest level of education. Most social scientists recognize that the level of education is a better indicator of socio-economic status. If we want to use this variable in our analyses, we either have to discard the first two waves or impute the missing values.
Let’s first have a look at a subset of the EVS data. You can download the EVS Longitudinal datafile, 1981-2008 (ZA4804 Data file, Version 3.0.0) from the GESIS website and save it in your working directory. In this example, I use the .dta (Stata) file.
In R, we start by loading the packages necessary for this demonstration (if needed, install them with the function install.packages()
).
library(dplyr) # used for data wrangling
library(ggplot2) # used for data visualization
library(haven) # used to import Stata and SPSS files
library(lme4) # performs linear mixed-effects models
library(magrittr) # allows pipe operator
library(mice) # performs MI
After, we load a subset the EVS dataset in the R working environment. The read_dta()
function from the haven
package allows to import .dta files in R.
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Set path to EVS data
datapath <- "ZA4804_v3-0-0.dta"
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Load a subset of variables
evs <- datapath %>%
# Convert Stata datafile to R dataframe
read_dta %>%
# Keep only a subset of variables
select(S002EVS, S003A, S024, A170, X001, X003, X023, X025R) %>%
# Rename variables
rename(wave = S002EVS, country = S003A, countrywave = S024,
stflife = A170, women = X001, age = X003,
eduage = X023, edulvl = X025R)
Suppose we would like to estimate the effect of the level of education on life satisfaction. We could select a few variables to work with:
wave
: the EVS wave (1 to 4)country
: a country idcountrywave
: a country-wave idstflife
: the question asks “how satisfied are you with your life as a whole these days?” and answers go from (1) “Dissatisfied” to (10) “Satisfied”women
: genderage
: ageeduage
: age at which respondents completed their formal educationedulvl
: highest level of education attained by respondents; three levels: 1) low (primary), 2) middle (secondary), and 3) high (tertiary).Note that you could use other variables if you would like to. For a list of available variables in the EVS Longitudinal Data Files, you can have a look here.
I reduce the sample to respondents of countries which were part of all four EVS waves.
evs <- evs %>%
# Group by country and wave
group_by(country, wave) %>%
# Aggregate and tag each wave with value "1" in new variable 'tag_wave'
summarize(tag_wave = 1) %>%
# Group again; now, only by country
group_by(country) %>%
# Aggregate,
# sum up the tags to count the number of waves per country, and
# create var 'keep' taking value T if sum of waves per country is 4
summarize(keep = sum(tag_wave) == 4) %>%
# Merge the aggregate dataframe with the original individual-level data
merge(evs) %>%
# Filter countries with 4 waves
filter(keep == T) %>%
# Drop variable 'keep'
select(-one_of("keep"))
Finally, We can also change the class of a few variables, from labelled
to factor
. Variables were automatically classified as labelled
when we imported the .dta file, however this format is not supported well by some R functions.
evs <- evs %>%
mutate(country = droplevels(as_factor(country)),
wave = droplevels(as_factor(wave)),
countrywave = droplevels(as_factor(countrywave)),
stflife = droplevels(as_factor(stflife)),
women = droplevels(as_factor(women)),
edulvl = droplevels(as_factor(edulvl)))
We now have 64792 respondents from 13 countries (including West Germany and Northern Ireland as “countries”).
table(evs$country, evs$wave)
##
## 1981-1984 1990-1993 1999-2001 2008-2010
## Belgium 1145 2792 1912 1509
## Denmark 1182 1030 1023 1507
## France 1200 1002 1615 1501
## Iceland 927 702 968 808
## Ireland 1217 1000 1012 1013
## Italy 1348 2018 2000 1519
## Malta 467 393 1002 1500
## Netherlands 1221 1017 1003 1554
## Spain 2303 2637 1200 1500
## Sweden 954 1047 1015 1187
## Great Britain 1167 1484 1000 1561
## West Germany 1305 2101 1037 1071
## Northern Ireland 312 304 1000 500
We can start investigating patterns of missingness in the data using the function md.pattern()
from the mice
package.
misstable <- md.pattern(evs)
# exclude columns for country, wave, countrywave: zero missing.
misstable[, 4:ncol(misstable)]
## women age stflife eduage edulvl
## 30643 1 1 1 1 1 0
## 148 1 1 0 1 1 1
## 5 0 1 1 1 1 1
## 122 1 0 1 1 1 1
## 1058 1 1 1 0 1 1
## 31638 1 1 1 1 0 1
## 5 0 0 1 1 1 2
## 19 1 1 0 0 1 2
## 30 1 0 1 0 1 2
## 267 1 1 0 1 0 2
## 21 0 1 1 1 0 2
## 118 1 0 1 1 0 2
## 659 1 1 1 0 0 2
## 1 1 0 0 0 1 3
## 1 0 0 1 0 1 3
## 3 1 0 0 1 0 3
## 2 0 0 1 1 0 3
## 20 1 1 0 0 0 3
## 23 1 0 1 0 0 3
## 6 1 0 0 0 0 4
## 2 0 0 1 0 0 4
## 1 0 0 0 0 0 5
## 37 314 465 1820 32760 35396
The table above combines both sporadically and systematically missing data. We can see that 30643 observations have no missing values. Out of all 35396 missing values, 32760 values are missing for the variable edulvl
. 31638 observations have a missing value only for this variable.
The code below produces a new aggregate dataframe identifying which questions are entirely missing in certain country-waves. I define a small function, allmiss()
, which counts the number of non-missing values within a country-wave. If the sum of non-missing values is 0 (meaning, all values are missing), the function returns TRUE
.
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Function: determine if all observations are missing in cluster
allmiss <- function(var) {
r <- (sum(!is.na(var)) == 0)
return(r)
}
#_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# Create summary dataset showing country-waves where variables are
# entirely missing
mdsummary <- evs %>%
group_by(countrywave) %>%
summarize(
miss_stflife = allmiss(stflife),
miss_women = allmiss(women),
miss_age = allmiss(age),
miss_edulvl = allmiss(edulvl),
miss_eduage = allmiss(eduage)
)
summary(mdsummary[, 2:ncol(mdsummary)])
## miss_stflife miss_women miss_age miss_edulvl
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:52 FALSE:52 FALSE:52 FALSE:26
## TRUE :26
## miss_eduage
## Mode :logical
## FALSE:52
##
The variable edulvl
is the only one with systematically missing values. In half of the 52 country-waves, respondents were not asked about their highest level of education. If we were to use listwise deletion, thousands of obervations would be lost.
Fortunately, in all the country-waves, respondents were asked about the age at which they completed their formal education. eduage
probably gives us a good grasp of the educational attainment of respondents. Let’s examine this graphically.
g <- ggplot(data=subset(evs, !is.na(edulvl) & !is.na(eduage)),
aes(x = as.numeric(eduage), fill = edulvl))
g <- g + geom_histogram(position="identity", alpha = 0.5, binwidth = 2)
g <- g + labs(x = "Age, completed formal education", y = "Number of observations")
g <- g + scale_fill_discrete(name = "Level of education")
g
What do we see? Respondents who completed their formal education at an older age have a higher level of education in general. However, there is substantial overlap between the three categories. If we were to fill in the missing values of edulvl
on the basis of eduage
in a single imputation, we would underestimate the uncertainty of our estimates. With multiple imputation, we can fill in the missing data while preserving the uncertainty of the imputations.
Results from analyses with MI data are said to be unbiased if data is “Missing at random” (MAR), that is, if unobserved variables are uncorrelated with the missing mechanism. The example introduced here is a plausible MAR case.
Analyzing multiply imputed data involves three steps:
mice
is a powerful and well-documented R package to perform these operations. It uses chained equations to impute separately each variable with missing values. It works well with continuous, binary, unordered categorical, and ordered categorical data. An excellent resource to learn how to use the package are the vignettes prepared by Gerko Vink and Stef van Buuren in mice
’s official documentation.
Let’s explore how mice
would impute the EVS data. To see the default settings without actually imputating anything, we simply have to run the mice()
function with the number of iterations (maxit
) set to zero. The meth
argument indicates which univariate imputation method will be used for each variable with missing values.
ini <- mice(evs, maxit = 0)
meth <- ini$meth
meth
## country wave countrywave stflife women age
## "" "" "" "polyreg" "logreg" "pmm"
## eduage edulvl
## "pmm" "polyreg"
By default, mice()
performs predictive mean matching (pmm
) for numeric variables, logistic regression imputation (logreg
) for boolean variables and factors with two levels, and polytomous regression imputation (polyreg
) for unordered factors with more than two levels. We can keep the default settings except for stflife
and edulvl
since these variables are ordered. We can estimate proportional odds models (polr
) instead.
meth[c("stflife", "edulvl")] <- "polr"
meth
## country wave countrywave stflife women age
## "" "" "" "polr" "logreg" "pmm"
## eduage edulvl
## "pmm" "polr"
We can now have a look at the predictors for each equation.
pred <- ini$pred
pred
## country wave countrywave stflife women age eduage edulvl
## country 0 0 0 0 0 0 0 0
## wave 0 0 0 0 0 0 0 0
## countrywave 0 0 0 0 0 0 0 0
## stflife 1 1 1 0 1 1 1 1
## women 1 1 1 1 0 1 1 1
## age 1 1 1 1 1 0 1 1
## eduage 1 1 1 1 1 1 0 1
## edulvl 1 1 1 1 1 1 1 0
To fill in the missing values in each incomplete variable, mice()
will use all remaining variables in the model. This is usually a good strategy since it improves our leverage to predict the missing data. We can however exclude wave
since we already control for country and country-wave dummies; it does not add substantive information to the predictions. Note that, in the case of systematically-missing data, country-wave dummies for which data is entirely missing will not contribute to the model.
pred[, c("wave")] <- 0
pred
## country wave countrywave stflife women age eduage edulvl
## country 0 0 0 0 0 0 0 0
## wave 0 0 0 0 0 0 0 0
## countrywave 0 0 0 0 0 0 0 0
## stflife 1 0 1 0 1 1 1 1
## women 1 0 1 1 0 1 1 1
## age 1 0 1 1 1 0 1 1
## eduage 1 0 1 1 1 1 0 1
## edulvl 1 0 1 1 1 1 1 0
We are now ready to generate m MI datasets. A higher number of imputed datasets would make statistical estimates more stable. However, since our number of observations is high, let’s limit m to a reasonable number, 5 (the default). We can also set a seed
to make the results reproducible.
Warning: this will take several minutes to compute given the number of observations.
imp <- mice(evs, m = 5, meth = meth, pred = pred, seed = 123, print = TRUE)
##
## iter imp variable
## 1 1 stflife women age eduage edulvl
## 1 2 stflife women age eduage edulvl
## 1 3 stflife women age eduage edulvl
## 1 4 stflife women age eduage edulvl
## 1 5 stflife women age eduage edulvl
## 2 1 stflife women age eduage edulvl
## 2 2 stflife women age eduage edulvl
## 2 3 stflife women age eduage edulvl
## 2 4 stflife women age eduage edulvl
## 2 5 stflife women age eduage edulvl
## 3 1 stflife women age eduage edulvl
## 3 2 stflife women age eduage edulvl
## 3 3 stflife women age eduage edulvl
## 3 4 stflife women age eduage edulvl
## 3 5 stflife women age eduage edulvl
## 4 1 stflife women age eduage edulvl
## 4 2 stflife women age eduage edulvl
## 4 3 stflife women age eduage edulvl
## 4 4 stflife women age eduage edulvl
## 4 5 stflife women age eduage edulvl
## 5 1 stflife women age eduage edulvl
## 5 2 stflife women age eduage edulvl
## 5 3 stflife women age eduage edulvl
## 5 4 stflife women age eduage edulvl
## 5 5 stflife women age eduage edulvl
I complete this vignette by performing a multilevel linear model with random effects to predict life satisfication as a function of gender, age, and the level of education. Our data has a hierchical structure: individual-level observations are nested within country-waves, which are themselves clustered within countries. We will therefore fit a three-level model. For simplicity, we assume that stflife
is a continuous variable with a normal distribution (the 10 point scale is not too far from that). Note also that 13 countries is a rather low number of clusters to perform a multilevel model; keep in mind that this might introduce some bias.
To perform the multilevel analysis, we will use the lmer()
function in the lme4
package (for more information, see Bates et al. 2015). We fit a three-level linear model with random intercepts for countries and country-waves within countries using the MI data stored in the object imp
.
fit <- with(imp, lmer(as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)))
fit
## call :
## with.mids(data = imp, expr = lmer(as.numeric(stflife) ~ women +
## age + edulvl + (1 | country/countrywave)))
##
## call1 :
## mice(data = evs, m = 5, method = meth, predictorMatrix = pred,
## printFlag = TRUE, seed = 123)
##
## nmis :
## country wave countrywave stflife women age
## 0 0 0 465 37 314
## eduage edulvl
## 1820 32760
##
## analyses :
## [[1]]
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266489.1
## Random effects:
## Groups Name Std.Dev.
## countrywave:country (Intercept) 0.1744
## country (Intercept) 0.4242
## Residual 1.8890
## Number of obs: 64792, groups: countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept) women2 age edulvl2 edulvl3
## 7.453507 0.005093 0.000610 0.176788 0.311105
##
## [[2]]
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266500.1
## Random effects:
## Groups Name Std.Dev.
## countrywave:country (Intercept) 0.1859
## country (Intercept) 0.4396
## Residual 1.8891
## Number of obs: 64792, groups: countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept) women2 age edulvl2 edulvl3
## 7.4660265 0.0040199 0.0004246 0.1759926 0.3198862
##
## [[3]]
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266503.2
## Random effects:
## Groups Name Std.Dev.
## countrywave:country (Intercept) 0.1817
## country (Intercept) 0.4362
## Residual 1.8892
## Number of obs: 64792, groups: countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept) women2 age edulvl2 edulvl3
## 7.4565711 0.0053359 0.0004732 0.1787183 0.3371287
##
## [[4]]
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266533.8
## Random effects:
## Groups Name Std.Dev.
## countrywave:country (Intercept) 0.1784
## country (Intercept) 0.4130
## Residual 1.8897
## Number of obs: 64792, groups: countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept) women2 age edulvl2 edulvl3
## 7.4592063 0.0020684 0.0004676 0.1702738 0.3139257
##
## [[5]]
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## as.numeric(stflife) ~ women + age + edulvl + (1 | country/countrywave)
## REML criterion at convergence: 266572.1
## Random effects:
## Groups Name Std.Dev.
## countrywave:country (Intercept) 0.1742
## country (Intercept) 0.4205
## Residual 1.8902
## Number of obs: 64792, groups: countrywave:country, 52; country, 13
## Fixed Effects:
## (Intercept) women2 age edulvl2 edulvl3
## 7.4692942 0.0052037 0.0003558 0.1578850 0.3107344
Finally, we pool the results. Note that, for the moment, mice
can only pool the fixed part of the results.
pool.fit <- pool(fit)
summary(pool.fit)[, 1:7] # print result and keep relevant columns
## est se t df Pr(>|t|)
## (Intercept) 7.4609209660 0.1243543208 59.9972797 54630.87222 0.000000e+00
## women2 0.0043441463 0.0149909446 0.2897847 24329.15687 7.719834e-01
## age 0.0004662373 0.0004617754 1.0096625 1638.55013 3.128061e-01
## edulvl2 0.1719316541 0.0212411212 8.0942834 110.22726 8.357759e-13
## edulvl3 0.3185559109 0.0238001066 13.3846422 60.58451 0.000000e+00
## lo 95 hi 95
## (Intercept) 7.2171855760 7.70465636
## women2 -0.0250390269 0.03372732
## age -0.0004394949 0.00137197
## edulvl2 0.1298377028 0.21402561
## edulvl3 0.2709580546 0.36615377
This procedure has allowed us to analyze the entire dataset rather than just the last two EVS waves. The analysis now covers the period from 1981 to 2010 instead of 1999 to 2010. The results show that, all other things being equal, people with a high level of education are more satisfied with their lives by a factor of 0.32 on the 10 point scale in comparison with people with a low level of education.
Multiple imputation constitutes a great tool to harmonize repeated cross-national surveys. In comparison with listwise deletion, it preserves valuable information and does not introduce bias if data is missing at random. There are however a few points to keep in mind:
The technique I introduced, with country and country-wave dummies included in the imputation models, has been labelled stratified multiple imputation (or fixed-effect multiple imputation). It can be biased if the relationship among variables varies significantly across countries or country-waves (in most cases, it is still a clear improvement in comparison with listwise deletion). Multiple imputation of clustered data is a field in active development. The objective of this vignette was not to present the bleeding edge of research, but rather to introduce students and researchers to an efficient and relatively simple method to harmonize cross-national surveys.
If you are interested in more advanced techniques, such as multilevel imputation with random effect speficications, I list additional references (Grund, Lüdtke, and Robitzsch 2018; Resche-Rigon and White 2016). Note that these techniques are usually much more computationally demanding, are subject to convergence issues, and might not support categorical variables. miceadds
, an extension of the mice
package, and the mitml
package are good solutions if you plan to move forward with multilevel multiple imputation.
I want to conclude by pointing out that another approach to the problem of systematically-missing data in cross-national surveys could be to fit separate imputation models on each country. This way, the multiple imputation procedure would take into account different relationships between variables within the clusters. This can be done by specifying multiple interaction effects of the form countryA * variable1 + countryB * variable1 + ...
.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.” Journal of Statistical Software 67 (1): 1–48. doi:10.18637/jss.v067.i01.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in R” 45 (3): 1–67. doi:10.18637/jss.v045.i03.
EVS. 2017. “About EVS.” http://www.europeanvaluesstudy.eu/page/about-evs.html.
Grund, Simon, Oliver Lüdtke, and Alexander Robitzsch. 2018. “Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations.” Organizational Research Methods 21 (1): 111–49. doi:10.1177/1094428117703686.
King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political Science Review 95 (01): 49–69. http://journals.cambridge.org/article_S0003055401000235.
Neundorf, Anja, and collaborators. 2017. “Global Citizen Politics.” https://globalcitizenpolitics.net.
Resche-Rigon, Matthieu, and Ian R White. 2016. “Multiple Imputation by Chained Equations for Systematically and Sporadically Missing Multilevel Data.” Statistical Methods in Medical Research, no. Online first: 1–16. doi:10.1177/0962280216666564.
Rubin, Donald B. 1987. Multiple Imputation for Survey Nonresponse. New York: Wiley.
Słomczyński, Kazimierz, and collaborators. 2017. “Survey Data Recycling.” https://dataharmonization.org/.