Coding CVs of members of the Bundestag using RQDA

At the WZB, as part of a project on the AfD (the new radical right party in Germany), I recently had to analyze CVs of Members of the Bundestag. The idea was to automatically download the MPs’ profiles from the Bundestag website using web scraping techniques and then to describe the social structure of the AfD faction using quantitative and qualitative methods. I was particularly interested in the prior political experience of the AfD representatives. Extracting this type of information automatically is difficult and I opted to code some of the material manually.

I was looking for a tool to work with the data collected. In social sciences, ATLAS.ti, MAXQDA, and NVivo are the most commonly used programs to analyze qualitative data. Yet, these programs are expensive and not everyone is able to afford a license. Also, I simply did not need all the bells and whistles offered by these tools (I suspect I am not the only one in this situation).

The essence of qualitative data analysis (QDA) is to annotate text using relevant codes. Think of a computer-assisted qualitative data analysis software (CAQDAS) as a super highlighter (your brain is still doing the hard stuff). The rest – extracting content from PDF files, combining codes, visualizing them, etc. – can be performed by other programs. These functions do not have to be necessarily bundled with QDA programs.

I took a look at RQDA, a free, light, and open-source CAQDAS built on top of R. RQDA was designed by Ronggui Huang from Fudan University, Shanghai, and has been used in a number of publications. The package is still in development (current version 0.3-1) and some bugs are apparent. Yet, RQDA does the essential right. It allows you to import text files (in many languages), code them using a graphical interface, and store your codes (and their meta-information) in a usable format.

What I find particularly exciting about RQDA is that it lets you use the powerful machinery of R. Compared to other programs that work with closed software environments, RQDA is highly expandable. Think of all the R packages available to manipulate, analyze, and visualize textual data. Combining qualitative and quantitative data is also really easy, which makes RQDA a very good tool for mixed methods.

Most importantly, since RQDA is free and open-source, anyone with an Internet access can download R and RQDA and reanalyze coded texts. Sometimes qualitative data contains sensitive information and it is not advisable to share it. Yet, often, scholars analyze data that is already public (as I do). In this case, it might be interesting to put your coding schemes online.

Researchers usually agree that quantitative methods should be reproducible. This means that it should be possible to reproduce the findings of a publication (tables and figures) by re-running code on the raw data. I argue that qualitative research, when it does not use sensitive data, should be traceable, in the sense that others should have the possibility to go back to the source texts, examine the context, and reinterpret the coding. Simply by being free and open source, RQDA facilitates the diffusion and reuse of qualitative material and makes qualitative research more traceable.

There are good RQDA tutorials online, especially the Youtube series prepared by Metin Caliskan, in French and English (see also, Chandra and Shang 2017; Estrada 2017). I learned a lot from these demonstrations and made good progress with the coding of the CVs of the members of the Bundestag. I am really satisfied with RQDA and, for the moment, do not feel the need to move to proprietary software.

While theories and methods to study protest are becoming ever more sophisticated, protest data still suffers from many of the same limitations as it did twenty years ago. Since the 1990s, new rich datasets on extra-electoral political participation have become available. Yet, despite the great contribution of the research teams behind the production of this empirical material, concerns remain with regards to the availability, reliability, and replicability of this data.

Scholars interested in comparing protest participation quantitatively in different countries are essentially dependent on two types of data: international social surveys and protest event data from newspapers or websites.^[1] These sources are all suboptimal, each having its advantages and disadvantages.

International social surveys

Most major international social surveys, like the European Values Study, the World Values Survey, the International Social Survey Programme, and the European Social Survey now incorporate questions relative to protest participation.

These social surveys have three great advantages: they are usually nationally-representative, their questionnaires are standardized, and, since their units of analysis are individuals, they include a bunch of variables (e.g. age, gender, and education) that can be incorporated as covariates into statistical analyses.

But, this type of survey has one downsize: it does not allow to link respondents to specific protest events. We can never say, individual X took part in demonstration Y. Typically, respondents will have to answer to questions such as “during the last 12 months, have you… taken part in a lawful demonstration?”^[2] The target, timing, and location of the political action remains unknown. We can identify who is active and who is not in a given country at a certain point in time, but we cannot, for example, compare mobilization for the environmental movement with that for the feminist movement.

These surveys suffer from another problem: their inconsistent coverage across countries and time. Typically, as with the EVS and WVS, survey rounds are conducted approximately every five years with different samples of countries. OECD countries are systematically over-represented. This makes it difficult to carry any form of time series cross-sectional analysis.

Finally, survey data is not fully open. Survey organizations typically reserve the right to distribute their data. Although this is often done for very legitimate reasons (updates, correction of mistakes), it can complicate the diffusion of secondary data such as protest indexes.

Protest event data

The second main type of protest data is based on secondary sources (newspapers, websites, police records) which report events such as demonstrations, strikes, or rallies. Here, the units of analyses are protest events rather than individuals. Up to recently, this data was coded entirely manually – a very tedious process. Now, more and more automated coding tools are becoming available, but often automation has come at the cost of reliability.

In many aspects, the balance sheet of protest event data, in terms of advantages and disadvantages, is the exact mirror of the one for survey data. Protest event data can be very useful for a few reasons. First, this type of data typically incorporates information on events’ form, size (number of participants), location, duration, target, and other characteristics such as the level of confrontation with the police. Second, since protest event data is usually based on daily reports, it can be easily reassembled into monthly or annual measures, making it straightforward to perform longitudinal analyses. Third, with access to archives, protest event data can theoretically go back in time, even as far as the beginning of the written press (as did Charles Tilly and his collaborators).^[3]

Nevertheless, collecting protest data from secondary sources comes with a warning notice. Social movement scholars are well aware that protest event data is subject to two forms of biases.^[4] Selection bias is present when media sources report on certain types of protest, but not others. This can reflect the ideological orientation or geographical focus of the source. In fact, what protest event data is really measuring is media attention, not absolute levels of protest. Description bias appears when protest events are reported incorrectly. At the age of “alternative facts,” the number of participants in a demonstration is notoriously difficult to estimate and figures can be quite inconsistent across different sources.

One final problem with protest event data is that the original sources behind the datasets are usually under copyright. If other scholars wish to revise some codings, they have to obtain the authorization to retrieve the original sources, usually newspaper articles (and that is when the dataset includes clear references, which was not always the case).

Machine-coded protest event data

Machine-coded protest event data has all the limitations of human-coded data, but, of course, is generated much more efficiently. Reliability is potentially an issue however, as automated coding mechanism are prone to false negatives and false positives. And, of course, machines do not have the finest interpretation when it comes to identifying the characteristics of the protest event. For example, the reliability and validity of the Global Database of Events, Language, and Tone (GDELT) project, one of the most ambitious attempts at automatically cataloging societal events, has been seriously questioned.^[5]

Opening, benchmarking, and triangulating

Where should we go from there? There are no perfect solutions, but certainly a few incremental changes would improve the quality and transparency of protest data. The strategy I propose would follow three lines: opening, benchmarking, and triangulating.

Opening

We need to make sure that all the data and the sources on which the data is based are open. For social surveys, this would mean facilitating the access to data with APIs. For protest event data, this would imply that the original sources are open and accessible. Some newspapers like the New York Times and the Guardian have already taken an important step by implementing APIs to easily retrieve their articles. We would expect public broadcasters to go a step further and place their articles (at least the older ones) under a Creative Commons license. Governments also own protest data, for example in the form of police records, that could be made public and machine-readable.

Benchmarking

By benchmarking, I mean comparing systematically the same type of data from different sources (e.g. some survey data compared to other survey data) and ideally developing measurement models to assess uncertainty of the data. A good example of how this could work is Alex Hanna’s Machine-learning protest event data system (MPEDS) where human-coded datasets are used to “train” machine-learning algorithms which classify and code protest events on the basis of large databases of newspaper articles.

Triangulating

Finally, triangulating would mean combining different types of protest data (from surveys, newspapers, and web sources) and cross-validate the measures. For example, we could think of a research design where protest event data is collected first and then, a nationally-representative survey, complements the analysis. We could use the protest event data to identify a list of the most prominent episodes of mobilization in a country and the survey data to get a clearer profile of the protesters (age, gender, education) and reassess the number of participants.

The three strategies of opening, benchmarking, and triangulating would certainly improve the transparency and robustness of research on extra-electoral political participation and social movements.

Notes and references

A third option would be expert-surveys such as the V-Dem civil society index. Yet, these measures are usually better at capturing the conditions under which protest takes place (the opportunity structure), rather than the actual level or orientation of the mobilization. See: Michael Bernhard et al., “Making Embedded Knowledge Transparent: How the V-Dem Dataset Opens New Vistas in Civil Society Research,” Perspectives on Politics 15, no. 2 (June 2017).
European Social Survey, ESS Round 8 Source Questionnaire (London: ESS ERIC Headquarters c/o City University London, 2016).
Charles Tilly, Louise Tilly and Richard Tilly, The Rebellious Century, 1830-1930 (Cambridge: Harvard University Press, 1975).
Jennifer Earl et al., “The Use of Newspaper Data in the Study of Collective Action,” Annual Review of Sociology 30 (2004).
Wang et al., “Growing pains for global monitoring of societal events,” Science 353, no. 6307 (September 2016); Alex Hanna, “Assessing GDELT with handcoded protest data,” Bad Hessian, http://badhessian.org/2014/02/assessing-gdelt-with-handcoded-protest-data/

Category: Open data

RQDA: How an open source alternative to ATLAS.ti, MAXQDA, and NVivo opens new possibilities for qualitative data analysis

Some reflections on the availability, reliability, and replicability of protest data