Major changes in btmembers, an R package to import data on all members of the Bundestag since 1949

With the upcoming German federal elections, I decided to make important changes to btmembers, my R package to import data on all members of the Bundestag since 1949.

Current composition of the German Bundestag

You can find more information about btmembers here. The CSV data is available here and the codebook here.

Version 0.1.0 changes the default behavior of the function import_members().

  • By default, import_members() now returns a list containing four data frames (namen, bio, wp, and inst), which together preserve all the information contained in the XML file provided by the Bundestag.
  • If import_members() is called with the argument condensed_df = TRUE, the function will return a condensed data frame. Each row corresponds to a member-term. Most of the information contained in the original data is preserved except only the most recent name of the member is retained and institutions are removed. A new column named fraktion is added to the data. fraktion is a recoded variable and refers to the faction the member spent most time in during a given parliamentary term.
  • The performance of import_members() has been improved by the integration of tidyr unnest functions.
  • The package does not come preloaded with the data anymore but uses GitHub to store the pre-processed data. This facilitates updates and will make the integration of GitHub Actions possible in the future.
  • update_available() has been deprecated.

These changes give users the possibility to reorganize the data as they wish and make the package faster and more robust.

Plot the mean and confidence interval of a variable across multiple groups using Stata

Stata offers many options to graph certain statistics (e.g. dot charts). These options, however, do not always work well to compare statistics between groups. To address this, I am sharing a program called plotmean, which allows users to graph the mean and confidence interval of a variable across multiple groups.

Running this do-file will generate the following graph:

The program relies on the statsby function and can be easily modified to plot all sorts of statistics.

New preprint: “Transition Spillovers? The Protest Behaviour of the 1989 Generation in Europe”

My last paper entitled “Transition Spillovers? The Protest Behaviour of the 1989 Generation in Europe” is now available on SocArXiv.

The paper is available here and the replication materials are available here.

This paper re-examines the well-documented gap of political participation between citizens of Western Europe and citizens of Central and Eastern Europe. Building on political socialization theory, it explores whether the deficit of participation in post-communist countries is moderated by previous experiences of mobilization. The study focuses on the protest behaviour of the 1989 generation, which is composed of citizens who reached political maturity during the collapse of communism.

RQDA: How an open source alternative to ATLAS.ti, MAXQDA, and NVivo opens new possibilities for qualitative data analysis

Coding CVs of members of the Bundestag using RQDA

At the WZB, as part of a project on the AfD (the new radical right party in Germany), I recently had to analyze CVs of Members of the Bundestag. The idea was to automatically download the MPs’ profiles from the Bundestag website using web scraping techniques and then to describe the social structure of the AfD faction using quantitative and qualitative methods. I was particularly interested in the prior political experience of the AfD representatives. Extracting this type of information automatically is difficult and I opted to code some of the material manually.

I was looking for a tool to work with the data collected. In social sciences, ATLAS.ti, MAXQDA, and NVivo are the most commonly used programs to analyze qualitative data. Yet, these programs are expensive and not everyone is able to afford a license. Also, I simply did not need all the bells and whistles offered by these tools (I suspect I am not the only one in this situation).

The essence of qualitative data analysis (QDA) is to annotate text using relevant codes. Think of a computer-assisted qualitative data analysis software (CAQDAS) as a super highlighter (your brain is still doing the hard stuff). The rest – extracting content from PDF files, combining codes, visualizing them, etc. – can be performed by other programs. These functions do not have to be necessarily bundled with QDA programs.

I took a look at RQDA, a free, light, and open-source CAQDAS built on top of R. RQDA was designed by Ronggui Huang from Fudan University, Shanghai, and has been used in a number of publications. The package is still in development (current version 0.3-1) and some bugs are apparent. Yet, RQDA does the essential right. It allows you to import text files (in many languages), code them using a graphical interface, and store your codes (and their meta-information) in a usable format.

What I find particularly exciting about RQDA is that it lets you use the powerful machinery of R. Compared to other programs that work with closed software environments, RQDA is highly expandable. Think of all the R packages available to manipulate, analyze, and visualize textual data. Combining qualitative and quantitative data is also really easy, which makes RQDA a very good tool for mixed methods.

Most importantly, since RQDA is free and open-source, anyone with an Internet access can download R and RQDA and reanalyze coded texts. Sometimes qualitative data contains sensitive information and it is not advisable to share it. Yet, often, scholars analyze data that is already public (as I do). In this case, it might be interesting to put your coding schemes online.

Researchers usually agree that quantitative methods should be reproducible. This means that it should be possible to reproduce the findings of a publication (tables and figures) by re-running code on the raw data. I argue that qualitative research, when it does not use sensitive data, should be traceable, in the sense that others should have the possibility to go back to the source texts, examine the context, and reinterpret the coding. Simply by being free and open source, RQDA facilitates the diffusion and reuse of qualitative material and makes qualitative research more traceable.

There are good RQDA tutorials online, especially the Youtube series prepared by Metin Caliskan, in French and English (see also, Chandra and Shang 2017; Estrada 2017). I learned a lot from these demonstrations and made good progress with the coding of the CVs of the members of the Bundestag. I am really satisfied with RQDA and, for the moment, do not feel the need to move to proprietary software.

New vignette: How to apply Oesch’s class schema on data from the European Social Survey (ESS) using R

Daniel Oesch (University of Lausanne) has developped a schema of social classes, which he discusses and applies in different publications. On his personnal website, he offers Stata, SPSS, and SAS scripts to generate the class schema with data from different surveys.

Scholars working with other programs (especially R) might be interested in using Oesch’s class schema as well. In this vignette, I show how to apply Oesch’s class schema on data from the European Social Survey (ESS) using R. See:

How to apply Oesch’s class schema on data from the European Social Survey (ESS) using R

Workshop on open access

As part of the Berlin Summer School in Social Sciences, I organized a workshop entitled “Open Access: Background and Tools for Early Career Researchers in Social Sciences.” The goal of the workshop was to introduce participants to open access publishing and present useful tools to make their publications available to a wider audience.

We addressed questions such as:

  • What are the limitations of the closed publication system?
  • What is OA publishing?
  • What are the different types of OA publications?
  • What are the available licenses for OA publications?
  • What is the share of OA publications in the scientific literature and how is this changing over time?
  • What sort of funding is available for OA publishing?

The workshop was structured around a 45 minute presentation punctuated by group discussions and exercises. 90 minutes were planned for the whole workshop.

This workshop was prepared as part of the Freies Wissen Fellowship sponsored by Wikimedia Deutschland, the Stifterverband, and the VolkswagenStiftung.

The contents of the workshop are under a CC BY 4.0 license. All the material of this workshop (the outline, the slides, and the bibliography) can be cloned or downloaded from GitHub.

Feel free to share and remix the material to create your own workshop.

Vignette: Identifying East and West Germans in the European Social Survey using R

As a complement to my recent paper on “Generations and Protest in Eastern Germany,” I have prepared a vignette explaining how to identify East and West Germans in the European Social Survey (ESS), while accounting for east-west migration in Germany. The categorization follows a political socialization approach. You can find the vignette here:

Identifying East and West Germans in the European Social Survey: A demonstration in R

The vignette was writen in R markdown and the original script is available on my GitHub page.

New preprint on “Generations and Protest in Eastern Germany”

My WZB Discussion Paper entitled “Generations and Protest in Eastern Germany: Between Revolution and Apathy” is now available on SocArXiv.

The paper is available here and the replication material here.

This paper compares the protest behavior of East and West Germans across generations and over time. It concludes that East Germans, especially those who grew up during the Cold War, participate less in protest activities than West Germans from the same generation after controlling for other individual characteristics.

Dear editor, what is your preprint policy?

Going through a peer-review process usually takes months if not years. At the end, if a paper makes it to publication, access will often be limited by publishers who impose a paywall on peer-reviewed articles.

Preprints allow authors to publish early research findings and to make them available to the entire world for free. The concept is simple: 1) you upload a paper to a public repository; 2) the paper goes through a moderation process that assesses the scientific character of the work; 3) the paper is made available online. These three steps are usually completed in a few hours. With preprints, authors can rapidly communicate valuable results and engage with a broader community of scholars.

Researchers are sometimes reluctant to publish their work as preprints for two reasons. First, they fear that their papers won’t be accepted by scholarly journals because preprints would violate the so-called Ingelfinger rule, i.e. their work would have been “published” before submission. Most journals however will accept to review and publish papers that are available as preprints. The SHERPA/RoMEO database catalogues journal policies regarding pre and postprints (accepted papers that incorporate reviewers’ comments). The vast majority of journals are listed as “yellow” or “green” in the database: they tolerate preprints (yellow) or pre and postprints (green).

Second, authors are worried that they might get scooped, that their work might be stolen by someone else who would get credits for their work. Yet, the experience of arXiv, the oldest preprint repository which publishes papers in mathematics, physics, and others, shows the exact opposite. Since its creation in 1991, the repository has helped prevent scooping by offering scholars the chance to put a publicly available timestamp on their work.[1]

My first preprint

I have decided to make a paper available as preprint in the coming days. My idea is to simultaneously submit the paper to a peer-reviewed journal and upload it on a public repository. I first shared the concerns of many of my colleagues regarding scooping and the possible rejection of my work by editors. However, the more I learned about preprints, the more I felt confident that this was the right way to proceed.

Here is what I did. I first selected the journal to which I would like to submit my paper. I then checked how the journal was rated in SHERPA/RoMEO. It turned out to be a “yellow” journal: so far so good. Finally, to be absolutely sure that the preprint would not be a problem I contacted the editor of the journal and asked:

Dear Professor XX,

I’m interested in publishing in journal YY. I would like to ask: what is your preprint policy? Would you review a paper that has been uploaded to a public repository like SocArXiv?

Best wishes,

Philippe Joly

And the response came a few minutes after:

Dear Mr Joly,

Yes, we would have no problem with that.

Best,

Professor XX

I now feel very comfortable uploading the preprint on a repository. I will try to store my paper on SocArxiv, which is one of the first online preprint servers for the social sciences. While economists have a long experience with publicly available working papers, sociologists and political scientists have been more reluctant to join the movement. SocArXiv has been active since 2017 and is modelled on arXiv. Interestingly, the team at SocArXiv has partnered with the Center for Open Science and their preprint service is hosted by the Open Science Framework, which I have covered in another post.

  1. Bourne PE, Polka JK, Vale RD, Kiley R (2017) Ten simple rules to consider regarding preprint submission. PLoS Comput Biol 13(5): e1005473.

Learning Git the wrong way

Git is a version control system. It keeps track of changes within files and allows for complex collaborative work. While it is mostly used by programmers for storing and sharing code, it can theoretically work with any type of file (text, images, etc.). GitHub is the most-used hosting service for Git. On Github, users store and make their Git repositories available to others.

The Git workflow is highly valued in the open science community for a few reasons. It is fast, secure, and well-suited for coordinating large collaborative projects. The characteristic feature of Git is its branch system which allows users to work on different lines of development in parallel. Branches can be easily merged or deleted with minimal risk of losing valuable material.

Inspired by other fellows and mentors of the Fellowship Freies Wissen, I started to use Git and GitHub about two months ago. However, I quickly faced difficulties.

The problem was that I took the wrong approach with Git. When learning a new computer skill (like R), I usually start experimenting early in the process and learn by solving the inevitable problems that come along the way. This “hands-on” approach proved to be more complicated with Git. While I was trying to keep track of the changes on my PhD project, I got rapidly confused by the concept of “staging area” and struggled moving from one branch to the other.

I realized that to start using Git efficiently I would need a more solid theoretical understanding of the system.

I stopped using Git with my project for a while and went back to the basics. I started reading Pro Git, 2nd edition, written by Scott Chacon and Ben Straub (Apress, 2018) and followed through the explanations with example repositories containing simple .txt files. I performed all the operations in Bash, the command-line system, instead of using a GUI.

Pro Git is a great resource: it is distributed under Creative Commons license and was translated in many languages. You can find the book in html, pdf, and other formats here.

If you are also starting to learn Git I would recommend going through the first three chapters (“Getting Started”, “Git Basics”, “Git Branching”) and the first sections of the chapter on GitHub. That’s about 120 pages.

Taking the time to learn Git properly proved very useful. After a few hours of reading, I was able to take advantage of most of the basic functions of Git. More and more, I am discovering the advantages of Git and wished I had learned it earlier.

You can now follow my work on GitHub here. I will continue to post vignettes in R and will distribute replication material for my papers.