New vignette: How to apply Oesch’s class schema on data from the European Social Survey (ESS) using R

Daniel Oesch (University of Lausanne) has developped a schema of social classes, which he discusses and applies in different publications. On his personnal website, he offers Stata, SPSS, and SAS scripts to generate the class schema with data from different surveys.

Scholars working with other programs (especially R) might be interested in using Oesch’s class schema as well. In this vignette, I show how to apply Oesch’s class schema on data from the European Social Survey (ESS) using R. See:

How to apply Oesch’s class schema on data from the European Social Survey (ESS) using R

Workshop on open access

As part of the Berlin Summer School in Social Sciences, I organized a workshop entitled “Open Access: Background and Tools for Early Career Researchers in Social Sciences.” The goal of the workshop was to introduce participants to open access publishing and present useful tools to make their publications available to a wider audience.

We addressed questions such as:

  • What are the limitations of the closed publication system?
  • What is OA publishing?
  • What are the different types of OA publications?
  • What are the available licenses for OA publications?
  • What is the share of OA publications in the scientific literature and how is this changing over time?
  • What sort of funding is available for OA publishing?

The workshop was structured around a 45 minute presentation punctuated by group discussions and exercises. 90 minutes were planned for the whole workshop.

This workshop was prepared as part of the Freies Wissen Fellowship sponsored by Wikimedia Deutschland, the Stifterverband, and the VolkswagenStiftung.

The contents of the workshop are under a CC BY 4.0 license. All the material of this workshop (the outline, the slides, and the bibliography) can be cloned or downloaded from GitHub.

Feel free to share and remix the material to create your own workshop.

Vignette: Identifying East and West Germans in the European Social Survey using R

As a complement to my recent paper on “Generations and Protest in Eastern Germany,” I have prepared a vignette explaining how to identify East and West Germans in the European Social Survey (ESS), while accounting for east-west migration in Germany. The categorization follows a political socialization approach. You can find the vignette here:

Identifying East and West Germans in the European Social Survey: A demonstration in R

The vignette was writen in R markdown and the original script is available on my GitHub page.

New preprint on “Generations and Protest in Eastern Germany”

My WZB Discussion Paper entitled “Generations and Protest in Eastern Germany: Between Revolution and Apathy” is now available on SocArXiv.

The paper is available here and the replication material here.

This paper compares the protest behavior of East and West Germans across generations and over time. It concludes that East Germans, especially those who grew up during the Cold War, participate less in protest activities than West Germans from the same generation after controlling for other individual characteristics.

Dear editor, what is your preprint policy?

Going through a peer-review process usually takes months if not years. At the end, if a paper makes it to publication, access will often be limited by publishers who impose a paywall on peer-reviewed articles.

Preprints allow authors to publish early research findings and to make them available to the entire world for free. The concept is simple: 1) you upload a paper to a public repository; 2) the paper goes through a moderation process that assesses the scientific character of the work; 3) the paper is made available online. These three steps are usually completed in a few hours. With preprints, authors can rapidly communicate valuable results and engage with a broader community of scholars.

Researchers are sometimes reluctant to publish their work as preprints for two reasons. First, they fear that their papers won’t be accepted by scholarly journals because preprints would violate the so-called Ingelfinger rule, i.e. their work would have been “published” before submission. Most journals however will accept to review and publish papers that are available as preprints. The SHERPA/RoMEO database catalogues journal policies regarding pre and postprints (accepted papers that incorporate reviewers’ comments). The vast majority of journals are listed as “yellow” or “green” in the database: they tolerate preprints (yellow) or pre and postprints (green).

Second, authors are worried that they might get scooped, that their work might be stolen by someone else who would get credits for their work. Yet, the experience of arXiv, the oldest preprint repository which publishes papers in mathematics, physics, and others, shows the exact opposite. Since its creation in 1991, the repository has helped prevent scooping by offering scholars the chance to put a publicly available timestamp on their work.[1]

My first preprint

I have decided to make a paper available as preprint in the coming days. My idea is to simultaneously submit the paper to a peer-reviewed journal and upload it on a public repository. I first shared the concerns of many of my colleagues regarding scooping and the possible rejection of my work by editors. However, the more I learned about preprints, the more I felt confident that this was the right way to proceed.

Here is what I did. I first selected the journal to which I would like to submit my paper. I then checked how the journal was rated in SHERPA/RoMEO. It turned out to be a “yellow” journal: so far so good. Finally, to be absolutely sure that the preprint would not be a problem I contacted the editor of the journal and asked:

Dear Professor XX,

I’m interested in publishing in journal YY. I would like to ask: what is your preprint policy? Would you review a paper that has been uploaded to a public repository like SocArXiv?

Best wishes,

Philippe Joly

And the response came a few minutes after:

Dear Mr Joly,

Yes, we would have no problem with that.

Best,

Professor XX

I now feel very comfortable uploading the preprint on a repository. I will try to store my paper on SocArxiv, which is one of the first online preprint servers for the social sciences. While economists have a long experience with publicly available working papers, sociologists and political scientists have been more reluctant to join the movement. SocArXiv has been active since 2017 and is modelled on arXiv. Interestingly, the team at SocArXiv has partnered with the Center for Open Science and their preprint service is hosted by the Open Science Framework, which I have covered in another post.

  1. Bourne PE, Polka JK, Vale RD, Kiley R (2017) Ten simple rules to consider regarding preprint submission. PLoS Comput Biol 13(5): e1005473.

Learning Git the wrong way

Git is a version control system. It keeps track of changes within files and allows for complex collaborative work. While it is mostly used by programmers for storing and sharing code, it can theoretically work with any type of file (text, images, etc.). GitHub is the most-used hosting service for Git. On Github, users store and make their Git repositories available to others.

The Git workflow is highly valued in the open science community for a few reasons. It is fast, secure, and well-suited for coordinating large collaborative projects. The characteristic feature of Git is its branch system which allows users to work on different lines of development in parallel. Branches can be easily merged or deleted with minimal risk of losing valuable material.

Inspired by other fellows and mentors of the Fellowship Freies Wissen, I started to use Git and GitHub about two months ago. However, I quickly faced difficulties.

The problem was that I took the wrong approach with Git. When learning a new computer skill (like R), I usually start experimenting early in the process and learn by solving the inevitable problems that come along the way. This “hands-on” approach proved to be more complicated with Git. While I was trying to keep track of the changes on my PhD project, I got rapidly confused by the concept of “staging area” and struggled moving from one branch to the other.

I realized that to start using Git efficiently I would need a more solid theoretical understanding of the system.

I stopped using Git with my project for a while and went back to the basics. I started reading Pro Git, 2nd edition, written by Scott Chacon and Ben Straub (Apress, 2018) and followed through the explanations with example repositories containing simple .txt files. I performed all the operations in Bash, the command-line system, instead of using a GUI.

Pro Git is a great resource: it is distributed under Creative Commons license and was translated in many languages. You can find the book in html, pdf, and other formats here.

If you are also starting to learn Git I would recommend going through the first three chapters (“Getting Started”, “Git Basics”, “Git Branching”) and the first sections of the chapter on GitHub. That’s about 120 pages.

Taking the time to learn Git properly proved very useful. After a few hours of reading, I was able to take advantage of most of the basic functions of Git. More and more, I am discovering the advantages of Git and wished I had learned it earlier.

You can now follow my work on GitHub here. I will continue to post vignettes in R and will distribute replication material for my papers.

Vignette: Improving the harmonization of repeated cross-national surveys with the R mice package

As part of the Freies Wissen Fellowship, I have prepared a vignette on:

Using multiple imputation to improve the harmonization of repeated cross-national surveys

The vignette introduces a technique for coping with the problem of systematically missing data in longitudinal social and political surveys.

The project combines many aspects of open science:

  • The multiple imputation procedure was implemented in R, a programming language and free software environment for statistical computing.
  • The text and the code of the vignette were prepared in R markdown, a form of literate programming. Literate programming makes research more transparent and reproducible. It is also a nice tool for the development of open educational resources (OER).
  • Finally, all the material of the vignette is on my GitHub page, where users can clone the repository, comment, and suggest modifications.

Some reflections on the availability, reliability, and replicability of protest data

While theories and methods to study protest are becoming ever more sophisticated, protest data still suffers from many of the same limitations as it did twenty years ago. Since the 1990s, new rich datasets on extra-electoral political participation have become available. Yet, despite the great contribution of the research teams behind the production of this empirical material, concerns remain with regards to the availability, reliability, and replicability of this data.

Scholars interested in comparing protest participation quantitatively in different countries are essentially dependent on two types of data: international social surveys and protest event data from newspapers or websites.[1] These sources are all suboptimal, each having its advantages and disadvantages.

International social surveys

Most major international social surveys, like the European Values Study, the World Values Survey, the International Social Survey Programme, and the European Social Survey now incorporate questions relative to protest participation.

These social surveys have three great advantages: they are usually nationally-representative, their questionnaires are standardized, and, since their units of analysis are individuals, they include a bunch of variables (e.g. age, gender, and education) that can be incorporated as covariates into statistical analyses.

But, this type of survey has one downsize: it does not allow to link respondents to specific protest events. We can never say, individual X took part in demonstration Y. Typically, respondents will have to answer to questions such as “during the last 12 months, have you… taken part in a lawful demonstration?[2] The target, timing, and location of the political action remains unknown. We can identify who is active and who is not in a given country at a certain point in time, but we cannot, for example, compare mobilization for the environmental movement with that for the feminist movement.

These surveys suffer from another problem: their inconsistent coverage across countries and time. Typically, as with the EVS and WVS, survey rounds are conducted approximately every five years with different samples of countries. OECD countries are systematically over-represented. This makes it difficult to carry any form of time series cross-sectional analysis.

Finally, survey data is not fully open. Survey organizations typically reserve the right to distribute their data. Although this is often done for very legitimate reasons (updates, correction of mistakes), it can complicate the diffusion of secondary data such as protest indexes.

Protest event data

The second main type of protest data is based on secondary sources (newspapers, websites, police records) which report events such as demonstrations, strikes, or rallies. Here, the units of analyses are protest events rather than individuals. Up to recently, this data was coded entirely manually – a very tedious process. Now, more and more automated coding tools are becoming available, but often automation has come at the cost of reliability.

In many aspects, the balance sheet of protest event data, in terms of advantages and disadvantages, is the exact mirror of the one for survey data. Protest event data can be very useful for a few reasons. First, this type of data typically incorporates information on events’ form, size (number of participants), location, duration, target, and other characteristics such as the level of confrontation with the police. Second, since protest event data is usually based on daily reports, it can be easily reassembled into monthly or annual measures, making it straightforward to perform longitudinal analyses. Third, with access to archives, protest event data can theoretically go back in time, even as far as the beginning of the written press (as did Charles Tilly and his collaborators).[3]

Nevertheless, collecting protest data from secondary sources comes with a warning notice. Social movement scholars are well aware that protest event data is subject to two forms of biases.[4] Selection bias is present when media sources report on certain types of protest, but not others. This can reflect the ideological orientation or geographical focus of the source. In fact, what protest event data is really measuring is media attention, not absolute levels of protest. Description bias appears when protest events are reported incorrectly. At the age of “alternative facts,” the number of participants in a demonstration is notoriously difficult to estimate and figures can be quite inconsistent across different sources.

One final problem with protest event data is that the original sources behind the datasets are usually under copyright. If other scholars wish to revise some codings, they have to obtain the authorization to retrieve the original sources, usually newspaper articles (and that is when the dataset includes clear references, which was not always the case).

Machine-coded protest event data

Machine-coded protest event data has all the limitations of human-coded data, but, of course, is generated much more efficiently. Reliability is potentially an issue however, as automated coding mechanism are prone to false negatives and false positives. And, of course, machines do not have the finest interpretation when it comes to identifying the characteristics of the protest event. For example, the reliability and validity of the Global Database of Events, Language, and Tone (GDELT) project, one of the most ambitious attempts at automatically cataloging societal events, has been seriously questioned.[5]

Opening, benchmarking, and triangulating

Where should we go from there? There are no perfect solutions, but certainly a few incremental changes would improve the quality and transparency of protest data. The strategy I propose would follow three lines: opening, benchmarking, and triangulating.

Opening

We need to make sure that all the data and the sources on which the data is based are open. For social surveys, this would mean facilitating the access to data with APIs. For protest event data, this would imply that the original sources are open and accessible. Some newspapers like the New York Times and the Guardian have already taken an important step by implementing APIs to easily retrieve their articles. We would expect public broadcasters to go a step further and place their articles (at least the older ones) under a Creative Commons license. Governments also own protest data, for example in the form of police records, that could be made public and machine-readable.

Benchmarking

By benchmarking, I mean comparing systematically the same type of data from different sources (e.g. some survey data compared to other survey data) and ideally developing measurement models to assess uncertainty of the data. A good example of how this could work is Alex Hanna’s Machine-learning protest event data system (MPEDS) where human-coded datasets are used to “train” machine-learning algorithms which classify and code protest events on the basis of large databases of newspaper articles.

Triangulating

Finally, triangulating would mean combining different types of protest data (from surveys, newspapers, and web sources) and cross-validate the measures. For example, we could think of a research design where protest event data is collected first and then, a nationally-representative survey, complements the analysis. We could use the protest event data to identify a list of the most prominent episodes of mobilization in a country and the survey data to get a clearer profile of the protesters (age, gender, education) and reassess the number of participants.

The three strategies of opening, benchmarking, and triangulating would certainly improve the transparency and robustness of research on extra-electoral political participation and social movements.

Notes and references

  1. A third option would be expert-surveys such as the V-Dem civil society index. Yet, these measures are usually better at capturing the conditions under which protest takes place (the opportunity structure), rather than the actual level or orientation of the mobilization. See: Michael Bernhard et al., “Making Embedded Knowledge Transparent: How the V-Dem Dataset Opens New Vistas in Civil Society Research,” Perspectives on Politics 15, no. 2 (June 2017).
  2. European Social Survey, ESS Round 8 Source Questionnaire (London: ESS ERIC Headquarters c/o City University London, 2016).
  3. Charles Tilly, Louise Tilly and Richard Tilly, The Rebellious Century, 1830-1930 (Cambridge: Harvard University Press, 1975).
  4. Jennifer Earl et al., “The Use of Newspaper Data in the Study of Collective Action,” Annual Review of Sociology 30 (2004).
  5. Wang et al., “Growing pains for global monitoring of societal events,” Science 353, no. 6307 (September 2016); Alex Hanna, “Assessing GDELT with handcoded protest data,” Bad Hessian, http://badhessian.org/2014/02/assessing-gdelt-with-handcoded-protest-data/

A look at the Open Science Framework (OSF)

Over the recent years, a number of tools have been proposed to help researchers make their research more accessible, from the first stages of project design to publication. While this development is welcome, the abundance of software solutions can sometimes feel overwhelming.

OSF: one tool across the entire research lifecycle

This week I have started experimenting with the Open Science Framework (OSF), a free and open source platform which aims at streamlining the research workflow in a single software environment. The OSF was developed by the Center for Open Science. It provides solutions for collaborative, open research throughout the research cycle.

After testing the platform for a few days, I am impressed by its flexibility and reliability together with its successful integration with other tools.

One of the great strengths of the OSF is that it gives users full control over what content is public and what content is private. By default, all projects are created as private (only the administrators and collaborators can see them), but as project evolves users can decide to make certain elements public. This transition from private to public is facilitated by the way the OSF organizes research material. Users develop “projects” which are divided into “components” (e.g. literature, data, analysis, communication), which can be further subdivided into “components” themselves, etc. This structure is highly flexible and allows researchers to use the OSF for almost any type of project (an article, an experiment, a class). Each project or component receives a globally unique identifier (GUID) and public projects can be attributed a DOI. Users define which license applies to which content.

This “OSF 101” video gives an overview of the many functions of the platform.

Video: “OSF 101”

My first OSF project

I decided to test the OSF by creating a new page which will group projects I am developing as part of the Freies Wissen Fellowship Program.

Figure: Project in public mode

In the public part of the project (see picture above), two components are currently available: “Literature” and “Data management plan.” The Literature component is a first attempt at making my bibliography open. There you will find a first bibTeX file with a list of references on “Political protest and political opportunity structures.” The Data management plan is a copy of the plan I posted on wikiversity. This version is written in markdown and can be modified and commented directly on the OSF platform.

Figure: Project in private mode

Of course, these two components are just a start. I plan on gradually adding more public content on my OSF project. In the private view of my project (picture above), you can see that I have linked three other projects: the two papers I am working on and the seminar on open social sciences.

Figure: Paper project

The paper projects are still in private mode, but, as shown in the picture above, they already contain a title, an abstract, tags, a manuscript, and a replication set. This already offers a good backup solution. There is currently no limit on the OSF storage. Only, individual files cannot have more than 5 GB. Files can be viewed, downloaded, deleted, and renamed. Plain text files can be edited collaboratively online. The OSF also has a multitude of well-integrated add-ons such as Dropbox and GitHub.

Overall, I am really satisfied with the OSF and will continue to explore it as I develop my projects further. If you want to learn more about the possibilities offered by the OSF, I recommend the John Hopkins’ Electronic Lab Notebook Template.

Creating a data management plan

As part of the Freies Wissen Fellowship Program, I was recently encouraged to prepare a data management plan (DMP) for my project. This was something new to me. In my study program (but also, I suspect, in most social science programs), issues related to data collection or generation, data analysis, data storing, and data sharing had usually been discussed on an ad hoc basis with supervisors or in small research teams.

Preparing an exhaustive DMP can be a tedious process at an early stage of a project, but it has obvious benefits down the road.

What is a data management plan?

A data management plan is a “written document that describes the data you expect to acquire or generate during the course of a research project, how you will manage, describe, analyze, and store those data, and what mechanisms you will use at the end of your project to share and preserve your data.” [1]

DMPs can vary in terms of format, but they typically aim at answering questions such as:

  • What type of data will be analyzed or created?
  • What file format will be used?
  • How will the data be stored and backed-up during the project?
  • How will sensitive data be handled?
  • How will the data be shared and archived after the project?
  • Who will be responsible for managing the data?

A good DMP should be a “living document” in the sense that it needs to evolve with the research project.

Why do you need a data management plan?

A DMP forces you to anticipate and address issues of data management that will arise in the course of the research project. There are a few reasons why such an approach can be beneficial, as highlighted by the University of Lausanne:

  • Taking a systematic and rigorous approach to data management saves time and money.
  • Developing a good backup strategy can avoid tragedies such as the loss of irrecoverable data.
  • More and more funders are asking for DMPs.
  • A DMP is a first logical step when planning to open your data to the public.
  • Generally, DMPs are an integral part of honest, responsible, and transparent research. [2]

A good example of the new direction public funders are taking with regards to DMPs is the EU Horizon 2020 Program. As part of the Open Research Data Pilot, the EU is now encouraging applicants to present a DMP where they detail how their research data will be findable, accessible, interoperable, and reusable (FAIR). [3]

Creating my own data management plan

For my project, I followed the DMP model proposed by the Bielefeld University. This concise model summarizes in 8 sections the essential elements of a DMP. You can find my plan here.

Challenges

I thought the exercise would be rather straightforward as my dissertation relies mostly on existing data. My contribution lies mostly in the merging and harmonization of repeated cross-national surveys together with national-level indicators.

I nonetheless faced some challenges. The first was that not all data providers state precisely what their terms of use are. Very few use an explicit license. Most of them will have conditions similar to the World values survey: “The data are available without restrictions, provided that 1) they are used for non-profit purposes, 2) data files are not redistributed, 3) correct citations are provided wherever results based on the data are published.”[4]

The second challenge I faced was to elaborate a backup strategy. One advantage of DMPs is that they quickly highlight your weaknesses. In my case, it was clear that I had not reflected enough on how to safely save and archive the thousands of lines of code I had produced over years. (Code is also data in a broad sense!) That’s why I am exploring new avenues to protect my files and soon distribute my replication material.

I am currently exploring solutions such as the Open Science Framework platform. OSF seems a good option as it is free, open source, and well-integrated with other platforms such as Github and Dropbox. I will report on my progress in upcoming posts.

References

  1. Stanford Libraries. “Data management plans.” https://library.stanford.edu/research/data-management-services/data-management-plans (accessed November 13, 2017).
  2. Université de Lausanne. “Réaliser un Data Management Plan.” https://uniris.unil.ch/researchdata/sujet/realiser-un-data-management-plan/ (accessed November 13, 2017).
  3. European Commission. Directorate-General for Research & Innovation. “H2020 Programme: Guidelines on FAIR Data Management in Horizon 2020.” http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf (accessed November 13, 2017).
  4. World Values Survey. “Integrated EVS/WVS 1981-2008 Conditions of Use.” http://www.worldvaluessurvey.org/WVSContents.jsp (accessed November 13, 2017).