In constructing and analyzing new data on Black protests, I’ve come to realize the importance of improving standard protocols for collecting protest event data. Although protest event data has one record per event, these events are recorded in news accounts that talk about multiple events in one article and the same event is often described in several different articles. The usual data collection protocols do not retain this information. The important Dynamics of Collective Action data set records only one article per event and the total number of articles, and discards identifiers for additional articles about the event. All research teams building protest event databases have to reconcile “duplications,” events that are described in more than one report. Past researchers have done this reconciliation “off stage” and yield only one composite record per event pointing to only one news article per event (if they document the publication source at all).
Replicable protest event data construction requires a structured process that creates a chain of files that documents the data collecting and coding process. The sources of event data can be documented with at least two shareable files, an event file and an event-article crosswalk file. A modification of past research procedures can make it possible to collect and process protest event data in a way that preserves this multiple-article-per-event data and makes the coding process better documented and more replicable with little if any additional work. The result will be higher quality data.
In the first step, one or more distinct events are identified within each article. A record is created for each event described in the article. The derived record contains identifying information about news article and machine- or human-coded information about the event, including minimally a brief description of the event and its date (which may be exact or estimated) and location. Researchers determine what other information to code as well. Our evolving MPEDS research protocol (in collaboration with Alex Hanna and Chaeyoon Lim) uses our web automation application to mark text in the article describing actors, targets, issues, actions, police actions, politicians, size, and violence. Coders also have the option of flagging events that are large events that contain subevents, subevents of larger events, counter-protests, or events that are likely to be determined to be out of scope (e.g. non-US events, historical events, non-protest events). Coders also provide open-ended text short descriptions of each event and of the content of each article.
The second step involves identifying and reconciling multiple reports on the same event, assigning a unique event identification code to each event, and creating derived codes for event characteristics that merge information across reports. Information on event date, location, and event characteristics from the first step allows researchers to filter events that may be duplicates to facilitate identification. The output of this step is two files, an event file and an article/event file. The article/event file is an edited version of the original data collection file containing one record for each event identified in each article with the addition of a unique event identifier so that the same event in multiple articles has the same event identifier. The event file has one record per event that combines or consolidates information from multiple articles about the same event. Fields are added to the event file to indicate events that are linked to multiple articles and to flag inconsistencies in the reports on the event from multiple sources. For purposes of research integrity, researchers retain the intermediate versions of these files containing both the original coding and the cleaned or derived codes, along with coding notes and documentation, but these intermediate files are not necessarily made public.
The third step involves creating the clean files for analysis and, later, public release. This will be a minimum of two files; these linked files may be stored as one relational database or as separate files with ID numbers that permit them to be related and merged for analysis. Files 1 and 2 are the minimum necessary for documenting an event database for public release. Files 3 and 4 are optional but desirable.
- The clean event file with the unique event ID and the derived codes for the event. This file indicates how many articles described the event and includes flags to identify events with inconsistent descriptions in multiple articles. It may optionally contain a field or fields listing the article identifiers of all articles that described the event. Required.
- An article-event crosswalk file that links event IDs with the publication information for each article that described the event and a unique article ID. This file has enough information about the article that it can be retrieved by another research team from a news article repository, including the name of the news repository and the article’s identification code within that repository. This file can also be used to examine the co-occurrence of events within an article. Required.
- An article file that contains the article ID number, full publication information to permit retrieval by others, the name of the news repository from which it was retrieved and the identification code within that repository, and additional fields about each article including the coder’s description of the article’s content, the number of events it described, and meta-data such as the length of the article, the headline, or whether it included illustrations. Research teams will usually have the full text of the article in this file as well for their own use, but copyright restrictions usually prevent public release of a version of the article file with full text.
- A cleaned version of the full event-article file that preserves information about what information about the event was in each article and allows retrieval of the coding decisions that were made to reconcile inconsistencies. Additional text documentation describes coding protocols and decision rules for the process of getting from news articles to the released datasets.