Contributors: Mariano Maisonnave, Fernando Delbianco, Fernando Tohmé, Ana Maguitman and Evangelos Milios
DOI: 10.17632/7d54rvzxkr.1 (This data set is also available in the Mendeley Data Repository through this link http://dx.doi.org/10.17632/7d54rvzxkr.1)
The present dataset is a dataset manually labeled for the task of Ongoing Event Detection (OED). This dataset was used for developing a neural network for detecting ongoing event mentions in news articles. A full description of this work can be found in [Maisonnave et al., 2020]. The task of OED consists of identifying event triggers, the word that most clearly indicates the occurrence of an ongoing event. For this work, several text extracts were analyzed, and each word manually labeled as event or non-event.
The present dataset consists of 2,200 news extracts from The New York Times Annotated Corpus, separated into files for training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier).
Labels description
We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our dataset we only labeled as event the first type of event. Furthermore, our definition of ongoing event also accounts for state of affairs and state changes. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. The word "devaluation" is labeled as non-event trigger as a devaluation may not take place. Similarly, the word "weakening" is a non-event trigger as it is a hypothetical event. Finally, the word "crisis" is considered a non-event trigger as the news refers to a crisis from the past. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example.
Data collection
We tokenized the full New York Times (NYT) archive using the Spacy NLP library and divided the dataset into sentences, using the Spacy sentence tokenizer. Afterward, we selected a subset of those sentences for labeling. We choose three episodes of real-world crises: the Mexican peso crisis of 1994, the Russian financial crisis of 1998, and the Asian financial crisis of 1997. We set up the search engine Lucene (with the default configuration) to search sentences related to these three episodes. We performed a search using keywords manually selected by experts. Examples of these keywords include "Mexico'', "Crisis'', "Debt'', "capital flight'', "devaluation''. From the obtained results, we randomly selected two thousand sentences. Also, we randomly selected from these results a separate set of two hundred sentences for testing purposes.
Data labeling:
We developed a simple active learning tool to assist the labeling process of the training and validation data. This tool used an early prototype of the RNN model used for event prediction in [1] to suggest labels. The tool marked each possible event trigger candidate in the text while the user was in charge of correcting the suggestions by removing candidate event triggers proposed by the tool or adding events that the system did not suggest. We employed a consensus-based approach to minimize errors during the labeling process. Each sentence was presented to four users for labeling, along with the corresponding suggestion generated by the model. Then the four users had to agree on keeping a suggested label, removing it or adding a new one. The whole process took a total of fifteen sessions of approximately two hours each.
To avoid biasing the users' decisions when tagging events in the held-out set used for testing, we had to adopt a different approach from the one used for labeling the training and validation set. Therefore, for the labeling of the testing set, the RNN system was not used for generating suggestions. Instead, for the testing set, the tool provided the four users with the raw sentences with no additional information and the users had to reach a consensus on which words were event triggers.
Further details on the labeling process can be found in the annotation guidelines for the OED task.
The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format.
The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article.
The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the <event></event> tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
This research work was supported in part by CONICET (Argentina), a LARA Google Research grant, the Emerging Leaders in the Americas Program (ELAP-Canada) and Universidad Nacional del Sur (PGI-UNS 24/N051 and 24/E145).
CC BY 4.0 You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.