Debating Technologies Datasets

In this page you can download copies of the IBM Debating Technologies Datasets.

The datasets are released under the following licensing and copyright terms:

To download, please fill in the request form below.

Other datasets are expected to be released over time.

IBM Unraveling Language Patterns

(GReedy Augmented Sequential Patterns)

IBM Unraveling Language Patterns is an algorithm for automatically extracting patterns that characterize subtle linguistic phenomena.
To that end, IBM Unraveling Language Patterns augments each term of input text with multiple layers of linguistic information. These different facets of the text terms are systematically combined to reveal rich patterns.

Manager

Ranit Aharonov, Manager Debating Technologies

The following Datasets are available:

Name Description Link to paper
10 speeches recorded by professional debaters about controversial topics, and their manual and automatic transcripts, in both raw and cleaned (processed) versions.

The dataset includes:
  1. Audio files of 10 debating speeches
  2. Manual and automatic transcripts of the speeches, raw and cleaned versions
4603 Wikipedia categories and lists annotated for stance (Pro/Con) towards a concept, for a set of 132 concepts. The data were published by Toledo-Ronen et al. at the ACL-2016 Workshop on Argument Mining.

The dataset includes:
  1. ReleaseNotes.txt - release notes file describing the data
  2. WikipediaCategoriesResults.csv - the dataset
  3. WikipediaCategoriesLabeling.docx - the guidelines used for labeling the data
2294 labeled claims and 4690 labeled evidence for 58 different topics. Labeled data published by Rinott et al. EMNLP-2015. This data is an extension of the CE-ACL-2014 data.

The dataset includes:
  1. Two CSV files containing, for each topic, the claims and evidence that were identified for it in relevant Wikipedia articles.
  2. The original Wikipedia articles - from Wikipedia April 2012 dump - in the form of text files, cleaned from any Wikisyntax or HTML markup.
Term-relatedness values for 9,856 pairs of terms. These data were published by Levy et al. at ACL-2015.

The dataset includes:
  1. Release Notes.txt - release notes describing the data
  2. TermRelatednessResults.csv - the dataset
  3. TermRelatednessLabeling.doc - the guidelines used for labeling the data
1,392 labeled claims for 33 different topics, and 1,291 labeled evidence for 350 distinct claims in 12 different topics. These data were published by Aharoni et al. in the First Workshop on Argumentation Mining at ACL-2014.

The dataset includes:
  1. Two CSV files containing, for each topic, the claims and evidence that were identified for it in relevant Wikipedia articles.
  2. The original Wikipedia articles - from Wikipedia April 2012 dump - in the form of text files, cleaned from any Wikisyntax or HTML markup.

.

Debating Technologies Datasets - Licensing Notice

Each copy or modified version that you distribute must include a licensing notice stating that the work is released under CC-BY-SA and either a) a hyperlink or URL to the text of the license or b) a copy of the license. For this purpose, a suitable URL is: http://creativecommons.org/licenses/by-sa/3.0/.