IBM Project Debater

Debater Datasets

This page allows you to download copies of the Project Debater Datasets.

The datasets are released under the following licensing and copyright terms, unless specified otherwise in their release notes:

To download, please fill in the request forms below.

Other datasets are expected to be released over time.


Ranit Aharonov, Manager, Project Debater team, IBM Research - Haifa

Noam Slonim, Principal Investigator, Project Debater team, IBM Research - Haifa

Project Debater Datasets

The development of an automatic debating system naturally involves advancing research in a range of artificial intelligence fields. This page presents several annotated data sets developed as part of Project Debater to facilitate this research. It is organized by research sub-fields explained below.

Argument Mining is a prominent research frontier. Within this field, we distinguish between Argument Detection - the detection and segmentation of argument components such as claims and evidence; and Argument Stance Classification – determining the polarity of an argument component with respect to a given topic.

Beyond argument mining, a debating system should face the challenge of interactivity i.e., the ability to understand and rebut the text of the opponent’s speech. Debate Speech Analysis is a new research field that focuses on this challenge.

Another important aspect of a debating system is the ability to interact with its surroundings in a human-like manner. Namely, it should be able to articulate arguments and listen to arguments made by others. Regarding the former, the Text to Speech system must demonstrate human-like expressiveness to keep human listeners engaged. The latter may call for Speech-to-text systems that are especially designed for a debating scenario.

Finally, a debating system should naturally rely on more fundamental NLP capabilities. One example is the ability to assess the semantic relatedness of various pieces of texts and glue these into a coherent narrative. The system should also have the ability to identify the basic concepts mentioned in the text. The corresponding benchmark data we released thus far in this context are described in the section on Basic NLP.

  1. Argument Detection

    The various argument detection datasets differ in size (e.g., number of topics), type of element detected (claims, claim sentences, or evidence), and method used for detection (pre-selected articles vs. automatic retrieval). The table below lists the different datasets and provides information on their characteristics:

    Dataset Reference Topics Element Method

    150 (70 train, 30 held-out, 50 test)

    Claim Sentence

    Automatically retrieved Wikipedia sentences

    118 (83 train, 35 test)


    Automatically retrieved Wikipedia sentences

    58 (leave one topic out)


    Pre-selected Wikipedia articles

    33 (leave one topic out)


    Pre-selected Wikipedia articles

  2. Argument Stance Classification and Sentiment Analysis

    A debating system must distinguish between arguments that support its side in the debate and those supporting the opponent’s side. The following datasets were developed as part of the work on Project Debater’s stance classification engine.

    1. Claim Stance

      The claim stance dataset includes stance annotations for claims, as well as auxiliary annotations for intermediate stance classification subtasks.

      Dataset Reference Topics Number of Claims Method



      Manually identified and annotated claims from Wikipedia

    2. Sentiment Analysis

      Sentiment analysis is an important sub-component of our stance classification engine. The following two resources address sentiment analysis of complex expressions, which goes beyond simple aggregation of word-level sentiments. The first resource is a sentiment lexicon of idiomatic expressions, like “on cloud nine” and “under fire”. The second resource addresses sentiment composition – predicting the sentiment of a phrase from the interaction between its constituents. For example, in the phrases “reduced bureaucracy” and “fresh injury”, both “reduced” and “fresh” are followed by a negative word. However, “reduced” flips the negative polarity, resulting in a positive phrase, while “fresh” propagates the negative polarity to the phrase level, resulting in a negative phrase. Accordingly, “reduced” is part of our “reversers” lexicon, and “fresh” is part of the “propagators” lexicon.

      Dataset Reference Content Source

      5,000 frequently occurring idioms with sentiment annotation

      Manually annotated idioms from Wiktionary

      Sentiment composition lexicons containing 2,783 words and sentiment lexicons containing 66K unigrams and 262K bigrams.

      Automatically learned from a large proprietary English corpus

    3. Expert Stance

      Expert evidence (premise) is a commonly used type of argumentation scheme. Prior knowledge about the expert’s stance towards the debate topic can help predict the polarity of such arguments. For example, an argument made by Richard Dawkins about atheism is likely to have a PRO stance, since Dawkins is a well-known atheist. Such information can be extracted from Wikipedia categories: Dawkins, for instance, is listed under “Antitheists”, ”Atheism activists”, “Atheist feminists” and “Critics of religions”. The Wikipedia Category Stance dataset contains stance annotations of Wikipedia categories towards Wikipedia concepts representing controversial topics.

      Dataset Reference Topics Number of Categories Method



      Manually annotated Wikipedia Categories

  3. Debate Speech Analysis

    In order to respond to an opponent’s speech, the system must process the opponents’ voice and `understand’ its content. The provided dataset focuses on the Automatic Speech Recognition (ASR) component.

    Dataset Reference Speeches Topics Source



    Recordings of 10 expert debaters

  4. Expressive Text to Speech

    The emphasized words dataset was created to train and evaluate a system that receives a written argumentative speech and predicts which words should be emphasized by the Text-to-Speech component.

    Dataset Reference Number of Paragraphs Number of Sentences Source



    The speeches were created based on claims/evidence automatically detected from Wikipedia

  5. Basic NLP Tasks

    The following datasets relate to basic NLP tasks, addressed as part of Project Debater.

    1. Semantic Relatedness

      Predicting semantic relatedness between texts is a basic NLP problem with a wide variety of applications. Relatedness can be measured between several types of texts, ranging from words to documents. The relatedness datasets listed below differ in the type of elements considered (words, multi-word-terms, and concepts), number of topics from which the pairs were extracted, and number of annotated pairs.

      Dataset Reference Number of Topics Type of elements Number of pairs

      143 (82 train, 41 test)

      Wikipedia Entities

      19,276 (12,969 train, 6307 test)


      Words and Multi-word Terms


    2. Mention Detection

      The goal of Mention Detection is to map entities/concepts mentioned in text to the correct concept in a knowledge base. This process involves segmenting the text (as some concepts span multiple words) and the disambiguation of terms with more than one meaning.

      Dataset Reference Number of Sentences Number of Topics Source

      3000 (500 train and 500 test for each of the three text sources)


      Mix of Wikipedia articles and ASR/manual transcripts of speeches by expert debaters

  6. .

    Debater Datasets - Licensing Notice

    Each copy or modified version that you distribute must include a licensing notice stating that the work is released under CC-BY-SA and either a) a hyperlink or URL to the text of the license or b) a copy of the license. For this purpose, a suitable URL is:


IBM Unraveling Language Patterns

(GReedy Augmented Sequential Patterns)

IBM Unraveling Language Patterns is an algorithm for automatically extracting patterns that characterize subtle linguistic phenomena.
To that end, IBM Unraveling Language Patterns augments each term of input text with multiple layers of linguistic information. These different facets of the text terms are systematically combined to reveal rich patterns.