Skip to content

Commit

Permalink
Add Documentation for FORA, NPR-2p, and FOMC Corpus and Update Relate…
Browse files Browse the repository at this point in the history
…d Datasets and Config Files (#238)

Co-authored-by: seanzhangkx8 <[email protected]>
Co-authored-by: Yash Chatha <[email protected]>
Co-authored-by: Laerdon Kim <[email protected]>
  • Loading branch information
4 people authored Nov 14, 2024
1 parent 7f2bced commit d1dcc52
Show file tree
Hide file tree
Showing 11 changed files with 3,699 additions and 1 deletion.
20 changes: 19 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
[![Discord Community](https://img.shields.io/static/v1?logo=discord&style=flat&color=red&label=discord&message=community)](https://discord.gg/WMFqMWgz6P)


This toolkit contains tools to extract conversational features and analyze social phenomena in conversations, using a [single unified interface](https://convokit.cornell.edu/documentation/architecture.html) inspired by (and compatible with) scikit-learn. Several large [conversational datasets](https://github.com/CornellNLP/ConvoKit#datasets) are included together with scripts exemplifying the use of the toolkit on these datasets. The latest version is [3.0.1](https://github.com/CornellNLP/ConvoKit/releases/tag/v3.0.1) (released Nov. 8, 2024); follow the [project on GitHub](https://github.com/CornellNLP/ConvoKit) to keep track of updates.
This toolkit contains tools to extract conversational features and analyze social phenomena in conversations, using a [single unified interface](https://convokit.cornell.edu/documentation/architecture.html) inspired by (and compatible with) scikit-learn. Several large [conversational datasets](https://github.com/CornellNLP/ConvoKit#datasets) are included together with scripts exemplifying the use of the toolkit on these datasets. The latest version is [3.0.1](https://github.com/CornellNLP/ConvoKit/releases/tag/v3.0.1) (released November 8, 2024); follow the [project on GitHub](https://github.com/CornellNLP/ConvoKit) to keep track of updates.

Read our [documentation](https://convokit.cornell.edu/documentation) or try ConvoKit in our [interactive tutorial](https://colab.research.google.com/github/CornellNLP/ConvoKit/blob/master/examples/Introduction_to_ConvoKit.ipynb).

Expand Down Expand Up @@ -137,6 +137,24 @@ A collection of all the conversations that occurred over 10 seasons of Friends,

Name for download: `friends-corpus`

### [Federal Open Market Committee (FOMC) Corpus](https://convokit.cornell.edu/documentation/fomc.html)

Transcripts of recurring meetings of the Federal Reserve’s Open Market Committee (FOMC), where important aspects of U.S. monetary policy are decided, covering the period 1977-2008.

Name for download: `fomc-corpus`

### [NPR Interview 2P Dataset Corpus](https://convokit.cornell.edu/documentation/npr-2p.html)

This corpus contains conversations between NPR show hosts and their guests.

Name for download: `npr-2p-corpus`

### [DeliData Dataset Corpus](https://convokit.cornell.edu/documentation/deli.html)

This corpus contains conversations in multi-party problem-solving contexts, containing information about group discussions and team performance.

Name for download: `deli-corpus`

### [Switchboard Dialog Act Corpus](https://convokit.cornell.edu/documentation/switchboard.html)

A collection of 1,155 five-minute telephone conversations between two participants, annotated with speech act tags.
Expand Down
4 changes: 4 additions & 0 deletions docs/source/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,7 @@ Datasets
Supreme Court Oral Arguments Dataset <supreme.rst>
Wikipedia Articles for Deletion Dataset <wiki-articles-for-deletion-corpus.rst>
CaSiNo Corpus <casino-corpus.rst>
NPR Interviews 2P Corpus <npr-2p.rst>
Federal Open Market Committee Corpus <fomc.rst>
FORA Corpus <fora.rst>
DeliData Corpus <deli.rst>
90 changes: 90 additions & 0 deletions docs/source/deli.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
DeliData Corpus
===============

DeliData is a dataset designed for analyzing deliberation in multi-party problem-solving contexts. It contains information about group discussions, capturing various aspects of participant interactions, message annotations, and team performance.

The corpus is available upon request from the authors, and a ConvoKit-compatible version can be derived using ConvoKit’s conversion tools. ConvoKit also host the ConvoKit-format deli corpus, which can be directly downloaded following instruction in the Usage section.

For a full description of the dataset collection and potential applications, please refer to the original publication: `Karadzhov, G., Stafford, T., & Vlachos, A. (2023). DeliData: A dataset for deliberation in multi-party problem solving. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW2), 1-25.`

Dataset details
---------------

All ConvoKit metadata attributes retain the original names used in the dataset.

Speaker-level information
^^^^^^^^^^^^^^^^^^^^^^^^^

Metadata for each speaker includes the following fields:

* speaker: Identifier or pseudonym of the speaker.

Utterance-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each utterance includes:

* id: Unique identifier for an utterance.
* conversation_id: Identifier for the conversation that the utterance belongs to.
* reply_to: Identifier for the previous utterance in the conversation, if any (null if not a reply).
* speaker: Name or pseudonym of the utterance speaker.
* text: Normalized textual content of the utterance with applied tokenization and masked special tokens.
* timestamp: Null for the entirety of this corpus.

Metadata for each utterance includes:

* annotation_type: Type of utterance deliberation, if annotated (e.g., "Probing" or "Non-probing deliberation"). If unannotated, may be null.
* annotation_target: Target annotation, indicating the intended focus of the message, such as "Moderation" or "Solution." May be null if not annotated.
* annotation_additional: Any additional annotations indicating specific deliberative actions (e.g., "complete_solution"), may be null if not annotated.
* message_type: Type of message, categorized as INITIAL, SUBMIT, or MESSAGE, indicating its function in the dialogue.
* original_text: Original text as said in the collected conversation; For INITIAL type, contains the list of participants and cards presented. For SUBMIT type, contains the cards submitted

Conversation-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For each conversation we provide:

* id: id of the conversation

Metadata for each conversation includes:

* team_performance: Approximate performance of the team based on user submissions and solution mentions, ranging from 0 to 1, where 1 indicates all participants selected the correct solution.
* sol_tracker_message: Extracted solution from the current message content.
* sol_tracker_all: Up-to-date "state-of-mind" for each of the participants, i.e. an approximation of what each participant think the correct solution is at given timestep. This is based on initial solutions, submitted solutions, and solution mentions. team_performance value is calculated based on this column
* performance_change: Change in team performance relative to the previous utterance.

Usage
-----

Convert the DeliData Corpus into ConvoKit format using the following notebook: `Converting DeliData to ConvoKit Format <https://github.com/CornellNLP/ConvoKit/blob/master/examples/dataset-examples/DELI/ConvoKit_DeliData_Conversion.ipynb>`_

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("deli-corpus"))


For some quick stats:

>>> corpus.print_summary_stats()

* Number of Speakers: 30
* Number of Utterances: 17111
* Number of Conversations: 500

Additional note
---------------
Data License
^^^^^^^^^^^^

ConvoKit is not distributing the corpus separately, and thus no additional data license is applicable. The license of the original distribution applies.

Contact
^^^^^^^

Questions regarding the DeliData corpus should be directed to Georgi Karadzhov ([email protected]).

Files
^^^^^^^

Request the Official Released DeliData Corpus without ConvoKit formatting: https://delibot.xyz/delidata
67 changes: 67 additions & 0 deletions docs/source/fomc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
Federal Open Market Committee (FOMC) Corpus
===========================================

Transcripts of recurring meetings of the Federal Reserve’s Open Market Committee (FOMC), where important aspects of U.S. monetary policy are decided, covering the period 1977-2008. (108,504 conversational exchanges between 364 speakers of FOMC board members in 268 meetings).

Distributed together with:
`Talk it up or play it down? (Un)expected correlations between (de-)emphasis and recurrence of discussion points in consequential U.S. economic policy meetings <https://chenhaot.com/papers/de-emphasis-fomc.html>`_. Chenhao Tan and Lillian Lee. Presented in Text As Data 2016.

Dataset details
---------------

Speaker-level information
^^^^^^^^^^^^^^^^^^^^^^^^^

Speakers in this dataset are FOMC members, indexed by their name as recorded in the transcripts.
* id: name of the speaker
* chair: (boolean) is speaker FOMC Chair
* vice_chair: (boolean) is speaker FOMC Vice-Chair

Utterance-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^

For each utterance, we provide:
* id: index of the utterance (concatenating the meeting date with the utterance’s sequence position)
* speaker: the speaker who authored the utterance
* conversation_id: ID of meeting
* reply_to: id of the sequentially prior utterance (None for the first utterance of a meeting)
* text: textual content of the utterance
* timestamp: calculated value based off the date of the meeting and the speech index

Metadata for utterances include:
* speech_index: index of utterance in the context of the conversation
* parsed: parsed version of the utterance text, represented as a SpaCy Doc

Conversational-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Conversations are indexed by a string representing the meeting date.

Usage
-----------

Convert the FOMC Corpus into ConvoKit format using this notebook `Converting FOMC Corpus to ConvoKit Format <https://github.com/CornellNLP/ConvoKit/blob/master/examples/dataset-examples/FOMC/fomc_to_convokit.ipynb>`_

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("fomc-corpus"))


For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 364
Number of Utterances: 108504
Number of Conversations: 268


Additional note
---------------

The original dataset can be downloaded `here <https://chenhaot.com/pages/de-emphasis-fomc.html>`_. Refer to the original README for more explanations on dataset construction.

Contact
^^^^^^^

Please email any questions to: [email protected] (Cristian Danescu-Niculescu-Mizil).
110 changes: 110 additions & 0 deletions docs/source/fora.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
Fora Corpus
=============
Fora corpus is a dataset of 262 annotated transcripts of multi-person facilitated dialogues regarding issues like education, elections, and public health, primarily through the sharing of personal experience. The corpus is available by request from the authors (`https://github.com/schropes/fora-corpus <https://github.com/schropes/fora-corpus>`_) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed below.

A full description of the dataset can be found here: `Schroeder, H., Roy, D., & Kabbara, J. (2024). Fora: A corpus and framework for the study of facilitated dialogue. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13985–14001). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.754. <https://doi.org/10.18653/v1/2024.acl-long.754>`_
Please cite this paper when using Fora in your research.

Usage
-----

Request Fora Corpus from (transcripts only): `https://github.com/schropes/fora-corpus <https://github.com/schropes/fora-corpus>`_

Convert the Fora Corpus into ConvoKit format using this notebook `Converting Fora Corpus to ConvoKit Format <https://github.com/CornellNLP/ConvoKit/blob/master/examples/dataset-examples/FORA/ConvoKit_Fora_Conversion.ipynb>`_

Dataset details
---------------

All ConvoKit metadata attributes preserve the names used in the original corpus, as detailed `here. <https://github.com/schropes/fora-corpus>`_

Speaker-level information
^^^^^^^^^^^^^^^^^^^^^^^^^

There were 1776 unique participants. The following information is recorded in the speaker level metadata:

Metadata for each speaker include:
* speaker_name : Usually, first name or pseudonym of the speaker (str).
* is_fac : Whether the current speaker is a facilitator (boolean).
* location : Location of the conversation (str).
* source_type : Information about the type of audio input (str).

Utterance-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^

For each utterance we provide:

* id: Unique identifier for an utterance.
* conversation_id: Utterance id corresponding to the first utterance of the conversation.
* reply_to: Utterance id of the previous utterance in the conversation.
* speaker: Speaker object corresponding to the author of this utterance.
* text: Textual content of the utterance.

Metadata for each utterance include:

* original_index: Index of the original entry in the dataset.
* collection_title: String title of the collection.
* collection_id: Numeric identifier of the conversation collection.
* SpeakerTurn: Index of speaker turn within the conversation (1-indexed).
* audio_start_offset: In number of seconds, offset within the recording at which point the speaker turn begins.
* audio_end_offset: In number of seconds, offset within the recording at which point the speaker turn ends.
* duration: In number of seconds, duration of the speaker turn.
* conversation_id: Unique identifier for the conversation.
* speaker_id: Unique int identifier of the speaker within the conversation. Speakers who participated in multiple conversations do not have a persistent speaker_id - these are unique to each conversation.
* speaker_name: Usually, first name or pseudonym of the speaker. This field may have been anonymized in cases where the last name was provided. Overall, it is reliable but may have occasional diarization errors.
* words: String of all words in the speaker turn.
* is_fac: Boolean representing whether the current speaker is a facilitator.
* cofacilitated: Boolean representing whether the current conversation has more than one facilitator.
* annotated: Boolean representing whether the conversation was annotated by human experts for facilitation strategies and personal sharing.
* start_time: Date of the conversation start time. Likely reliable as the date the conversation happened, but may be approximate due to potential delay in uploading.
* source_type: String providing information about the type of audio input (e.g., Zoom, Hearth, iPhone).
* location: Represents the location of the conversation, typically a town or neighborhood. About 1/3 of conversations do not have a value for this field and are marked "Unknown."
* Personal story: Binary label representing the presence of a "Personal story" as annotated by a human.
* Personal experience: Binary label representing the presence of a "Personal experience" as annotated by a human.
* Express affirmation: Binary label representing the presence of "Express affirmation" as annotated by a human.
* Specific invitation: Binary label representing the presence of "Specific invitation" as annotated by a human.
* Provide example: Binary label representing the presence of "Provide example" as annotated by a human.
* Open invitation: Binary label representing the presence of "Open invitation" as annotated by a human.
* Make connections: Binary label representing the presence of "Make connections" as annotated by a human.
* Express appreciation: Binary label representing the presence of "Express appreciation" as annotated by a human.
* Follow up question: Binary label representing the presence of "Follow up question" as annotated by a human.

Conversation-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For each Conversation, we provide the following metadata:

* collection_id: Numeric identifier of the conversation collection.

* conversation_id: Unique identifier for the conversation.

* cofacilitated: Boolean representing whether the current conversation has more than one facilitator.

* annotated: Boolean representing whether the conversation was annotated by human experts for facilitation strategies and personal sharing.

* start_time: Date of the conversation start time. Likely reliable as the date the conversation happened, but may be approximate due to potential delay in uploading.

* source_type: String providing information about the type of audio input (e.g., Zoom, Hearth, iPhone).

* location: Represents the location of the conversation, typically a town or neighborhood. About 1/3 of conversations do not have a value for this field and are marked "Unknown."


Statistics about the dataset
----------------------------

* Number of Speakers: 1776
* Number of Utterances: 39911
* Number of Conversations: 262

Additional note
---------------
Data License
^^^^^^^^^^^^

ConvoKit is not distributing the corpus separately, and thus no additional data license is applicable. The license of the original distribution applies.

Contact
^^^^^^^

Questions about the conversion into ConvoKit format should be directed to Sean Zhang <[email protected]>

Questions about the Fora corpus should be directed to the corresponding authors Hope Schroeder <[email protected]>, Deb Roy <[email protected]>, and Jad Kabbara <[email protected]> of the original paper.
Loading

0 comments on commit d1dcc52

Please sign in to comment.