There are currently few datasets appropriate for training and evaluating models for Conversational Information Seeking (CIS). The main aim of TREC CAsT is to advance research on conversational search systems. The goal of the track is to create a reusable benchmark for open-domain information centric conversational dialogues.
The track will run in 2019 and establish a concrete and standard collection of data with information needs to make systems directly comparable.
This is the first year of TREC CAsT, which will run as a track in TREC. This year we aim to focus on candidate information ranking in context:
- Read the dialogue context: Track the evolution of the information need in the conversation, identifying salient information needed for the current turn in the conversation
- Retrieve Candidate Response Information: Perform retrieval over a large collection of paragraphs (or knowledge base content) to identify relevant information
- Year 1 task guidelines
- Comments and feedback are welcome.
- Training topics year 1 V1.0 - 30 example training topics
- Coming soon: Partial judgment data for a subset of training topics
- Additional resources: MS MARCO Conversational Search Sessions Conversational Search data and train data is released.
- The corpus is a combination of three standard TREC collections: MARCO Ranking passages, Wikipedia (TREC CAR), and News (Washington Post)
- The MS MARCO Passage Ranking collection
- The TREC CAR paragraph collection v2.0
- The TREC Washington Post Corpus: Note requires an organizational agreement.
- The collection id is
[collection_id_paragraph_id]
with collection and paragraph separated by an underscore. - The collection ids are in the set:
{MARCO, CAR, WAPO}
. - The paragraph ids are: standard provided by MARCO and CAR. For WAPO the paragraph ID is
[article_id-paragraph_index]
where the paragraph_index is the 0-based position index of the paragraph using the provided paragraph markup separated by a single dash. - Example WaPo combined document id:
[WAPO_903cc1eab726b829294d1abdd755d5ab-1]
, or CAR:[CAR_6869dee46ab12f0f7060874f7fc7b1c57d53144a]
- TREC-CAsT Tools repository with code and scripts for processing data.
- Note: This will evolve over time, it currently contains topic definition files.
- Year 1 planning information
- Comments and feedback are welcome.
Information Needs
- ~50-100 topics with manually defined trajectories
- Start from initial general topic
- Conversation evolves across ‘realistic’ facets for ~10 turns
- Manually created topics from crowdsourcing
- May 23: Training data released
- April 18th: Guidelines released
- November 13: Announcement that the track will run next year.
- March 19: Sample topic data for conversational and MARCO sessions available
- May 1st: Track guidelines are released
- Twitter: @treccast
- Slack: treccast.slack.com
- Google groups [email protected]
- Training data release: May 23rd
- Test topic release: June 12th
- Run submission: August 16th
Forthcoming
- Jeff Dalton, University of Glasgow
- Chenyan Xiong, Microsoft Research
- Jamie Callan, Carnegie Mellon University
- Laura Dietz, University of New Hamsphire
- Jimmy Lin, University of Waterloo
- Julia Kiseleva, Microsoft Research
- Vanessa Murdock, Amazon Research
- Paul Bennett, Microsoft Research
- Zhiting Hu, CMU
- Anton Leuski, USC