Skip to content

Commit

Permalink
Include datasets in README.md.
Browse files Browse the repository at this point in the history
  • Loading branch information
shaoyijia committed Oct 3, 2024
1 parent 32f9b99 commit 3a3dc21
Showing 1 changed file with 13 additions and 1 deletion.
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,8 +285,20 @@ If you have installed the source code, you can customize Co-STORM based on your
1. Co-STORM introduces multiple LLM agent types (i.e. Co-STORM experts and Moderator). LLM agent interface is defined in `knowledge_storm/interface.py` , while its implementation is instantiated in `knowledge_storm/collaborative_storm/modules/co_storm_agents.py`. Different LLM agent policies can be customized.
2. Co-STORM introduces a collaborative discourse protocol, with its core function centered on turn policy management. We provide an example implementation of turn policy management through `DiscourseManager` in `knowledge_storm/collaborative_storm/engine.py`. It can be customized and further improved.
## Datasets
To facilitate the study of automatic knowledge curation and complex information seeking, our project releases the following datasets:
## Replicate Replicate STORM & Co-STORM paper result
### FreshWiki
The FreshWiki Dataset is a collection of 100 high-quality Wikipedia articles focusing on the most-edited pages from February 2022 to September 2023. See Section 2.1 in [STORM paper](https://arxiv.org/abs/2402.14207) for more details.
You can download the dataset from [huggingface](https://huggingface.co/datasets/EchoShao8899/FreshWiki) directly. To ease the data contamination issue, we archive the [source code](https://github.com/stanford-oval/storm/tree/NAACL-2024-code-backup/FreshWiki) for the data construction pipeline that can be repeated at future dates.
### WildSeek
To study users’ interests in complex information seeking tasks in the wild, we utilized data collected from the web research preview to create the WildSeek dataset. We downsampled the data to ensure the diversity of the topics and the quality of the data. Each data point is a pair comprising a topic and the user’s goal for conducting deep search on the topic. For more details, please refer to Section 2.2 and Appendix A of [Co-STORM paper](https://www.arxiv.org/abs/2408.15232).
The WildSeek dataset is available [here](https://huggingface.co/datasets/YuchengJiang/WildSeek).
## Replicate STORM & Co-STORM paper result
For STORM paper experiments, please switch to the branch `NAACL-2024-code-backup` [here](https://github.com/stanford-oval/storm/tree/NAACL-2024-code-backup).
Expand Down

0 comments on commit 3a3dc21

Please sign in to comment.