The dataset for the paper "AI vs. Human -- Differentiation Analysis of Scientific Content Generation"

Abstract:

Recent neural language models have taken a significant step forward in producing remarkably controllable, fluent, and grammatical text. Although studies have found that AI-generated text is not distinguishable from human-written text for crowd-sourcing workers, there still exist errors in AI-generated text which are even subtler and harder to spot. We primarily focus on the scenario in which scientific AI writing assistant is deeply involved. First, we construct a feature description framework to distinguish between AI-generated text and human-written text from syntax, semantics, and pragmatics based on the human evaluation. Then we utilize the features, i.e., writing style, coherence, consistency, and argument logistics, from the proposed framework to analyze two types of content. Finally, we adopt several publicly available methods to investigate the gap of between AI-generated scientific text and human-written scientific text by AI-generated scientific text detection models. The results suggest that while AI has the potential to generate scientific content that is as accurate as human-written content, there is still a gap in terms of depth and overall quality. The AI-generated scientific content is more likely to contain errors in factual issues. We find that there exists a "writing style" gap between AI-generated scientific text and human-written scientific text. Based on the analysis result, we summarize a series of model-agnostic and distribution-agnostic features for detection tasks in other domains. Findings in this paper contribute to guiding the optimization of AI models to produce high-quality content and addressing related ethical and security concerns.

note:

This is the version 2 of the paper "Is This Abstract Generated by AI? A Research for the Gap between AI-generated Scientific Text and Human-written Scientific Text"

Scientific Abstract Dataset

There are 4 "text_type" in the file AI-vs-Human-2500-v1.xlsx

"abstract" refers to the original abstracts of the papers
"gpt3_abs" refers to the abstracts generated by GPT-3
"polish_abs" refers to the abstracts polished by GPT-3
"chatgpt_abs" refers to the abstracts generated by ChatGPT

In our dataset, we label the AI-generated abstracts as "Fake" and the original abstracts as "Real".

Wiki Item Dataset

There are 25 wiki item descriptions in the file wiki-0210-openai.xlsx .

Each item has a original description and a description generated by ChatGPT.

Cite our paper

@misc{ma2023ai,
  title={AI vs. Human--Differentiation Analysis of Scientific Content Generation},
  author={Ma, Yongqiang and Liu, Jiawei and Yi, Fan and Cheng, Qikai and Huang, Yong and Lu, Wwei and Liu, Xiaozhong},
  journal={arXiv preprint arXiv:2301.10416},
  year={2023}

}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

The dataset for the paper "AI vs. Human -- Differentiation Analysis of Scientific Content Generation"

Scientific Abstract Dataset

Wiki Item Dataset

Cite our paper

Files

README.md

Latest commit

History

README.md

File metadata and controls

The dataset for the paper "AI vs. Human -- Differentiation Analysis of Scientific Content Generation"

Scientific Abstract Dataset

Wiki Item Dataset

Cite our paper