Skip to content

centre-for-humanities-computing/chicago_corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 

Repository files navigation

The Chicago Corpus Static Badge

As part of the efforts of the Fabula-NET project at the Center for Humanities Computing, Århus University, we present a dataset of quality judgments on 9,000 19th and 20th century English-language literary novels by 3,166 predominantly Anglophone authors.

The data includes annotation of expert opinions and crowd-based resources to allow comparative analyses between different literary quality evaluations, as well as several textual metrics chosen for their connection with literary reception. A large part of the corpus is subjected to copyright (see the available pre-1924 works here). We release quality and reception measures together with stylometric and sentiment data for each of the 9,000 novels to promote future research and comparison. Read the Paper presenting this resource.


⚡ Data included

  • 9,000 titles
  • Author, title & year
  • Various textual metrics
  • Various reception metrics

For an overview of all included data, see the corpus documentation.

Available formats: .xlsx, .json


🔍 Example

BOOK_ID TITLE AUTH_FIRST AUTH_LAST PUBL_DATE ... AVG_RATING SCIFI_AWARDS PULITZER TRANSLATIONS ... PERPLEXITY MEAN_SENT READABILITY
6913 A Clash of Kings George R. R. Martin 1999 ... 4.41 1 0 38 ... 79.97 -0.002 92.73
20636 Dune Frank Herbert 1965 ... 4.25 1 0 398 ... 72.74 -0.007 85.18
22741 Beloved Toni Morrison 1987 ... 3.92 0 1 68 ... 68.78 0.030 91.71
5778 Misery Stephen King 1987 ... 4.20 0 0 74 ... 68.09 -0.032 82.54
86 The Portrait of a Lady Henry James 1881 ... 3.78 0 0 53 ... 80.35 0.150 71.65

Above: Example of titles and corresponding values for selected metrics


📈 Corpus statistics

The corpus of texts from which we constructed our dataset was assembled by Hoyt Long and Richard Jean So in the Textual Optics Lab; it encompasses 9088 novels published in the United States between 1880 and 2000 and was compiled based on the number of libraries holding each title (based on the WorldCat catalogue), favoring works with a higher number of library holdings.


Titles Authors Titles per author
9088 3166 2.88

Above: Number of titles/authors in the corpus


Below: Mean & SD of some of the included features

Metric Wordcount Sentence Length Wordlength Type/Token Ratio Compressibility Bigram Entropy Word Entropy Flesch Ease Dale Chall New Mean Sentiment Std Sentiment End Sentiment Beginning Sentiment Hurst Exponent Approximate Entropy
Mean (µ) 118584.71 86.56 3.67 0.69 2.92 14.63 9.69 82.70 5.10 0.03 0.35 0.03 0.04 0.61 1.75
St. dev. (±) 64746.05 29.44 0.18 0.02 0.14 0.55 0.30 6.48 0.33 0.04 0.04 0.07 0.05 0.04 0.15

🏆 "Quality", "reader appreciation" or "popularity" metrics

Beyond textual features, we present various "quality proxies", that is, ways of estimating valuation in literary culture, such as whether or not titles are included in Bestseller or Canon lists. We also include what we call "continuous" proxies, that is, scores per title, for example of GoodReads ratings or translation numbers (see the corpus documentation).

Because of the library holdings selection criteria, the corpus comprises much high-quality fiction from authors who have received prestigious distinctions, such as the Nobel Prize (i.a., Toni Morrison), the National Book Award (i.a., Don DeLillo). Yet, library holdings appear to indicate both high distinction and mass popularity, reflecting library users' demand and preferences. So the corpus also comprises widely popular novels from mainstream literature (i.a., Agatha Christie), and notable works on the broad spectrum of so-called "genre literature", from Mystery to Science Fiction (i.a., Tolkien, Philip K. Dick etc.). An examination of the relation between various proxies in this corpus is forthcoming.


📖 Documentation

📄 Paper The Chicago resource paper.
✏️ Documentation Detailed description of measures and proxies included in the dataset.
🗂️ Previous works Publications that have previously used the Chicago Corpus.
🔬 Textual Optics Lab The Chicago Corpus at the Textual Optics Lab, University of Chicago.
📚 Citation Bibtex citation.
🔥 EmotionArcs Emotion Arcs of the Chicago Corpus (a linked dataset).
🔬 CHC Center for Humanities Computing, hosting the FabulaNET project.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published