Skip to content

Latest commit

 

History

History
33 lines (25 loc) · 2.05 KB

ShakespeareCollectedWorks.md

File metadata and controls

33 lines (25 loc) · 2.05 KB

Text Analysis of The Collected Works of William Shakespeare

The Collected Works of William Shakespeare was used as a dataset to test trees during development. This is not actually a large dataset by today's standards, at <10 MB of text files, but nonetheless useful for testing. It's unclear how the Bard would feel about his work being used for this purpose, no offense intended!

Standalone programs to build trees from and analyze The Collected Works of William Shakespeare can be found here.

Some output from these programs:

  • Radix Tree of the individual words contained in all of Shakespeare's works

    • Tree viewable here
    • Values are the works containing those words
    • 29,008 nodes
  • Suffix Tree of the individual words contained in all of Shakespeare's works

    • Tree viewable here
    • Values in this output are the complete words associated with each suffix
    • 75,780 nodes
  • Suffix Tree of Shakespeare's Tragedy, Antony and Cleopatra

    • 280 MB in RAM using DefaultCharSequenceNodeFactory
    • Insufficient RAM to build tree using DefaultCharArrayNodeFactory
    • 29.438 GB if written to disk (highlights problems suffered by DefaultCharArrayNodeFactory)
    • 217,697 nodes
  • Suffix Tree of the entirety of Shakespeare's Tragedies (10 plays)

    • 2.0 GB in RAM using DefaultCharSequenceNodeFactory
    • Insufficient RAM to build tree using DefaultCharArrayNodeFactory
    • 248.997 GB if written to disk (highlights memory which would be required by DefaultCharArrayNodeFactory)
    • 1,965,884 nodes
  • The Longest Common Substring of Shakespeare's Tragedies (10 plays)

    • Output viewable here
    • Longest common substring is: "dramatis personae"