Books are a profound resource. Novels especially are permanently inscribed with a vast archive of information and narratives. They highlight unseen elements of the human condition that would never be revealed in objective texts. When filtered through computational analysis, these books can inform intelligent systems about the most intimate parts of the human experience, and in turn, improve their capacity for human-centered reasoning.
Stony Brook University has identified a new way for users and researchers to engage closely with literature through a new research project called: StonyBook: A Resource for the Computational Analysis of Novels. Developed by PhD candidates Allen Kim, Charuta Pethe, and Director of the Institute for AI-Driven Discovery and Innovation Steven Skiena, “StonyBook” offers an easily-navigable annotation system, which takes the form of an online interface, to analyze over 50,000 books from esteemed libraries such as Project Gutenberg and HathiTrust.
Traditionally, the research efforts in fields such as digital literary studies and computational linguistics are hindered by the severe lack of annotation systems for large bodies of text, like novels. To that end, up until recently, Natural Language Processing (NLP) has almost exclusively been employed within the context of short texts.
“New techniques are necessary to make sense of long documents such as novels,” according to Skiena.
As such, “We built a system capable of understanding narrative elements such as character interactions and flow of events. That led us to build a unified system to process and annotate novels,” says Kim and Pethe.
NLP-driven computational analyses are not well-suited to extract valuable information from larger texts because they work with crude resources from digital libraries. Many of these books are 100 to 200 years old but are then scanned from libraries. In order to mitigate this disarray, “StonyBook’s” design includes a primary, integral mechanism to analyze the texts through a processing and cleaning phase. This phase fixes the text’s errors without viewing the image of each individual page. Rather, it takes on the entire body of the reconstructed, scanned version of the text and cleans it.
A variety of text features complicate the essence of the narrative. Through a multi-staged structural parsing, as part of the cleaning phase, these contents are removed. Structural parsing even closely dissects texts at the alphabetic and numeric level. Through each stage the novel becomes increasingly more refined until the narrative stands alone, stripped of decorative textual elements. Once the cleaning stage is complete, NLP analyses can take the reins on these self-standing narratives, from the front to back cover.
Post-clean up, the text is systematically broken up from chapters, to paragraphs, to sentences, and to words. Initially, a machine learning model is employed to recognize the patterns associated with chapter headings; this method ultimately identifies chapters, and splits them from the text en masse. Each word is annotated consistently on the basis of its grammatical and syntactical structure.
First, the word is tagged with its part of speech (noun, verb, adjective, etc.). Then, the word is annotated with its dependency tag which identifies its grammatical relationship between the other words in the clause. Words are then tagged by their Named Entity Recognition (NER) which annotates words as people, places, etc. NLP continues to mark the lemma, the original form of the word. As it continues, words are identified with their coreference resolution; this process associates pronouns with the entities that they reference. Quotes are also then associated with the entity that spoke them.
These processes go beyond compartmentalizing textual elements and pave the way for statistical analysis. In aggregate, “StonyBook” evaluates the evolution of literature. These analyses unveil trends and shifts in literature which have gone unnoticed as decades have passed.
The program is a profound resource which helps to analyze pivotal societal changes through literary analysis. For example, in a patriarchal society, much literature has men at its epicenter. Female protagonists existed, but they were few and far between. At the beginning of the Women’s Suffrage movement, though, the tides started to change. Starting with the mid-19th century all the way to its end, there was a stark and quick decline in the existence of male protagonists. Women took the center stage. After this sweeping social movement, though, men reassumed literary supremacy. However, it’s “StonyBook’s” analyses that uncovered this brief, pivotal change which otherwise may have gone unnoticed.
“StonyBook’s” analyses reveal that literature functions in accordance with the historical circumstances in which it was produced. To that end, as the reading skill has become more accessible and widespread, authors have responded. Instead of catering to the educated and elite, authors have adapted to suit the literacy needs of general audiences. Authors over the centuries have increasingly simplified their language for it to be comprehensible. The readability score calculated by “StonyBook” concludes that books have become easier to read over time.
These analyses are intrinsic to the research community. “StonyBook is useful to researchers in computer science for large-scale NLP tasks, digital humanities for corpus-level aggregate analyses, and linguistics for studying word patterns over time and across genres,” says Kim and Pethe.
“StonyBook” is responsible for multiple up-and-coming inroads in the research community, for both literary scholars and computer scientists alike.
“We are currently using “StonyBook” as the foundation for multiple research projects: novel summary alignment, audiobook generation, and creating neural representations of narratives. In the future, we imagine this will be useful for further research related to character relationship dynamics, conversation modeling, question answering, and automatic summarization, particularly for novels,” says Kim and Pethe.
-Alyssa Dey, Communications Assistant