A common task in literature study is to find examples of a theme. Until now, literary scholars searching for examples have had to rely on searching for sets of words they think are associated with the theme.
Theme-finding by searching for words poses a problem. Synonymy and the infinite variance of language mean that the same theme might surface in many different forms using many different words. Even for scholars with intimate knowledge of the text, a single set of words is not enough. Depending on their mental context, the words that come to mind might not always be complete and representative.
For example, take the Shakespearean theme of “seeing is believing” — that seeing an event with one’s own eyes is more credible than hearing about it second-hand. A scholar might search for the words “believe”, “speak”, “eyes”, and “see”. That search might be able to capture this example (from The Winter’s Tale 5.2):
Then have you lost a sight, which was to be seen, can not be spoken of.
but not this one (from King Lear 4.6):
I would not take this from report; it is, And my heart breaks at it.
As a solution, we at WordSeer propose search-by-example. This technology dates back to the 80′s in the field of information retrieval, and so far, it’s been successful in helping find relevant documents. We think it could work for theme-finding too.
With search-by-example, instead of inferring which words represent a theme, and then searching for those words, a scholar can search for sentences that match a set of examples. A scholar marks a set of examples of a theme, and the system returns a list of sentences it thinks are relevant.
This process is a cycle. When the system returns results, the scholar gives it feedback by labeling sentences “relevant” if they match the theme, and “not-relevant” if they don’t. The system gradually builds a model of what the scholar is interested in, and eventually returns results that are mostly relevant.
For example, in under five minutes, I was able to use the examples above to come up with seven more candidates:
Gracious my lord, I should report that which I say I saw, But know not how to do’t. (Macbeth 5.5)
Most noble sir, That which I shall report will bear no credit, Were not the proof so nigh. (Winter’s Tale 5.1)
I would not hear your enemy say so, Nor shall you do mine ear that violence, To
make it truster of your own report Against yourself: I know you are no truant. (Hamlet 1.2)If in Naples I should report this now, would they believe me? (The Tempest 3.3)
They call him Doricles; and boasts himself To have a worthy feeding: but I have it Upon his own report and I believe it; He looks like sooth. (Winter’s tale 4.4)
It is not so; thou hast misspoke, misheard; Be well advised, tell o’er thy tale again: It can not be thou dost but say’ tis so: I trust I may not trust thee; for thy word Is but the vain breath of a common man: Believe me, I do not believe thee, man; I have a king’s oath to the contrary. (King John 3.1)
I do beseech you, either not believe The envious slanders of her false accusers; Or, if she be accused on true report, Bear with her weakness, which, I think, proceeds From wayward sickness, and no grounded malice. (Richard III 1.3)
Of course, this is all theory until it’s been proven to work. And while I’m not a Shakespeare scholar, I did build this particular system, so it might not be surprising that I can get a few results out of it.
So to find out whether search-by-example works, we’ve designed a five-minute study around three Shakespearean themes. There are three systems: one search, and two different example-based ones. Participants are shown an example of a theme, and asked to use a system to find as many relevant results as they can in five minutes. The systems and theme are randomly assigned.
We’ll find our answer by comparing the quality and quantity of the sentences the participants find on the three systems. Expert scholars will help us judge quality: they will rate the relevance of sentences the different systems produce (without knowing which system produced which sentence). For quantity, there is a time limit — which system produces more high-quality results in five minutes?
So, does example-based exploration work better than search for theme finding?
If you have five minutes, you can help us find out by participating in the study:
http://wordseer.berkeley.edu/themes/
A common task in literature study is to find examples of a theme. Until now, literary scholars searching for examples have had to rely on searching for sets of words they think are associated with the theme.
Theme-finding by searching for words poses a problem. Synonymy and the infinite variance of language mean that the same theme might surface in many different forms using many different words. Even for scholars with intimate knowledge of the text, a single set of words is not enough. Depending on their mental context, the words that come to mind might not always be complete and representative.
For example, take the Shakespearean theme of “seeing is believing” — that seeing an event with one’s own eyes is more credible than hearing about it second-hand. A scholar might search for the words “believe”, “speak”, “eyes”, and “see”. That search might be able to capture this example (from The Winter’s Tale 5.2):
Then have you lost a sight, which was to be seen, can not be spoken of.
but not this one (from King Lear 4.6):
I would not take this from report; it is, And my heart breaks at it.
As a solution, we at WordSeer propose search-by-example. This technology dates back to the 80′s in the field of information retrieval, and so far, it’s been successful in helping find relevant documents. We think it could work for theme-finding too.
With search-by-example, instead of inferring which words represent a theme, and then searching for those words, a scholar can search for sentences that match a set of examples. A scholar marks a set of examples of a theme, and the system returns a list of sentences it thinks are relevant.
This process is a cycle. When the system returns results, the scholar gives it feedback by labeling sentences “relevant” if they match the theme, and “not-relevant” if they don’t. The system gradually builds a model of what the scholar is interested in, and eventually returns results that are mostly relevant.
For example, in under five minutes, I was able to use the examples above to come up with seven more candidates:
Gracious my lord, I should report that which I say I saw, But know not how to do’t. (Macbeth 5.5)
Most noble sir, That which I shall report will bear no credit, Were not the proof so nigh. (Winter’s Tale 5.1)
I would not hear your enemy say so, Nor shall you do mine ear that violence, To
make it truster of your own report Against yourself: I know you are no truant. (Hamlet 1.2)If in Naples I should report this now, would they believe me? (The Tempest 3.3)
They call him Doricles; and boasts himself To have a worthy feeding: but I have it Upon his own report and I believe it; He looks like sooth. (Winter’s tale 4.4)
It is not so; thou hast misspoke, misheard; Be well advised, tell o’er thy tale again: It can not be thou dost but say’ tis so: I trust I may not trust thee; for thy word Is but the vain breath of a common man: Believe me, I do not believe thee, man; I have a king’s oath to the contrary. (King John 3.1)
I do beseech you, either not believe The envious slanders of her false accusers; Or, if she be accused on true report, Bear with her weakness, which, I think, proceeds From wayward sickness, and no grounded malice. (Richard III 1.3)
Of course, this is all theory until it’s been proven to work. And while I’m not a Shakespeare scholar, I did build this particular system, so it might not be surprising that I can get a few results out of it.
So to find out whether search-by-example works, we’ve designed a five-minute study around three Shakespearean themes. There are three systems: one search, and two different example-based ones. Participants are shown an example of a theme, and asked to use a system to find as many relevant results as they can in five minutes. The systems and theme are randomly assigned.
We’ll find our answer by comparing the quality and quantity of the sentences the participants find on the three systems. Expert scholars will help us judge quality: they will rate the relevance of sentences the different systems produce (without knowing which system produced which sentence). For quantity, there is a time limit — which system produces more high-quality results in five minutes?
So, does example-based exploration work better than search for theme finding?
If you have five minutes, you can help us find out by participating in the study:
http://wordseer.berkeley.edu/themes/
WordSeer 2: Test users wanted
A new version of WordSeer is in the works.
It’s been guided by the advice of our long-suffering literature-scholar collaborators. And by the tales of frustration and trial-and-error of the students of the Hamlet class who tried to use WordSeer to analyze parts of the play. We also thought hard about the text analysis process as a series of steps. “What might Tanya Clement have been thinking and doing at each stage of her computational analysis of repetition in Gertrude Stein’s The Making of Americans“? ”What about when we analyzed language use differences in the descriptions of men and women in Shakespeare?” Out of this has come a better (we hope) understanding of the needs of scholars of text in the humanities.
We’ve completely rebuilt WordSeer. Instead of a traditional web application with a different visualization on each page, WordSeer now works more like an environment. Almost like a desktop — with windows and menu bars and persistent, useful, objects.
However, as researchers in Human-Computer Interaction, we know that we need to do user studies. First, we need to check whether we’re on the right track. Do our improvements make for a better experience than the old version? More importantly, we need more observations. To understand the humanities text analysis process, we want to observe more humanities text analysis.
Until now, the closest we’ve come to “user studies” is an iterative bouncing-around of ideas with just three scholars. They have been more like guides and expert consultants than “users” and they helped us sketch the first lines, and refine our first ideas into something that was actually useful.
We’ve acted upon the knowledge they helped us accumulate, the result of which is the completely redesigned WordSeer. We’re looking for a bigger set of users now, for a formal study. We’re hoping to find a set of around 15 professional literature scholars who will allow us to observe them as they use WordSeer to explore a problem of genuine professional interest to them.
So what text collection could possibly interest 15 different scholars in the digital humanities community enough to want to do a computationally-assisted analysis of it? And allow us to observe them at it?
In a rare moment of epiphany, we realized we could just ask you. So here’s a poll. It’s populated with some examples, but we encourage you to respond in the “other” field. Tell us: what collection, if set up with text analysis and visualization tools, would make you interested?
Men and Women in Shakespeare
In previous posts, I’ve shown how WordSeer can be used to explore small, well-defined questions: what word did Shakespeare use for ‘beautiful’? Is the occurrence of the word ‘love’ the same in the comedies and tragedies? This post is different. WordSeer has now developed enough to support a simple, but complete, exploratory analysis.
The question we’ll think about is this:
“How does the portrayal of men and women in Shakespeare’s plays change under different circumstances?”
As one answer, we’ll see how WordSeer suggests that when love is a major plot point, the language referring to women changes to become more physical, and the language referring to men becomes more sentimental.
Search
We began our analysis with the question, “what are some things that are portrayed as ‘his’ and some things that are ‘hers’?. A typical keyword search returns an unstructured lists of results, and a standard approach in literature study is to view them in a concordance. This is a list of all the sentences in which a word occurs, with the target word aligned in the center of the view, exposing the contexts to its left and right, sorted in some way. WordSeer uses the word tree concordance visualization which makes common contexts easier to see by grouping them in a tree-like structure.
The word tree for her is shown in Figure 1 above. Some words like beauty stand out, but constructions like her own muddy the picture. The problem lies in the different ways in which his and her are used. The word his is always a possessive pronoun, and word sequences containing his would nearly always be relevant. However, her can also be a 3rd-person pronoun, and will yield constructions like “I told her that X” and “I gave her the Y”.
With WordSeer, we can get around this problem with grammatical search.The system uses natural language processing (NLP) to extract relationships between words, and allows users to specify both keywords and relationships between them. In the tool’s search interface, pairs of words are specified using input boxes, and the relationship between them is selected from a drop-down menu (Figure 2). Leaving a word-input box blank returns all matches.
With this feature, we can take advantage of the fact that possessive relationships between words can be automatically detected, to express our question precisely: “what are all the words with which his has a possessive relationship?”. The results are shown in Figure 3 below.
Comparing these words with those for her (Figure 4 below) reveals immediate differences. The word father is most common for her, with husband, and son close behind. Several body parts enter the picture: eyes, hand, face, tongue, lips, cheek. A picture emerges: women’s most commonly-mentioned possessions are their male relatives and their bodies.
Visualization, Reading, and Hypothesis-Generation
Our next question was whether this physical, male-dominated picture of women was consistent, or whether it changed in different types of plays. We used the tool’s collections feature to divide the plays into comedies, tragedies, and histories – the three most commonly-accepted categorizations of Shakespeare’s plays. We also created pre-1600, and post-1600 categories to check whether there were temporal differences.
Collections were created using the “collections” bay, a collapsible window at the bottom of the screen. We added the appropriate plays through the document listing (sortable and filterable by date, title, full-text search, grammatical search, and length).
We used the tool’s newspaper-strip visualization (Figure 6) to compare the prevalence of the two categories of words in different types of plays. Each play is represented as a long column. Within each column, small, colored horizontal blocks (corresponding to 10 sentences each) highlight the presence of a match.

Figure 6. Comparing the prevalence of body parts possessed-by her (eyes, lips, cheeks, and face)(blue) and relatives possessed-by her (husband, father, sons, daughters, children) (orange) in the comedies. Each column is a comedy, represented in alternating shades of grey. Hovering over a column (e.g. “Much Ado About Nothing" above) darkens it and displays the title. Hovering over a highlighted block displays the matching sentence.
The results for the tragedies collection were similar to the results for comedies (Figure 6) but in histories (Figure 7), an interesting pattern emerged. It seemed that body parts (blue) were somewhat less prevalent in these plays, but family (orange) remained unchanged.

Figure 7. Comparing the prevalence of body parts possessed-by her (eyes, lips, cheeks, and face)(blue) and relatives possessed-by her (husband, father, sons, daughters, children) (orange) in the histories. Each column is a play, represented in alternating shades of grey.
Hypothesis-building: close reading, annotation, and exploration
WordSeer supports quick, large-scale analysis through search and visualization, but in all cases maintains links back to the source text. Hovering over a blue or orange highlighted block in Figures 6 or 7 brings up a popup displaying the matching sentence. Clicking opens the reading interface to that point (Figure 8). The full text of the document is loaded, and the system automatically scrolls to the relevant sentence, and highlights it.

Figure 8. WordSeer’s reading interface. If the document is subdivided into sections, these appear on the right as a table of contents.
Hovering over a few body-part results quickly led to a new hypothesis. In our rough sample, many of the mentions sounded romantic. We used the reading and annotating interface to follow up on this by clicking on the highlighted blocks in the newspaper-column visualization.
We selected the speeches referring to body parts and tagged them by the topics they seemed to contain. It soon became apparent that many of the mentions were speeches by a lover.
Our hypothesis was strengthened when we viewed related words. For exploration of style and language, WordSeer uses computational linguistics to calculate words commonly used in similar contexts, or commonly used within a 10-sentence window of each other. Clicking on any word while reading brings up a small window showing related words.
In our example, the the related words for body-parts (e.g. Figure 10 for face) strengthened our growing suspicion that female body part mentions were associated with romance. The popup shows that other body parts are frequently mentioned, along with love, fair, and sweet.
Assembling Evidence
We created a final pair of categories focusing on love: not-love-stories for plays in which love is not a major plot point, and love-stories for plays in which it is. When we reorganized the plays along these lines, the results were immediate.

Figure 11. Visualization of the love-stories collection comparing the prevalence of body parts possessed-by her (blue) and relatives possessed-by her (orange).
In the love-stories (Figure 11), we see both body parts and male relatives. By contrast, the not-love-stories visualization (Figure 12) shows predominantly male relatives, and hovering over the occurrences of body parts reveals a gloomy picture of her tear-stained cheeks and her sorrowful eyes.

Figure 12. Visualization of the not-love-stories collection comparing the prevalence of body parts possessed-by her (blue) and relatives possessed-by her (orange).
The grammatical search results (below) agree with the newspaper-strip visualizations and related words. We see more physical attributes possessed-by her in the in the love-stories collection (Figure 13a) than in the not-love collection (13b).
The grammatical search results show that the language around men changes as well (Figures 14a and 14b below). In the not-love case, the only woman to appear is mother, at number 20, but in the love case, wife takes first place, followed by favor. Compared to the physical language for women, these words have a more sentimental quality.
Thus, we see that, while a male-dominated picture of both men and women is always present, physical aspects are more prominent for women in plays about love. For men, the more sentimental aspects come to the fore.
Conclusion
WordSeer is being developed through case studies. This means we observe scholars working with texts, figure out what they need, and then try to translate it into interactions, text mining algorithms, and visualizations. Therefore, when the time comes to demonstrate it, I always think examples work better than anything else.
So what do the literature scholars among you think of this simple example? How might it be improved, and made more convincing? What are its flaws? What would you have done? Please comment, even if it is to criticize. It would be great to hear your thoughts.
WordSeer: “love” in Shakespeare’s tragedies and comedies
When scholars try to make sense out of large collections of text, they frequently do two things: compare, and collect. They collect samples of “interesting” things, and compare them with each other along various relevant dimensions.
In this post, I demonstrate the collection and comparison features of WordSeer by using it to compare the usage of the word “love” in Shakespeares comedies and tragedies. You can watch the screencast, or simply read on.
The first thing to do is collect the comedies and tragedies into separate lists. To do this, I created a new collection called “tragedies” using the new “collections” feature.
Next, I had to collect all of Shakespeare’s tragedies into that collection. Figure 2 shows WordSeer’s list of plays. I walked down this list and clicked the checkboxes next to the tragedies, using Wikipedia as an authoritative source of tragedies.
Once I’d selected all the tragedies, I clicked the “Add Items” button to add them to a collection. I selected the “tragedies” collection and added the plays.
This populated the collection with the plays. I did the same for the comedies, ending up with two collections
I was now ready to compare my collections. I opened up two windows to the heat map view. One was going to visualize the tragedies, and one the comedies.

Figure 6. Setting up the heat maps. One window visualized the "tragedies" collection, and the other window visualized "comedies".
Finally, I was ready to compare the two. I was interested in the word “love”, and whether there would be any differences in how frequently it was used in the comedies and the tragedies. To that end, I typed in “love” into the comedies window and got the heat map in Figure 7.

FIgure 7. The occurrences of "love" in Shakespeare's comedies. Each column is a play, each highlighted block represents that the word "love" occurred there.
Not surprisingly, “love” is everywhere. But what about the tragedies? In the other window, typing in “love” yielded the results in Figure 8.
To my surprise, the tragedies were equally full of “love”. Which, among other things, reveals my poor knowledge of Shakespeare.
Still, the hope is that our Shakespeare scholar, Michael Ullyot, (@ullyot) will use collections and heat maps to discover something truly interesting.
“Beautiful” in Shakespeare
On Tuesday, Feb. 1, I’ll be presenting my latest project WordSeer, at the Farsight 2011 conference on the future of search. This event will be streamed live from TechCrunch, the tech world’s favorite blog about new technology and startup news, and will be attended by high-profile techies from Bing, Google, Blekko, and the like. Please tune in at 10am PST Tuesday, and follow along with #futuresearch on twitter, and let’s get the digital humanities some high-tech exposure that day!
WordSeer is a new way of searching through text inspired by the way literary scholars work. Literature scholars ask detailed, analytical questions of text, for which it’s important for them to get a sense of how different words are used and in what contexts. For our project, we teamed up with scholars who are exploring language use in a collection of North American slave narratives.
When analyzing text, traditional keyword-based search can only take you so far. Instead of having to read every document hoping to come across relevant passages, you can immediately zoom in on them with a search. But can we do better? When trying to form a hypothesis or get a sense of contents, a long list of search results is still unwieldy because it’s not really the matching sentences we’re interested in, it’s what they have to say about our topic.
Luckily, we don’t have to stop at matching keywords. Sentences aren’t mysterious bags of words, they follow rules and have structures, which computers have been capable of deciphering with speed and precision for some years now. From these structures, computers can automatically infer relationships between words. For example, in the sentence,
“The good God has given every man intellect”
computers can automatically infer that “God” is described as “good”, and that he is the agent doing the giving.
With WordSeer, we’re going beyond keyword search by using language processing to automatically extract and aggregate the parts of matching sentences relevant to a query. In the first place, we make it easy to express an analytical query in terms of a grammatical relationship. For example, if a scholar wanted to know what the slave narratives collection indicated about the relationship between slaves and God, they could simply ask (live demo link) how God ”is described” (for which WordSeer finds and displays all the adjectives that are applied to the word God) and what “is done by“ God (for which WordSeer finds and categorizes all instances of verbs in which God is an agent).
Of course, this is only a rough, high-level picture of what the slave narratives say about how God is described and what God does, but a rough idea can often serve to guide intuition and help generate or discredit hypotheses. By making the process of “getting a rough idea” quick and inexpensive, we can speed up the entire research pipeline.
More and more source text in the humanities gets digitized every day, making it accessible to large scale computational analysis. Nevertheless, traditional methods of humanistic analysis are based on detailed arguments built upon on close readings of individual texts. How will the field adapt? How do we use statistics and text mining to answer humanistic questions?
Zoom in to the field of American literature, and further into the realm of studying the (digitized) narratives of escaped former slaves, published by white abolitionists. There are widespread stylistic and thematic similarities among these narratives. How can text mining help literature scholars here? That’s where WordSeer, my latest project, comes in.
The MONK project at CMU, and the Voyeur project at McMaster University share the same cause as WordSeer. But, when it comes to text analysis, they are essentially search interfaces that show simple statistics about word order, type and frequency. The grammatical relationships within text are neglected.
WordSeer
WordSeer is an evolving project, as all digital humanities projects inevitably are. As my friends in the English department and I learn what we can do for each other, it will get steadily more well-defined, but right now, it’s simple: a search interface and a reading interface. The search interface allows queries based on grammatical structure, and the reading interface is for reading narratives, comparing them, and coming up with new queries.
Search
The search screen is shown below. It supports standard keyword-based search, so scholars can look for words or exact matches in the text. More interestingly, there’s grammatical search. Using grammatical relationships extracted through natural language processing, users can ask how things were described, what actions were performed upon them and by them, who possessed certain things, or what was possessed by them.
For example, the figure above (click for larger image in new window) shows the query, ‘give all adjectives that are applied to the words “slave, bondman, negro”‘. The system returns not only a list of occurrences in the narratives, but also automatically-generated graphs, showing the frequencies of the different words. As you can see, “poor” is the most frequent adjective. The results are sortable, and filterable: clicking on bars filters the list to show just results containing those words. Above, I’ve filtered to show just the instances where “valuable” is applied to “slave”.
Reading
Interviews with our literary scholar friends suggested that a search interface alone would not be enough, so WordSeer supports reading narratives individually.
The reading view is shown below. Scholars can select one (or, indeed many) sentences from the search results and be taken to a reading screen, where the narratives are opened up to the correct place. Grammatical search doesn’t end there, however, because the entire text is interactive.
Highlighting a portion of a sentence and clicking the “examine” button (bottom right corner) shows the text pattern, as well as all the grammatical relationships in the highlighted portion. For example, I clicked on a passage about hospitals, and was presented with the pattern-examiner screen (below).
I can select some patterns, either the original passage or some grammatical patterns, and examine them further. I can use them as search queries and be taken back to the original search screen, I can save them for later, or I can view their distributions in the text I’m reading.
Being able to compare the distribution of phrases or patterns across texts can give an idea of how similar the texts are, or of how much their subject matter overlaps. For example, if I wanted to know where plantations were mentioned in these texts, I would highlight the word, “plantation” and click “See in Text”, giving the result below.
The white column represent the length of the entire text, and green bars indicate that the pattern of interest occurred. If I had selected multiple patterns, I would see different colored bars.Clicking on any of the little green bars takes me to an occurrence of the pattern, highlighted in the text.
Language Processing
All of this works because I applied language processing to the text beforehand, and stored the information a database for quick access. I applied part-of-speech tagging, syntactic parsing, and dependency parsing to decompose sentences into their grammatical constituents. For example, the sentence, “The cruel man beat us severely” contains the word “cruel” which is an adjective modfier of the word “man”, which is a noun. There is verb object relation between “beat” and “us”, and a verb subject relation between “man” and “beat”.
If you want to know more about natural language processing, I gave a BootCamp about text mining at THATCamp SF recently, here are the slides [pdf]. I also wrote a blog post introducing the subject for a digital humanities audience.
What next?
Syntactic analysis is just a small part of what natural language processing can do. Right now, I’m working on being able to track named entities through a narrative and see descriptions applied to them, and actions in which they participate.
This year’s conference of the Association for Computational Linguistics, the most prestigious event in computational linguistics, had a paper that got me very excited. It’s called Extracting Social Networks from Literary Fiction [pdf], and here’s the abstract (emphasis added):
We present a method for extracting social networks from literature, namely, nineteenth-century British novels and serials. We derive the networks from dialogue interactions, and thus our method depends on the ability to determine when two characters are in conversation. Our approach involves character name chunking, quoted speech attribution and conversation detection given the set of quotes. We extract features from the social networks and examine their correlation with one another, as well as with metadata such as the novel’s setting. Our results provide evidence that the majority of novels in this time period do not fit two characterizations provided by literacy scholars. Instead, our results suggest an alternative explanation for differences in social networks.
The paper advances a new technique for extracting social networks from text, and uses it on 19th century novels to argue that certain aspects of literary theory about novels might be false. In this post, I’ll explain the analysis to the digital humanities audience and discuss some strengths and weaknesses in the argument.
Written at Columbia University by two computer scientists and one English scholar, this paper contains exciting things to both computational linguists and literature researchers. For computational linguists, it proposes the first ever algorithm for extracting speaker-to-speaker networks from free text. This opens up fascinating new areas of study because it is now possible to computationally analyze interactions between people in a text and not just what they say to each other.
For literary scholars, it suggests two hypotheses from literary theory about community and society in 19th century novels might be false, namely:
Literary studies about the nineteenth-century British novel are often concerned with the nature of the community that surrounds the protagonist. Some theorists have suggested a relationship between the size of a community and the amount of dialogue that occurs, positing that “face to face time” diminishes as the number of characters in the novel grows. Others suggest that as the social setting becomes more urbanized, the quality of dialogue also changes, with more interactions occurring in rural communities than urban communities. Such claims have typically been made, however, on the basis of a few novels that are studied in depth. In this paper, we aim to determine whether an automated study of a much larger sample of nineteenth century novels supports these claims.
To make their arguments, the authors frame the statements above in terms of social networks:
- If face-to-face time diminishes as the number of characters grows, then the more characters the novel has, the less dense its extracted social network will be.
- Second, if more interactions occur in rural settings than urban settings, networks from rural novels will be densely connected, but contain fewer characters, but networks from urban settings be large and loosely connected.
Then, they extract social networks from novels using the following steps. First, the Stanford named-entity tagger automatically locates all the names in each novel. Then, a classifier automatically assigns a speaker to every instance of direct speech in the novel using features of the surrounding text. A “conversation” occurs if two characters speak within 300 words each other, and finally, a social network is constructed from the conversations. Nodes are named speakers (that appear 3 times or more – the named-entity tagger is somewhat error prone). Edges appear if there was a conversation between two characters, a heavier edge means more conversations. The end result is a social network like the one shown above, which was extracted from Mansfield Park by Jane Austen.
Using the social networks they extract, the authors show that there is no significant difference in this dimension between urban and rural novels. Instead, they show that the biggest differences seem to be between novels in the third person and novels in the first person – the third-person novels have “dense, talkative” networks, whereas the first-person novels all center around the character “I”:
Our data suggests … that the “urban novel” is not as strongly distinctive a form as has been asserted, and that in fact it can look much like the village fictions of the century, as long as the same method of narration is used
Their claim seems too strong to me. In order to make the problem tractable, they have reduced the concepts of “characters” and “conversation” to simple metrics, but important information isleft out:
- Characters are equated with names that appear more than thrice, but this leaves out
- Named characters that are mentioned less than 3 times
- Nameless characters that don’t speak
- They assume all conversations are direct speech – they ignore indirect and reported speech
From an NLP perspective, it’s easy to see why they’ve made these simplifications. In the kinds of text that we are used to dealing with: expository things like news articles, or explanatory things like journal papers, infrequent, nameless entities don’t matter and are rare. We’re used to looking for significant entities, popular topics of discussion, so the “drop the infrequent” approach goes a long way, and eliminates noise.
Nevertheless, when it comes to characterizing the aesthetics of a novel’s depiction of a social network, it’s a different matter, no longer about how “important” a character is or how “significant” some topic is. To me, it seems plausible that the number of infrequently-appearing named characters, and the number of nameless characters who are seen but never heard chan change the quality of a social network one experiences in a novel. Without further investigation into how frequent these infrequent-character, or nameless-character-cases are in this particular corpus, I really don’t think they have enough data to claim to have refuted scholarly intuition.Without further investigation, we have no idea whether the cases they leave out are frequent or infrequent enough in urban novels to sway the analysis, and in which direction the decision would go.
It’s very easy to point out problems, but I can’t think of any ways to fix them: the tools to separate infrequent named entities from junk entities just don’t exist. And the tools to identify nameless, speechless characters haven’t made much headway either – what is the difference, quantitatively, between the words “the woman standing at the station” and “the hansom cab standing at the station”? To computers that rely on statistical, automatically extracted information about language, the difference is currently very difficult to detect.
What do the humanists among you think of this work? Compared to the other literary analysis of the same novels done at Stanford, this approach is more linguistically sophisticated – but do all of these computational attempts seem heavy handed? Or do they spark ideas in your head, inspiring you to apply and improve them on your own problems?

























