It’s been a while since I posted a blog, so I thought I’d share with you a DH project that I was recently asked to collaborate on. It’s called Tesserae, and can be found here: http://tesserae.caset.buffalo.edu/
“Literature is an artificial universe,” author Kathryn Schulz recently declared in the New York Times Book Review, “and the written word, unlike the natural world, can’t be counted on to obey a set of laws” (Schulz). Schulz was criticizing the value of Franco Moretti’s “distant reading,” although her critique seemed more like a broadside against “culturomics,” the aggressively quantitative approach to studying culture (Michel et al.). Culturomics was coined with a nod to the data-intensive field of genomics, which studies complex biological systems using computational models rather than the more analog, descriptive models of a prior era. Schulz is far from alone in worrying about the reductionism that digital methods entail, and her negative view of the attempt to find meaningful patterns in the combined, processed text of millions of books likely predominates in the humanities.
Historians largely share this skepticism toward what many of them view as superficial approaches that focus on word units in the same way that bioinformatics focuses on DNA sequences. Many of our colleagues question the validity of text mining because they have generally found meaning in a much wider variety of cultural artifacts than just text, and, like most literary scholars, consider words themselves to be context-dependent and frequently ambiguous. Although occasionally intrigued by it, most historians have taken issue with Google’s Ngram Viewer, the search company’s tool for scanning literature by n-grams, or word units. Michael O’Malley, for example, laments that “Google ignores morphology: it ignores the meanings of words themselves when it searches…[The] Ngram Viewer reflects this disinterest in meaning. It disambiguates words, takes them entirely out of context and completely ignores their meaning…something that’s offensive to the practice of history, which depends on the meaning of words in historical context.” (O’Malley)
Such heated rhetoric—probably inflamed in the humanities by the overwhelming and largely positive attention that culturomics has received in the scientific and popular press—unfortunately has forged in many scholars’ minds a cleft between our beloved, traditional close reading and untested, computer-enhanced distant reading. But what if we could move seamlessly between traditional and computational methods as demanded by our research interests and the evidence available to us?
In the course of several research projects exploring the use of text mining in history we have come to the conclusion that it is both possible and profitable to move between these supposed methodological poles. Indeed, we have found that the most productive and thorough way to do research, given the recent availability of large archival corpora, is to have a conversation with the data in the same way that we have traditionally conversed with literature—by asking it questions, questioning what the data reflects back, and combining digital results with other evidence acquired through less-technical means.
We provide here several brief examples of this combinatorial approach that uses both textual work and technical tools. Each example shows how the technology can help flesh out prior historiography as well as provide new perspectives that advance historical interpretation. In each experiment we have tried to move beyond the more simplistic methods made available by Google’s Ngram Viewer, which traces the frequency of words in print over time with little context, transparency, or opportunity for interaction.
The Victorian Crisis of Faith Publications
One of our projects, funded by Google, gave us a higher level of access to their millions of scanned books, which we used to revisit Walter E. Houghton’s classic The Victorian Frame of Mind, 1830-1870 (1957). We wanted to know if the themes Houghton identified as emblematic of Victorian thought and culture—based on his close reading of some of the most famous works of literature and thought—held up against Google’s nearly comprehensive collection of over a million Victorian books. We selected keywords from each chapter of Houghton’s study—loaded words like “hope,” “faith,” and “heroism” that he called central to the Victorian mindset and character–and queried them (and their Victorian synonyms, to avoid literalism) against a special data set of titles of nineteenth-century British printed works.
The distinction between the words within the covers of a book and those on the cover is an important and overlooked one. Focusing on titles is one way to pull back from a complete lack of context for words (as is common in the Google Ngram Viewer, which searches full texts and makes no distinction about where words occur), because word choice in a book’s title is far more meaningful than word choice in a common sentence. Books obviously contain thousands of words which, by themselves, are not indicative of a book’s overall theme—or even, as O’Malley rightly points out, indicative of what a researcher is looking for. A title, on the other hand, contains the author’s and publisher’s attempt to summarize and market a book, and is thus of much greater significance (even with the occasional flowery title that defies a literal description of a book’s contents). Our title data set covered the 1,681,161 books that were published in English in the UK in the long nineteenth century, 1789-1914, normalized so that multiple printings in a year did not distort the data. (The public Google Ngram Viewer uses only about half of the printed books Google has scanned, tossing—algorithmically and often improperly—many Victorian works that appear not to be books.)
Our queries produced a large set of graphs portraying the changing frequency of thematic words in titles, which were arranged in grids for an initial, human assessment (fig. 1). Rather than accept the graphs as the final word (so to speak), we used this first, prospecting phase to think through issues of validity and significance.
Fig. 1. A grid of search results showing the frequency of a hundred words in the titles of books and their change between 1789 and 1914. Each yearly total is normalized against the total number of books produced that year, and expressed as a percentage of all publications.
Upon closer inspection, many of the graphs represented too few titles to be statistically meaningful (just a handful of books had “skepticism” in the title, for instance), showed no discernible pattern (“doubt” fluctuates wildly and randomly), or, despite an apparently significant trend, were unhelpful because of the shifting meaning of words over time.
However, in this first pass at the data we were especially surprised by the sharp rise and fall of religious words in book titles, and our thoughts naturally turned to the Victorian crisis of faith, a topic Houghton also dwelled on. How did the religiosity and then secularization of nineteenth-century literature parallel that crisis, contribute to it, or reflect it? We looked more closely at book titles involving faith. For instance, books that have the words “God” or “Christian” in the title rise as a percentage of all works between the beginning of the nineteenth century and the middle of the century, and then fall precipitously thereafter. After appearing in a remarkable 1.2% of all book titles in the mid-1850s, “God” is present in just one-third of one percent of all British titles by the first World War (fig. 2). “Christian” titles peak at nearly one out of fifty books in 1841, before dropping to one out of 250 by 1913 (fig. 3). The drop is particularly steep between 1850 and 1880.
Fig. 2. The percentage of books published in each year in English in the UK from 1789-1914 that contain the word “God” in their title.
Fig. 3. The percentage of books published in each year in English in the UK from 1789-1914 that contain the word “Christian” in their title.
These charts are as striking as any portrayal of the crisis of faith that took place in the Victorian era, an important subject for literary scholars and historians alike. Moreover, they complicate the standard account of that crisis. Although there were celebrated cases of intellectuals experiencing religious doubt early in the Victorian age, most scholars believe that a more widespread challenge to religion did not occur until much later in the nineteenth century (Chadwick). Most scientists, for instance, held onto their faith even in the wake of Darwin’s Origin of Species (1859), and the supposed conflict of science and religion has proven largely illusory (Turner). However, our work shows that there was a clear collapse in religious publishing that began around the time of the 1851 Religious Census, a steep drop in divine works as a portion of the entire printed record in Britain that could use further explication. Here, publishing appears to be a leading, rather than a lagging, indicator of Victorian culture. At the very least, rather than looking at the usual canon of books, greater attention by scholars to the overall landscape of publishing is necessary to help guide further inquiries.
More in line with the common view of the crisis of faith is the comparative use of “Jesus” and “Christ.” Whereas the more secular “Jesus” appears at a relatively constant rate in book titles (fig. 4, albeit with some reduction between 1870 and 1890), the frequency of titles with the more religiously charged “Christ” drops by a remarkable three-quarters beginning at mid-century (fig. 5).
Fig. 4. The percentage of books published in each year in English in the UK from 1789-1914 that contain the word “Jesus” in their title.
Fig. 5. The percentage of books published in each year in English in the UK from 1789-1914 that contain the word “Christ” in their title.
Prospecting a large textual corpus in this way assumes that one already knows the context of one’s queries, at least in part. But text mining can also inform research on more open-ended questions, where the results of queries should be seen as signposts toward further exploration rather than conclusive evidence. As before, we must retain a skeptical eye while taking seriously what is reflected in a broader range of printed matter than we have normally examined, and how it might challenge conventional wisdom.
The power of text mining allows us to synthesize and compare sources that are typically studied in isolation, such as literature and court cases. For example, another text-mining project focused on the archive of Old Bailey trials brought to our attention a sharp increase in the rate of female bigamy in the late nineteenth century, and less harsh penalties for women who strayed. (For more on this project, see http://criminalintent.org.) We naturally became curious about possible parallels with how “marriage” was described in the Victorian age—that is, how, when, and why women felt at liberty to abandon troubled unions. Because one cannot ask Google’s Ngram Viewer for adjectives that describe “marriage” (scholars have to know what they are looking for in advance with this public interface), we directly queried the Google n-gram corpus for statistically significant descriptors in the Victorian age. Reading the result set of bigrams (two-word couplets) with “marriage” as the second word helped us derive a more narrow list of telling phrases. For instance, bigrams that rise significantly over the nineteenth century include “clandestine marriage,” “forbidden marriage,” “foreign marriage,” “fruitless marriage,” “hasty marriage,” “irregular marriage,” “loveless marriage,” and “mixed marriage.” Each bigram represents a good opportunity for further research on the characterization of marriage through close reading, since from our narrowed list we can easily generate a list of books the terms appear in, and many of those works are not commonly cited by scholars because they are rare or were written by less famous authors. Comparing literature and court cases in this way, we have found that descriptions of failed marriages in literature rose in parallel with male bigamy trials, and approximately two decades in advance of the increase in female bigamy trials, a phenomenon that could use further analysis through close reading.
To be sure, these open-ended investigations can sometimes fall flat because of the shifting meaning of words. For instance, although we are both historians of science and are interested in which disciplines are characterized as “sciences” in the Victorian era (and when), the word “science” retained its traditional sense of “organized knowledge” so late into the nineteenth century as to make our extraction of fields described as a “science”—ranging from political economy (368 occurrences) and human [mind and nature] (272) to medicine (105), astronomy (86), comparative mythology (66), and chemistry (65)—not particularly enlightening. Nevertheless, this prospecting arose naturally from the agnostic searching of a huge number of texts themselves, and thus, under more carefully constructed conditions, could yield some insight into how Victorians conceptualized, or at least expressed, what qualified as scientific.
Word collocation is not the only possibility, either. Another experiment looked at what Victorians thought was sinful, and how those views changed over time. With special data from Google, we were able to isolate and condense the specific contexts around the phrase “sinful to” (50 characters on either side of the phrase and including book titles in which it appears) from tens of thousands of books. This massive query of Victorian books led to a result set of nearly a hundred pages of detailed descriptions of acts and behavior Victorian writers classified as sinful. The process allowed us to scan through many more books than we could through traditional techniques, and without having to rely solely on opaque algorithms to indicate what the contexts are, since we could then look at entire sentences and even refer back to the full text when necessary.
In other words, we can remain close to the primary sources and actively engage them following computational activity. In our initial read of these thousands of “snippets” of sin (as Google calls them), we were able to trace a shift from biblically freighted terms to more secular language. It seems that the expanding realm of fiction especially provided space for new formulations of sin than did the more dominant devotional tracts of the early Victorian age.
Experiments such as these, inchoate as they may be, suggest how basic text mining procedures can complement existing research processes in fields such as literature and history. Although detailed exegeses of single works undoubtedly produce breakthroughs in understanding, combining evidence from multiple sources and multiple methodologies has often yielded the most robust analyses. Far from replacing existing intellectual foundations and research tactics, we see text mining as yet another tool for understanding the history of culture—without pretending to measure it quantitatively—a means complementary to how we already sift historical evidence. The best humanities work will come from synthesizing “data” from different domains; creative scholars will find ways to use text mining in concert with other cultural analytics.
In this context, isolated textual elements such as n-grams aren’t universally unhelpful; examining them can be quite informative if used appropriately and with its limitations in mind, especially as preliminary explorations combined with other forms of historical knowledge. It is not the Ngram Viewer or Google searches that are offensive to history, but rather making overblown historical claims from them alone. The most insightful humanities research will likely come not from charting individual words, but from the creative use of longer spans of text, because of the obvious additional context those spans provide. For instance, if you want to look at the history of marriage, charting the word “marriage” itself is far less interesting than seeing if it co-occurs with words like “loving” or “loveless,” or better yet extracting entire sentences around the term and consulting entire, heretofore unexplored works one finds with this method. This allows for serendipity of discovery that might not happen otherwise.
Any robust digital research methodology must allow the scholar to move easily between distant and close reading, between the bird’s eye view and the ground level of the texts themselves. Historical trends—or anomalies—might be revealed by data, but they need to be investigated in detail in order to avoid conclusions that rest on superficial evidence. This is also true for more traditional research processes that rely too heavily on just a few anecdotal examples. The hybrid approach we have briefly described here can help scholars discover exactly which books, chapters, or pages to focus on, without relying solely on sophisticated algorithms that might filter out too much. Flexibility is crucial, as there is no monolithic digital methodology that can applied to all research questions. Rather than disparage the “digital” in historical research as opposed to the spirit of humanistic inquiry, and continue to uphold a false dichotomy between close and distant reading, we prefer the best of both worlds for broader and richer inquiries than are possible using traditional methodologies alone.
Chadwick, Owen. The Victorian Church. New York: Oxford University Press, 1966.
Houghton, Walter Edwards. The Victorian Frame of Mind, 1830-1870. New Haven: Published for Wellesley College by Yale University Press, 1957.
Schulz, Kathryn. “The Mechanic Muse – What Is Distant Reading?” The New York Times 24 Jun. 2011, BR14.
Michel, Jean-Baptiste et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331.6014 (2011): 176 -182.
O’Malley, Michael. “Ngrammatic.” The Aporetic, December 21, 2010, http://theaporetic.com/?p=1369.
Turner, Frank M. Between Science and Religion; the Reaction to Scientific Naturalism in Late Victorian England. New Haven: Yale University Press, 1974.
By my own criteria I’ve already failed… I started this series of posts with the intention of documenting the process of finding and extracting editorials as I was actually doing the work. But here I am about to describe some work I finished a few weeks back. Oh well…
In my previous instalments (here and here), I focused on the Sydney Morning Herald. Having continued the hunt for missing editorials I started in the last post, I’ve now got a CSV file with the urls of the first editorial published in every edition of the SMH from 1913. Good-o, I thought, I can now start harvesting and analysing some content.
But then ensued a crisis of faith. The whole point of this exercise was to be able to build up some comparisons – between newspapers, between states, between the city and the bush. But the process of actually finding the editorials seemed beset with difficulties. Could the rules I developed for the SMH be applied elsewhere? Could I ever assemble a useful set of editorials without large amounts of human intervention? I decided to try a few quick experiments to see whether the whole project was worth pursuing.
I started with a few assumptions:
These assumptions were based on my own experience as a long-time newspaper researcher and on some preliminary poking around. For example, when I looked at The Argus I noticed that editorials were typically followed by news summaries. Unfortunately, these are treated as a single article in Trove, resulting in large blocks of text that are only part editorial. By specifying an upper word limit I hoped to filter these sorts of articles out. Similarly, there are sometimes brief announcements or publication details headed with the name of the newspaper. The lower word limit was intended to exclude these.
The next step was to harvest every article from 1913 that was headed with the name of its publication. I created a script to generate a list of all the newspapers that published issues in 1913. Then I called my existing harvester to download all the matching articles and save the details to a series of CSV files — one CSV file per newspaper.
In the previous instalment of this series I created a script to check the CSV output of my harvester for missing or duplicate dates. I extended this to perform a series of tests on each article based on the assumptions above. First, I filtered out articles on odd-numbered pages, then articles that were too short or too long. Finally I checked the remainder for missing or duplicate issue dates.
The details of the articles in each category were written out to JSON files. Using these files and a bit of JQuery magic I could quickly build a simple web interface that allowed me to explore the results.
You can browse the summary results for the full list of newspapers, or you can drill down to view the actual articles assigned to each category.
I’ll save the full analysis for the next post, but if you play around with the results you quickly notice a few things. First, letters to the editor often include the name of the newspaper! If you look at The Mercury, for example, you’ll notice I’ve identified 1057 potential editorials — most of which are letters. Fortunately they should be fairly easy to filter out. In most cases the ‘even numbers only’ assumption worked pretty well, and the word length filters did remove quite a lot of false positives. There are still plenty of problems, but I’m encouraged enough to continue. Yes, there will be a Part #4!
In previous posts, I’ve shown how WordSeer can be used to explore small, well-defined questions: what word did Shakespeare use for ‘beautiful’? Is the occurrence of the word ‘love’ the same in the comedies and tragedies? This post is different. WordSeer has now developed enough to support a simple, but complete, exploratory analysis.
The question we’ll think about is this:
“How does the portrayal of men and women in Shakespeare’s plays change under different circumstances?”
As one answer, we’ll see how WordSeer suggests that when love is a major plot point, the language referring to women changes to become more physical, and the language referring to men becomes more sentimental.
We began our analysis with the question, “what are some things that are portrayed as ‘his’ and some things that are ‘hers’?. A typical keyword search returns an unstructured lists of results, and a standard approach in literature study is to view them in a concordance. This is a list of all the sentences in which a word occurs, with the target word aligned in the center of the view, exposing the contexts to its left and right, sorted in some way. WordSeer uses the word tree concordance visualization which makes common contexts easier to see by grouping them in a tree-like structure.
The word tree for her is shown in Figure 1 above. Some words like beauty stand out, but constructions like her own muddy the picture. The problem lies in the different ways in which
her are used. The word
his is always a possessive pronoun, and word sequences containing
his would nearly always be relevant. However,
her can also be a 3rd-person pronoun, and will yield constructions like
“I told her that X” and
“I gave her the Y”.
With WordSeer, we can get around this problem with grammatical search.The system uses natural language processing (NLP) to extract relationships between words, and allows users to specify both keywords and relationships between them. In the tool’s search interface, pairs of words are specified using input boxes, and the relationship between them is selected from a drop-down menu (Figure 2). Leaving a word-input box blank returns all matches.
With this feature, we can take advantage of the fact that possessive relationships between words can be automatically detected, to express our question precisely: “what are all the words with which
his has a possessive relationship?”. The results are shown in Figure 3 below.
Comparing these words with those for
her (Figure 4 below) reveals immediate differences. The word
father is most common for
son close behind. Several body parts enter the picture:
cheek. A picture emerges: women’s most commonly-mentioned possessions are their male relatives and their bodies.
Our next question was whether this physical, male-dominated picture of women was consistent, or whether it changed in different types of plays. We used the tool’s collections feature to divide the plays into comedies, tragedies, and histories – the three most commonly-accepted categorizations of Shakespeare’s plays. We also created pre-1600, and post-1600 categories to check whether there were temporal differences.
Collections were created using the “collections” bay, a collapsible window at the bottom of the screen. We added the appropriate plays through the document listing (sortable and filterable by date, title, full-text search, grammatical search, and length).
We used the tool’s newspaper-strip visualization (Figure 6) to compare the prevalence of the two categories of words in different types of plays. Each play is represented as a long column. Within each column, small, colored horizontal blocks (corresponding to 10 sentences each) highlight the presence of a match.
The results for the tragedies collection were similar to the results for comedies (Figure 6) but in histories (Figure 7), an interesting pattern emerged. It seemed that body parts (blue) were somewhat less prevalent in these plays, but family (orange) remained unchanged.
WordSeer supports quick, large-scale analysis through search and visualization, but in all cases maintains links back to the source text. Hovering over a blue or orange highlighted block in Figures 6 or 7 brings up a popup displaying the matching sentence. Clicking opens the reading interface to that point (Figure 8). The full text of the document is loaded, and the system automatically scrolls to the relevant sentence, and highlights it.
Hovering over a few body-part results quickly led to a new hypothesis. In our rough sample, many of the mentions sounded romantic. We used the reading and annotating interface to follow up on this by clicking on the highlighted blocks in the newspaper-column visualization.
We selected the speeches referring to body parts and tagged them by the topics they seemed to contain. It soon became apparent that many of the mentions were speeches by a lover.
Our hypothesis was strengthened when we viewed related words. For exploration of style and language, WordSeer uses computational linguistics to calculate words commonly used in similar contexts, or commonly used within a 10-sentence window of each other. Clicking on any word while reading brings up a small window showing related words.
In our example, the the related words for body-parts (e.g. Figure 10 for
face) strengthened our growing suspicion that female body part mentions were associated with romance. The popup shows that other body parts are frequently mentioned, along with
We created a final pair of categories focusing on love: not-love-stories for plays in which love is not a major plot point, and love-stories for plays in which it is. When we reorganized the plays along these lines, the results were immediate.
In the love-stories (Figure 11), we see both body parts and male relatives. By contrast, the not-love-stories visualization (Figure 12) shows predominantly male relatives, and hovering over the occurrences of body parts reveals a gloomy picture of
The grammatical search results (below) agree with the newspaper-strip visualizations and related words. We see more physical attributes possessed-by
her in the in the love-stories collection (Figure 13a) than in the not-love collection (13b).
The grammatical search results show that the language around men changes as well (Figures 14a and 14b below). In the not-love case, the only woman to appear is mother, at number 20, but in the love case, wife takes first place, followed by favor. Compared to the physical language for women, these words have a more sentimental quality.
Thus, we see that, while a male-dominated picture of both men and women is always present, physical aspects are more prominent for women in plays about love. For men, the more sentimental aspects come to the fore.
WordSeer is being developed through case studies. This means we observe scholars working with texts, figure out what they need, and then try to translate it into interactions, text mining algorithms, and visualizations. Therefore, when the time comes to demonstrate it, I always think examples work better than anything else.
So what do the literature scholars among you think of this simple example? How might it be improved, and made more convincing? What are its flaws? What would you have done? Please comment, even if it is to criticize. It would be great to hear your thoughts.
Back when I was looking at ‘When did the Great War become the First World War?‘ I promised a detailed post on how I constructed the graphs. But of course I got distracted. Then I started adding new features to the script and redesigning the graphs, so…
Anyway, the result is a rather neat little gizmo henceforth named QueryPic (I got a bit sick of ‘search summariser’ and ‘graph-maker thing’). The first version just harvested data and left all the graph-making to you. But QueryPic does it all! It harvests the data and makes the graph. Woohoo.
Here’s an example showing ‘drought’ versus ‘flood’:
Yes, it’s a Python script and yes it runs on the command line. Let’s get that out of the way now. I don’t think I have the time and energy to develop cross-platform gui versions of all my tools. I’d rather spend the time adding new features or exploring new possibilities. Sorry, but until I have a wealthy benefactor or a technical support team, I think that’s the way it has to be. In any case, the code is all there – so build your own gui!
Actually, if I did have the time and energy I don’t think I’d build a standalone gui anyway. What would be much cooler would be a web service, where people could run, share and combine their queries. Social graph-making! A celebration of serendipity! A historical playground! Hmmm…
But for now there’s this python script. It’s dead easy to use. Starting from the beginning…
There are a number of optional arguments that you add to the command line to customise your results:
-n (or –name) [a query name]
Give a name to your query. The name is used to create filenames for the html and data files, it is also used in the legend of the graph. The default is to use the search keywords as the name.
-d (or –directory) [a directory path]
The full pathname of the directory/folder for your results. The default is a ‘graphs’ sub-directory in the current directory.
-g (or –graph) [a graph name]
Specify the name of the html file that’s created. This is useful for displaying multiple queries on a single graph. Just run QueryPic for each query, using the same graph name each time. The default is either the value specified by the -n parameter or a name derived from the search keywords.
-m (or –monthly)
Plot the query at monthly intervals. The default interval is a year.
QueryPic builds a simple visualisation of your search query in the Trove newspaper database. A list of search results is difficult to interpret and offers little context. QueryPic shows you the number of articles matching your query over time, enabling you reframe your questions, pursue hunches, or simply play around.
QueryPic takes your Trove newspaper query and looks for a date range. If it doesn’t find one, it assumes you want your graph to go from 1803 to 1954 (the complete contents of the newspaper database — except for the Women’s Weekly). QueryPic then strips out any date parameters from the query, so it can fire off the query within the start and end dates, at the specified date interval.
Date interval? In the previous version of this script you could only plot points at yearly intervals, so it was impossible to zoom in an see what might be happening over the span of a single year or two. But amazing advances in QueryPic technology mean you can now plot changes by month. Here for example is a new version of my Great War/First World War graph, focused on 1938–1946 and plotted at monthly intervals.
So for each interval within the date range QueryPic fires off a request to Trove. From the response it scrapes out the total number of results for that date. If the total is greater than zero, it then fires off a second request to find the total number of newspaper articles for that year. Your query results divided by the total number of articles gives the proportion of articles for that date matching your search query.
Plot ‘cat’ against ‘dog’ in a graph called ‘animals’:
python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -g "animals" python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -g "animals"
Specify a directory for your results:
python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -d "/User/bill/Documents/graphs"
Plot results at monthly intervals:
python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat&fromyyyy=1920&toyyyy=1921" -m
Specify a name:
python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -n "Felines"
As I explained in the first of this series, I’m documenting my efforts to extract every editorial published in the Sydney Morning Herald in 1913 from the Trove newspaper database. It’s an experiment both in text mining and historical writing — an attempt to put the method up front.
While I didn’t think there was anything very thrilling in the first instalment, recording my thoughts and assumptions in this way has already proved useful. In a comment, Owen Stephens noted that his attempt to reproduce my search query produced fewer results. After a little bit of poking around I realised that the fulltext modifier, which I often use to switch off fuzzy matching, counteracts the ‘search headings only’ flag. So my query was returning results that had the string ‘The Sydney Morning Herald’ anywhere in the article.
Try it for yourself.
Here’s my original query — searching for fulltext:”The Sydney Morning Herald” in headings only (supposedly). You’ll notice that it returns 335 results and it’s clear from a quick scan that a number are false positives (they don’t follow the pattern for editorials).
Here’s Owen’s query — searching for “The Sydney Morning Herald” in headings only. It returns 294 results, without any obvious false positives.
So my attempt to disable fuzzy matching actually produced a less accurate result! Weird.
Actually, I think one important benefit of this sort of text mining is that it helps you understand how the search engines you’re using actually work. Once you start poking and prodding, the idiosyncrasies start to emerge.
Anyway, I harvested Owen’s cleaner result set and opened up the resulting csv file. As it seemed in Trove, there we’re very few false positives. Indeed there were only two articles that didn’t seem to follow the standard editorial format, and these were notes added to the editorial page. On the other hand, there were obviously about 20 editorials missing. I could have manually worked through the csv file to identify the missing dates, but I thought I’d try to create some tools that would do the work for me.
What I wanted was the details of the first editorial in every edition of the newspaper in 1913 — so there should be one, and only one, article for each day on which the newspaper was published. I needed a tool that would analyse the csv file and do two things:
The resulting code is all on GitHub if you want follow along. I wrote a Python script that opens up the csv file, extracts all the date strings, converts them to datetime objects and then saves them to a list. Once that’s done it’s pretty easy to loop through and find duplicates:
def find_duplicates(list): ''' Check a list for suplicate values. Returns a list of the duplicates. ''' seen = set() duplicates =  for item in list: if item in seen: duplicates.append(item) seen.add(item) return duplicates
Finding missing dates was a little more complicated, but Google came to the rescue with some handy code samples. All I had to do was set a start and end date (in this case 1 January 1913 and 31 December 1913) and create a timedelta object equal to a day. Then it’s just a matter of adding the timedelta to the start date, comparing the new date to the dates extracted from the csv file, and continuing on until you hit the end. If the new date isn’t in the csv file, then it gets added to the missing list.
if year: start_date = datetime.date(year, 1, 1) end_date = datetime.date(year, 12, 31) else: start_date = article_dates end_date = article_dates[-1] one_day = datetime.timedelta(days=1) this_day = start_date # Loop through each day in specified period to see if there's an article # If not, add to the missing_dates list. while this_day <= end_date: if this_day.weekday() not in exclude: #exclude Sunday if this_day not in article_dates: missing_dates.append(this_day) this_day += one_day
I’ve tried to make the code as reusable as possible, so you can either supply a year, or the script will read start and end dates from the csv file itself.
All that left me with two more lists of dates: ‘duplicates’ and ‘missing’. At first I just wrote these out to a text file, but then I decided it would be useful to write the results to an html page. That way I could add links that would take me to the actual issue within Trove, helping me to quickly find the missing editorial.
Unfortunately there’s no direct way to go from a date to an issue — you first need to find the issue identifier. How do you do this? If you dig around in the code beneath the page for each newspaper title, you’ll find that the ajax interface pulls in a json file with issue information. You can access this through a url like: http://trove.nla.gov.au/ndp/del/titlesOverDates/[year]/[month]. Here’s an example for January 1913.
The json includes all issues for all titles in the specified month. So you then have to loop through to find a specific title and day. Once you have the issue identifier you can just attach it to a url:
def get_issue_url(date, title_id): ''' Gets the issue url given a title and date. ''' year, month, day = date.timetuple()[:3] url = 'http://trove.nla.gov.au/ndp/del/titlesOverDates/%s/%02d' % (year, month) issues = json.load(urllib2.urlopen(url)) for issue in issues: if issue['t'] == title_id and int(issue['p']) == day: issue_id = issue['iss'] return 'http://trove.nla.gov.au/ndp/del/issue/%s' % issue_id
Finally, to save myself having to cut and paste the missing dates back into the csv file, I added a few lines to write them in automatically.
So now I have a handy little html page, complete with dates and links, that I’m working through to find all the missing editorials. All I need for the next stage are the urls for the editorial and the page on which it’s published. I’m just cutting and pasting these from the citation box in Trove into the csv file. Once this is done I can start trying to find all the editorials.
PS: I noted in my first post that one benefit in finding the editorials was that the main news articles usually appeared on the page after the editorials. I’ve been thinking some more about ways to identify ‘major’ news stories. Word length perhaps? But not always. Hmmm, but major stories do seem to be published at the top of the page. After a bit more poking around in the code I found that there’s a ‘y value’ assigned to each article that indicates its position on the page. So if I harvest all the articles on the page after the editorials and then rank them by their y values? Interesting…
When scholars try to make sense out of large collections of text, they frequently do two things: compare, and collect. They collect samples of “interesting” things, and compare them with each other along various relevant dimensions.
In this post, I demonstrate the collection and comparison features of WordSeer by using it to compare the usage of the word “love” in Shakespeares comedies and tragedies. You can watch the screencast, or simply read on.
The first thing to do is collect the comedies and tragedies into separate lists. To do this, I created a new collection called “tragedies” using the new “collections” feature.
Next, I had to collect all of Shakespeare’s tragedies into that collection. Figure 2 shows WordSeer’s list of plays. I walked down this list and clicked the checkboxes next to the tragedies, using Wikipedia as an authoritative source of tragedies.
Once I’d selected all the tragedies, I clicked the “Add Items” button to add them to a collection. I selected the “tragedies” collection and added the plays.
This populated the collection with the plays. I did the same for the comedies, ending up with two collections
I was now ready to compare my collections. I opened up two windows to the heat map view. One was going to visualize the tragedies, and one the comedies.
Finally, I was ready to compare the two. I was interested in the word “love”, and whether there would be any differences in how frequently it was used in the comedies and the tragedies. To that end, I typed in “love” into the comedies window and got the heat map in Figure 7.
Not surprisingly, “love” is everywhere. But what about the tragedies? In the other window, typing in “love” yielded the results in Figure 8.
To my surprise, the tragedies were equally full of “love”. Which, among other things, reveals my poor knowledge of Shakespeare.
In their chapter in Writing History in the Digital Age, Trevor Owens and Fred Gibbs encourage historians to write about the ways they work with data — to document their methods, their working assumptions, their dead ends and their discoveries. It’s an important argument and one that makes me wonder again about forms of publication that might integrate narrative, methods and sources.
In the meantime though we have blogs. My problem is that I’m easily bored so by the time I get to the end of a project or experiment I’m already thinking about the next one. Going back and trying to write things up seems a bit of a chore (which is why I’m always way behind in my blog writing). Also leaving the writing to the end means that I tend to take shortcuts — leaving out some of the ‘boring’ procedural stuff or the ‘stupid’ ideas that just didn’t work.
But Trevor and Fred’s chapter has made me think I should be a bit more diligent, so as I start a new series of text-mining experiments I’ve decided to write things up as I’m doing them. So be warned, this could get messy…
So what do I want to do? You might not be surprised to learn that it’s another Trove newspaper database experiment. I want to see if I can harvest newspaper editorials over a certain period and then analyse these to build up a picture of what issues, events or ideas were perceived as important. As I’m currently looking at ways of harvesting digital sources relating to 1913 for an exhibition being developed by the National Museum of Australia, I’m going to start by focusing on 1913.
But editorials are opinion pieces, wouldn’t it be better to harvest ‘news’ articles?
First of all, I’m thinking that editorials will be fairly easy to identify and extract — there’s no real way in Trove to separate out current news from other sorts of articles. Secondly, I’m assuming that the issues that make it into editorials have some importance attached to them. Attached by whom, you may well ask — whose voice is being represented in the editorial? This is an important question and I’m thinking that it could be explored in interesting ways by harvesting editorials from a range of papers and regions. Thirdly, finding the editorials might actually help me find the major news articles, simply because in this period the main news stories were often on the page after the editorials.
So how do I find them? Looking at the Sydney Morning Herald for 1913, you can see that the editorials follow a regular pattern:
To check this I conducted a search for articles including ‘The Sydney Morning Herald’ in their title. The search returns 335 results. Of course we’d expect there to be 312 (6 x 52), but it looks like there’s quite a few false positives and some days missing altogether (presumably due to OCR errors). You can see there’s a fair bit of consistency in the pages that editorials appear on, but it doesn’t quite seem consistent enough to rely on. So I’ve decided that as a first step I’ll harvest all the articles from this query. I’ll then do some manual cleaning to remove the articles that aren’t editorials and try and identify and retrieve the missing days.
Remember, this won’t give me all the editorials, only the first editorial from each day. To get all the editorials, I’ll have to write a new script that will take this first result set, retrieve all the articles from the editorial page and then try to work out which of the articles are editorials — they should be the ones that come after the first editorial and have no subtitle. Or that’s the theory.
I’ve harvested the query. You can view the spreadsheet on Google Docs if you feel so moved.
[After I wrote the sentence above I checked the CSV file properly and realised I'd stuffed up. There's a bit of a bug in my harvester that means if the query string you use includes a start value, the harvester wil retrieve the same page of results over and over again... I really need to fix that. I'm now running it again. You wanted warts and all, right?]
[After I wrote the paragraph above I checked my new harvest and realised I'd stuffed up again. There were only half as many results as there should have been! So I poked around and realised some recent changes I'd made to the harvest script meant I was only getting odd numbered results (I was incrementing the row value twice). A lesson in what happens when you do this stuff late at night... Trying again. ]
Briefly, Zapaday is a new (in alpha) online tool which mines the web (with a bit of editorial thrown in) for future events to create a global public calendar. This puts Zapaday in some overlap with Recorded Future, but with a clearer message and more narrow scope.
Here's the Recorded Future video for convenience: