Jan 242013
 

Last week I took my daughter to Sydney so she could attend a girls-only Minecraft workshop at the Powerhouse Museum (they created some wonderful things). It was a 3½ bus journey each way, so to keep myself occupied I set myself the challenge of trying to build something en route. I made a fair bit of progress, but ultimately failed. I had to steal a few extra hours this week to get it to the point where people might find it useful.

The Australian WWI Records Finder

The Australian WWI Records Finder

So here it is — a (sort of) aggregated search interface to records about Australian First World War service personnel. Give it a name and it will search:

It’s ‘sort-of’ aggregated because it’s really just a series of separate searches presented on the one page. But even this should make it easier for people to match up records across the different data sets.

Using

Type in a family name and, optionally, a given name or a service number. Hit search. Wait. Wait a bit more. The National Archives’ RecordSearch database can often be pretty slow. Eventually though, each of the databases will be queried in turn and the results added to the page.

Once the results have loaded, click on a title and the little spinny thing will start up again as more details are retrieved from the database. In this ‘detail’ view, all the other results from the database are hidden. This makes it a bit easier to compare records across databases. Just click on the title again to go back to the ‘list’ view.

If your search returns lots of results, you can use the ‘next’ and ‘previous’ links to explore the complete set. They’ll all load in the current page via the magic of AJAX.

It’s not obvious from the interface, but you can feed query parameters directly via the url. For example try http://wraggelabs.com/ww1-records/?family_name=wragge. Why is this useful? Perhaps you’ve got your own database of names on the web. Using this you could easily create links from each name that looked for relevant records in the Finder.

That’s about it. It’s just a quick, bus-trip-inspired experiment, so there are many limitations and future possibilities.

Limitations

<!–INSERT USUAL WARNING ABOUT THE FRUSTRATIONS OF SCREEN SCRAPING–>

I’m just using the standard search interfaces of the various databases and screen-scraping the results. Unfortunately they all work slightly differently. For example, the AWM databases don’t distinguish between family names and given names, so if you search for the family name ‘Smith’ you’ll also get results like ‘Jones, Bruce Smith’. The CWGC database, on the other hand, will only match an other name if it comes first, while RecordSearch (or more strictly NameSearch) will also match the names of next-of-kin. Fun fun fun.

I figure anything is better than nothing, but if you’re not getting the results you expect head off to the original interfaces and try your luck there. I’m making no promises.

You’ll also notice that the maximum number of results for each data source varies. The CWGC returns 15 results, while the AWM hands over a whopping 50. These are just the default settings for the original search engines. I could’ve fiddled with the settings, but it didn’t really seem worth it.

And oh yeah… screen scraping… inherently fragile… might fall over and die at any minute.

Possibilities

As you may have guessed from previous posts, I rather like making connections. This experiment grew out of the work I’m doing on the ‘Doing Our Bit’ project with the Mosman Library. I’ve been building a series of forms that will make it easy for contributors to link people in the Mosman project to any of these databases. Just paste in a url from RecordSearch and the system will automagically retrieve all the file metadata and also check for an entry in Mapping our Anzacs. It’s pretty nifty. But of course it made me think about having a way to search across all these different databases.

And then what?

Having found a series of records for an individual it would be good if they could then be permanently linked. If I had the time and money to do more work on this, I’d want to allow people to save the connections they find. And of course then expose these connections as Linked Open Data. It wouldn’t be difficult.

There’s probably also a lot more that could be done with machine matching of records. Perhaps someone’s already working on this for the centenary — it seems like an obvious point of attack. It would be good if the forthcoming centenary commemorations resulted in something that brought all these datasets together and exposed identifiers that could be easily used by community projects like ‘Doing Our Bit’.

Details

Yes, I cheated. I had already done a lot of work on the screen-scrapery bits of this pre bus trip. I’ve been working a RecordSearch client on and off for a while to use with projects like Invisible Australians. The AWM and CWGC scrapers I wrote for ‘Doing Our Bit’. Feel free to grab the code and play.

The actual application was built using the Python micro-framework Flask. I’m a big fan of Django, but there’s a lot of overhead involved if you just want to throw together a simple app. I’ve been wanting to try Flask for a while and was pleased to find just how quick and fun it was to get something up and running.

To make the whole thing as responsive as possible, the search results are retrieved using AJAX calls to simple APIs I built in Flask on top of my screen scraper code. There’s actually very little code in the Flask app itself. The downside of this is that the Javascript is a bit of a mess. Ah well.

Next

I don’t know whether I can put any more time into this at the moment — too many other projects competing for my time and no more bus trips coming up. But if you think it’s useful or worthwhile please let me know and I’ll see what I can do.

At the very least it shows how with just a little impatience and ingenuity we can find fairly simple ways to integrate records from a variety of sources. We don’t have to wait for some centralised solution.

Dec 302012
 

I obviously did a lot of talking in 2012, but I also made a few things…

The evolution of QueryPic

Screen Shot 2012-09-27 at 12.08.28 AM

Try QueryPic

At the start of 2012 QueryPic was a fairly messy Python script that scraped data from the Trove newspaper database and generated a local html file. It worked well enough and was generously reviewed in the Journal of Digital Humanities. But QueryPic’s ability to generate a quick visualisation of a newspaper search was undermined by the work necessary to get the script running in the first place. I wanted it to be easy and accessible for everyone.

Fortunately the folks at the National Library of Australia had already started work on an API. Once it became available for beta testing, I started rebuilding QueryPic — replacing the Python and screen-scraping with Javascript and JSON.

In the meantime, I headed over the New Zealand for a digital history workshop and began to wonder about building a NZ version of QueryPic based on the content of Papers Past, available through the DigitialNZ API. The work I’d already done with the Trove API made this remarkable easy and QueryPic NZ was born.

Once the Trove API was publicly released I finished off the new version of QueryPic. Instead of a Python script that had to be downloaded and run from the command line, QueryPic was now a simple web form that generated visualisations on demand.

The new version also included a ‘shareable’ link, but all this really did was regenerate the query. There was no way of citing a visualisation as it existed at a certain point in time. If QueryPic was going to be of scholarly use, it needed to be properly citable. I also wanted to make it possible to visualise more complex queries.

And so the next step in QueryPic’s evolution was to hook the web form to a backend database that would store queries and make them available through persistent urls. With the addition of various other bells and whistles, QueryPic became a fully-fledged web application — a place for people to play, to share and to explore.

Headlines and history

Explore The Front Page

Explore The Front Page

Back in 2011 I started examining ways of finding and extracting editorials from digitised newspapers.  Because the location of editorials is often tied up with the main news stories, this started me thinking about when the news moved to the front page. And of course this meant that I ended up downloading the metadata for four million newspaper articles and building a public web application — The Front Page — to explore the results. ;-)

The Front Page was also the first resource published on my new dhistory site (since joined by the Archives Viewer and QueryPic). dhistory — ‘your digital history workbench’ — is where I hope to collect tools and resources that have graduated from WraggeLabs.

Viewing archives

Try Archives Viewer

Try Archives Viewer

In 2012 I also revisited some older projects. After much hair-pulling and head-scratching, I finally managed to get the Zotero translator for the National Archives of Australia’s RecordSearch database working nicely again. I also updated it to work with the latest versions of Zotero, including the new bookmarklet.

My various userscripts for RecordSearch also needed some maintenance. This prompted me to reconsider my hacked together alternative interface for viewing digitised files in RecordSearch. While the userscript worked pretty well, there were limits to what I could do. The alternative was to build a separate web interface… and so the Archives Viewer was born.

Stories and data

Expect bugs ye who enter here...

Expect bugs ye who enter here…

 

In the ‘work-in-progress’ category is the demo I put together for my NDF2012 talk, Small stories in a big data world. Expect to see more of this…

My favourite things

Two things I made in 2012 are rather special (to me at least). Instead of responding to particular needs or frustrations, these projects emerged from late night flashes of inspiration — ‘what if…?’ moments. They’re not particularly useful, but both have encouraged me to think about what I do in different ways.

Play!

Play!

The Future of the Past is a way of exploring a set of newspaper articles from Trove. I’ve told the story of its creation elsewhere — I simply fell in love with the evocative combinations of words that were being generated by text analysis and wanted to share them. It’s playful, surprising and frustrating. And you can make your own tweetable fridge poetry!

Screen Shot 2012-07-10 at 5.20.45 PM

The People Inside

One night I was thinking about The Real Face of White Australia and the work I’d done extracting photos of people from the records of the National Archives of Australia’s database. I wondered what would happen if we went the other way — if we put the people back into RecordSearch. The result was The People Inside – an experiment in rethinking archival interfaces.

 

Aug 292012
 

I’m deeply in love with the collections of the National Archives of Australia. They move me, they inspire me, they make me want to do something. How do I express my love? I’ve written stories about things like atomic bombs, progress, astronomy and weather forecasting — pursuing lives and events documented in the Archives’ rich holdings. I work on projects like Invisible Australians, hoping to bring the compelling remnants of the White Australia Policy to broader public attention. And I build things. I make tools that help other people explore, understand and use the Archives. I do this because these riches need to be used. They need to be shared. They need to be part of the fabric of our lives.

A few years ago I created a little script for Firefox that put a fresh face on the display of digitised records in the National Archives’ RecordSearch database. It’s publicly available and has been installed more than 500 times. Demonstrating this script at the ‘Doing our bit’ Build-a-thon a few weeks ago made me realise again both how useful it was and how much work it still needed.

One of the most exciting features when I first created the script was the ability to display the records on a ’3D wall’, courtesy of a Firefox plugin called CoolIris. But CoolIris uses Flash and is no longer being supported. Time for a new approach.

Say hello to the Archives Viewer (naming things isn’t really one of my strengths). Instead of rewriting my existing script I decided to create a completely new web application. Why? Mainly because it gave me a lot more flexibility. I could also make use of a variety of existing tools and frameworks like Django, Bootstrap, Isotope and FancyBox. Standing upon the code of giants, I had the whole thing up and running in a single weekend.

What does it do? Simply put, just feed the Archives Viewer the barcode of a digitised file in RecordSearch and it grabs the metadata and images and displays them in a variety of useful ways. It’s really pretty simple, both in execution and design.

Yep, there’s a wall. It’s not quite as spacey and zoom-y as the CoolIris version, but perhaps that’s a good thing. It’s just a flat wall of page image thumbnails with a bit of lightbox-style magic thrown in. But when I say just, well… look for yourself. There’s something a bit magical about seeing all the pages of a file at once, taking in their shapes and colours as well as their content. This digital wall provides a strangely powerful reminder of the physical object.

National Archives of Australia: ST84/1, 1908/471-480

Of course you can also view the file page by page if you want. Printing is a snap — just type in any combination of pages or page ranges and hit the button. The images and metadata are assembled ready to print. No more wondering ‘which file did this print out come from?’.

But perhaps the most important feature is that each page has it’s own unique, persistent url. Basic stuff, but oh, so important. With a good url you can share and cite. Find something exciting? Tell the world about it! I’ve included your typical social media share buttons to help you along.

One disadvantage over the original userscript is that the viewer isn’t directly linked to RecordSearch. You probably don’t want to have to cut and paste the barcode every time you view a file. So I’ve also created a couple of connectors that ummm… connect things up.

The first connector is just a bookmarklet. A bookmarklet is just a little piece of javascript code disguised as a browser bookmark. Just drag this link — Archives Viewer — to your browser’s bookmark toolbar. Then when you’re on the item page of a digitised file in RecordSearch, just click the bookmarklet and you’ll be instantly transported to the wall.

The second connector is a bit smarter. It’s an enhanced version of another userscript I wrote to display the number of pages in a digitised file. It still does that, but now it also rewrites the links to the digitised files so that they automatically open in the Archives Viewer. It’s a bit harder to install. You need Chrome or Firefox and the add-ons Greasemonkey (for Firefox) or Tampermonkey (for Chrome). Then just go to the userscript page and hit the big ‘Install’ button.

You might be wondering about Zotero (at least I hope you are). My Zotero-RecordSearch translator lets you capture page images and metadata direct to your own research database, so what happens when you’re transported across to the Archives Viewer? Never fear, I’ve written a new translator that lets you save pages as you could in RecordSearch. Even better, you get a persistent, context-enriched url, and the ability to capture multiple pages at once. Yippee!

But that’s not quite all. Buried within the pages is some lovely Linked Open Data. To be truthful, it’s not really very ‘linked’ yet, but it does expose the basic metadata in a machine-readable form, borrowing from the vocabularies of projects like Locah and the Archival Ontology. It’s an experiment, as is the Archives Viewer itself. We can learn by doing.

I’ve given quite a few talks over recent times encouraging people to take up their tools and start hacking away at the digital collections of our cultural institutions. Yes, I admit it, I’m an impatient historian (and a grumpy one at that). But it’s also because I think it’s important that we recognise that access is never just something you’re given. It’s something that we make through our stories, our projects, and our tools. It’s something that’s grounded in respect and powered by love.

 Posted by on August 29, 2012
Jun 292012
 

On 15 April 1944 the Sydney Morning Herald turned inside out. For more than a hundred years, the front page had been dominated by advertisements, but this changed suddenly in 1944 as the newspaper took on a completely new look. In place of the ads were the day’s top stories, headlines and photographs — a ‘front page’ design familiar to modern readers.

The change was, the newspaper explained, partly a response to the demands of war. Advertising had been cut due to the rationing of newsprint and ‘an urgent public demand in these critical days for more papers and more news’. But they were also looking forward to the problems of peace:

It is essential… that we should not only provide the space, but also adopt the manner and methods of presentation which will spread knowledge of these problems yet more widely, and bring them home yet more deeply, among the people of this country.

But the Sydney Morning Herald wasn’t breaking new ground. The design of front pages had been changing across the first half of the twentieth century as advertisements gradually gave way to news. This graph shows the average number of words per issue on the front pages of Australian newspapers devoted to advertising.

You can see a clear decline from about the turn of the century. News articles, on the other hand, were on the way up.

Not all the changes were as sudden as the Sydney Morning Herald‘s. The Barrier Miner entered the First World War with the ads on top, but by war’s end the position was reversed. In between was a period of transition as you can see from this graph which plots advertising against news.

If you dig a bit deeper, you find that the amount of advertising follows a regular pattern.

These peaks and troughs in June 1916 are a week apart — Saturday’s front page was all advertising, but the next day brought a ‘Special Sunday Issue’ focused on the ‘Latest War News’.

It’s clear just from these two examples that there are stories behind these changes. There are subtleties and contingencies to be explored along with dramatic shifts.

And now you can explore them…

The Front Page

The Front Page is a database containing details of more than 4 million front page newspaper articles harvested from the National Library of Australia’s Trove service.

Trove divides articles into a series of categories:

  • articles (news)
  • advertising
  • detailed lists, results, guides
  • family notices
  • literature

I’ve simply gone through and added up the numbers of articles and the numbers of words in each category for each issue, and aggregated this across months, years and the full run of each newspaper.

These totals are presented as a series of linked tables and graphs. Just click on a point to zoom in, or use the navigation controls to go directly to the issue of your choice. It’s pretty straightforward.

Why?

We’re lucky to have rich resources like Trove, but if we’re going to make best use of them we have to move beyond the search box to find new ways of exploring and contexualising their content. That’s why I’ve developed tools like QueryPic, Headline Roulette and even The future of the past. Each lets you engage with the newspaper database in a different way.

But not all newspaper articles are created equal. I’d like to be able to aggregate and analyse the ‘top’ stories for each day, but to do this I need to know more about the structure of the newspapers themselves. I’ve already made a few attempts to find and extract editorials. This is useful because before the main news moved to the front page it was often directly after the editorials. But when did the news shift to the front page?

Now I can find out.

But why create a public web resource? Well, it’s just what I do. I build and I share. It’s what motivates me. It’s how I understand things. It’s where I find both my questions and my answers. Hey, I’m a digital humanist ok?

How?

Everything’s up on GitHub, so you can follow along with my ugly coding. It was all a bit of an experiment, because I simply didn’t know whether I could harvest and use 4 million articles. How long would it take? Would MySQL grind to a halt? Would my laptop blow up?

In my Harold White lecture I wondered whether what I was trying to do was really beyond the reach of ‘an ordinary bloke and his laptop’. I suspect the day is rapidly coming where my work will be superceded by well-funded academic projects with access to supercomputers and a pool of bright young graduate students. But for now I’ll just keep pushing the boundaries of what’s possible over a dodgy home broadband connection.

Of course, this project was only possible because of the Trove API. My screen-scrapers of yore would have been impossibly slow and wasteful of bandwith. With the API I could simply construct a query and then loop through the 4 million articles in batches of a hundred. These were then fed into MySql via Django. I quickly worked out that I needed to keep my Django models simple. My clever relational model linking newspapers, issues, pages and articles was just too complex for this sort of operation. I flattened everything out to store all the metadata in a single ‘article’ model.

The harvesting operation took about 5 days. Once I had all the metadata I ran a couple of processes to do all the adding up and saved the results to a separate ‘totals’ table.

Then it was just a matter of building a front end. Using Django, Twitter Bootstrap and HighCharts made this amazingly easy. Really. Really truly.

What now?

I built this because I wanted to track changes in the design of front pages, but now I’m wondering what else I can find. The role of war in the examples above is intriguing. Are there other changes in our relationship to ‘news’ that these graphs might reveal?

I hope other people will wonder about this as well.

I have some ideas for future developments. For example, I’d like to add tagging to make it easy to construct timelines of significant changes. But first I just want to see if anybody’s actually interested. If you have any ideas, suggestions or comments please let me know.

Ok, off you go — explore.

 Posted by on June 29, 2012
May 222012
 

[view on Storify]

This is a story about a thing I made. I’m still not sure what to call it. Or what it’s really for.

But I like it.

And I hope other people will too…

 Posted by on May 22, 2012
May 172012
 

There seems to be a lot of topic modelling going on at the moment. Any why not? Projects like Mining the Dispatch are demonstrating the possibilities. Tools like Mallet are making it easy. And generous DHers like Ted Underwood and Scott Weingart are doing a great job explaining what it is and how it works.

I’ve talked briefly about using topic modelling to explore digitised newspapers, something that the Mapping Texts project has also been investigating. But I’ve also been following with interest Chad Black’s use of algorithmic techniques, including topic modelling, to look for local variations amidst the legal system of the early modern Spanish empire.

As part of the Invisible Australians project, Kate and I are exploring the bureaucracy of the White Australia Policy. In particular, we’re interested in the interaction between policy and practice, between the highly-centralised bureaucracy and the activities of individual port officials. Like Chad, we’re interested in mapping local variations — to try and understand the bureaucracy from the point of view of an individual forced to live within its restrictions.

I recently gave a presentation about the project at Digital Humanities Australasia (post coming soon!), and in preparation I decided to try a few topic modelling experiments. They were very simple, but I was impressed by the possibilities for exploring archival systems.

The problem I started with was this. The workings of the White Australia Policy are well documented by records held by the National Archives of Australia. Some series within the archives are specifically related to the operations of the policy — such as those containing many thousands of CEDTs. But there are also general correspondence series created by the customs offices in each state, as well as the Commonwealth Department of External Affairs which administered the Immigration Restriction Act (responsibility was later taken by the Department of Home and Territories and it’s successors). These general correspondence series are important, because they often include details of difficult or controversial cases — those that required a policy judgment, or prompted a change in existing practices. But how do you find relevant files within series that can contain large numbers of items?

Series A1, for example, is a correspondence series created by the Department of External Affairs. It contains more than 60,000 items. Past research tells us that amongst these 60,000 files are records of important policy discussions relating to White Australia. But these files tend to be labelled with the names of the people involved, so unless you know the names in advance they can be difficult to find.

Mitchell Whitelaw’s A1 Explorer, part of the Visible Archive project, lets you to explore the contents of Series A1 in a easy and engaging way. But while the A1 Explorer provides new opportunities for discovery, it doesn’t offer the fine-grained analysis we need to sift out the files we’re after. And so… topic modelling.

The process was pretty simple. While I can dip into my bag of screen-scrapers to harvest series directly from the NAA’s RecordSearch database, there was already an XML dump of A1 available from data.gov.au. So I extracted the basic file metadata from the XML and wrote the identifiers and titles out to a text file, one item per line. Following the instructions on the website I then loaded this file into Mallet:

/Applications/Mallet/bin/mallet import-file --input ./A1.txt --output A1.mallet --keep-sequence --remove-stopwords

Then it was just a matter of firing up the topic modeller:

/Applications/Mallet/bin/mallet train-topics --input ./A1.mallet --output-state ./A1.gz --output-doc-topics ./A1-topics.txt --output-topic-keys ./A1-keys.txt --num-topics 40

Again, I just followed the examples on the Mallet site.

Once it was finished I opened up A1-keys.txt to browse the ‘topics’ Mallet had found. The results were intriguing. There are a large number of applications for naturalisation in A1, so it’s no surprise that ‘naturalisation’ figures prominently in a number of the topics. What was more interesting was the way Mallet had grouped the naturalisation files. For example:

naturalization christian hans hansen jensen petersen andersen nielsen larsen christensen johannes jens niels pedersen andreas johansen martin jorgensen

and

naturalisation certificate giuseppe salvatore frank la leo samios spina sorbello leonardo fisher natale patane torrisi barbagallo luka rossi ross

Based on the co-occurrence of names within the file titles, Mallet had created groupings that roughly reflected the ethnic origins of applicants. It makes sense when you think about what Mallet is doing, but I still found it pretty amazing.

Mallet also found clusters around the major activities of the department, such as the administration of the territories. But of most interest to us was:

1 0.55539 passport ah student exemption students lee wong chinese young deserter education sing wing chong readmission son hing chin wife

The Chinese names alongside words such as ‘readmission’ and ‘wife’ suggested that this topic revolved around the administration of the White Australia Policy. This was easy to test. In A1-topics.txt was a list of every file in the series and their weightings in relation to each of the topics. I wasn’t sure what was a reasonable cut-off value to use in assessing the weightings, but after a bit of trial and error I fixed on a value of 0.7. I then just extracted the identifiers of every file that had a weighting greater than 0.7 for this topic. I used the identifiers to build a simple web page that Kate and I could browse. I also included links back to RecordSearch so we could explore further.

Browse the full list

It’s a pretty impressive result. Instead of fumbling with the uncertainties of keyword searches, we now have a list of more than 1,300 files that are clearly of relevance to Invisible Australians. There’s a few false positives and there are likely to be other files that we’ll have missed altogether, but now we have a much clearer picture of the types of files that are included and how they are described.

And that was at my first attempt, simply using the default settings. I’m now starting to play around with some of Mallet’s configuration options to see what sort of difference they make. I’m also keen to try out GenSim, a topic modelling package for Python.

I’m really excited about the possibilities of these sort of tools for analysing the contents of archival descriptive systems, something I mentioned in my Digital Humanities Australasia paper. Much more to come on this I suspect…

 Posted by on May 17, 2012
Apr 182012
 

It seems a bit late to be introducing the newest version of QueryPic. Folks are already using it to explore the contents of digitised newspapers made available through Trove and Papers Past. Some, like the National Library of New Zealand, Andrew S. Bowman and the Carnamah Historical Society are already blogging about it. But I suppose I’d better document a few things…

As I noted in my post about QueryPicNZ (yes I now have a rather confusing proliferation of QueryPics), I was waiting for the Trove API to become public. Last week I noticed a little ‘API’ link pop up in the Trove footer and so I set to work…

"The past" versus "the future" in the new QueryPic

My original version of QueryPic (recently reviewed in the Journal of the Digital Humanities) used a series of Python scripts to harvest and scrape content from the Trove web pages. This meant that you had to download the scripts and be code-confident enough to run them in a terminal. It’s still a useful tool and I’ll be updating it as well, but I wanted to create something quicker and simpler that encouraged people to explore and play.

The latest version of QueryPic (QueryPic+, QueryPic Web, QueryPic 2.0?) simply runs in your browser. It uses JQuery to grab data on the fly from the Trove and DigitalNZ APIs. Like previous versions, it uses the HighCharts library to turn the data into pretty graphs.

What does it do? It’s really pretty basic. QueryPic just displays the number of articles matching your search query over time. By default, these are displayed as a proportion of the total articles available for that year, but a dropdown field lets you switch to view the raw numbers. It’s simple, but it’s also remarkably evocative, suggestive and fun. Just try it!

Why stop at just one query? To compare frequency patterns you can add as many as you like. Just keep entering new words or phrases.

If you notice an interesting peak or trough you can just click on it and another API request will be fired off to retrieve the first 20 matching articles. So it’s also a new way of exploring the newspaper databases themselves.

There are plenty of limitations — not all newspapers are digitised, for example, and the quality of the OCR is patchy. The National Library of New Zealand’s post does a great job summing up a number of issues relating to Papers Past. It’s not magic, it’s not perfect, but is it useful? I think so.

Tasks for the future:

  • Create some sort of backend that makes it easy to save , share and cite your query data. The ‘share’ link just regenerates the graph which, of course, might change as new articles are added to the databases.
  • Make it possible to add more complex queries — I want to keep the interface simple, so I’ll probably create a bookmarklet to take any Trove or Papers Past query and display it using QueryPic.
  • As I mentioned over at the WraggeLabs Emporium, I intend to rewrite my various Trove tools to work with the new API. This will include the classic Python version of QueryPic. I still think it’s useful for harvesting your own data.
The code is on my GitHub site and you can also follow updates at the QueryPic page in the WraggeLabs Emporium.

 

 Posted by on April 18, 2012
Apr 012012
 

You may have noticed I have a bit on an interest in exploring ways of using digitised historical newspapers. In the last year or so I’ve spent a lot of time scraping, mining, processing and visualising content from the Trove collection of digitised Australian newspapers. But what about other countries?

Recently I was invited to a digital history workshop organised by Sydney Shep (@nzsydney) at the Victoria University of Wellington. In between sessions I started to play with the DigitalNZ API guided by Chris McDowall (@fogonwater). In anticipation of the forthcoming Trove API I’d already done a bit of work converting QueryPic to run in the browser. It didn’t take long to adapt this to work with New Zealand newspapers available through Papers Past.

So presenting for your enjoyment and education… QueryPicNZ.

Wind, rain and snow in QueryPicNZ

Like QueryPic, the New Zealand version graphs newspaper search results over time. But thanks to the DigitalNZ API it has a number of advantages:

  • it runs in your browser — no need to download or run any scripts
  • results appear almost instantly
  • easy to combine queries — just search on a new word or phrase
  • easy to remove queries — just use the ‘Clear last’ button
  • easy to share — just copy the provided link or use the Tweet button

It’s limited to simple word or phrase searches at the moment, but eventually I’ll add the ability to process more sophisticated queries. I also want to add a way of saving, sharing and citing graphs. For now the ‘share’ link simply regenerates the graph, so if the content has changed the result could well be different.

The code is available on GitHub.

Ultimately, I want to combine Trove and Papers Past so that you can query and combine content from either Australia or New Zealand… perhaps even other countries?

 Posted by on April 1, 2012
Feb 202012
 

By my own criteria I’ve already failed… I started this series of posts with the intention of documenting the process of finding and extracting editorials as I was actually doing the work. But here I am about to describe some work I finished a few weeks back. Oh well…

In my previous instalments (here and here), I focused on the Sydney Morning Herald. Having continued the hunt for missing editorials I started in the last post, I’ve now got a CSV file with the urls of the first editorial published in every edition of the SMH from 1913. Good-o, I thought, I can now start harvesting and analysing some content.

But then ensued a crisis of faith. The whole point of this exercise was to be able to build up some comparisons  – between newspapers, between states, between the city and the bush. But the process of actually finding the editorials seemed beset with difficulties. Could the rules I developed for the SMH be applied elsewhere? Could I ever assemble a useful set of editorials without large amounts of human intervention? I decided to try a few quick experiments to see whether the whole project was worth pursuing.

I started with a few assumptions:

  1. The first (and only the first) editorial in any issue is headed with the name of the newspaper.
  2. Editorials are published on even numbered pages.
  3. Editorials vary in length between about 100 and 1500 words.

These assumptions were based on my own experience as a long-time newspaper researcher and on some preliminary poking around. For example, when I looked at The Argus I noticed that editorials were typically followed by news summaries. Unfortunately, these are treated as a single article in Trove, resulting in large blocks of text that are only part editorial. By specifying an upper word limit I hoped to filter these sorts of articles out. Similarly, there are sometimes brief announcements or publication details headed with the name of the newspaper. The lower word limit was intended to exclude these.

The next step was to harvest every article from 1913 that was headed with the name of its publication. I created a script to generate a list of all the newspapers that published issues in 1913. Then I called my existing harvester to download all the matching articles and save the details to a series of CSV files — one CSV file per newspaper.

In the previous instalment of this series I created a script to check the CSV output of my harvester for missing or duplicate dates. I extended this to perform a series of tests on each article based on the assumptions above. First, I filtered out articles on odd-numbered pages, then articles that were too short or too long. Finally I checked the remainder for missing or duplicate issue dates.

The details of the articles in each category were written out to JSON files. Using these files and a bit of JQuery magic I could quickly build a simple web interface that allowed me to explore the results.

Summary details of each newspaper

You can browse the summary results for the full list of newspapers, or you can drill down to view the actual articles assigned to each category.

Full details

I’ll save the full analysis for the next post, but if you play around with the results you quickly notice a few things. First, letters to the editor often include the name of the newspaper! If you look at The Mercury, for example, you’ll notice I’ve identified 1057 potential editorials — most of which are letters. Fortunately they should be fairly easy to filter out. In most cases the ‘even numbers only’ assumption worked pretty well, and the word length filters did remove quite a lot of false positives. There are still plenty of problems, but I’m encouraged enough to continue. Yes, there will be a Part #4!

 

 Posted by on February 20, 2012
Dec 202011
 

As I explained in the first of this series, I’m documenting my efforts to extract every editorial published in the Sydney Morning Herald in 1913 from the Trove newspaper database. It’s an experiment both in text mining and historical writing — an attempt to put the method up front.

While I didn’t think there was anything very thrilling in the first instalment, recording my thoughts and assumptions in this way has already proved useful. In a comment, Owen Stephens noted that his attempt to reproduce my search query produced fewer results. After a little bit of poking around I realised that the fulltext modifier, which I often use to switch off fuzzy matching, counteracts the ‘search headings only’ flag. So my query was returning results that had the string ‘The Sydney Morning Herald’ anywhere in the article.

Try it for yourself.

Here’s my original query — searching for fulltext:”The Sydney Morning Herald” in headings only (supposedly). You’ll notice that it returns 335 results and it’s clear from a quick scan that a number are false positives (they don’t follow the pattern for editorials).

Here’s Owen’s query — searching for “The Sydney Morning Herald” in headings only. It returns 294 results, without any obvious false positives.

So my attempt to disable fuzzy matching actually produced a less accurate result! Weird.

Actually, I think one important benefit of this sort of text mining is that it helps you understand how the search engines you’re using actually work. Once you start poking and prodding, the idiosyncrasies start to emerge.

Anyway, I harvested Owen’s cleaner result set and opened up the resulting csv file. As it seemed in Trove, there we’re very few false positives. Indeed there were only two articles that didn’t seem to follow the standard editorial format, and these were notes added to the editorial page. On the other hand, there were obviously about 20 editorials missing. I could have manually worked through the csv file to identify the missing dates, but I thought I’d try to create some tools that would do the work for me.

What I wanted was the details of the first editorial in every edition of the newspaper in 1913 — so there should be one, and only one, article for each day on which the newspaper was published. I needed a tool that would analyse the csv file and do two things:

  • identify dates that occur multiple times (false positive alert!)
  • identify dates that are absent from the result set (missing in action!)

The resulting code is all on GitHub if you want follow along. I wrote a Python script that opens up the csv file, extracts all the date strings, converts them to datetime objects and then saves them to a list. Once that’s done it’s pretty easy to loop through and find duplicates:

def find_duplicates(list):
    '''
    Check a list for suplicate values.
    Returns a list of the duplicates.
    '''
    seen = set()
    duplicates = []
    for item in list:
        if item in seen:
            duplicates.append(item)
        seen.add(item)
    return duplicates

Finding missing dates was a little more complicated, but Google came to the rescue with some handy code samples. All I had to do was set a start and end date (in this case 1 January 1913 and 31 December 1913) and create a timedelta object equal to a day. Then it’s just a matter of adding the timedelta to the start date, comparing the new date to the dates extracted from the csv file, and continuing on until you hit the end. If the new date isn’t in the csv file, then it gets added to the missing list.

if year:
        start_date = datetime.date(year, 1, 1)
        end_date = datetime.date(year, 12, 31)
    else:
        start_date = article_dates[0]
        end_date = article_dates[-1]
    one_day = datetime.timedelta(days=1)
    this_day = start_date
    # Loop through each day in specified period to see if there's an article
    # If not, add to the missing_dates list.
    while this_day <= end_date:
        if this_day.weekday() not in exclude: #exclude Sunday
            if this_day not in article_dates:
                missing_dates.append(this_day)
        this_day += one_day

I’ve tried to make the code as reusable as possible, so you can either supply a year, or the script will read start and end dates from the csv file itself.

All that left me with two more lists of dates: ‘duplicates’ and ‘missing’. At first I just wrote these out to a text file, but then I decided it would be useful to write the results to an html page. That way I could add links that would take me to the actual issue within Trove, helping me to quickly find the missing editorial.

Unfortunately there’s no direct way to go from a date to an issue — you first need to find the issue identifier. How do you do this? If you dig around in the code beneath the page for each newspaper title, you’ll find that the ajax interface pulls in a json file with issue information. You can access this through a url like: http://trove.nla.gov.au/ndp/del/titlesOverDates/[year]/[month]. Here’s an example for January 1913.

The json includes all issues for all titles in the specified month. So you then have to loop through to find a specific title and day. Once you have the issue identifier you can just attach it to a url:

def get_issue_url(date, title_id):
    '''
    Gets the issue url given a title and date.
    '''
    year, month, day = date.timetuple()[:3]
    url = 'http://trove.nla.gov.au/ndp/del/titlesOverDates/%s/%02d' % (year, month)
    issues = json.load(urllib2.urlopen(url))
    for issue in issues:
        if issue['t'] == title_id and int(issue['p']) == day:
            issue_id = issue['iss']
    return 'http://trove.nla.gov.au/ndp/del/issue/%s' % issue_id

My results file with links to Trove

Finally, to save myself having to cut and paste the missing dates back into the csv file, I added a few lines to write them in automatically.

So now I have a handy little html page, complete with dates and links, that I’m working through to find all the missing editorials. All I need for the next stage are the urls for the editorial and the page on which it’s published. I’m just cutting and pasting these from the citation box in Trove into the csv file. Once this is done I can start trying to find all the editorials.

PS: I noted in my first post that one benefit in finding the editorials was that the main news articles usually appeared on the page after the editorials. I’ve been thinking some more about ways to identify ‘major’ news stories. Word length perhaps? But not always. Hmmm, but major stories do seem to be published at the top of the page. After a bit more poking around in the code I found that there’s a ‘y value’ assigned to each article that indicates its position on the page. So if I harvest all the articles on the page after the editorials and then rank them by their y values? Interesting…

 Posted by on December 20, 2011