Research questions, abstract problems – a round table on Citizen Science

Feb 272017

I recently participated in a round-table discussion entitled “Impossible Partnerships”, organized by The Cultural Capital Exchange at the Royal Institution, on the theme of Citizen Science; the Impossibe Partnerships of the title being those between the academy and the wider public. It is always interesting to attend citizen science events – I get so caught up in the humanities crowdsourcing world (such as it is) that it’s good to revisit the intellectual field that it came from in the first place. This is one of those blog posts whose main aim is to organize my own notes and straighten my own thinking after the event, so don’t read on if you are expecting deep or profound insights.


Crucible of knowledge: the Royal Institution’s famous lecture theatre

Galaxy Zoo of course featured heavily. This remains one of the poster-child citizen science projects, because it gets the basics right. It looks good, it works, it reaches out to build relationships with new communities (including the humanities), and it is particularly good at taking what works and configuring it to function in those new communities. We figured that one of the common factors that keeps it working across different areas is its success in tapping in to intrinsic motivations of people who are interested in the content – citizen scientists are interested in science. There is also an element of altruism involved, giving one’s time and effort for the greater good – but one point I think we agreed on is that it is far, far easier to classify the kinds of task involved, rather than the people undertaking them. This was our rationale in that 2012 scoping study of humanities crowdsourcing.

A key distinction was made between projects which aggregate or process data, and those which generate new data. Galaxy Zoo is mainly about taking empirical content and aggregating it, in contrast, say, to a project that seeks to gather public observations of butterfly or bird populations. This could be a really interesting distinction for humanities crowdsourcing too, but one which becomes problematic where one type of question leads to the other. What if content is processed/digitized through transcription (for example), and this seeds ideas which leads to amateur scholars generating blog posts, articles, discussions, ideas, books etc… Does this sort of thing happen in citizen science (genuine question – maybe it does).  So this is one of those key distinctions between citizen science and citizen humanities. The raw material of the former is often natural phenomena – bird populations, raw imagery of galaxies, protein sequences – but in the latter it can be digital material that “citizen humanists” have created from whatever source.

Another key question which came up several times during the afternoon was the nature of science itself, and how citizen science relates to it. A professional scientist will begin an experiment with several possible hypotheses, then test them against the data. Citizen scientists do not necessarily organize their thinking in this way. This raises the question: can the frameworks and research questions of a project be co-produced with public audiences? Or do they have to be determined by a central team of professionals, and farmed out to wider audiences? This is certainly the implication of Jeff Howe’s original framing of crowdsourcing:

“All these companies grew up in the Internet age and were designed to take advantage of the networked world. … [I]t doesn’t matter where the laborers are – they might be down the block, they might be in Indonesia – as long as they are connected to the network.

Technological advances in everything from product design software to digital video cameras are breaking down the cost barriers that once separated amateurs from professionals. … The labor isn’t always free, but it costs a lot less than paying traditional employees. It’s not outsourcing; it’s crowdsourcing.”

So is it the case that citizen science is about abstract research problems – “are golden finches as common in area X now as they were five years ago?” rather than concrete research questions – “why has the population of golden finches declined over the last five years?”

For me, the main takeaway was our recognition citizen science and “conventional” science is not, and should not try to be, the same thing, and should not have the same goals. The important thing in citizen science is not to focus on the “conventional” scientific out comes of good, methodologically sound and peer-reviewable research – that is, at most, an incidental benefit – but on the relationships between professional academic scientists and non-scientists it creates; and how these can help build a more scientifically literate population. The same should go for the citizen humanities. We can all count bird populations, we can all classify galaxies, we call all transcribe handwritten text, but the most profitable goal for citizen science/humanities is a more collaborative social understanding of why doing so matters.

Feb 252017

Keynote presentation at the Deakin University Faculty of Arts and Education HDR Summer School, Geelong, 24 February 2017.

I’m a historian. But in the past decade the nature of my research has changed quite profoundly. Instead of heading off to the archives, taking lots of notes, and writing up a book or an article, I now make things. Generally these things are online, and open to the public. I make things for people to use, to explore, to play, and to ponder.

I started down this track before I realised there was a name for what I do – practice-led research. The things that I make even have their own acronym – they’re NTROs, or Non Traditional Research Outputs.

But practice-led research is not just about making things. New knowledge is generated through cycles of creation and reflection. My aim in making is not to follow a blueprint, or check off a list of requirements, but to end up asking ‘What is this thing?’, ‘What does it do?’, ‘How does it do it?’.

In the past, I’ve tended to talk about my research practice as playing with data. I think there’s an important argument to be made for the role of play in research, particularly when confronted with large cultural datasets. But ‘play’ doesn’t quite capture what I do, nor does it look very convincing in a research proposal. So what do I really do?

Let’s play a game.

Headline Roulette is a very simple game. Presented with the title of a digitised newspaper article drawn at random from Trove’s collection of more than 200 million you are challenged to guess the year in which the article was published. Sounds easy, but you only get ten guesses. It’s sort of like a cross between hangman and The Price is Right.

Despite its simplicity, I’ve known it to unleash the competitive instincts of a workshop full of historians. But for me, Headline Roulette is important because it provides an example of what becomes possible once we make cultural heritage collections available online. Our interactions are no longer limited to conventional modes of viewing or reading – we can play, and we can build.

I made the first version of Headline Roulette back in 2010. It was a game, but it was also an argument about access and possibilities.

Perhaps we should first take a step back. Who’s used Trove?

Trove is a fundamental part of Australia’s research infrastructure – and not just for those of us in the humanities or social sciences. Trove is a lot more than digitised newspapers, but access to more than 150 years worth of digitised newspapers has profoundly changed historical practice.

I say this not just because you have been spared the pain and suffering wrought by microfilm readers upon a generation of historians, but because the meaning of access itself has changed. Headline Roulette is just one simple and silly example of how once cultural heritage resources are in digital form we can use them differently. We can see them differently.

Imagine your search in Trove’s newspapers zone returns 10,000 or 100,000 results. How do you make sense of that? How do you get an understanding of the whole, when all you see is page after page of search results?

QueryPic extracts data from Trove to visualise your search as a single chart – showing you the number of articles per year that match your query. You can even compare the occurrence of particular words or phrases.

But that’s only the beginning, because once you think about web resources as data rather than just another type of publication you can aggregate and analyse – you can look for big, dramatic pictures as well as tiny, fragile fragments.

Trove Harvester is tool that delivers historical newspaper articles in bulk – thousands, even millions of articles saved to your computer for offline exploration.

What might you do with a million newspaper articles?

Research using digital resources like Trove is not constrained to the window of your web browser. You can ask new types of questions.

But back in 2010–11 when I created the first versions of Headline Roulette, QueryPic and the Trove Harvester there was no easy way of getting data out of Trove. The thing is, web pages are good for delivering data to human beings, but not so good for computers. Computers are actually pretty dumb, and you need to be quite explicit in packaging up data for them. Nowadays Trove has a thing called an API (an Application Programming Interface) which delivers data in a carefully structured format that even computers can understand. You can use APIs to harvest data, or to build new tools or interfaces. APIs are cool.

Without an API, the first versions of my tools had to turn human-readable web pages into computer-readable data – a process known as screen scraping. They were, therefore, not only useful or interesting applications in their own right, they were arguments about why things like APIs matter. Why web pages aren’t enough. Why researchers need access to data.

These are arguments we’re still making. Next week I’m heading to a workshop in California where we’ll be discussing how libraries and other cultural institutions can deliver their data in ways that support new forms of research.

But we don’t have to wait. By screen scraping web pages, by reverse engineering online databases, we can continue to develop the argument for access by extracting, sharing, and using data.

What could you do with 70gb of digitised surveillance files from the Australian Security Intelligence Organisation (ASIO)? I’f you’d like a copy I have them here on a USB drive.

Don’t worry – we’re not about to be raided by the security services. These are all files that have been carefully examined and released to the public through the National Archives of Australia. You can find them by searching the Archives’ online database – RecordSearch.

Who’s used RecordSearch? It’s not the most friendly system, but the collection it documents, and the metadata it provides, is rich and wondrous. I’ve spent a lot of time trying to get useful data out of RecordSearch – not just ASIO files, also records documenting the administration of the White Australia Policy, as well as higher-level data aimed at building my understanding of how the Archives, and its descriptive systems, actually work.

It is painful and frustrating work. But, I would argue, it is research. Terms like ‘data mining’ and ‘text mining’ fly around all the time, making it seem as if the the accumulation of data is a mechanical process – as if we’re just digging it up. But the practice of screen scraping, or of liberating data from any cultural heritage source, is not simply extractive – it’s iterative and interpretative. It’s a process through which you begin to understand how the data is organised, what its limits and assumptions are, what its history is. What it means. We’re not just taking things out, we’re putting them back.

Frederick Gibbs and Trevor Owens argue that historical data need not be deployed solely as statistical evidence. ‘It can also help’, they suggest, ‘with discovering and framing research questions’ – questions, not answers; interpretation not calculation. Gibbs and Owens describe an ‘iterative interaction with data as part of the hermeneutic process’.

For me, RecordSearch is like an archaeological site. Excavating data from it involves digging through layers of technology, institutional history, and descriptive practice to try and understand why we have what we have.

Those of you undertaking projects using the collections of the National Archives will almost certainly come across the process of ‘access examination’. Under the Archives Act, government records more than twenty years old are expected to be opened to the public. However, the act also defines a number of exceptions to this rule – for example, records that endanger national security or infringe an individual’s privacy can be completely, or partially, withheld from scrutiny. The process of assessing records against this set of exemptions is called ‘access examination’.

The vast majority of records are opened without problem – they are, after all, more than 20 years old. But a significant number are not. While you can’t use these records, RecordSearch does provide some information about them. So I decided to see what we couldn’t see.

In January 2016 I fired up my screen scraper and harvested details of all the files in RecordSearch that have the access status of ‘closed’ – there were 14,370 of these files that had been through the process of access examination and withheld from public view. I then created my own interface that lets you explore this data from a variety of angles – such as the reasons why files were closed, when decisions were made about them, how old they are, and which government agencies created them.

It is perhaps the most frustrating search interface ever devised, given that you’re not allowed to see any of the files you find.

Those of you currently planning research projects might be interested to know where most of these files come from. It’s not defence or the intelligence agencies, but what is now the Department of Foreign Affairs and Trade (DFAT) – in January 2016, there were 1,747 closed files from just one DFAT series. But if you dig deeper you see that most of these files aren’t withheld for one of the reasons defined by the Archives Act, they are described as ‘closed pending advice’. The National Archives is still waiting to hear back from DFAT about them. Using my interface you can see that there were 54 files in this series where the Archives has been waiting for more than five years. So if you’re embarking on a project using the National Archives, make sure you get your access examination requests in early. Just in case.

My aim in extracting and sharing this data is to better understand access examination itself as a historical process. It’s work that enables us to ask different types of questions, but it also makes a change in the process itself. My interface is public, offering a critical commentary on the ‘official’ system. As a result of my research, the Archives has made changes to the way it describes closed files. It’s both research and intervention, history and hack.

‘Hack’ has a number of definitions, both positive and negative. Mark Olsen describes the ‘hacker ethos’ as:

‘a way of feeling your way forward through trial and error, up to and perhaps beyond the limits of your expertise, in order to make something, perhaps even something new. It is provisional, sometimes ludic, and involves a willingness to transgress boundaries, to practice where you don’t belong… Whether eloquent or a kludge, a hack gets things done.’1

Olsen explores what hacking means in the context of the humanities, arguing not only that hacking has a legitimate place in humanities practice, but that the humanities itself needs to be hacked to foster the development of new skills and literacies.

At this point you’re probably thinking, ‘But I don’t do any of this wacky digital stuff, what has this got to do with me?’

Who’s heard of filter bubbles, or search personalisation? Who’s read one of the many reports recently about the way computer algorithms are shaping our online experience? Olsen argues for a humanities practice that equips us to wrestle with complex techno-social systems.

And we’re not just talking about Google.

Last year Matthew Reidsma published an analysis of algorithmic bias in library discovery systems. He hacked a common commercial library product to show some of the biases underlying its recommendations system. The interfaces we use to access information are never neutral. The databases we search are products of selection and exclusion. Hacking enables us to interact with these systems as critics, and not just consumers.

Using the Trove API you can create a chart showing the number of digitised newspaper articles available per year from 1803 onwards. If you do this, you’ll notice two significant features. First, there is a dramatic drop-off in the number of articles after 1954. This is the ‘copyright cliff of death’. Few things are certain in our overly-complex copyright system, but 1954 provides a practical cut-off point. History stops in 1954.

You’ll also notice a substantial peak in the number of articles around 1914. Why might this be? Did something significant happen in 1914?

In fact, it’s all about money. In the lead up to the centenary of WWI it was decided to focus limited digitisation resources on newspapers from the WWI period. It was a perfectly reasonable decision, but the consequences are effectively invisible to any user of the web interface. You don’t know what you’re searching.

The power of Google encourages us to put a lot of faith in search interfaces. We trust that they will just work. And if we can’t find what we’re looking for, we often assume that it doesn’t exist.

Hansard, the recorded proceedings of the Australian parliament from 1901 can be searched using the ParlInfo database on the Australian Parliament House website. Perhaps you’ve used it – it’s a wonderfully rich resource. Powering the search results are a series of well-structured XML files, one for each sitting day, that identify individual debates and speeches.

Last year I reverse-engineered ParlInfo and harvested all those XML files. I thought they’d provide a great dataset for exploring changes in political speech, and so I created a repository containing all the files for the House of Representatives and the Senate from 1901 to 1980. Feel free to download and play.

But in the process of harvesting the files I noticed that some of the XML files were empty. After a bit more analysis I realised that about 100 sitting days were missing – they didn’t show up in search results on ParlInfo.

The ‘missing’ days were concentrated in the Senate between 1910 and 1920. So anyone relying on ParlInfo to research the WWI period would have missed significant amounts of content. This ‘black hole’ was effectively invisible to any user of the web interface. It was only though hacking that its shape and extent was revealed.

Fortunately staff at the Parliamentary Library have investigated and fixed the problem. But it’s a good example of why we should, as researchers, start from the assumption that search interfaces lie. Processes of selection and description shape the ‘reality’ of online collections. We then explore them through complex technological systems that appear comprehensive, even when they are not. You can’t find what’s not there. Online collections hide as much as they reveal.

Of course this is true of all historical sources. We are trained to analyse both context and content, to make judgements about authenticity and accuracy. These same skills need to applied to digital resources, to data. Indeed, Gibbs and Owen argue that ‘historians must treat data as text, which needs to be approached from multiple points of view and as openly as possible’. But how do we find multiple points of view when interfaces construct our experiences and limit our perspectives. How do we open data to new possibilities? How do we see data differently?

No doubt you’ve been encouraged to find a way of expressing your research questions succinctly, in a way that communicates with a non-specialist audience – yes, I mean the dreaded elevator pitch. You’re not the only one.

I’ve landed back in academia after a number of years working in cultural heritage institutions, and pursuing my own research interests with the support of the international digital humanities community.

Believe me when I say, Twitter changed my life. There I was, hacking away on cultural heritage data without any real assistance or encouragement, when I discovered, via Twitter, that there were people out there like me. Many of these people are now my friends, and I’ve been lucky to travel around the world to meet and work with them.

But coming back to academia I’ve found that my collection of projects, tools, experiments, and obsessions was not quite enough – my research needs a ‘narrative’.

So, like you, I’ve had to think about why I do what I do. What motivates my research? What matters?

For me it comes back to the nature of this thing we call ‘access’. Cultural heritage organisations talk about ‘access’ all the time, particularly in relation to online collections. But what does it actually mean? I want to overturn our assumptions about access – exploring it not as a process of opening things up, but as a system of controls and limits. It’s not a state of being, it’s a struggle for meaning and power.

My methodology, and I think I can call it that, is the multiplication of contexts. Context is, of course, critical to cultural heritage collections – it enables us to locate them within history and culture, to analyse their authenticity, to mobilise their value as evidence. But the descriptive systems we use to manage and explore collections represent only a privileged subset of possible contexts.

Now I’m still figuring this out, but I think what my work does is that it removes collections from these highly-controlled systems and lets them loose in a variety of new contexts. This allows unexpected features, or new uses, to emerge – we see them differently, and in that moment, the nature of access shifts, however slightly. It’s those moments I’m trying to catch and observe.

If you’ve ever tried to use Hansard through the ParlInfo database you’ll realise that it’s just really difficult to read. You’re presented with a series of nested fragments, so it’s hard to get a sense of the context and flow of the day’s proceedings. Having downloaded all those XML files, I thought I’d have a go at presenting Hansard in a form that privileged reading over search.

So I created Historic Hansard – dedicated to lovers of political speech. It does nothing very fancy, but I think it does it pretty well.

In the end, however, Hansard is still just text. What’s lost in the documentation process is the performance – the theatre of parliament. But not completely. As well as formal speeches, many interjections have been recorded and preserved.

A few weeks ago I extracted all those interjections from 1901 to 1980, about a million of them, and saved them to a new database. As I fiddled with different presentation methods, I started to see them as something akin to tweets – quick, pithy, and pointed. What would happen, I wondered, if we reimagined interjections from a century ago in an age of social media.

Like many of my projects, this whatever it is took me a couple of days to build. No research grants were harmed in its creation, no committees were needlessly formed. This is not because I’m a whizz-bang coder – I’m certainly not. It has to do with the nature of this work – it’s rapid, experimental, and sometimes even ephemeral. I don’t design websites, I make interventions – things that are not only of the world, but in the world. They do something.

Stephen Ramsay explores the hermeneutical possibilities of screwing around with technology and texts. The ‘screwmeneutical imperative’ he suggests is based on the fact that:

’a writerly, anarchic text… is more useful than the readerly, institutional text. Useful and practical not in spite of its anarchic nature, but as a natural consequence of the speed and scale that inhere in all anarchic systems.

Digital technologies give us the opportunity to play with scale and speed. We can manipulate millions of newspaper articles, and we can build a new version of Hansard in a weekend. But this shift also applies to the way we communicate. Instead of waiting months or years for an article to appear in print, we can post it on a blog, or in a digital repository. It is fundamental to the work that I do that it is shared, it is public by default – not just the results, but the code, the data, the process, and yes the licensing. Access is not just what we take, it’s what we do.

The multiplication of contexts has some interesting precedents as a research methodology. In the literary world the Oulipo movement sought to play with the contraints of composition. Lisa Samuels and Jerome McGann suggested that the deliberate misreading of a text, what they termed ‘deformance’, could yield critical insights. More recently, Mark Sample has argued for a ‘deformed humanities’ where we learn about things by breaking them.

In history we have the counterfactual – a creative reimagining of a past that never was, aimed at revealing perspectives and possibilities too quickly closed and forgotten. As Sean Scalmer argues, ‘counterfactuals are fun’:

‘Conventions can be disregarded, or even mocked. Worlds might be remade, the tyrannical overthrown, and the humble elevated. New orders can be imagined.’2

But counterfactuals are not fiction. They work best when they sail close to an accepted version of the past; when they play with the constraints of documentary evidence rather than just ignore them. Just because an approach is playful, it doesn’t mean that there are no rules. As Ian Bogost has recently argued, the fun of play is ‘not doing what we want, but doing what we can with what is given’.3 Play is an investigation of limits.

While some of the ASIO files held by the National Archives are closed to the public, most are ‘open with exception’. This means that sensitive parts of the files have been removed. Whole pages can be withheld, or sections of text blacked out – a process known as redaction.

A redaction is, by definition, an absence of information, and yet the frequency, density, and placement of redactions across a large collection of documents could conceivably tell us something interesting. So last year I wrote a kludgy computer vision script that found and extracted redactions from digitised ASIO files. I now have a collection of 250,000 redactions which I’ve shared on Figshare – grab a copy now!

I’m continuing to explore the possibilities of these redactions as data points. But there was also something visually interesting about the redactions, particularly when they were assembled on masse.

Here you can browse all 250,000 redactions. But that’s not all, you can also use them as entry points to the documents they were intended to obscure.

Contexts here have been reversed, the files have been turned inside out – the limits remain, indeed the scale of redaction is emphasised, and yet within these limits, perhaps even because of these limits, we can experience the files quite differently. We are no longer simply the subjects of state surveillance, we can reverse the gaze, inspect the process, and ask new questions.

The manipulation of contexts is not mere invention. The limits of access offer both meaning and rules. We have skin in this game, its outcomes matter, what is at stake is our ability to see, and be seen, within the cultural record. Access changes who we can imagine ourselves to be.

In the first half of the twentieth century, if you were deemed not to be ‘white’ and wanted to travel overseas from your home here in Australia, you had to carry special documents. Without them, you’d probably be stopped from returning – from coming home. This was ‘extreme vetting’ White Australia style.

Many thousands of these documents are now held by the National Archives of Australia. In 2011, I used my screen scraper to harvest about 12,000 images like this from RecordSearch. I then ran them through a facial detection script and created The Real Face of White Australia.

There are about 7,000 faces in this seemingly endless scrolling wall. And that’s from just a small sample of the White Australia records. It’s powerful, compelling and discomfiting. But the power comes not from any technical magic, but from the faces themselves – from what we feel when meet their gaze. Once again the records have been turned inside out – instead of seeing files, metadata, or a list of search results, we see the people inside.

Play can be serious. It can make you feel things you don’t expect. It can challenge your certainties and take you to the limits of what you know.

That sounds a lot like research to me.

  1. M. J. Olson, ‘Hacking the humanities: Twenty-first-Century literacies and the “becoming other”of the humanities’, in Eleonora Belfiore and Anna Upchurch (eds), Humanities in the Twenty-first Century: Beyond utility and markets, Palgrave Macmillan, 2013, pp. 237–250.
  2. Sean Scalmer, ‘Introduction’, in Stuart Macintyre and Sean Scalmer (eds), What if? Australian history as it might have been, Melbourne University Press, Melbourne, 2006, pp. 1–11.
  3. Ian Bogost, Play Anything: The Pleasure of Limits, the Uses of Boredom, and the Secret of Games,Basic Books, New York, 2016, p. 236.

The Art and History of Calligraphy

Feb 252017
On Thursday, 2 March (19.00–20.30), professional calligrapher Patricia Lovett will be giving a talk at the British Library, entitled 'The Art and History of Calligraphy'. Patricia will be drawing from the Library’s rich collections of manuscripts to tell us about the art and history of calligraphy from her own practitioner’s...

Based on a *True* True Story? Oscars edition

Feb 252017

Oscar-nominated Hacksaw Ridge, Hidden Figures & Lion are all “based on true events”. But which is truly the truest of them all?

We carried out a scene-by-scene, beat-by-beat visual fact-check of several Oscar contenders – plus other true-story films like The Big Short, Spotlight and Wolf of Wall Street. Check out the newly-updated interactive visualization.

» See the truth
» Or just read the nicely tabulated data

Notes on running the DH-USB

Feb 252017

Our digital archaeology textbook will be intertwined with an instance of the DHBox. One of the participants in that project is Jonathan Reeve, who has been building a version that runs off a bootable USB. So naturally, I had to give it a spin. I ran out, got a new usb stick and….

…had to figure out Bittorrent. Every time I went to install the client, every browser I had on every machine kept blocking it as malicious. Normally I can work around this sort of thing, but it was really pernicious. Turned out, my stable of computers were all quite happy with uTorrent instead. With that installed, I grabbed the torrent files from the DH-USB repository, and let them do their magic. It took 3 hrs to get the full .img file.

…had to figure out how to put that .img onto a usb stick such that it would be bootable. Unetbootin should’ve worked, but didn’t. In the end, I had to do it from the command line, per the ‘alternative instructions’:

MacOS: Identify the label of your USB drive with the command diskutil list. Then unmount the disk with diskutil unmountDisk /dev/diskX, replacing diskX with your drive name. Finally, run sudo dd if=/path/to/dh-usb.img of=/dev/rdiskX bs=1m again replacing /path/to/dh-usb.img with the path to the .img file, and diskX with the name of your disk.

Then I had to figure out how to get the damned machines to boot from the stick rather than their own hard drive. On the Mac, this was easy – just hold the alt key down while the machine powers up, and you can then select the usb stick. NB: you can also, it seems, select whatever wifi network happens to be in the air at this stage, but if you do this (I did) everything will go sproing shortly thereafter and the stick won’t boot. So don’t do this. On the Windows 10 machine I had access to, booting up from a disk or stick is no longer the straight-forward ‘hold down f11’ or whatever anymore. No, you have to search for the ‘advanced startup’ options, and then find the boot from disk option, where  you specify the usb stick. THEN the machine powers down and up again… and will tell you that the security settings won’t let you proceed any further. Apparently, there’s a setting somewhere in the BIOS that you have to switch, but as it wasn’t my machine and I’d had enough, I abandoned it. Windows folks, godspeed. (Incidentally, for various reasons, computers much older than about five years are out of luck, as some key pieces of ur-code have changed in recent years:

[you need] a modern system that supports UEFI. Legacy BIOS boot may be possible, but it hasn’t been extensively tested

I had some other issues subsequent as I tried to install R and R Studio, but I’ve sorted those out with Jonathan and by the time you read this, they probably won’t be issues any more (but you can click on the ‘closed issues’ on the repo to see what my issues were). One thing that drove me nuts was trying to persuade Arch Linux to find the damned wifi.

I eventually stumbled across this re ubuntu:

so tried this:

$ lspci -vvnn | grep -A 9 Network

and saw that I had kernal modules: brcmfmac, wl, but none in use. So I tried this:

$ sudo modprobe brcmfmac

and ran the first command again; kernal now in use!

$ sudo wifi-menu

…and connected. Kept getting connection errors; went to settings > network and connected through there, ta da!


There you have it. A portable DH computer on a stick, ready to go. For use in classes, it’s easy enough to imagine just buying a bunch of usb sticks and filling them up with not only the computing parts but also the data sets, supporting documentation, articles etc and distributing them in class; for my online class this summer maybe the installation-onto-the-stick steps can be made more streamlined… of course, that’s what DH-Box prime is for, so I’ve asked the kind folks over in the school of computer science if they wouldn’t mind installing it on their open stack. We shall see.

 Posted by on February 25, 2017

Easily download large-ish survey datasets

Feb 252017

Many government organizations release microdata for surveys every year. It comes as anonymized responses from each survey participant for each question in said survey. However, those who want to use this data often run into the challenge of downloading and parsing. It’s rarely straightforward.

So, Anthony Damico provides a big helping of R scripts to easily download data from a bunch of surveys. He calls the site Analyze Survey Data for Free.

Governments spend billions of dollars each year surveying their populations. If you have a computer and some energy, you should be able to unlock it for free, with transparent, open-source software, using reproducible techniques. We’re in a golden era of public government data, but almost nobody knows how to mine it with technology designed for this millennium. I can change that, so I’m gonna. Help. Use it.

The site has been around for a few years but I just discovered it. I’m not sure we’re in a “golden era of public government data” right now (although I’d be happy if you prove me wrong).

I recently used a modified script to download data from the CDC, and it saved me a bunch of time.


Feb 252017
Feb 242017
