I hope to blog more extensively on semantic web technologies, but decided to start with a simple overview of the subject for those just getting started. The following was actually something I decided to write earlier this week after a conversation a colleague and I had with a client.
More on museum datasets, un-comprehensive-ness, data mining
(Another short response post)
Thus far we’ve not had much luck with museum datasets.
Sure, some of us have made our own internal lives easier by developing APIs for our collection datasets, or generated some good PR by releasing them without restrictions. In a few cases enthusiasts have made mobile apps for us, or made some quirky web mashups. These are fine and good.
But the truth is that our data sucks. And by ‘our’ I mean the whole sector.
Earlier in the year when Cooper-Hewitt released their collection data on Github under a Creative Commons Zero license, we were the first in the Smithsonian family to do so. But as PhD researcher Mia Ridge found after spending a week in our offices trying to wrangle it, the data itself was not very good.
As I said at the time of release,
Philosophically, too, the public release of collection metadata asserts, clearly, that such metadata is the raw material on which interpretation through exhibitions, catalogues, public programmes, and experiences are built. On its own, unrefined, it is of minimal ‘value’ except as a tool for discovery. It also helps remind us that collection metadata is not the collection itself.
One of the reasons for releasing the metadata was simply to get past the idea that it was somehow magically ‘valuable’ in its own right. Curators and researchers know this already – they’d never ‘just rely on metadata’, they always insist on ‘seeing the real thing’.
Last week Jasper Visser pointed to one of the recent SIGGRAPH 2012 presentations which had developed an algorithm to look at similarities in millions of Google Street View images to determine ‘what architectural elements of a city made it unique’. I and many others (see Suse Cairns) loved the idea and immediately started to think about how this might work with museum collections – surely something must be hidden amongst those enormous collections that might be revealed with mass digitisation and documentation?
I was interested a little more than most because one of our curators at Cooper-Hewitt had just blogged about a piece of balcony grille in the collection from Paris. In the blogpost the curator wrote about the grille but, as one commenter quickly pointed out, didn’t provide a photo of the piece in its original location. Funnily enough, a quick Google search for the street address in Paris from which the grille had been obtained quickly revealed not only Google Street View of the building but also a number of photos on Flickr of the building specifically discussing the same architectural features that our curator had written about. Whilst Cooper-Hewitt had the ‘object’ and the ‘metadata’, the ‘amateur web’ held all the most interesting context (and discussion).
So then I began thinking about the possibilities for matching all the architectural features from our collections to those in the Google Street View corpus . . .
But the problem with museum collections is that they aren’t comprehensive – even if their data quality was better and everything was digitised.
As far as ‘memory institutions’ go, they are certainly no match for library holdings or archival collections. Museums don’t try to be comprehensive, and at least historically they haven’t been able to even consider being so. Or, as I’ve remarked before, it is telling that the memory institution that ‘acquired’ the Twitter archive was the Library of Congress and not a social history museum.
Accepting the challenges of representing scholarly knowledge
I just read and thought some of you would also enjoy this peice on "The inevitable messiness of digital metadata" by David Weinberger of Harvard's Berkman Center and LIbrary Innovation Lab (and a Cluetrain co-author for fellow geeks who remember the pre Web 2.0 days). He writes in response to an op-ed by Neil Jeffries in Wikipedia's Signpost.
Sprucing up the TikaFileIdentifier
Following the SPRUCE mashup I attended in April, we are very pleased to be one of the organizations granted a SPRUCE Project funding award, which will allow us to 'spruce' up the TikaFileIdentifier tool. (Paul has written more about these funding awards on the OPF site.)
TikaFileIdentifier is the tool which was developed at the mashup to address a problem several of us were having extracting metadata from batches of files, in our case within ISO images. Due to the nature of the mashup event the tool is still a bit rough around the edges, and this funding will allow us to improve on it. We aim to create a user interface and a simpler install process, and carry out performance improvements. Plus, if resources allow, we hope to scope some further functionality improvements.
This is really great news, as with the improvements that this funding allows us to make, the TikaFileIdentifier will provide us with better metadata for our digital files more efficiently than our current system of manually checking each file in a disk image. Hopefully the simpler user interface and other improvements means that other repositories will want to make use of it as well; I certainly think it will be very useful!
Project idea/request for comment: OpenDOI
With Martin Eve's kind permission I am copying and pasting from his post.
[N.B: I annotated his original text by adding some hyperlinks just in case, to make discussion more accessible to all and provide further info if needed].
“Emerging Bibliographic Tools and Technologies”
From INFOdocket :
From UK’s discovery Blog: “Emerging Bibliographic Tools and Technologies”
From a discovery Blog Post:
Last week I attended a workshop on ‘emerging bibliographic tools’ organised by JISC. The idea of the workshop was to bring together a small group of people with experience of a wide variety of tools used to transform, publish, and otherwise manipulate bibliographic data.
Here Are Some of the Topics and Tools Mentioned:
- Linked Data and RDF
- Identifiers – the challenges of finding and exploiting appropriate ones such as DOI, ISBN, AuthorClaim and ORCID
- Automatic metadata creation from full text resources
- Ontologies and representations – from MARC to BibJSON to RIS to BibTeX to Bibliographic Ontology to Schema.org
- Spidering/Web crawling technology: CrystalEye, PubCrawler, nutch
"Family Names Service"
From the CrossTech blog :
"...announcing a small web API that wraps a family name database here at CrossRef R&D. The database, built from CrossRef's metadata, lists all unique family names that appear as contributors to articles, books, datasets and so on that are known to CrossRef. As such the database likely accounts for the majority of family names represented in the scholarly record.
The web API comes with two services: a family name detector that will pick out potential family names from chunks of text and a family name autocompletion system.
Very brief documentation can be found here along with a jQuery example of autocompletion.
...We're not proposing this database as an authority but rather something that backs a practical service for family name detection and autocompletion.
"
Things clever people do with your data #65535: Introducing ‘Free Your Metadata’
Last year Seth van Hooland at the University of Ghent approached us to look at how people used and navigated our online collection.
A few days ago Seth and his team launched Free Your Metadata – a demonstrator site for showing how even irregular metadata can have valued to others and how, if it is released rather than clutched tightly onto (until that mythical day when it is ‘perfect’), it can be cleaned up and improved using new software tools.
What’s awesome is that Seth used the Powerhouse’s downloadable collection datafile as the test data for the project.
Here’s Seth and his team talking about the project.
F&N: What made the Powerhouse collection attractive for use as a data source?
Number one, it’s available for everyone and therefore our experiment can be repeated by others. Otherwise, the records are very representative for the sector.
F&N: Was the data dump more useful than the Collection API we have available?
This was purely due to the way Google Refine works: on large amounts of data at once. But also, it enables other views on the data, e.g., to work in a column-based way (to make clusters). We’re currently also working on a second paper which will explain the disadvantages of APIs.
F&N: What sort of problems did you find with our collection?
Sometimes really broad categories. Other inconveniences could be solved in the cleaning step (small textual variations, different units of measurement). All issues are explained in detail in the paper (which will be published shortly). But on the whole, the quality is really good.
F&N: Why do you think museums (and other organisations) have such difficulties doing simple things like making their metadata available? Is there a confusion between metadata and ‘images’ maybe?
There is a lot of confusion about what the best way is to make metadata available. One of the goals of the Free Your Metadata initiative, is to put forward best practices to do this. Institutions such as libraries and museums have a tradition to only publish information which is 100% complete and correct, which is more or less impossible in the case of metadata.
F&N: What sorts of things can now be done with this cleaned up metadata?
We plan to clean up, reconcile, and link several other collections to the Linked Data Cloud. That way, collections are no longer islands, but become part of the interlinked Web. This enables applications that cross the boundaries of a single collection. For example: browse the collection of one museum and find related objects in others.
F&N: How do we get the cleaned up metadata back into our collection management system?
We can export the result back as TSV (like the original result) and e-mail it. Then, you can match the records with your collection management system using records IDs.
–
Go and explore Free Your Metadata and play with Google Refine on your own ‘messy data’.
Things clever people do with your data #65535: Introducing ‘Free Your Metadata’
Last year Seth van Hooland at the Free University Brussels (ULB) approached us to look at how people used and navigated our online collection.
A few days ago Seth and his colleague Ruben Verborgh from the University Ghent launched Free Your Metadata – a demonstrator site for showing how even irregular metadata can have valued to others and how, if it is released rather than clutched tightly onto (until that mythical day when it is ‘perfect’), it can be cleaned up and improved using new software tools.
What’s awesome is that Seth & Ruben used the Powerhouse’s downloadable collection datafile as the test data for the project.
Here’s Seth and his team talking about the project.
F&N: What made the Powerhouse collection attractive for use as a data source?
Number one, it’s available for everyone and therefore our experiment can be repeated by others. Otherwise, the records are very representative for the sector.
F&N: Was the data dump more useful than the Collection API we have available?
This was purely due to the way Google Refine works: on large amounts of data at once. But also, it enables other views on the data, e.g., to work in a column-based way (to make clusters). We’re currently also working on a second paper which will explain the disadvantages of APIs.
F&N: What sort of problems did you find with our collection?
Sometimes really broad categories. Other inconveniences could be solved in the cleaning step (small textual variations, different units of measurement). All issues are explained in detail in the paper (which will be published shortly). But on the whole, the quality is really good.
F&N: Why do you think museums (and other organisations) have such difficulties doing simple things like making their metadata available? Is there a confusion between metadata and ‘images’ maybe?
There is a lot of confusion about what the best way is to make metadata available. One of the goals of the Free Your Metadata initiative, is to put forward best practices to do this. Institutions such as libraries and museums have a tradition to only publish information which is 100% complete and correct, which is more or less impossible in the case of metadata.
F&N: What sorts of things can now be done with this cleaned up metadata?
We plan to clean up, reconcile, and link several other collections to the Linked Data Cloud. That way, collections are no longer islands, but become part of the interlinked Web. This enables applications that cross the boundaries of a single collection. For example: browse the collection of one museum and find related objects in others.
F&N: How do we get the cleaned up metadata back into our collection management system?
We can export the result back as TSV (like the original result) and e-mail it. Then, you can match the records with your collection management system using records IDs.
–
Go and explore Free Your Metadata and play with Google Refine on your own ‘messy data’.
If you’re more nerdy you probably want to watch their ‘cleanup’ screencast where they process the Powerhouse dataset with Google Refine.
Interviews w/5 metadata experts
From INFOdocket :
DCMI (Dublin Core Metadata Initiative) 2011 Conference gets underway next week in The Hague and the conference web site is home to interviews with DCMI leaders and conference speakers.
+ Emmanuelle Bermès, Modern Art Museum Centre Pompidou
Metadata and semantic web expert Emmanuelle Bermès... will reflect on Linked Data issues for libraries, archives and museums, based on her experience at the National Library of France and the modern art museum Centre Pompidou. In this interview, Bermes talks about her experience implementing Dublin Core for the French digital library Gallica.
+ Makx Dekkers, CEO, Dublin Core Metadata Initiative (2001-2011)
...In this interview, Dekkers reflects on the major accomplishments of the last decade, and his vision for the future of Dublin Core...
+ Stuart Sutton, CEO, Dublin Core Metadata Initiative (2011- )
Stuart Sutton... tells us about part of his future vision for Dublin Core, and what he hopes to achieve at the helm of the organisation.
Tom Baker, Chief Information Officer (Communications, Research and Development) &
Diane Hillmann Vocabulary Maintenance Officer
Tom Baker & Diane Hillmann are two of the longest standing members of the Dublin Core community. Both have been involved in the metadata schema for over 15 years.... In this interview, they look at what makes Dublin Core stand out from other metadata schemas...