Episode 30 – Live From Egypt!

On this episode we were lucky to have a live link to Alexandria, Egypt, for Wikimania 2008, the international meeting of those who work on Wikipedia and related open collaborative projects. In the feature segment we talk with Liam Wyatt of Wikipedia Weekly, who gives an insider’s scoop of the issues, debates, and future of Wikipedia. In the news roundup we discuss Yahoo’s new open search service, BOSS, and Google’s new virtual world, Lively, among other things. Picks of the week include some advice from Google’s blogs, some rich web-based applications, and Gmail power user tweaks.

Links mentioned on the podcast:
Wikimania 2008
Wikipedia Weekly
Yahoo BOSS
Google Lively
Google Labs Gmail tweaks
Requesting reconsideration using Google Webmaster Tools
Technologies Behind Google Ranking

Running time: 48:03
Download the .mp3

Episode 29 – Making It Count

As forms of scholarship move from the analog world of paper to the digital realm of the web, a debate has begun about how to give credit—if at all—to these new forms for the purposes of promotion and tenure. What will happen to peer review? What kinds of digital work should “count,” and how? That’s the featured discussion on this episode. We also cover the launch of Firefox 3, university presses putting their books on Amazon’s Kindle device, and the release of better copyright records.

Links mentioned on the podcast:
Google publishes copyright status of books from 1923-1963
U.S. Copyright Office Record Search
Mills on “Making Digital Scholarship Count”
Journal on Computing and Cultural Heritage
Creative Commons Case Studies
MozillaZine on “about:config”

Running time: 44:02
Download the .mp3

Shingles and Near Duplicate Detection

Sergei Vassilvitskii of Yahoo! has a useful ppt describing work to identify duplicate and near duplicate pages on the Web using shingles. Claims that 25%-40% of all WWW documents are duplicates or near duplicates. Hashing of documents cannot identify near duplicates while edit distance will not scale. Uses a hash of a small number of shingles (ngrams), calculating similarity by rate at which mini-hashes agree. Also has a useful discussion of Jaccard similarities. Talk is based on Andrei Broder's (AltaVista and Yahoo!) work, described in Identifying and filtering near-duplicate documents and previous papers cited there. There are other commercial applications of this approach, such as Equivio's near duplication identification service which uses a related similarity measure.

While I am at it, have a look at Detecting Near Duplicates in Big Data for pointers to recent work at Google on the same problem. Also, the recent International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN).

Datawocky: More data and human evaluation

Anand Rajaraman in Datawocky makes the case that more data usually beats better algorithms by reference to the NetFlix challenge and provides a little more detail in part two of the same post. He also notes that Google continues to use human evaluation as part of their search algorithm tuning in Are Machine-Learned Models Prone to Catastrophic Errors? suggesting that machine learning, based on seen instances, can suffer from the "Black Swan" problem. Finally, he makes the case, based on another blog entry, that one should Change the algorithm, not the dataset if your approach can't handle the scale of data you are throwing at it. Interesting comments all. A blog to watch.

Episode 28 – Raising the BarCamp

Might there be an alternative to the conventional meetings and conferences academics, librarians, and museum professionals go to every year, where papers and panels—and often bored or distracted attendees—are the norm? This episode’s feature story tackles that question by looking back at the experience of THATCamp: The Humanities and Technology Camp, a less structured “unconference” or “barcamp” that turned everyone into active participants. The roundtable discussion of the news includes a discussion of what the iPhone 3G and iPhone apps mean for educational and cultural institutions. Picks of the week include a new site on the Soviet Gulag, a way to avoid distractions on the Mac, and an open source mapping site.

Links mentioned on the podcast:
OS X Spaces
Gulag: Many Days, Many Lives
Open Street Map

Running time: 45:19
Download the .mp3

Episode 27 – All Atwitter

As Dan finally buckles under and joins in the most hyped Web 2.0 site of the moment, Twitter, Tom and Mills join him to debate the merits—and demerits—of the “microblogging” craze. Do services like Twitter merely increase the distractions and noise from the web, or might they be helpful for communication and community building in academia? In the news roundup, we cover Microsoft’s exit from book digitization and the significance of the tech layoffs at the University of Washington. Picks of the week include a podcast series from Harvard, a blog post explaining the semantic web, and a wiki for digital research tools.

Links mentioned on the podcast:
Mills on Twitter
Media Berkman
Semantic Web Patterns
Digital Research Tools (DiRT) wiki

Running time: 47:21
Download the .mp3

Similarity as a Scholarly Primitive

I gave this 4/6 talk at the Chicago Bamboo Project Workshop last week. I used Google's Presentation system in place of Powerpoint, which allows you to present with only a browser and to embed the talk in posts. Very handy, particularly since one can collaborate with others and provide links to the full screen presentation [Click here].

Episode 26 – Free for All

At a time when everything seems to be trending toward being freely available online, how can education and digital resources and tools for academia, libraries, and museums sustain themselves? Tom, Dan, and Mills discuss models for sustainability in the age of the free in the feature segment of this week’s podcast. In the news roundup, we cover the RIAA’s newfound love of the lawsuit and the University of Chicago Law School’s newfound hate of the laptop. Picks of the week include a proportional mapping tool, a thesis repository, and a site that helps non-techies understand and use RSS.

Links mentioned on the podcast:
Mills on free education
Laura Dewis, “Money makes the world go… open?”
Harvard Thesis Repository
World Mapper

Running time: 43:07
Download the .mp3

Episode 25 – Get With the Program

Tom and Dan are joined this week by Bill Turkel and Steve Ramsey, who provide fascinating insights into the nature of computer programming and how those in the humanities, museums, and libraries can get started with this foreign language. Bill and Steve were also kind enough to add their comments to our news roundup discussion of the launch of Google App Engine, which raises questions about outsourcing, and myLOC.gov, which raises questions about whether digital collections should have their own personalization tools. Picks for the week include two books on programming, an organizational tool for Thunderbird, and a map for browsing American history.

Links mentioned on the podcast:
The Programming Historian
Google App Engine
Network in Canadian History & Environment
Social Explorer
MIT Simile’s Seek
Beautiful Code
The Mythical Man-Month

Run time: 48:17
Download the .mp3