by Mitch Fraas
Today I’m teaching a workshop on using “screen scraping” in the digital humanities. No workshop is really useful without practical examples so last week I decided to try out my screen scraping chops on an exciting new database of book history data. The Kislak Center at Penn (where I’m Scholar in Residence) is quickly becoming one of the most important sites for book and manuscript provenance research and I wanted to see what I could do to highlight the potential for making extant provenance data more useful through new visualizations.
Several years ago, a few of the scholars behind the monumental Corpus of British medieval library catalogues project (now at fifteen volumes) led by Richard Sharpe began working on an online database to update and provide access to the wealth of information on medieval manuscripts contained in Neil Ker’s Medieval Libraries of Great Britain (1941, 1964, and1987). These volumes include accounts of books and manuscripts known to survive today which once were owned within Great Britain before the mid-16th century. Recently, through grants from the Mellon foundation and others, the team has taken much of this information and made it available online in the MLGB3 searchable database. The site appears to be inbeta mode at the moment and intermittently accessible but when it launches fully it will be an amazing resource and the culmination of a good deal of work by Sharpe and others. Looking through the database I was especially intrigued by the wealth of data on the current location of many of these medieval books and manuscripts. Given how comprehensive and detailed the project data is, even at this stage, I wanted to get a sense of what kind of picture would develop if we looked at the points of origin and current location of all these manuscripts in aggregate.
As of last week, the MLGB3′s online database included over 6,000 records for books and manuscripts owned by medieval libraries. In order to look at them in aggregate I used the ever-helpful wget utility to pull down each record in order. I was left with a gigantic mess of html with the useful data hidden within it. After extensive cleanup and parsing of the data I was able to throw the location names of the original medieval libraries as well as current owners against David Zwiefelhofer’s geocoding service (which I believe uses the Yahoo API) to get longitudes and latitudes. This didn’t go entirely smoothly as the names of ruined monasteries tend not to register very well in geo databases. Fortunately, there are a wealth of wikipedia entries providing detailed long./lat. information on a wide range of English historical sites and I was able to fill in the blanks.
Read full post here. (Originally posted 12 November 2013)