Anatomy of a Web Archive | The Signal: Digital Preservation

By Nicholas Taylor | November 8, 2013

by Nicholas Taylor

I’m inclined to blame the semantic flexibility of the word “archive” for the fact that someone with no previous exposure to web archives might variously suppose that they are: the result of saving web pages from the browser, institutions acting as repositories for web resources, a navigational feature of some websites allowing for browsing of past content, online storage platforms imagined to be more durable than the web itself, or, simply, “the Wayback Machine.” For as many policies and practices guide cultural heritage institutions’ approaches to web archiving, however, the “web archives” that they create and preserve are remarkably consistent. What are web archives, exactly?

WARC , West African Research Center, by Robin on Flickr

WARC, West African Research Center, by Robin, on Flickr

At the most basic level, web archives are one of two closely-related container file formats for web content: the Web ARchive Container formator its precursor, the ARchive Container format. A quick perusal of the data formats used by the international web archiving community shows a strong predominance of WARC and/or ARC. The ratification of WARC as an ISO standard in 2009 made it an even more attractive preservation format, though both WARC and ARC had been de facto standards since well before then. First used in 1996, the ARC format is more specifically described by the Sustainability of Digital Formats website as the “Internet Archive ARC file format”, a testament both to the outsized contribution of the Internet Archive to the web archiving field as well as the recentness of the community’s broadening membership.

Anatomically, a WARC or ARC file can be thought of as a single document made up of a series of concatenated records. For the WARC format, these records can be one of eight different types, the most predictable of which represents an archived resource (e.g., html, JavaScript, image, video, Flash, etc.) retrieved from the web. Examples of other record types include crawler characteristicshttp responseshttp requests,resource capture detailspointers to previously-captured content (i.e., when crawler-based content de-duplication is enabled), alternate formats for previously-captured content (e.g., format obsolescence use case), and resources spanning multiple WARC files. Aside from the field designating the record type, there are three other mandatory fields found in the header of every WARC record: a record identifier, the record body size, and a timestamp.

Read full post here. (Originally posted on 5 November 2013)