I’d like to propose an alternative digital preservation view for the BF partners to consider.
The preservation problem is undoubtedly going to look complicated if we concentrate on the live blogosphere. It’s an environment that is full of complex behaviours and mixed content. Capturing it and replaying it presents many challenges.
But what type of content is going into the BF repository? Not the live blogosphere. What’s going in is material generated by the spider: it’s no longer the live web. It’s structured content, pre-processed, and parsed, fit to be read by the databases that form the heart of the BF system. If you like, the spider creates a “rendition” of the live web, recast into the form of a structured XML file.
What I propose is that these renditions of blogs should become the target of preservation. This way, we would potentially have a much more manageable preservation task ahead of us, with a limited range of content and behaviours to preserve and reproduce.
If these blog renditions are preservable, then the preservation performance we would like to replicate is the behaviour of the Invenio database, and not live web behaviour. All the preservation strategy needs to do is to guarantee that our normalised objects, and the database itself, conform to the performance model.
When I say “normalised”, I mean the crawled blogs that will be recast in XML. As I’ve suggested previously, XML is already known to be a robust preservation format. We anticipate that all the non-XML content is going to be images, stylesheets, multi-media, stylesheets, and attachments. Preservation strategies for this type of content are already well understood in the digital preservation world, and we can adapt them.
There is already a strand of the project that is concerned with migration of the database, to ensure future access and replay on applications and platforms of the future. This in itself could feasibly form the basis of the long-term preservation strategy.
The preservation promise in our case should not guarantee to recreate the live web, rather to recreate the contents of the BF repository, and to replicate the behaviour of the BF database. After all that is the real value of what the project is offering: searchability, retrievability, and creating structure (parsed XML files) where there is little or no structure (the live blogosphere).
Likewise it’s important that the original order and arrangement of the blogs be supported. I would anticipate that this will be one of the possible views of the harvested content. If it’s possible for an Invenio database query to “rebuild” a blog in its original order, that would be a test of whether preservation has succeeded.
As to PREMIS metadata: in this alternative scenario the live data in the database and the preserved data are one and the same thing. In theory, we should be able to manipulate the database to devise a PREMIS “view” of the data, with any additional fields needed to record our preservation actions on the files.
In short, I wonder whether the project is really doing “web archiving” at all? And does it matter if we aren’t?
In summary I would suggest:
- We consider the target of preservation to be crawled blogs which have been transformed into parsed XML (I anticipate that this would not invalidate the data model).
- We regard the spidering action as a form of “normalisation” which is an important step to transforming unmanaged blog content into a preservable package.
- Following the performance model proposed by National Archives of Australia, we declare the performance we wish to replicate is that of normalised files in the Invenio database, rather than the behaviours of individual blogs. This approach potentially makes it simpler to define “significant properties”; instead of trying to define the significant properties of millions of blogs and their objects, we could concentrate on the significant properties of our normalised files, and of Invenio.