BlogForever and migration

April 2, 2012 in Blog

Recently I have been putting together my report on the extent to which the BlogForever platform operates within the framework of the OAIS model. Inevitably, I have thought a bit about migration as one of the potential approaches we could use to preserve blog content.

Migration is the process whereby we preserve data by shifting it from one file format to another. We usually do this when the “old” format is in danger of obsolescence for a variety of reasons, while the “target” format is something we think we can depend on for a longer period of time. This strategy works well for relatively static document-like content, such as format-shifting a text file onto PDF.

The problem with blogs, and indeed all web content, is when we start thinking of the content exclusively in terms of file formats. The content of a blog could be said to reside in multiple formats, not just one; and even if we format-shift all the files we gather, does that really constitute preservation?

With BlogForever, we’re going for an approach to capture and ingest which seems to have two discrete strands to it.

(1) We will be gathering and keeping the content in its “original” native formats, such as HTML, images files, CSS etc. At time of writing, the current plan is that we will have a repository record for each ingested blog post and all its associated files (original images, CSS, PDF, etc.) will be connected with this record. These separate files will be preserved and presumably migrated over time, if some of these native formats acquire “at risk” status.

(2) We are also going to create an XML file (complete with all detected Blog Data Model elements) from each blog post we are aggregating. What interests me here is that in this strand, an archived blog is being captured and submitted as a stream of data, rather than a file format. It so happens the format for storing that data-stream is going to be XML. The CyberWatcher spider is capable of harvesting blog content by harnessing the RSS feed from a blog, and by using blog-specific monitoring technologies like blog pings; and it also performs a complex parsing of the data it finds. The end result is a large chunk of “live” blog content, stored in an XML file.

Two things are of interest here. One is that the spider is already performing a form of migration, or transformation, simply by the action of harvesting the blog. Secondly, it’s migrating to XML, which is something we already know to be a very robust and versatile preservation format, more so even than a non-proprietary tabular format such as CSV. The added value of XML is the possibility of easily storing more complex data structures and multiple values.

If that assumption about the spider is correct, perhaps we need to start thinking of it as a transformation / validation tool. The more familiar digital preservation workflow assumes that migration will probably happen some time after the content has been ingested; what if migration is happening before ingest? We’re already actively considering the use of the preservation metadata standard PREMIS to document our preservation actions. Maybe the first place to use PREMIS is on the spider itself, picking up some technical metadata and logs on the way the spider is performing. Indeed, some of the D4.1 user requirements refer to this: DR6 ‘Metadata for captured Contents’ and DR17 ‘Metadata for Blogs’.

We anticipate the submitted XML is going to be further transformed in the Invenio repository via its databases, and various metadata additions and modifications will transform it from a Submission Information Package into an Archival Information Package and a Dissemination Information Package. As far as I can see though, the XML format remains in use throughout these processes. It feels as though the BlogForever workflow could have a credible preservation process hard-wired into it, and that (apart from making Archival Information Packages, backing-up and keeping the databases free from corruption) very little is needed from us in the way of migration interventions.

It also feels as though it would be much easier to test this methodology; the focus of the testing becomes the spider>XML>repository>database workflow, rather than a question of juggling multiple strategies and testing them against file formats and/or significant properties. Of course, migration would still need to apply to the original native file formats we have captured, and this would probably need to be part of our preservation strategy. But it’s the XML renditions which most users of BlogForever will be experiencing.