You are browsing the archive for blog preservation.

Preservation in BlogForever: an alternative view

July 23, 2012 in Blog

I’d like to propose an alternative digital preservation view for the BF partners to consider.

The preservation problem is undoubtedly going to look complicated if we concentrate on the live blogosphere. It’s an environment that is full of complex behaviours and mixed content. Capturing it and replaying it presents many challenges.

But what type of content is going into the BF repository? Not the live blogosphere. What’s going in is material generated by the spider: it’s no longer the live web. It’s structured content, pre-processed, and parsed, fit to be read by the databases that form the heart of the BF system. If you like, the spider creates a “rendition” of the live web, recast into the form of a structured XML file.

What I propose is that these renditions of blogs should become the target of preservation. This way, we would potentially have a much more manageable preservation task ahead of us, with a limited range of content and behaviours to preserve and reproduce.

If these blog renditions are preservable, then the preservation performance we would like to replicate is the behaviour of the Invenio database, and not live web behaviour. All the preservation strategy needs to do is to guarantee that our normalised objects, and the database itself, conform to the performance model.

When I say “normalised”, I mean the crawled blogs that will be recast in XML. As I’ve suggested previously, XML is already known to be a robust preservation format. We anticipate that all the non-XML content is going to be images, stylesheets, multi-media, stylesheets, and attachments. Preservation strategies for this type of content are already well understood in the digital preservation world, and we can adapt them.

There is already a strand of the project that is concerned with migration of the database, to ensure future access and replay on applications and platforms of the future. This in itself could feasibly form the basis of the long-term preservation strategy.

The preservation promise in our case should not guarantee to recreate the live web, rather to recreate the contents of the BF repository, and to replicate the behaviour of the BF database. After all that is the real value of what the project is offering: searchability, retrievability, and creating structure (parsed XML files) where there is little or no structure (the live blogosphere).

Likewise it’s important that the original order and arrangement of the blogs be supported. I would anticipate that this will be one of the possible views of the harvested content. If it’s possible for an Invenio database query to “rebuild” a blog in its original order, that would be a test of whether preservation has succeeded.

As to PREMIS metadata: in this alternative scenario the live data in the database and the preserved data are one and the same thing. In theory, we should be able to manipulate the database to devise a PREMIS “view” of the data, with any additional fields needed to record our preservation actions on the files.

In short, I wonder whether the project is really doing “web archiving” at all? And does it matter if we aren’t?

In summary I would suggest:

  • We consider the target of preservation to be crawled blogs which have been transformed into parsed XML (I anticipate that this would not invalidate the data model).
  • We regard the spidering action as a form of “normalisation” which is an important step to transforming unmanaged blog content into a preservable package.
  • Following the performance model proposed by National Archives of Australia, we declare the performance we wish to replicate is that of normalised files in the Invenio database, rather than the behaviours of individual blogs. This approach potentially makes it simpler to define “significant properties”; instead of trying to define the significant properties of millions of blogs and their objects, we could concentrate on the significant properties of our normalised files, and of Invenio.

“Trends in Blog Preservation” keynote speech at ICEIS2012

May 9, 2012 in Blog

The paper: “Trends in Blog Preservation” will be presented at a keynote speech at the the 14th International Conference on Enterprise Information Systems, to be held on 28 June-1 July 2012 in Craiova, Romania.

Authors: Vangelis Banos, Nikos Baltas, Yannis Manolopoulos

Abstract: Blogging is yet another popular and prominent application in the era of Web 2.0. According to recent measurements often considered as conservative, as of now worldwide there are more than 152 million blogs with content spanning over every aspect of life and science, necessitating long term blog preservation and knowledge management. In this talk, we will present a range of issues that arise when facing the task of blog preservation. We argue that current web archiving solutions are not able to capture the dynamic and continuously evolving nature of blogs, their network and social structure as well as the exchange of concepts and ideas that they foster. Furthermore, we provide directions and objectives that could be reached to realize robust digital preservation, management and dissemination facilities for blogs. Finally, we will introduce the BlogForever EC funded project, its main motivation and findings towards widening the scope of blog preservation.