BlogForever Project Statement

What: BlogForever is building an exciting new system to harvest, preserve and manage blog content, developing new insights through its restructuring and reuse. Towards this, it has stepped into yet uncharted territories of theoretical and practical aspects of blog preservation; it first researched blog structure and semantics; it is now defining solid blog preservation policies and developing a robust blog preservation software platform; it will then validate the platform through specific case studies using real world data.

Why: The content of blogs, their interconnections and influence constitute a unique socio-technical artifact of our times. Nevertheless, blogs are ephemeral: studies conducted in the past suggest that the average lifetime of a webpage is below 100 days. As a result of this situation, there is a dire need to preserve blog content as an essential part of our heritage that can prove valuable for current and future generations.

Current situation: The problem with most existing web preservation methods is that they copy entire websites from URLs, replicating the folder structure. This approach tends to treat each URL as a single entity, and follows the object-based method of digital preservation; by which all digital objects in a website (images, attachments, media, stylesheets) are copied and stored in sophisticated wrapper formats such as WARC. This workflow is not appropriate for the dynamic content of blogs.

BlogForever approach: On the contrary, BlogForever platform is performing an intelligent harvesting operation which retrieves and parses hypertext as well as all other associated content (images, linked files, etc) from blogs. The result of this action is carving semantic entities out of blog content on an unprecedented micro-level. Author names, comments, subjects, tags, categories, dates, links, and many other elements are expressed within a hierarchical structure. This content is imported into the BlogForever repository (based on CERN’s Invenio platform), a public-facing web archiving mechanism which provides facilities to preserve, view, interrogate & reuse the content to an unprecedented degree of detail.

For whom: This new system can be used by memory institutions (libraries, archives, museums, clearinghouses, electronic databases and data archives), researchers and universities, as well as communities of bloggers. For example, a repository of pharmaceutical research blogs would preserve knowledge and secure access to an insight otherwise lost for future generations.

Usage can be approached in two different ways:

  • BlogForever as a SaaS/Cloud service:  A flexible online service to create focused repositories/collections of blog archives; and an online content curation tool with an automated preservation workflow in addition to multiple value added options.
  • BlogForever as software: Downloadable solution that can be installed by organizations which need to create their own blog preservation workflow by deploying their own blog archives. The project is investigating both open source as well as proprietary offering models.