BlogForever, a collaborative European Commission funded project, developed an exciting new system to harvest, preserve, manage and reuse blog content. The system is performing an intelligent harvesting operation which retrieves and parses hypertext as well as all other associated content (images, linked files, etc) from blogs. It copies content by interrogating not only the RSS feed of a blog, but also by copying data from the original HTML. The parsing action is able to render the captured content into structured data, expressed in XML; it does this in accordance with the project’s data model.
The result of this action is carving semantic entities out of blog content on an unprecedented micro-level. Author names, comments, subjects, tags, categories, dates, links, and many other elements are expressed within a hierarchical structure. This content is imported into the BlogForever repository (based on CERN’s Invenio platform), a public-facing web archiving mechanism which provides facilities to preserve, view, interrogate & reuse the content to an unprecedented degree of detail.
BlogForever has delivered theoretical and applied research results which are advancing the state of the art, focusing on the study of weblogs, as well as the methods and policies necessary to preserve them.
Twelve partners from six European countries participated in BlogForever, bringing into the table multidisciplinary skills and ongoing work in weblogs, digital preservation, web archiving, semantics, analytics and software engineering skills. The project was coordinated by the Department of Informatics of the Aristotle University of Thessaloniki, began on the 1st of March 2011 and concluded on the 31st of August 2013.