Prototype app to extract microformats from blog posts

4:47 pm in Blog by Vangelis Banos

BlogForever aims to aggregate, preserve, manage & disseminate blogs and one of the ways to achieve this goal is to extract pieces of information omitted by current web archiving methods & solutions. One of the emerging standards used to enrich web content with extra information is Microformats. BlogForever aims to harness information encoded in Microformats to improve all aspects of blog archives.

About Microformats

Microformats are a way of adding simple markup to human-readable data items such as events, contact details or locations, on web pages, so that the information in them can be extracted by software and indexed, searched for, saved, cross-referenced or combined.

More technically, they are items of semantic markup, using just standard “plain old semantic (X)HTML” (i.e. “POSH“) with a set of common class-names and “rel” values. They are open and available, freely, for anyone to use.

Source: http://microformats.org/wiki/introduction

Microformats extractor prototype application

During our work on blog data model, AUTH has developed a prototype Java application in order to extract Microformats from blog posts. This application uses Any23 library. Usage:

  1. You must have Sun Java installed in your computer
  2. Download source code + binary
  3. Unzip
  4. Execute     ./Extractor URL -o outputfile.xml

Sample output for a random wordpress.com post: http://rachaeljames.wordpress.com/2011/06/27/the-tree-of-life/

  • RDF XML output file: rachaeljames.xml.zip
  • Examining the output XML file, one can find interesting elements such as the following which uses the XFN microformat.
<rdf:Description rdf:nodeID="node163o96nfgx12">

<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>

<mePage xmlns="http://vocab.sindice.com/xfn#" rdf:resource="http://rikkicheri.blogspot.com/"/>

</rdf:Description>

Results from this prototype will be useful for the definition of the Blog data model as well as the development of the optimal blog content aggregation techniques.

Microformats extractor has been developed by Manuel Schinas, student at AUTH.

Future Work

Microformats is not the only way to enrich web page content. Recently, Microsoft, Google & Yahoo agreed on a new standard called Microdata & created schema.org. Since blogs generally use this kind of technologies, BlogForever will monitor the exciting new developments in this area in order to create the best possible solution for blog preservation, management & exploitation.