Thoughts about blog data and metadata
April 25, 2011 in Blog
During the ArchivePress project at ULCC, we briefly considered the data and metadata generally made available with blogs and blog posts. As ArchivePress focused on the representations of blogs in newsfeeds, we examined the metadata that is generated in common, and exposed in the newsfeeds of three of the most common blog platforms, WordPress, Blogger and TypePad. Blogger and Typepad prefer the Atom newsfeed format; WordPress (particularly WordPress.com) prefers RSS (though it can be made to publish Atom feeds too). This analysis was done, about a year ago, things may have changed, but here is a summary of what we found.
For each Blog, the following core information is available in the feeds:
WordPress (RSS) | Blogger (Atom) | Typepad (Atom) | |
---|---|---|---|
Feed Unique ID | NA | feed/id | feed/id |
Blog URL | rss/channel/link | feed/link@rel=”alternate” | feed/link@rel=”alternate” |
Blog Title | rss/channel/title | feed/title | feed/title |
Blog Description | rss/channel/description | feed/subtitle | feed/subtitle |
Date of last update | rss/channel/lastBuildDate | feed/updated | feed/updated |
Generating software | rss/channel/generator | feed/generator | feed/generator |
For each Post, we established that the following core information is available in the newsfeeds:
WordPress (RSS) | Blogger (Atom) | Typepad (Atom) | |
---|---|---|---|
Post Unique ID | rss/channel/item/guid@isPermaLink | feed/entry/id | feed/entry/id |
Post Title | rss/channel/item/title | feed/entry/title | feed/entry/title |
Post Summary | rss/channel/item/description | NA | feed/entry/summary |
Post URL | rss/channel/item/link | feed/entry/link@rel=”alternate” | feed/entry/link@rel=”alternate” |
Date of publication | rss/channel/item/pubDate | feed/entry/published | feed/entry/published |
Date of last update | NA | feed/entry/updated | feed/entry/updated |
Post Author | rss/channel/item/dc:creator rss/xmlns:dc |
feed/entry/author/name | feed/entry/author/name |
Post Category | rss/channel/item/category | feed/entry/category@term | feed/entry/category@term |
Post Content | rss/channel/item/content:encoded rss/xmlns:content |
feed/entry/content | feed/entry/content |
Post Comments | rss/channel/item/comments | feed/entry/link@rel=”replies” | feed/entry/link@rel=”replies” |
Post Comments Feed | rss/channel/item/wfw:commentRss | NA | NA |
One interesting point we noted was that neither Blogger nor Typepad published a link to a Comments Feed for each post. This made our work on ArchivePress more difficult since it was predicated on being able to easily identify the Comments feed for each post, and harvest new Comments as they were published. Obviously for blogs generated other than by WordPress, this was not going to be so easy. (Our ace developer Emanuele found some workarounds, but that’s another story.)
I think this offers us an interesting overview of the core of standard, structured blog data and metadata, in three of the leading blog platforms. This is the data structure and metadata profile that is maintained in blog databases, in one of its native forms, and I’d expect it to be present in all blog platforms, since it arguably represents the essence of blogs. I hope this will be useful background when considering the core models for data and metadata handling that will be developed for BlogForever.