Rendering and storing web pages using wkhtmltopdf
8:53 am in Blog by Nikos Kasioumis
Wkhtmltopdf is a quite simple shell utility that allows the user to convert any given web page (html) to an image (jpg, png, etc) or a document (pdf). It relies on the WebKit layout engine to render web pages through the respective library written for Qt (QtWebKit).
WebKit is an open source state-of-the-art layout engine which powers Google Chrome and Apple Safari, two major web browsers out there. It is a very dynamic project and one of the most consistent engines for web page rendering following closely the latest standards.
Wkhtmltopdf is open source, written in C++ and distributed under the GNU Lesser General Public License. PHP and Python bindings are already available making it easy to use the utility natively with other platforms as well. On a well updated system wkhtmltopdf is a trustworthy solution for storing snapshot of web pages on demand.
Wkhtmltopdf’s output formats include image formats such as JPEG and PNG and the PDF document format. PDF is an open standard for document exchange independent of software, hardaware and operating systems and each PDF file encaptulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it.


Hi Nikos. I suppose the question is: why would we want to do this? We must obviously start with storing and managing the original Web files, Is it your thinking that this would render thumbnails (not necessarily small) as a visual reference for the look-and-feel of a site?
It could be an answer to “How do I render correctly the web page (html) of a blog I captured 10 years ago with current browsers and rendering technologies?”. It produces a high quality snapshot of the layout of a blog post and could work along the actual captured web files.
Hi Nikos. Just a question related to your reply to Richard. converting html to pdf may enable us to retain some information about the webpage without relying on “current browsers and rendering technologies” but how does it circumvent our reliance on changes in pdf viewers? Further, note that a blog is not a webpage but a collection of web pages. Does wkhtmltopdf retain the structural relation between the pages in a website as pdf pages?
Dear Yunhyong,
I copy this from wikipedia:
This file format created by Adobe Systems in 1993 is used for representing documents in a manner independent of application software, hardware, and operating systems.[2] Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it.
It seems that using PDF to preserve the layout of the page is so much better than using HTML + web browser as this solution is heavily dependant on software, hardware & operating system.
Ι know that there are many different versions of PDF (1.7 is the latest I think and 2.0 is under development) but I also know of many software tools to migrate files from one version to another.
What is more, there are infinite libraries & tools to view & manage PDF.
Also talking about structural relation between blog pages, this is easilly done as blogs have a fixed structure (list of posts in a reverse chronological order). We will do this anyway as we will aggregate & analyse blog content. I think that this has nothing to do with preserving page rendering.
Hi Yuhnyong,
I agree with Vangelis and I believe his answer covers your question, so just adding a couple of points here:
A PDF file is as reliable to be correctly viewed in the future as any other file format out there. By that I mean that there is no guarantee for any solution we might come up with now that it will last forever. That’s where the PDF format comes in as a really strong candidate. It’s open, it’s widely accepted and used, and it is after all a file format that (quoting wikipedia) “encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it”. Which is pretty much the function I recommended this tool for anyway.
Now, as far as a structural relation between the various web pages of a blog is concerned, the answer is no, wkhtmltopdf does not do that. Neither is is meant to do that or take care of it in any way. Wkhtmltopdf is only rendering web pages and producing the outcome in pdf or jpg or png etc. This information can be collected easily though by other means (Vangelis’ comment covers me completely on this one) and accompany the rendered and stored PDFs.
I have tested this solution myself in a dozen websites and I am very satisfied with the resulting pdf file. I agree that the web page layout is preserved using this technology.
Also, the fact that there are Python bindings enables us to integrate it to Invenio.
Please also remember that the topic of archived web page rendering has not been addressed by other digital preservation projects so far. I have made contact with all currently active projects in the last EC workshop and they all rely on the current browser to render archived web pages.
I don’t know what our digital preservation experts would think of this technology, it seems promising to me.
I think what Nikos suggested would be very useful. Adding such a feature to our final BlogForever platform would help us to preserve “full” characteristics of a blog (visual design, layout along with the blog metadata). I haven’t tested the utility yet but according to the example document that Nikos sent, it may do a great job.
Downloaded the binary and gave it a try. It works at least as goood as the one we are using (http://www.princexml.com/) at work for our related needs.
I also think Nikos suggestion is a great one, since there is great potential as to what it can be done by implementing such a functionality within the project.
For example, we could setup a repeating task capturing a snapshot of each blog participating to the project every month (or week).
This would be an interesting sub-project, especially when capturing PDF as opposed to images since we could actually treat those snapshots as “raw content”, and thus achieving “fancy” reports based on comparison (per blog or globally).
Some of those could be:
+ Design trends (tricky but doable).
+ Content heat-maps.
+ Color heat-maps.
+ Accessibility / Readability reports.
Again, a fine idea/tool.
What’s challenging is what we can, or rather should do, with it and what NOT to.
Hi George.
just some quick questions: is it not better to keep the actual “raw content” than to keep a proxy of the raw content? can we not do everything you have mentioned (assuming it be done with pdfs) without converting to pdf if we have the raw website data?
We will keep the actual “raw content” anyway. The problem is how to render it correctly in the future and this is where PDF may be the right solution.