PDF Creation from HTML service

Thursday, June 12th, 02014 at 12:21 UTC

HTML is one of those beautifully simple technologies that has now permeated every aspect of our daily-life. The web is made-up of HTML pages, HTML emails, browsers on TV, App stores, HTML is just about everywhere.

When people say HTML they tend to mean that special trio of technologies. The HTML is really just the marked-up context. There are two more independent pieces. Javascript which provides the behavior to the content. Then CSS which is the presentation layer. Each is a technology in its own right, but get lumped together as HTML or “The Web”.

If we love the web because it is made from HTML, then why are we trying to make PDFs from it?

There exist load of complex layout programs which can export to PDF, but the problem is that they are not scriptable, not easily scriptable, are not available to use without a license or can’t run from the command line. There are some older Unix tools to create PDF files on the command line, but not very easily with HTML, or they accept the HTML, but are un-aware of new versions of CSS.

That still doesn’t explain why we’d want to take a perfectly good webpage, which work fantastically well in a browser across a variety of devices and then solidify it into a single, (usually) unaccessible PDF file. It would seem a step backwards.

This is true, making PDFs as a final product when you have HTML does seem very strange, but our goal isn’t a final product, it is an intermediate step between the digital world of the web and the analog world of paper.

We needed a way to pull together multiple web-based APIs to build-up an HTML page with fancy CSS print styles and render that into a flat PDF ready for printing. This is because printing from a browser introduces more issues and has less control. PDFs on the other hand already have a massive professional tool chain waiting to accept your files and print them. The PDF is simply the intermediate step.

This is something that Aaron Straup Cope refers to as The PaperNet. We’ve written about the PaperNet before and use it quite often in our daily life. From small booklets to on-demand calendars, we take a lot of HTML (or sometimes SVG) and generate PDFs for printing.

This is a quick tutorial about how you can do this too. It is much easier than you think. The newest cutting edge tools allow you to use just about every CSS rule, including controlling page breaks, rotating text and embedding fonts, images and SVG. Todo this, we use something called PhantomJS. This is a ‘headless webkit’ instance. It is the same as the Safari or Chrome browser, but without the browser part. You give it a URL or HTML file, it loads it in a browser that you can’t see and then saves the results as a PNG or PDF. Anything you can do with HTML and CSS in the browser, this tool can replicate and save.

We’ve open sourced all of the code that you will need to download to run your own “PDF as a Service” instance to convert HTML into a PDF. Just downloading it won’t do you much good, so what follows is a simple tutorial, hopefully simple enough for anyone to follow the steps and get things running in just a few minutes.

Getting Started

What you will need for this tutorial:

If you are techie enough to want to run this locally, you’ll need a few more things. If you’re not, you can skip these step completely:

The first thing we’ll need to do is get the code from https://github.com/optional-is/html2pdf so we can begin to install our html2pdf tool. Using GIT is the easiest way download the files. GIT calls this ‘cloning’ when you clone the copy of all the code to have locally.

Since the GIT repository, where all the code is saved together, is open to anyone you should be able to paste this text into your terminal.app or Command Prompt and it will begin the process of saving the files locally.

git clone git@github.com:optional-is/html2pdf.git

Great, now that we have the code locally, we need to create a new server somewhere on Heroku to upload our files. Since this is probably the first time you’ve used Heroku, we need to login, so type the following on your command line:

heroku login

If that worked, you can now create a new server by typing heroku app:create then a unique name for your server. I decided to use “html2pdf-optional”, you can change the -optional to your company or name. Remember, we want something unique.

heroku apps:create html2pdf-optional

This will create a new server for you. How easy was that? It also creates something called a GIT remote. This is a name for GIT to upload too. Much like an FTP server, but using GIT instead.

Now that we have our code locally and new server, we need to add three items onto Heroku since we are doing something very special by installing PhantomJS on their server. Copy and paste the following lines 1-by-1. These add some settings into heroku to install phantomjs and tell it where so you can call the program.

heroku config:set BUILDPACK_URL=https://github.com/rsussland/heroku-buildpack-python-phantomjs

heroku config:set LD_LIBRARY_PATH="/usr/local/lib:/usr/lib:/lib:/app/vendor/phantomjs/lib"

heroku config:set PATH="/usr/local/bin:/usr/bin:/bin:/app/vendor/phantomjs/bin"

Finally, we need to upload the html2pdf code to that new server. In GIT terms this is called “push”. Paste the following code into the command line to push our code to Heroku:

git push heroku master

This should be everything you need to run an HTML to PDF converter. To view the site, you can type

heroku open

which will launch the browser or you can visit the domain in your browser yourself by typing your-appname.herokuapp.com

That’s it. You should new see a text area with some sample HTML in it. If you press the button, it will think for a second and return a PDF.

Gotchas

There are a few things to be aware of with using Heroku. Since you aren’t paying, the server will go to sleep after a few minutes of inactivity. This just means that if you come back tomorrow, the first time you visit the site it will take a few more seconds to “wake-up” before showing you the webpage. If this is really annoying, you can pay Heroku to have a second dyno to keep things awake or move the code to a different service. Heroku also tries to be a good web citizen to your customers and any page load that takes too long, it drops automatically so that others are stuck waiting. This is incase something gets itself into an infinite loop and would never stop. What this does mean is that VERY large, 100+ page PDFs will take longer to create than Heroku will wait for and you’ll never get your results. The only way around this is to move the code to a server that you have more control over. For small, practical uses, these shouldn’t be a problem, but if you are trying to convert that HTML version of War and Peace into a PDF, it probably won’t work!