March 2009: Fullscreen version added.
July 2009: Turns out I missed a step, sometimes.

Original How to convert any website to a single PDF. Now with a sample site conversion. Fullscreen version added. Turns out I missed a step, sometimes. Turns out I missed several steps. Editable
version 3 & 4 of 5

 

Here is a simple two three step process to convert any site to pdf. Good for traveling or lack of internet. In my case, the original inspiration for this came from the Pyglet API docs. Pyglet offers an introductory pdf but limits the full api to html. Their server can used to be really spotty and disappears would disappear for days at a time. Trying to use their server is was less fun than learning a new CLI trick.

Kyle (2009-03-20-17-09-00)

The Pyglet homepage makes mention of a new fast and reliable host. Hooray!



 

I presume you're running some linux/bsd and have already installed wget as well as htmldoc.

Step one: wget the site

wget -nd -mk http://example.com

-nd flattens any directory structure
-m mirrors site
-k converts all internet links to local filesystem links

### Step one dot five: convert to non-UTF-8 enca -L none -x ISO-8859-1 *.html

> Turns out htmldoc 1.8 does not support UTF-8. Version 1.9 does, but overwise use this command to clean things up.

Step two: convert to pdf

htmldoc --webpage -f example.pdf example_path/toc.html example_path/*.html

--webpage glues many html documents together, inserting page breaks between each
-f names the output file

Optionally, to format the document for the computer screen, use --size 12x9in.

The default glob expansion puts the pages in alphabetical order. Explicitly mentioning the table of contents places it on the first page. There will be a second copy of the TOC in the glob, but it won't get in the way.

You'll have little control over the exact formatting and ordering of the sections. Htmldoc usually makes the formatting look pretty good. The page order does not matter too much because all the html links are preserved in the pdf. More importantly, it is electronically searchable.

And finally here is a complete, fully linked, offline, searchable PDF version of the Pyglet API in 8.5x11 paper format and formatted for computer screens

I found out about the UTF-8 thing while building a copy of Pro Git