July 28, 2009: Turns out I missed several steps.
Here is a simple three step process to convert any site to pdf. Good for traveling or lack of internet. In my case, the original inspiration for this came from the Pyglet API docs. Pyglet offers an introductory pdf but limits the full api to html. Their server used to be really spotty and would disappear for days at a time. Trying to use their server was less fun than learning a new CLI trick.
Step one: wget the site
wget -nd -mk http://example.com
-ndflattens any directory structure
-kconverts all internet links to local filesystem links
### Step one dot five: convert to non-UTF-8
### Step two: make htmldoc happy
enca -L none -x ISO-8859-1 *.html
mogrify -background white -flatten *.png
Turns out htmldoc 1.8 does not support UTF-8. Version 1.9 does, but overwise use this command to clean things up, but use
encato get around it otherwise. Furthmore, htmldoc does not play well with transparent PNGs. The
mogrifycommand will remove the alpha and replace it with a white background.
Step two: three: convert to pdf
htmldoc --webpage -f example.pdf example_path/toc.html example_path/*.html
--webpageglues many html documents together, inserting page breaks between each
-fnames the output file
Optionally, to format the document for the computer screen, use
The default glob expansion puts the pages in alphabetical order. Explicitly mentioning the table of contents places it on the first page. There will be a second copy of the TOC in the glob, but it won't get in the way.
You'll have little control over the exact formatting and ordering of the sections. Htmldoc usually makes the formatting look pretty good. The page order does not matter too much because all the html links are preserved in the pdf. More importantly, it is electronically searchable.
I found out about the UTF-8 and PNG thing while building a copy of Pro Git