Pandoc for Haskell Hackers

John MacFarlane

BayHac 2014

Why am I here?

I created a virus…

…that spreads GHC

	debian popcon	Hackage
pandoc	1997	26220
darcs	1908	3420
xmonad	1733	6432

How did it happen?

trs80 kim1

I grew up programming video games in BASIC on a TRS-80 and 6502 assembly on a KIM-1 that my grandfather bought (he must have been among the first microcomputer hobbyists). In college I worked summers as a programmer, writing scientific software in Pascal on VAX workstations with whopping 20 MB hard drives.

I got out of programming when GUIs took over, but started to get interested again in 2004 after installing linux. I messed around with python and lisp and ruby, and wrote a website for my department. Then in 2005, an Australian philosopher/logician colleague of mine, Greg Restall, mentioned that his favorite programming language was Haskell. So I checked it out, and found that I liked it too.

I started messing around with parsec, the parser combinator library. I asked myself, “what would be fun to try to parse?” I’d gotten interested in lightweight markup languages and was using reStructuredText for lecture notes and handouts. And I’d just seen this “markdown” thing that John Gruber introduced. So I said to myself, “Self, let’s try writing a markdown parser.” That seemed like a good challenge, because markdown is about as parser-unfriendly a language as you can get. An asterisk, for example, can be either an open or close tag for emphasis, or part of an open or close tag for strong emphasis, or just a literal asterisk—and which it is often depends on what comes after. At the time, the only markdown parsers around were just big regex transformations. Parsec worked well for this. Before long I had a working parser. It was considerably faster and more accurate than John Gruber’s perl script. And it was much easier to maintain and extend.

One thing led to another. I wasn’t entirely happy with the docutils (reStructuredText) tool chain, and I saw some advantages to “making my own tools” for writing. I needed to convert my existing documents in reStructuredText to markdown, so I added a reStructuredText parser and a markdown writer. I needed output in LaTeX as well as HTML, so I added a LaTeX writer. I needed footnotes, inline LaTeX math, and other features, so I extended pandoc’s markdown dialect. The project provided hours of pleasant procrastination.

First release

Libraries begat libraries

highlighting-kate
zip-archive
texmath
pandoc-citeproc (citeproc-hs)
gitit

The command-line tool

A quick demonstration.

git clone https://github.com/jgm/BayHac2014
cd BayHac2014/demo
vim script.txt

Our goal is to learn how to use pandoc as a library, but let’s first have a quick demo of the command-line tool.

Run through script.txt in demo/.

By default pandoc works as a pipe, reading from stdin and writing to stdout. Try it:

pandoc

Hit Ctrl-D (Ctrl-Z on Windows) when you’re finished typing text.

You can use options. This one triggers “smart typography” (quotes, dashes, ellipses).

pandoc --smart

Let’s convert to latex instead of HTML.

pandoc --to latex

Or to mediawiki:

pandoc --to mediawiki

Let’s convert a latex file to markdown:

pandoc -f latex -t markdown example.tex

For help and information on which options pandoc supports:

pandoc --help

More detail can be found in the pandoc README. Or:

man pandoc
man pandoc_markdown

The --standalone or -s option creates a standalone document with header, footer, and metadata:

pandoc --standalone --smart -o r.html -t html5 README

Let’s add a table of contents and use some custom CSS:

pandoc --standalone --smart --toc -o r.html -t html5 \
  --css my.css README

Standalone documents are constructed from templates. To see the default template for a format, use -D:

pandoc -D html5 > my.html5

The template language is documented here.

sample1.txt contains some nice structured metadata. This is YAML but with strings interpreted as markdown.

---
title: My demonstration
author:
 - Kurt Gödel
 - Haskell Curry
version:
 - number: 1.0
   date: July 13, 1945
   log:  Initial commit
 - number: 1.1
   date: August 14, 1946
   log:  Added some math
---

The metadata has a nice version history. Let’s edit my.html5 to include this before the </header> tag:

<ul class="versions">
$for(version)$
<li>Version $version.number$ ($version.date$): $version.log$</li>
$endfor$
</ul>

Let’s try our custom template:

pandoc -s -S --template my.html5 -t html5 sample1.txt \
  -o sample1.html --mathjax

We can create a PDF. Pandoc shells out to pdflatex for this.

pandoc sample1.txt -o sample1.pdf

If you want to use xelatex instead, use --latex-engine=xelatex.

We can create a word document without opening Word:

pandoc sample1.txt -o sample1.docx

Note that the TeX math in the markdown file gets converted to native Word equations.

or an epub:

pandoc sample1.txt -t epub3 -o sample1.epub

Pandoc can process citations using bibtex bibliographies (or several other formats). Take a look at sample2.txt and sample2.bib.

We tell it to use pandoc-citeproc as a filter:

pandoc -s --filter pandoc-citeproc sample2.txt -o sample2.docx

Try changing the bibliography style. Edit sample2.txt to uncomment

csl: chicago-fullnote-bibliography.csl.

Then:

pandoc -s --filter pandoc-citeproc sample2.txt -o sample2.docx

Citations work in all formats supported by pandoc:

pandoc -s --filter pandoc-citeproc sample2.txt -o sample2.org
emacs sample2.org

Source code highlighting is automatic for marked code blocks. It works in HTML, PDF, and docx:

pandoc -s sample3.txt -o sample3.html
pandoc -s sample3.txt -o sample3.pdf
pandoc -s sample3.txt -o sample3.docx

You can change the highlighting style:

pandoc -s sample3.txt -o sample3.html --highlight-style=monochrome

Pandoc has native support for literate haskell.

paste.lhs is a literate Haskell file with markdown text:

ghci paste.lhs
pandoc paste.lhs -f markdown+lhs -t html -s -o paste.html
pandoc paste.lhs -f markdown+lhs -t latex+lhs -s -o paste.lhs.tex

Pandoc can also convert to and from haddock, though this needs updating in light of recent changes in haddock’s markup.

pandoc -f markdown -t haddock
pandoc -f haddock -t markdown

Pandoc also supports beamer and several HTML slide show formats. This slide show was written with pandoc:

pandoc slides.txt -o slides.html -t revealjs --css slides.css \
  -S --highlight-style=espresso

A tour of pandoc’s API

Readers and writers

Text.Pandoc

Prelude> :m + Text.Pandoc
Text.Pandoc> let doc = readMarkdown def "*hi*"
Text.Pandoc> doc
Pandoc (Meta {unMeta = fromList []}) [Para [Emph [Str "hi"]]]
Text.Pandoc> writeLaTeX def doc
"\\emph{hi}"
Text.Pandoc> readMarkdown def{readerSmart = True} "dog's"
Pandoc (Meta {unMeta = fromList []}) [Para [Str "dog\8217s"]]

The Pandoc types

Text.Pandoc.Definition

You can use pandoc -t native and pandoc -f native to explore:

% echo "[*link*](/foo)" | pandoc -t native
[Para [Link [Emph [Str "link"]] ("/foo","")]]

Builder

Text.Pandoc.Builder

Concatenating lists is slow. So we use special types Inlines and Blocks that wrap Sequences of Inline and Block elements.

A simple example

Here’s a JSON data source about CNG fueling stations in the Chicago area: cng_fuel_chicago.json. Boss says: write me a letter in Word listing all the stations that take the Voyager card.

No need to open Word for this job! fuel.hs

Transforming a Pandoc document

Text.Pandoc.Generic

Text.Pandoc.Walk

Example: `walk`

module AllCaps (allCaps) where
import Text.Pandoc.Definition
import Data.Char (toUpper)

allCaps :: Inline -> Inline
allCaps (Str xs) = Str $ map toUpper xs
allCaps x = x

% ghci AllCaps.hs
*AllCaps > Text.Pandoc.Walk.walk allCaps $ Para [Emph [Str "hi"]]
Para [Emph [Str "HI"]]

Filters

Suppose we have a program that defines a transformation

f :: Pandoc -> Pandoc

Since Pandoc has Read and Show instances, we can write a pipe:

-- f.hs
main = interact (show . f . read)

And use it thus:

pandoc -t native -s | runghc f.hs | pandoc -f native -s -t latex

JSON filters

Read and Show are really slow. Better to use JSON serialization:

pandoc -t json -s | runghc fjson.hs | pandoc -f json -s -t latex

To simplify this pattern, we added --filter:

pandoc -s -t latex --filter fjson.hs

toJSONFilter

Text.Pandoc.JSON

toJSONFilter takes any function a -> a or a -> [a] or a -> IO a, where a is a Pandoc type, and turns it into a JSON filter.

import Text.Pandoc.JSON
import AllCaps (allCaps)

main = toJSONFilter allCaps

Example: `emphToCaps.hs`

-- pandoc --filter ./emphToCaps.hs
import Text.Pandoc.JSON
import Text.Pandoc.Walk
import AllCaps (allCaps)

emphToCaps :: Inline -> [Inline]
emphToCaps (Emph xs) = walk allCaps xs
emphToCaps x = [x]

main :: IO ()
main = toJSONFilter emphToCaps

Output format conditionalization

pandoc --filter passes the name of the output format as first argument to the filter. So the filter’s behavior can depend on the output format.

toJSONFilter makes this easy: just use a function whose first argument is Maybe Format.

Example: `emphToCaps2.hs`

Emph as Small Caps in LaTeX and HTML, ALL CAPS otherwise:

-- pandoc --filter ./emphToCaps2.hs
import Text.Pandoc.JSON
import Text.Pandoc.Walk
import AllCaps (allCaps)

emphToCaps :: Maybe Format -> Inline -> [Inline]
emphToCaps (Just f) (Emph xs)
  | f == Format "html" || f == Format "latex" = [SmallCaps xs]
emphToCaps _ (Emph xs) = walk allCaps xs
emphToCaps _ x = [x]

main :: IO ()
main = toJSONFilter emphToCaps

Exercises

http://johnmacfarlane.net/BayHac2014/exercises.pdf