% Pandoc for Haskell Hackers % John MacFarlane % BayHac 2014 # Why am I here?
I'm a philosophy professor. And this is just the sort of question you expect a philosophy professor to ask---but I mean it in the most mundane sense. I have no formal background whatsoever in computer science. I think I'm a decent Haskell programmer, but nothing close to an expert. So why am I giving a talk at a Haskell hackathon?
# I created a virus... --- # ...that spreads GHC
Because I created the most effective virus for spreading GHC installations: pandoc.
| ![](images/virus1.png) | ![](images/virus2.png) | ![](images/virus3.png) | ![](images/virus4.png) ---- debian popcon Hackage -------- --------------- ------------------ pandoc 1997 26220 darcs 1908 3420 xmonad 1733 6432 -------- --------------- ------------------ ![](images/downloads-google.png) # How did it happen? ![trs80](images/TRS80.jpg) ![kim1](images/KIM1.jpg)
I grew up programming video games in BASIC on a TRS-80 and 6502 assembly on a KIM-1 that my grandfather bought (he must have been among the first microcomputer hobbyists). In college I worked summers as a programmer, writing scientific software in Pascal on VAX workstations with whopping 20 MB hard drives. I got out of programming when GUIs took over, but started to get interested again in 2004 after installing linux. I messed around with python and lisp and ruby, and wrote a website for my department. Then in 2005, an Australian philosopher/logician colleague of mine, Greg Restall, mentioned that his favorite programming language was Haskell. So I checked it out, and found that I liked it too. I started messing around with parsec, the parser combinator library. I asked myself, "what would be fun to try to parse?" I'd gotten interested in lightweight markup languages and was using reStructuredText for lecture notes and handouts. And I'd just seen this "markdown" thing that John Gruber introduced. So I said to myself, "Self, let's try writing a markdown parser." That seemed like a good challenge, because markdown is about as parser-unfriendly a language as you can get. An asterisk, for example, can be either an open or close tag for emphasis, or *part of* an open or close tag for strong emphasis, or just a literal asterisk---and which it is often depends on what comes after. At the time, the only markdown parsers around were just big regex transformations. Parsec worked well for this. Before long I had a working parser. It was considerably faster and more accurate than John Gruber's perl script. And it was much easier to maintain and extend. One thing led to another. I wasn't entirely happy with the docutils (reStructuredText) tool chain, and I saw some advantages to "making my own tools" for writing. I needed to convert my existing documents in reStructuredText to markdown, so I added a reStructuredText parser and a markdown writer. I needed output in LaTeX as well as HTML, so I added a LaTeX writer. I needed footnotes, inline LaTeX math, and other features, so I extended pandoc's markdown dialect. The project provided hours of pleasant procrastination.
# First release ![](images/first-pandoc-web-page.png)
Then something possessed me to release it. In August 2006 I released the first version. Not long after that, a Debian developer, Recai Oktas, contacted me offering to add it to debian. So I became an open-source developer, and I've worked on pandoc in my spare time ever since.
# Libraries begat libraries > - highlighting-kate > - zip-archive > - texmath > - pandoc-citeproc (citeproc-hs) > - gitit # The command-line tool A quick demonstration. git clone https://github.com/jgm/BayHac2014 cd BayHac2014/demo vim script.txt
Our goal is to learn how to use pandoc as a library, but let's first have a quick demo of the command-line tool. Run through `script.txt` in `demo/`.
---- By default pandoc works as a pipe, reading from stdin and writing to stdout. Try it: pandoc Hit Ctrl-D (Ctrl-Z on Windows) when you're finished typing text. --- You can use options. This one triggers "smart typography" (quotes, dashes, ellipses). pandoc --smart --- Let's convert to latex instead of HTML. pandoc --to latex . . . Or to mediawiki: pandoc --to mediawiki --- Let's convert a latex file to markdown: pandoc -f latex -t markdown example.tex --- For help and information on which options pandoc supports: pandoc --help More detail can be found in the pandoc [README](http://johnmacfarlane.net/pandoc/README.html). Or: man pandoc man pandoc_markdown --- The `--standalone` or `-s` option creates a standalone document with header, footer, and metadata: pandoc --standalone --smart -o r.html -t html5 README . . . Let's add a table of contents and use some custom CSS: pandoc --standalone --smart --toc -o r.html -t html5 \ --css my.css README --- Standalone documents are constructed from templates. To see the default template for a format, use `-D`: pandoc -D html5 > my.html5 The template language is documented [here](http://johnmacfarlane.net/pandoc/README.html#templates). --- [`sample1.txt`](demo/sample1.txt) contains some nice structured metadata. This is YAML but with strings interpreted as markdown. --- title: My demonstration author: - Kurt Gödel - Haskell Curry version: - number: 1.0 date: July 13, 1945 log: Initial commit - number: 1.1 date: August 14, 1946 log: Added some math --- --- The metadata has a nice version history. Let's edit `my.html5` to include this before the `` tag: --- Let's try our custom template: pandoc -s -S --template my.html5 -t html5 sample1.txt \ -o sample1.html --mathjax --- We can create a PDF. Pandoc shells out to pdflatex for this. pandoc sample1.txt -o sample1.pdf If you want to use xelatex instead, use `--latex-engine=xelatex`. --- We can create a word document without opening Word: pandoc sample1.txt -o sample1.docx Note that the TeX math in the markdown file gets converted to native Word equations. . . . or an epub: pandoc sample1.txt -t epub3 -o sample1.epub --- Pandoc can process citations using bibtex bibliographies (or several other formats). Take a look at [`sample2.txt`](demo/sample2.txt) and [`sample2.bib`](demo/sample2.bib). We tell it to use `pandoc-citeproc` as a filter: pandoc -s --filter pandoc-citeproc sample2.txt -o sample2.docx --- Try changing the bibliography style. Edit [`sample2.txt`](demo/sample2.txt) to uncomment csl: chicago-fullnote-bibliography.csl. Then: pandoc -s --filter pandoc-citeproc sample2.txt -o sample2.docx --- Citations work in all formats supported by pandoc: pandoc -s --filter pandoc-citeproc sample2.txt -o sample2.org emacs sample2.org --- Source code highlighting is automatic for marked code blocks. It works in HTML, PDF, and docx: pandoc -s sample3.txt -o sample3.html pandoc -s sample3.txt -o sample3.pdf pandoc -s sample3.txt -o sample3.docx . . . You can change the highlighting style: pandoc -s sample3.txt -o sample3.html --highlight-style=monochrome --- Pandoc has native support for literate haskell. [`paste.lhs`](demo/paste.lhs) is a literate Haskell file with markdown text: ghci paste.lhs pandoc paste.lhs -f markdown+lhs -t html -s -o paste.html pandoc paste.lhs -f markdown+lhs -t latex+lhs -s -o paste.lhs.tex --- Pandoc can also convert to and from haddock, though this needs updating in light of recent changes in haddock's markup. pandoc -f markdown -t haddock pandoc -f haddock -t markdown --- Pandoc also supports beamer and several HTML slide show formats. [This slide show](slides.txt) was written with pandoc: pandoc slides.txt -o slides.html -t revealjs --css slides.css \ -S --highlight-style=espresso # A tour of pandoc's API # Readers and writers [Text.Pandoc](doc/pandoc/Text-Pandoc.html) . . . ``` haskell Prelude> :m + Text.Pandoc Text.Pandoc> let doc = readMarkdown def "*hi*" Text.Pandoc> doc Pandoc (Meta {unMeta = fromList []}) [Para [Emph [Str "hi"]]] Text.Pandoc> writeLaTeX def doc "\\emph{hi}" Text.Pandoc> readMarkdown def{readerSmart = True} "dog's" Pandoc (Meta {unMeta = fromList []}) [Para [Str "dog\8217s"]] ``` # The Pandoc types [Text.Pandoc.Definition](doc/pandoc-types/Text-Pandoc-Definition.html) . . . You can use `pandoc -t native` and `pandoc -f native` to explore: ``` % echo "[*link*](/foo)" | pandoc -t native [Para [Link [Emph [Str "link"]] ("/foo","")]] ``` # Builder [Text.Pandoc.Builder](doc/pandoc-types/Text-Pandoc-Builder.html) Concatenating lists is slow. So we use special types `Inlines` and `Blocks` that wrap `Sequence`s of `Inline` and `Block` elements. # A simple example Here's a JSON data source about CNG fueling stations in the Chicago area: [cng_fuel_chicago.json](cng_fuel_chicago.json.html). Boss says: write me a letter in Word listing all the stations that take the Voyager card. . . . No need to open Word for this job! [fuel.hs](./fuel.hs.html) # Transforming a Pandoc document [Text.Pandoc.Generic](doc/pandoc-types/Text-Pandoc-Generic.html) [Text.Pandoc.Walk](doc/pandoc-types/Text-Pandoc-Walk.html) # Example: `walk` ```haskell module AllCaps (allCaps) where import Text.Pandoc.Definition import Data.Char (toUpper) allCaps :: Inline -> Inline allCaps (Str xs) = Str $ map toUpper xs allCaps x = x ``` . . . ``` % ghci AllCaps.hs *AllCaps > Text.Pandoc.Walk.walk allCaps $ Para [Emph [Str "hi"]] Para [Emph [Str "HI"]] ``` # Filters Suppose we have a program that defines a transformation ```haskell f :: Pandoc -> Pandoc ``` Since `Pandoc` has `Read` and `Show` instances, we can write a pipe: ```haskell -- f.hs main = interact (show . f . read) ``` And use it thus: pandoc -t native -s | runghc f.hs | pandoc -f native -s -t latex # JSON filters `Read` and `Show` are really slow. Better to use JSON serialization: pandoc -t json -s | runghc fjson.hs | pandoc -f json -s -t latex . . . To simplify this pattern, we added `--filter`: pandoc -s -t latex --filter fjson.hs # toJSONFilter [Text.Pandoc.JSON](doc/pandoc-types/Text-Pandoc-JSON.html) `toJSONFilter` takes any function `a -> a` or `a -> [a]` or `a -> IO a`, where `a` is a Pandoc type, and turns it into a JSON filter. ```haskell import Text.Pandoc.JSON import AllCaps (allCaps) main = toJSONFilter allCaps ``` # Example: `emphToCaps.hs` ```haskell -- pandoc --filter ./emphToCaps.hs import Text.Pandoc.JSON import Text.Pandoc.Walk import AllCaps (allCaps) emphToCaps :: Inline -> [Inline] emphToCaps (Emph xs) = walk allCaps xs emphToCaps x = [x] main :: IO () main = toJSONFilter emphToCaps ``` # Output format conditionalization `pandoc --filter` passes the name of the output format as first argument to the filter. So the filter's behavior can depend on the output format. `toJSONFilter` makes this easy: just use a function whose first argument is `Maybe Format`. # Example: `emphToCaps2.hs` Emph as Small Caps in LaTeX and HTML, ALL CAPS otherwise: ```haskell -- pandoc --filter ./emphToCaps2.hs import Text.Pandoc.JSON import Text.Pandoc.Walk import AllCaps (allCaps) emphToCaps :: Maybe Format -> Inline -> [Inline] emphToCaps (Just f) (Emph xs) | f == Format "html" || f == Format "latex" = [SmallCaps xs] emphToCaps _ (Emph xs) = walk allCaps xs emphToCaps _ x = [x] main :: IO () main = toJSONFilter emphToCaps ``` # Exercises