Converting math-heavy LaTeX to HTML with Pandoc
Posted on 2024-12-20 by ubikiumRecently there’s a need to convert a math-heavy LaTeX document to HTML. Of all the tools I tried, Pandoc wins out as it’s the only one that didn’t choke on macros. It generated a surprisingly decent output and offered great customizability. In this post I’ll talk about the general setup and some tips on how to extend the process to add various features.
For a complete example, see this directory.
It features a conversion from a document with various kinds of math packages into a website with interlinked pages.
Everything should follow from the run-pandoc.sh
script.
General setup and limitations
The overall pipeline is:
LaTeX source
↓ pandoc reader
pandoc AST
↓ filters (0 or more)
pandoc AST
↓ pandoc writer
HTML files
↓ MathJax rendering
rendered HTML files
To achieve the desired end goal for a specific component, you’ll need to think about at which step this transformation should happen and how it would affect other things in the pipeline. For example, to rewrite links to a certain format, you can do that with a custom reader, or a Lua filter, or by JavaScript in the final HTML file.
What kind of LaTeX sources are suitable for this pipeline? Pandoc can recognize common packages and their commands. Of course there’s a limitation, which roughly corresponds to what’s expressible in Pandoc’s Markdown.
As for your own macros, Pandoc can parse and perform macro expansion inside the reader. So basically if you are using macros for simple string substitution, rather than general programming, there’s a very good chance that it just works. You can also use this as a way to redefine macros not supported by Pandoc (or MathJax).
Preserve math blocks
You’ll probably want to use MathJax to render your math.
To do that, you’ll need to use pandoc --standalone --mathjax
.
--standalone
instructs Pandoc to generate the headers and footers, etc.
Otherwise, Pandoc assumes it’s transforming a document fragment and thus will not add those components.
The --mathjax
option will prevent Pandoc from rendering the math blocks itself, but preserve the LaTeX commands and add a segment in the header to load the MathJax module, which will render the LaTeX commands into actual things that can be displayed by the browser.
To customize MathJax options, add a custom header with the --include-in-header header.html
option.
Translate math packages
The LaTeX source might not be written in a way that’s aware of MathJax’s LaTeX support. So it might use a package that is not supported by MathJax. To solve this problem, we have to rewrite the math blocks. There are several places in the pipeline which can be changed to do this and you can combine different transformations together.
Starting from the Pandoc reader, the easiest approach is to translate the bad commands into something supported by MathJax.
Remember it’s only string substitution, so it’s okay to do horrible things like \renewcommand\and[0]{\end{mathpar}\begin{mathpar}}
.
I should mention that Pandoc doesn’t handle the star variant of a command well (e.g. \inferrule
and \inferrule*
).
In LaTeX, *
is just an argument to the command, but I didn’t find an easy way to do the if-else
branching with Pandoc’s default LaTeX reader, but maybe I’m missing something obvious.
If the above rewriting is not enough to solve your problem, you can use a Pandoc filter to perform whatever transformation you need.
Because of --mathjax
, we’ll find raw LaTeX commands as a string at the filter stage and you can rewrite it to a different string.
I wrote some functions to do macro search and replace, which can be found here (or see the hosted version).
MathJax performance issue
Pandoc prefers building everything as one page, but if there are too many equations, MathJax rendering might take too long and even crash the page with an out-of-memory error. There are generally two ways to fix this:
- Use the target of
-t chunkedhtml
. - Pre-render math blocks into SVG or CHTML files.
If performance is critical, you can actually do both.
For pre-rendering, examples in this repo are very useful.
One note on CHTML pre-rendering, the CSS is necessary and it’s determined by what’s already rendered.
So you need to render everything, then generate the CSS, and compose it with the Pandoc output header.
The tex2chtml-page
script won’t change the header if there’s already one.
Links in general and links in math blocks
In LaTeX, links can be added with \label
-\ref
, or \hypertarget
-\hyperlink
.
Pandoc generally translates them into HTML element identifiers and anchors respectively.
When this fails, e.g. \label
commands inside figure captions are not translated, you can use a filter to manually parse out the label and add it as an identifier to the element.
In math blocks, the raw LaTeX commands are preserved, so we need to rewrite the string to a format such that after the MathJax rendering, an identifier or an anchor is produced.
For identifiers, \label
or \hypertarget
can be substituted by MathJax’s \cssId
.
For anchors, \ref
or \hyperlink
can be substituted by \href
.
More can be found in MathJax’s documentation.
There’s a complex interaction between the chunkedhtml
target and links.
Pandoc will try to rewrite the links so that they’ll point to the correct chunked page.
However, this doesn’t include identifiers and anchors in math blocks (remember they are only produced after MathJax rendering).
So you’ll need to correct them yourself.
I achieved this by collecting all identifiers and then patching each one to the correct link format of page#id
.
See this example.
Anchors for section titles, navigation sidebar, and other goodies
Pandoc’s documentation itself has many additional features. The implementation can be found at Panodc’s website source code.