Skip to content

UTF-8 characters in lstlisting breaks pdf conversion #131

@rossbar

Description

@rossbar

Bug Report

Describe the bug

This is not necessarily an IPyPublish bug, but a limitation in the lstlisting LaTeX package causes pdf conversion to fail if unicode characters are used within an lstlisting environment. I stumbled upon this using the %timeit ipython magic in a code cell, as the output of %timeit includes unicode characters (the plus-minus sign, greek characters for second-prefixes, etc.)

To Reproduce

Steps to reproduce the behavior:

  1. Create a file called example.Rmd with the following contents
\```{python}
%timeit a = 2 + 2
\```
  1. nbpublish -f latex_ipypublish_all.exec -pdf example.Rmd

Minimal Notebook Example

timeit_nb.ipynb.txt

Same build instructions as above (with the different filename of course). Note that this issue is downstream in the build process (at the latex -> pdf step) so is insensitive to whether the input file is .Rmd, .ipynb, etc.

Expected Behaviour

Currently, the conversion fails with errors from pdflatex. The desired behavior is a successful build with unicode characters properly represented in lstlisting environments.

Runtime Information

(please complete the following information)

  • IPyPublish: 0.10.10

  • Python: 3.8.1

  • OS: Arch linux (5.5.2-arch1-1)

  • Pandoc: 2.8

  • (optional for pdf issues) texlive: 3.14159265

  • (optional for pdf issues) latexmk: 4.65

Additional context

The .log file provided by pdflatex is not particularly helpful as it makes it seem as though the problem is with the utf8x or ucs packages/options. After some digging, I was able to trace the problem back to a limitation with lstlisting. A simple procedure for confirming this:

  1. Open the converted/timeit.tex file generated by the nbpublish process
  2. Navigate to the lstlisting environment around the output from the code cell
  3. Comment out the lstlisting environment
  4. Build with pdflatex: pdflatex timeit.tex

The build will complete without errors and the output from the code cell will be properly rendered, albeit in plain LaTeX.

Proposed solution

The limitations of lstlisting with respect to unicode input are documented, and there is a proposed solution in section 2.5 of the documentation. It involves including an escapeinside= parameter in the lstlisting environment to pass the handling of characters in the environment back to latex. For example, here is the original lstlisting in timeit.tex as generated by the build process:

\begin{lstlisting}[language={},postbreak={},numbers=none,xrightmargin=7pt,belowskip=5pt,aboveskip=5pt,breakindent=0pt]
11.1 ns ± 2.64 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)   
                                                                                
\end{lstlisting}

Here is the modified version that includes escapeinside that fixes the issue:

\begin{lstlisting}[language={},postbreak={},numbers=none,xrightmargin=7pt,belowskip=5pt,aboveskip=5pt,breakindent=0pt,escapeinside={*(}{)*}]
*(11.1 ns ± 2.64 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)   )*

\end{lstlisting}

Note that the characters that define the escaped section (*( and )* in my example) are configurable and could be specified for the entire document with \lstset.

If the proposed solution sounds workable to you, I'm happy to attempt to implement it. Some discussion would be required to hammer out details (e.g. appropriate escape characters). I wanted to create an issue first to see if there were any additional insights/ideas.

Logging

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions