Skip to content

Standard YAML frontmatter syntax causes lists to not be detected #5

@nathanlesage

Description

@nathanlesage

Initial checklist

Affected packages and versions

micromark-extension-frontmatter v1.0.0

Link to runnable example

https://codesandbox.io/s/hopeful-matan-nybskz?file=%2Fsrc%2Findex.js

Steps to reproduce

I have a workload where I need to transform Markdown documents that can also include frontmatters to a syntax tree, using parse() on the remark processor.

It breaks when that said YAML frontmatter uses Jekyll-style delimiters (three dashes both on top and on the bottom of the frontmatter). It works, however, when I use Pandoc-style delimiters (three dashes on top, three dots on the bottom). Usage of the plugin then disables the processor's ability to detect lists properly.

Here are the steps to reproduce. Above I have linked a sandbox that is already properly set up. In there, simply exchange the end-fence for the frontmatter between dashes and dots and observe the difference in produced syntax trees.

Processor setup

export function md2ast (markdown: string): Root {
  return remark()
    .use(remarkFrontmatter, [
      // Either Pandoc-style frontmatters ...
      { type: 'yaml', fence: { open: '---', close: '...' } },
      // ... or Jekyll/Static site generators-style frontmatters.
      { type: 'yaml', fence: { open: '---', close: '---' } }
    ])
    .use(remarkMath)
    .parse(markdown)
}

Note that while I am being verbose in explicitly stating the delimiters, exchanging the second definition to simply yaml does not change the effect.

Test document

Use this document and run it through said pipeline. You will notice that, when you use three dashes to end the frontmatter, the list is not correctly detected, whereas, when you exchange that with three dots (Pandoc-style frontmatter), it will be correctly detected.

---
title: "The Devil is in the Details: Ethical Pitfalls in the Sociological use of NLP techniques"
date: 2022-12-05
id: 20221205151411
author: Hendrik Erz
---

# Export Link Removal

Export this file into any format to test out the corresponding LUA filter.

* History of NLP in Sociology (from Mosteller and Wallace to today)
* What methods are in use? (three types: Bayes/simple such as Logistic Regression; Machine Learning such as LDA/random forests; deep learning such as LSTM/BERT)
* What are they being used for?
    * This is a worngly written second-indended word
* Where do these methods come from?
* Are there already ethical notes around? Or don’t they care?

Expected behavior

Correct behavior with Pandoc-style frontmatter

If you exchange the three dashes with three dots in the test-document, it works as expected.

This is the correct Syntax Tree:

{
  "type": "root",
  "children": [
    {
      "type": "yaml",
      "value": "title: \"The Devil is in the Details: Ethical Pitfalls in the Sociological use of NLP techniques\"\ndate: 2022-12-05\nid: 20221205151411\nauthor: Hendrik Erz",
      "position": {
        "start": { "line": 1, "column": 1, "offset": 0 },
        "end": { "line": 6, "column": 4, "offset": 160 }
      }
    },
    {
      "type": "heading",
      "depth": 1,
      "children": [
        {
          "type": "text",
          "value": "Export Link Removal",
          "position": {
            "start": { "line": 8, "column": 3, "offset": 164 },
            "end": { "line": 8, "column": 22, "offset": 183 }
          }
        }
      ],
      "position": {
        "start": { "line": 8, "column": 1, "offset": 162 },
        "end": { "line": 8, "column": 22, "offset": 183 }
      }
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "Export this file into any format to test out the corresponding LUA filter.",
          "position": {
            "start": { "line": 10, "column": 1, "offset": 185 },
            "end": { "line": 10, "column": 75, "offset": 259 }
          }
        }
      ],
      "position": {
        "start": { "line": 10, "column": 1, "offset": 185 },
        "end": { "line": 10, "column": 75, "offset": 259 }
      }
    },
    {
      "type": "list",
      "ordered": false,
      "start": null,
      "spread": false,
      "children": [
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "History of NLP in Sociology (from Mosteller and Wallace to today)",
                  "position": {
                    "start": { "line": 12, "column": 3, "offset": 263 },
                    "end": { "line": 12, "column": 68, "offset": 328 }
                  }
                }
              ],
              "position": {
                "start": { "line": 12, "column": 3, "offset": 263 },
                "end": { "line": 12, "column": 68, "offset": 328 }
              }
            }
          ],
          "position": {
            "start": { "line": 12, "column": 1, "offset": 261 },
            "end": { "line": 12, "column": 68, "offset": 328 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "What methods are in use? (three types: Bayes/simple such as Logistic Regression; Machine Learning such as LDA/random forests; deep learning such as LSTM/BERT)",
                  "position": {
                    "start": { "line": 13, "column": 3, "offset": 331 },
                    "end": { "line": 13, "column": 161, "offset": 489 }
                  }
                }
              ],
              "position": {
                "start": { "line": 13, "column": 3, "offset": 331 },
                "end": { "line": 13, "column": 161, "offset": 489 }
              }
            }
          ],
          "position": {
            "start": { "line": 13, "column": 1, "offset": 329 },
            "end": { "line": 13, "column": 161, "offset": 489 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "What are they being used for?",
                  "position": {
                    "start": { "line": 14, "column": 3, "offset": 492 },
                    "end": { "line": 14, "column": 32, "offset": 521 }
                  }
                }
              ],
              "position": {
                "start": { "line": 14, "column": 3, "offset": 492 },
                "end": { "line": 14, "column": 32, "offset": 521 }
              }
            },
            {
              "type": "list",
              "ordered": false,
              "start": null,
              "spread": false,
              "children": [
                {
                  "type": "listItem",
                  "spread": false,
                  "checked": null,
                  "children": [
                    {
                      "type": "paragraph",
                      "children": [
                        {
                          "type": "text",
                          "value": "This is a worngly written second-indended word",
                          "position": {
                            "start": { "line": 15, "column": 7, "offset": 528 },
                            "end": { "line": 15, "column": 53, "offset": 574 }
                          }
                        }
                      ],
                      "position": {
                        "start": { "line": 15, "column": 7, "offset": 528 },
                        "end": { "line": 15, "column": 53, "offset": 574 }
                      }
                    }
                  ],
                  "position": {
                    "start": { "line": 15, "column": 5, "offset": 526 },
                    "end": { "line": 15, "column": 53, "offset": 574 }
                  }
                }
              ],
              "position": {
                "start": { "line": 15, "column": 5, "offset": 526 },
                "end": { "line": 15, "column": 53, "offset": 574 }
              }
            }
          ],
          "position": {
            "start": { "line": 14, "column": 1, "offset": 490 },
            "end": { "line": 15, "column": 53, "offset": 574 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "Where do these methods come from?",
                  "position": {
                    "start": { "line": 16, "column": 3, "offset": 577 },
                    "end": { "line": 16, "column": 36, "offset": 610 }
                  }
                }
              ],
              "position": {
                "start": { "line": 16, "column": 3, "offset": 577 },
                "end": { "line": 16, "column": 36, "offset": 610 }
              }
            }
          ],
          "position": {
            "start": { "line": 16, "column": 1, "offset": 575 },
            "end": { "line": 16, "column": 36, "offset": 610 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "Are there already ethical notes around? Or don’t they care?",
                  "position": {
                    "start": { "line": 17, "column": 3, "offset": 613 },
                    "end": { "line": 17, "column": 62, "offset": 672 }
                  }
                }
              ],
              "position": {
                "start": { "line": 17, "column": 3, "offset": 613 },
                "end": { "line": 17, "column": 62, "offset": 672 }
              }
            }
          ],
          "position": {
            "start": { "line": 17, "column": 1, "offset": 611 },
            "end": { "line": 17, "column": 62, "offset": 672 }
          }
        }
      ],
      "position": {
        "start": { "line": 12, "column": 1, "offset": 261 },
        "end": { "line": 17, "column": 62, "offset": 672 }
      }
    }
  ],
  "position": {
    "start": { "line": 1, "column": 1, "offset": 0 },
    "end": { "line": 18, "column": 1, "offset": 673 }
  }
}

Actual behavior

Incorrect behavior with Jekyll-style frontmatter

If you just run the above test document (with three dashes), it produces a wrong syntax tree. Observe how it does not detect the list properly.

{
  "type": "root",
  "children": [
    {
      "type": "yaml",
      "value": "title: \"The Devil is in the Details: Ethical Pitfalls in the Sociological use of NLP techniques\"\ndate: 2022-12-05\nid: 20221205151411\nauthor: Hendrik Erz",
      "position": {
        "start": { "line": 1, "column": 1, "offset": 0 },
        "end": { "line": 6, "column": 4, "offset": 160 }
      }
    },
    {
      "type": "heading",
      "depth": 1,
      "children": [
        {
          "type": "text",
          "value": "Export Link Removal",
          "position": {
            "start": { "line": 8, "column": 3, "offset": 164 },
            "end": { "line": 8, "column": 22, "offset": 183 }
          }
        }
      ],
      "position": {
        "start": { "line": 8, "column": 1, "offset": 162 },
        "end": { "line": 8, "column": 22, "offset": 183 }
      }
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "Export this file into any format to test out the corresponding LUA filter.",
          "position": {
            "start": { "line": 10, "column": 1, "offset": 185 },
            "end": { "line": 10, "column": 75, "offset": 259 }
          }
        }
      ],
      "position": {
        "start": { "line": 10, "column": 1, "offset": 185 },
        "end": { "line": 10, "column": 75, "offset": 259 }
      }
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "* History of NLP in Sociology (from Mosteller and Wallace to today)\n* What methods are in use? (three types: Bayes/simple such as Logistic Regression; Machine Learning such as LDA/random forests; deep learning such as LSTM/BERT)\n* What are they being used for?\n* This is a worngly written second-indended word\n* Where do these methods come from?\n* Are there already ethical notes around? Or don’t they care?",
          "position": {
            "start": { "line": 12, "column": 1, "offset": 261 },
            "end": { "line": 17, "column": 62, "offset": 672 }
          }
        }
      ],
      "position": {
        "start": { "line": 12, "column": 1, "offset": 261 },
        "end": { "line": 17, "column": 62, "offset": 672 }
      }
    }
  ],
  "position": {
    "start": { "line": 1, "column": 1, "offset": 0 },
    "end": { "line": 18, "column": 1, "offset": 673 }
  }
}

Runtime

Node v16, Other (please specify in steps to reproduce)

Package manager

yarn v1

OS

macOS

Build and bundle tools

Webpack

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐛 type/bugThis is a problem👍 phase/yesPost is accepted and can be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions