Description
Problem Statement
There are numerous incompatible Markdown flavors. For this discussion, let's assume "Markdown" means strict CommonMark unless otherwise specified.
Many people expect to use Markdown notations inside their JSDoc. Writing a Markdown parser is already somewhat tricky, since the grammar is highly sensitive to context (compared to other rich-text formats such as HTML). Extending it with JSDoc tags causes some interesting collisions and ambiguities. Some motivating examples:
1. Code fences that span tags
/**
* I can use backticks to create a `code fence` that gets highlighted in a typewriter font.
*
* This `@tag` is not a TSDoc tag, since it's inside a code fence.
*
* This {@link MyClass | hyperlink to the `MyClass` base class} should get highlighting
* in its target text.
*
* This {@link MyClass | example of backtick (`)} has an unbalanced backtick (`)
* inside a tag.
*/
Intuitively we'd expect it to be rendered like this:
I can use backticks to create a code fence
that gets highlighted in a typewriter font.
This @tag
is not a TSDoc tag, since it's inside a code fence.
This hyperlink to the MyClass
base class should get highlighting in its target text.
This example of backtick (`) has an unbalanced backtick (`) inside a tag.
2. Stars
Stars have the same problems as backticks, but with even more special cases:
/**
* Markdown would treat these as
* * bullet
* * items.
*
* Inside code comments, the left margin is sometimes ambiguous:
** bullet
** items?
*
* Markdown confusingly *allows a * inside an emphasis*.
* Does a *{@link MyClass | * tag}* participate in this?
*/
Intuitively we'd expect it to be rendered like this:
Markdown would treat these as
- bullet
- items.
Inside code comments, the left margin is sometimes ambiguous:
- bullet
- items?
Markdown confusingly allows a * inside an emphasis.
Does a * tag participate in this?
3. Whitespace
Markdown assigns special meanings to whitespace indentation. For example, indenting 4 spaces is equivalent to a ``` block. Newlines also have lots of have special meanings.
This could be fairly confusing inside a code comment, particularly with weird cases like this:
/** Is this indented? */
/** some junk
Is this indented? */
/**
Is this okay at all? */
/**
Is this star part of the comment?
* mystery
Or is it a Markdown bullet?
*/
Perhaps TSDoc should issue warnings about malformed comment framing.
Perhaps we should try to disable some of Markdown's indentation rules. For example, the TSDoc parser could trim whitespace from the start of each line.
4. Markdown Links
Markdown supports these constructs:
[Regular Link](http://example.com)
[Cross-reference Link][1]
. . .
[1]: http://b.org

Autolinks are handy: http://example.com
However if you want an accurate URL-detector, it turns out to be a fairly big library dependency.
The Markdown link functionality partially overlaps with JSDoc's {@link}
tag. But it's missing support for API item references.
5. Markdown Tables
Markdown tables have a ton of limitations. Many constructs aren't supported inside table cells. You can't even put a newline inside a table cell. CommonMark had a long discussion about this, but so far does not support the pipes-and-dashes table syntax at all. Instead it uses HTML tables. This seems pretty wise.
6. HTML elements
Most Markdown flavors allow HTML mixed into your content. The CommonMark spec has an entire section about this. This is convenient, although HTML is an entire separate grammar with its own complexities. For example, HTML has a completely distinct escaping mechanism from Markdown.
Here's a few interesting cases to show some interactions:
/**
* Here's a <!-- @remarks --> tag inside an HTML comment.
*
* Here's a TSDoc tag that {@link MyClass | <!-- } seemingly starts --> an HTML comment.
*
* The `@remarks` tag normally separates two major TSDoc blocks. Is it okay for that
* to appear inside a table?
*
* <table><tr><td>
* @remarks
* </td></tr></table>
*/
Two Possible Solutions
Option 1: Extend an existing CommonMark library
The most natural approach would be for the TSDoc parser to include an integrated CommonMark parser. The two grammars would be mixed together. We definitely don't want to write a CommonMark parser from scratch, so instead the TSDoc library would need to extend an existing library. Markdown-it and Flavormark are possible choices that are both oriented towards custom extensions.
Possible downsides:
- Incorporating full Markdown into the TSDoc AST nodes implies that our doc comment emitter would need to be a full Markdown emitter. (In my experience, correctly emitting Markdown is every bit as tricky as parsing Markdown.)
- To support an entrenched backend with its own opinionated Markdown flavor, this approach wouldn't passthrough Markdown content from doc comments; instead the backend would have to parse AST nodes that were emitted back to Markdown. This can be good (if you're rigorous and writing a proper translator) or bad (if you're taking the naive route)
- This approach couples our API contract (e.g. the AST structure) to an external project
- Possibly increases friction for tools that are contemplating taking a dependency on @microsoft/tsdoc
Option 2: Treat full Markdown as a postprocess
A possible shortcut would be to say that TSDoc operates as a first pass that snips out the structures we care about, and returns everything else as plain text. We don't want to get tripped up by backticks, so we make a small list of core constructs that can easily screw up parsing:
- code fences (backticks)
- links
- CommonMark escapes
- HTML elements (but only as tokens, ignoring nesting)
- HTML comments (?)
Anything else is treated as plain text for TSDoc, and gets passed through (to be possibly reinterpreted by another layer of the documentation pipeline).
/**
* This is *bold*. Here's a {@link MyClass | link to `MyClass`}. <div>
* @remarks
* Here's some more stuff. </bad>
*/
Here's some pseudocode for a corresponding AST:
[
{
"nodeKind": "textNode",
"content": "This is *bold*. Here's a " // <-- we ignore the Markdown stars
},
{
"nodeKind": "linkNode",
"apiItemReference": {
"itemPath": "MyClass"
},
"linkText": [
{
"nodeKind": "textNode",
"content": "link to "
},
{
"nodeKind": "codeFenceNode", // <-- we parse the backticks though
"children": [
{
"nodeKind": "textNode",
"content": "MyClass"
}
]
},
{
"nodeKind": "textNode",
"content": ". "
},
{
"nodeKind": "htmlElementNode",
"elementName": "div"
}
]
},
{
"nodeKind": "customTagNode",
"tag": "@remarks"
},
{
"nodeKind": "textNode",
"content": "Here's some more stuff."
},
{
"nodeKind": "htmlElementNode",
"elementName": "bad", // <-- we care about HTML delimiters, but not HTML structure
"isEndTag": true
}
]
Possible downsides:
- The resulting syntax would be fairly counterintuitive for people who assume they're writing real Markdown. All the weird little Markdown edge cases would be handled oddly.
- This model invites a documentation pipeline to do nontrivial syntactic postprocessing. For content authors, the language wouldn't have a unified specification. (This isn't like a templating library that supports proprietary HTML tags. Instead, it's more like if one tool defined HTML without attributes, and then another tried to retrofit attributes on top of it.)
- We might end up having to code a small CommonMark parser (although it would be a subset of the work involved for a parser that handles the full grammar)
- How will the second stage Markdown parser accurately report line numbers for errors?
What do you think? Originally I was leaning towards #1 above, but now I'm wondering if #2 might be a better option.