RFC: Two possibilities for integrating Markdown into TSDoc

# Problem Statement

There are numerous incompatible Markdown flavors.  For this discussion, let's assume "Markdown" means strict [CommonMark](https://github.com/commonmark/CommonMark) unless otherwise specified.

Many people expect to use Markdown notations inside their JSDoc. Writing a Markdown parser is already somewhat tricky, since the grammar is highly sensitive to context (compared to other rich-text formats such as HTML).  Extending it with JSDoc tags causes some interesting collisions and ambiguities.  Some motivating examples:

### 1. Code fences that span tags

```typescript
/**
 * I can use backticks to create a `code fence` that gets highlighted in a typewriter font.
 * 
 * This `@tag` is not a TSDoc tag, since it's inside a code fence.
 * 
 * This {@link MyClass | hyperlink to the `MyClass` base class} should get highlighting
 * in its target text.
 * 
 * This {@link MyClass | example of backtick (`)} has an unbalanced backtick (`)
 * inside a tag.
 */
```

Intuitively we'd expect it to be rendered like this:
___

I can use backticks to create a `code fence` that gets highlighted in a typewriter font.

This `@tag` is not a TSDoc tag, since it's inside a code fence.

This [hyperlink to the `MyClass` base class](#) should get highlighting in its target text.

This [example of backtick (\`)](#) has an unbalanced backtick (`) inside a tag.

___


### 2. Stars

Stars have the same problems as backticks, but with even more special cases:

```typescript
/**
 * Markdown would treat these as
 * * bullet
 * * items.
 *
 * Inside code comments, the left margin is sometimes ambiguous:
 ** bullet
 ** items?
 *
 * Markdown confusingly *allows a * inside an emphasis*.
 * Does a *{@link MyClass | * tag}* participate in this?
 */
```

Intuitively we'd expect it to be rendered like this:
___

Markdown would treat these as
* bullet
* items.

Inside code comments, the left margin is sometimes ambiguous:
* bullet
* items?

Markdown confusingly *allows a * inside an emphasis*.
Does a *[\* tag](#)* participate in this?

___

### 3. Whitespace

Markdown assigns special meanings to whitespace indentation.  For example, indenting 4 spaces is equivalent to a ``` block.  Newlines also have lots of have special meanings.

This could be fairly confusing inside a code comment, particularly with weird cases like this:

```typescript
/**     Is this indented? */

/** some junk
    Is this indented? */

/**
Is this okay at all? */

/**
Is this star part of the comment?
 * mystery
Or is it a Markdown bullet? 
 */
```

Perhaps TSDoc should issue warnings about malformed comment framing.

Perhaps we should try to disable some of Markdown's indentation rules.  For example, the TSDoc parser could trim whitespace from the start of each line.

### 4. Markdown Links

Markdown supports these constructs:
```markdown
[Regular Link](http://example.com)

[Cross-reference Link][1]
. . .
[1]: http://b.org

![Image Link](http://example.com/a.png)

Autolinks are handy:  http://example.com
However if you want an accurate URL-detector, it turns out to be a fairly big library dependency.
```

The Markdown link functionality partially overlaps with JSDoc's [`{@link}`](http://usejsdoc.org/tags-inline-link.html) tag.  But it's missing support for [API item references](https://github.com/Microsoft/tsdoc/issues/9).

### 5. Markdown Tables

Markdown tables have a ton of limitations. Many constructs aren't supported inside table cells.  You can't even put a newline inside a table cell.  CommonMark had a [long discussion](https://talk.commonmark.org/t/tables-in-pure-markdown/81) about this, but so far does not support the [pipes-and-dashes](https://help.github.com/articles/organizing-information-with-tables/) table syntax at all.  Instead it uses HTML tables.  This seems pretty wise.

### 6. HTML elements

Most Markdown flavors allow HTML mixed into your content.  The CommonMark spec has an [entire section about this](http://spec.commonmark.org/0.27/#html-blocks).  This is convenient, although HTML is an entire separate grammar with its own complexities. For example, HTML has a completely distinct escaping mechanism from Markdown.

Here's a few interesting cases to show some interactions:

```typescript
/**
 * Here's a  tag inside an HTML comment.
 *
 * Here's a TSDoc tag that {@link MyClass |  an HTML comment.
 *
 * The `@remarks` tag normally separates two major TSDoc blocks.  Is it okay for that
 * to appear inside a table?
 *
 * <table><tr><td>
 * @remarks
 * </td></tr></table>
 */
```

# Two Possible Solutions

## Option 1: Extend an existing CommonMark library

The most natural approach would be for the TSDoc parser to include an integrated CommonMark parser.  The two grammars would be mixed together.  We definitely don't want to write a CommonMark parser from scratch, so instead the TSDoc library would need to extend an existing library. [Markdown-it](https://github.com/markdown-it/markdown-it) and [Flavormark](https://github.com/AnyhowStep/flavormark) are possible choices that are both oriented towards custom extensions.

### Possible downsides:

- Incorporating full Markdown into the TSDoc AST nodes implies that our doc comment emitter would need to be a full Markdown emitter.  (In my experience, correctly emitting Markdown is every bit as tricky as parsing Markdown.)
- To support an entrenched backend with its own opinionated Markdown flavor, this approach wouldn't passthrough Markdown content from doc comments; instead the backend would have to parse AST nodes that were emitted back to Markdown.  This can be good (if you're rigorous and writing a proper translator) or bad (if you're taking the naive route)
- This approach couples our API contract (e.g. the AST structure) to an external project
- Possibly increases friction for tools that are contemplating taking a dependency on **@microsoft/tsdoc**

## Option 2: Treat full Markdown as a postprocess

A possible shortcut would be to say that TSDoc operates as a first pass that snips out the structures we care about, and returns everything else as plain text.  We don't want to get tripped up by backticks, so we make a small list of core constructs that can easily screw up parsing:

- code fences (backticks)
- links
- CommonMark escapes
- HTML elements (but only as tokens, ignoring nesting)
- HTML comments (?)

Anything else is treated as plain text for TSDoc, and gets passed through (to be possibly reinterpreted by another layer of the documentation pipeline).

```typescript
/**
 * This is *bold*. Here's a {@link MyClass | link to `MyClass`}. <div>
 * @remarks
 * Here's some more stuff. </bad>
 */
```

Here's some pseudocode for a corresponding AST:

```javascript
[
  {
    "nodeKind": "textNode",
    "content": "This is *bold*. Here's a "  // <-- we ignore the Markdown stars
  },
  {
    "nodeKind": "linkNode",
    "apiItemReference": {
      "itemPath": "MyClass"
    },
    "linkText": [
      {
        "nodeKind": "textNode",
        "content": "link to "
      },
      {
        "nodeKind": "codeFenceNode",  // <-- we parse the backticks though
        "children": [
          {
            "nodeKind": "textNode",
            "content": "MyClass"
          }
        ]
      },
      {
        "nodeKind": "textNode",
        "content": ". "
      },
      {
        "nodeKind": "htmlElementNode",
        "elementName": "div"
      }
    ]
  },
  {
    "nodeKind": "customTagNode",
    "tag": "@remarks"
  },
  {
    "nodeKind": "textNode",
    "content": "Here's some more stuff."
  },
  {
    "nodeKind": "htmlElementNode",
    "elementName": "bad", // <-- we care about HTML delimiters, but not HTML structure
    "isEndTag": true
  }
]
```

### Possible downsides:
- The resulting syntax would be fairly counterintuitive for people who assume they're writing real Markdown.  All the weird little Markdown edge cases would be handled oddly.
- This model invites a documentation pipeline to do nontrivial syntactic postprocessing.  For content authors, the language wouldn't have a unified specification.  (This isn't like a templating library that supports proprietary HTML tags.  Instead, it's more like if one tool defined HTML without attributes, and then another tried to retrofit attributes on top of it.)
- We might end up having to code a small CommonMark parser (although it would be a subset of the work involved for a parser that handles the full grammar)
- How will the second stage Markdown parser accurately report line numbers for errors?

What do you think?  Originally I was leaning towards #1 above, but now I'm wondering if #2 might be a better option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Two possibilities for integrating Markdown into TSDoc #12

Problem Statement

1. Code fences that span tags

2. Stars

3. Whitespace

4. Markdown Links

5. Markdown Tables

6. HTML elements

Two Possible Solutions

Option 1: Extend an existing CommonMark library

Possible downsides:

Option 2: Treat full Markdown as a postprocess

Possible downsides:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Two possibilities for integrating Markdown into TSDoc #12

Description

Problem Statement

1. Code fences that span tags

2. Stars

3. Whitespace

4. Markdown Links

5. Markdown Tables

6. HTML elements

Two Possible Solutions

Option 1: Extend an existing CommonMark library

Possible downsides:

Option 2: Treat full Markdown as a postprocess

Possible downsides:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions