Description
Based on the issues encountered in the issue #12 thread, we are concluding that TSDoc cannot reasonably be based directly on the CommonMark spec. The goals are conflicting:
-
CommonMark goal: ("common" = union) Provide a standardized algorithm for parsing every familiar markup notation. It's okay if the resulting syntax rules are impossible for humans to memorize, because mistakes can be easily corrected using the editor's interactive preview. If a syntax is occasionally misinterpreted, the consequence is incorrect formatting on the web site, which is a relatively minor issue.
-
TSFM goal: ("common" = intersection) Provide a familiar syntax that is very easy for humans to memorize, so that a layperson can predict exactly how their markup will be rendered (by every possible downstream doc pipeline). Computer source code is handled by many different viewers which may not support interactive preview. If a syntax is occasionally misinterpreted, the consequence is that a tag such as
@beta
or@internal
may be ignored by the parser, which could potentially cause a serious issue (e.g. an escalation from an enterprise customer whose service was interrupted because of a broken API contract).
Hypothesis: For every TSFM construct, there exists a normalized form that will be parsed identically by CommonMark and TSDoc. In "strict mode" the TSDoc library can issue warnings for expressions that are not in normalized form. Assuming the author eliminates all such warnings, then a documentation pipeline can passthrough unmodified TSDoc content to a backend CommonMark engine, and have confidence that the rendered output will be correct.
Below are some proposed TSFM restrictions:
Whitespace generally doesn't matter
This principle is very easy for people to remember, and eliminates a ton of edge cases.
Example 1:
/**
* TSFM considers this to be an HTML element, whereas CommonMark does not:
* <element attribute="@tag"
*
* />
*/
Example 1 converted to normalized form (so CommonMark interprets it the same as TSDoc):
/**
* TSFM considers this to be an HTML element, whereas CommonMark does not:
* <element attribute="@tag"
* />
*/
Example 2:
/**
* CommonMark interprets this indentation to make a code block, TSFM sees rich markup:
*
* **bold** @tag
*/
Example 2 converted to normalized form (so CommonMark interprets it the same as TSDoc):
/**
* CommonMark interprets this indentation to make a code block, TSFM sees rich markup:
*
* **bold** @tag
*/
Stars cannot be nested arbitrarily
TSDoc will support stars for bold/italics, based on 6 types of tokens that can be recognized by the lexical analyzer with minimal lookahead:
- Opening italics single-star, e.g.
*text
is interpreted as<i>text
- Closing italics single-star, e.g.
text*
is interpreted astext</i>
- Opening bold double-star, e.g.
**text
is interpreted as<b>text
- Closing bold double-star, e.g.
text**
is interpreted astext</b>
- Opening bold+italics triple-star, e.g.
***text
is interpreted as if<b+i>text
- Closing bold+italics triple-star, e.g.
text***
is interpreted as iftext</b+i>
Other patterns are NOT interpreted as star tokens, e.g. text * text *
contains literal asterisks, as does ****a****
. A letter in the middle of a word can never be styled using stars, e.g. Toys*R*Us
contains literal asterisk characters. A single-star followed by a double-star can be closed by a triple-star (e.g. *italics **bold+italics***
is seen as <i>italics<b>bold+italics</b+i>
). Star markup is prohibited from spanning multiple lines.
Other characters (e.g. underscore) are NOT supported by TSDoc as synonyms for bold/italics.
Example 3:
/**
* *CommonMark sees italics, but TSDoc does not because
* its stars cannot span lines.*
*
* CommonMark sees italics here: __proto__
*
* Common**M**ark sees a boldfaced M, but TSDoc sees literal stars.
*/
Example 3 normalized form:
/**
* \*CommonMark sees italics, but TSDoc does not because
* its stars cannot span lines.\*
*
* CommonMark sees italics here: \_\_proto\_\_ (or better to use `__proto__`)
*
* Common\*\*M\*\*ark sees a boldfaced M, but TSDoc sees literal stars.
*
* If you really need to boldface a letter, use HTML elements: Common<b>M</b>ark.
*/
Example 4:
/**
* For **A **B** C** the B is double-boldfaced according to CommonMark.
* The TSDoc tokenizer sees `<b>A <b>B</b> C</b>` which the parser then flattens
* to `<b>A **B</b> C**` because it doesn't allow nesting.
*
* Improper balancing also gets ignored, e.g. for **A *B** C* the TSDoc tokenizer
* will see `<b>A <i>B</b> C</i>` which the parser flattens to `<b>A *B</b> C*`
* Whereas CommonMark would counterintuitively see `<i><i>A<i>B</i></i>C</i>`.
*/
Example 4 normalized form:
/**
* For **A \*\*B** C\*\* the B is double-boldfaced according to CommonMark.
* The TSDoc tokenizer sees `<b>A <b>B</b> C</b>` which the parser then flattens
* to `<b>A **B</b> C**` because it doesn't allow nesting.
*
* Improper balancing also gets ignored, e.g. for **A \*B** C\* the TSDoc tokenizer
* will see `<b>A <i>B</b> C</i>` which the parser flattens to `<b>A *B</b> C*`
* Whereas CommonMark would counterintuitively see `<i><i>A<i>B</i></i>C</i>`.
*/
Code spans are simplified
For TSFM, a nonescaped backtick will always start a code span and end with the next backtick. Whitespace doesn't matter.
Example 5:
/**
* `Both TSDoc and CommonMark
* agree this is code.`
*
* before `CommonMark disagrees
*
* if a line is skipped, though.` after
*
* `But this is not code because the backtick is unterminated
*/
Example 5 normalized form:
/**
* `Both TSDoc and CommonMark
* agree this is code.`
*
* before `CommonMark disagrees
* if a line is skipped, though.` after
*
* \`But this is not code because the backtick is unterminated
*/
Blocks don't nest
I want to say that ">" blockquotes should not be supported at all, since the whitespace handling for these constructs is highly counterintuitive. Instead we would recommend <blockquote>
HTML tags for this scenario.
Lists are a very useful and common scenario. However, CommonMark lists also have a lot of counterintuitive rules regarding handling of whitespace.
A simplification would be to say that TSFM interprets any line that starts with "-" as being a list item, and the list ends with the first blank line. No other character (e.g. "*" or "+") can be used to create lists. If complicated nesting is required, then HTML tags such as <ul>
and <li>
should be used to avoid any confusion.
Example 6:
/**
* A list with 3 things
* - item 1
* - item 2
* spans several
* lines
* - item 3
*
* Two lists separated by a newline
* - list 1 with one item
*
* - list 2 with one item
*
* + not a list item
* + not a list item
*
* CommonMark surprisingly considers this to be a list whose first item is another list,
* whereas TSDoc sees a minus character as the first item:
* - - foo
*/
Example 6 normalized form:
/**
* A list with 3 things
* - item 1
* - item 2
* spans several
* lines
* - item 3
*
* Two lists separated by a newline
* - list 1 with one item
* <!-- CommonMark requires an HTML comment to separate two lists -->
* - list 2 with one item
*
* \+ not a list item
* \+ not a list item
*
* CommonMark surprisingly considers this to be a list whose first item is another list,
* whereas TSDoc sees a minus character as the first item:
* - \- foo
*/