markdown-it 设计原则

数据流

输入数据是通过规则的嵌套链来解析的。有 3 条嵌套链 - coreblockinline

core
    core.rule1 (normalize)
    ...
    core.ruleX

    block
        block.rule1 (blockquote)
        ...
        block.ruleX

    core.ruleX1 (intermediate rule that applies on block tokens, nothing yet)
    ...
    core.ruleXX

    inline (applied to each block token with "inline" type)
        inline.rule1 (text)
        ...
        inline.ruleX

    core.ruleYY (applies to all tokens)
    ... (abbreviation, footnote, typographer, linkifier)

解析的结果是一个 形符列表,它将被传递给 renderer 以生成 html 内容。

这些形符本身可以再次被解析以产生更多的形符(例如:一个 list token 可以被分成多个 inline tokens)。

一个 env 沙盒可以和形符一起使用,为你的解析器和渲染器注入外部变量。

每条链(core / block / inline)在解析数据时都使用一个独立的 state 对象,因此每个解析操作都是独立的,可以即时禁用。

形符流

代替传统的 AST,我们使用更低级的数据表示– 形符。其区别很简单:

  • 形符是一个简单的序列(Array)。

  • 开头和结尾标记是分开的。

  • 有一些特殊的形符对象,即 “内联容器”,具有嵌套的形符。具有内联标记的序列(粗体、斜体、文本,……)。

参见 形符类,了解每个形符内容的详情。

总的来说,一个形符流是:

  • 在顶层 – 成对的或单一的 “block” 形符数组。

    • 打开/关闭标题、列表、分块引号、段落,……。

    • 代码、栅栏式块、水平规则、HTML 块、内联式容器

  • 每个内联标记都有一个 .children 属性,有一个内联内容的嵌套形符流:

    • 打开/关闭为 strong、em、link、code,…

    • 文本、换行

为什么不是 AST?因为我们的任务不需要它。我们遵循 KISS 原则。如果你愿意 – 你可以调用一个没有渲染器的解析器,并将标记流转换为 AST。

关于形符的更多细节:

规则

Rules are functions, doing “magic” with parser state objects. A rule is associated with one or more chains and is unique. For instance, a blockquote token is associated with blockquote, paragraph, heading and list chains.

Rules are managed by names via Ruler instances and can be enabled / disabled from the MarkdownIt methods.

You can note, that some rules have a validation mode - in this mode rules do not modify the token stream, and only look ahead for the end of a token. It’s one important design principle - a token stream is “write only” on block & inline parse stages.

Parsers are designed to keep rules independent of each other. You can safely enable/disable them, or add new ones. There are no universal recipes for how to create new rules - design of distributed state machines with good data isolation is a tricky business. But you can investigate existing rules & plugins to see possible approaches.

Also, in complex cases you can try to ask for help in tracker. Condition is very simple - it should be clear from your ticket, that you studied docs, sources, and tried to do something yourself. We never reject with help to real developers.

Renderer

After the token stream is generated, it’s passed to a renderer. It then plays all the tokens, passing each to a rule with the same name as token type.

Renderer rules are located in md.renderer.rules[name] and are simple functions with the same signature:

def function(renderer, tokens, idx, options, env):
  return htmlResult

In many cases that allows easy output change even without parser intrusion. For example, let’s replace images with vimeo links to player’s iframe:

import re
md = MarkdownIt("commonmark")

vimeoRE = re.compile(r'^https?:\/\/(www\.)?vimeo.com\/(\d+)($|\/)')

def render_vimeo(self, tokens, idx, options, env):
    token = tokens[idx]

    if vimeoRE.match(token.attrs["src"]):

        ident = vimeoRE.match(token.attrs["src"])[2]

        return ('<div class="embed-responsive embed-responsive-16by9">\n' +
               '  <iframe class="embed-responsive-item" src="//player.vimeo.com/video/' +
                ident + '"></iframe>\n' +
               '</div>\n')
    return self.image(tokens, idx, options, env)

md = MarkdownIt("commonmark")
md.add_render_rule("image", render_vimeo)
print(md.render("![](https://www.vimeo.com/123)"))

Here is another example, how to add target="_blank" to all links:

from markdown_it import MarkdownIt

def render_blank_link(self, tokens, idx, options, env):
    tokens[idx].attrSet("target", "_blank")

    # pass token to default renderer.
    return self.renderToken(tokens, idx, options, env)

md = MarkdownIt("commonmark")
md.add_render_rule("link_open", render_blank_link)
print(md.render("[a]\n\n[a]: b"))

Note, if you need to add attributes, you can do things without renderer override. For example, you can update tokens in core chain. That is slower, than direct renderer override, but can be more simple.

You also can write your own renderer to generate other formats than HTML, such as JSON/XML… You can even use it to generate AST.

Summary

This was mentioned in Data flow, but let’s repeat sequence again:

  1. Blocks are parsed, and top level of token stream filled with block tokens.

  2. Content on inline containers is parsed, filling .children properties.

  3. Rendering happens.

And somewhere between you can apply additional transformations :) . Full content of each chain can be seen on the top of parser_core.py, parser_block.py and parser_inline.py files.

Also you can change output directly in renderer for many simple cases.