myst_parser.parsers.parse_html
#
A simple but complete HTML to Abstract Syntax Tree (AST) parser.
The AST can also reproduce the HTML text.
Example:
>> text = '<div class="note"><p>text</p></div>'
>> ast = tokenize_html(text)
>> list(ast.walk(include_self=True))
[Root(''), Tag('div', {'class': 'note'}), Tag('p'), Data('text')]
>> str(ast)
'<div class="note"><p>text</p></div>'
>> str(ast[0][0])
'<p>text</p>'
Note: optional tags are not accounted for (see https://html.spec.whatwg.org/multipage/syntax.html#optional-tags)
1. Module Contents#
1.1. Classes#
This class holds the tags's attributes. |
|
An Element of the xml/html document. |
|
The root of the AST tree. |
|
Represent xml/html tags under the form: <name key="value" ...> ... </name>. |
|
Represent XHTML style tags with no children, like <img src="t.gif" /> |
|
Represent tags with no children, only start tag, like <img src="t.gif" > |
|
Represent data inside xml/html documents, like raw text. |
|
Represent declarations, like <!DOCTYPE html> |
|
Represent HTML comments |
|
Represent processing instructions like <?xml-stylesheet ?> |
|
Represent character codes like: � |
|
Represent entities like & |
|
The engine class to generate the AST tree. |
|
The tokenizer class. |
1.2. Functions#
1.3. API#
- class myst_parser.parsers.parse_html.Attribute[源代码]#
Bases:
dict
This class holds the tags's attributes.
Initialization
Initialize self. See help(type(self)) for accurate signature.
- class myst_parser.parsers.parse_html.Element(name: str = '', attr: dict | None = None)[源代码]#
Bases:
collections.abc.MutableSequence
An Element of the xml/html document.
All xml/html entities inherit from this class.
Initialization
Initialise the element.
- property parent: myst_parser.parsers.parse_html.Element | None#
Return parent.
- property children: list[myst_parser.parsers.parse_html.Element]#
Return copy of children.
- reset_children(children: list[myst_parser.parsers.parse_html.Element], deepcopy: bool = False)[源代码]#
- insert(index: int, item: myst_parser.parsers.parse_html.Element)[源代码]#
- deepcopy() myst_parser.parsers.parse_html.Element [源代码]#
Recursively copy and remove parent.
- abstract render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) str [源代码]#
Returns a HTML string representation of the element.
- 参数:
tag_overrides -- Provide a dictionary of render function for specific tag names, to override the normal render format
- walk(include_self: bool = False) collections.abc.Iterator[myst_parser.parsers.parse_html.Element] [源代码]#
Walk through the xml/html AST.
- strip(inplace: bool = False, recurse: bool = False) myst_parser.parsers.parse_html.Element [源代码]#
Return copy with all Data tokens that only contain whitespace / newlines removed.
- find(identifier: str | type[myst_parser.parsers.parse_html.Element], attrs: dict | None = None, classes: collections.abc.Iterable[str] | None = None, include_self: bool = False, recurse: bool = True) collections.abc.Iterator[myst_parser.parsers.parse_html.Element] [源代码]#
Find all elements that match name and specific attributes.
- class myst_parser.parsers.parse_html.Root(name: str = '', attr: dict | None = None)[源代码]#
Bases:
myst_parser.parsers.parse_html.Element
The root of the AST tree.
Initialization
Initialise the element.
- class myst_parser.parsers.parse_html.Tag(name: str = '', attr: dict | None = None)[源代码]#
Bases:
myst_parser.parsers.parse_html.Element
Represent xml/html tags under the form: <name key="value" ...> ... </name>.
Initialization
Initialise the element.
- render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) str [源代码]#
- class myst_parser.parsers.parse_html.XTag(name: str = '', attr: dict | None = None)[源代码]#
Bases:
myst_parser.parsers.parse_html.Element
Represent XHTML style tags with no children, like <img src="t.gif" />
Initialization
Initialise the element.
- render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) str [源代码]#
- class myst_parser.parsers.parse_html.VoidTag(name: str = '', attr: dict | None = None)[源代码]#
Bases:
myst_parser.parsers.parse_html.Element
Represent tags with no children, only start tag, like <img src="t.gif" >
Initialization
Initialise the element.
- class myst_parser.parsers.parse_html.TerminalElement(data: str)[源代码]#
Bases:
myst_parser.parsers.parse_html.Element
- deepcopy() myst_parser.parsers.parse_html.TerminalElement [源代码]#
Copy and remove parent.
- class myst_parser.parsers.parse_html.Data(data: str)[源代码]#
Bases:
myst_parser.parsers.parse_html.TerminalElement
Represent data inside xml/html documents, like raw text.
Initialization
Initialise the element.
- class myst_parser.parsers.parse_html.Declaration(data: str)[源代码]#
Bases:
myst_parser.parsers.parse_html.TerminalElement
Represent declarations, like <!DOCTYPE html>
Initialization
Initialise the element.
- class myst_parser.parsers.parse_html.Comment(data: str)[源代码]#
Bases:
myst_parser.parsers.parse_html.TerminalElement
Represent HTML comments
Initialization
Initialise the element.
- class myst_parser.parsers.parse_html.Pi(data: str)[源代码]#
Bases:
myst_parser.parsers.parse_html.TerminalElement
Represent processing instructions like <?xml-stylesheet ?>
Initialization
Initialise the element.
- class myst_parser.parsers.parse_html.Char(data: str)[源代码]#
Bases:
myst_parser.parsers.parse_html.TerminalElement
Represent character codes like: �
Initialization
Initialise the element.
- class myst_parser.parsers.parse_html.Entity(data: str)[源代码]#
Bases:
myst_parser.parsers.parse_html.TerminalElement
Represent entities like &
Initialization
Initialise the element.
- class myst_parser.parsers.parse_html.Tree(name: str = '')[源代码]#
The engine class to generate the AST tree.
Initialization
Initialise Tree
- last() myst_parser.parsers.parse_html.Element [源代码]#
Return the last pointer which point to the actual tag scope.
- nest_tag(name: str, attrs: dict)[源代码]#
Nest a given tag at the bottom of the tree using the last stack's pointer.
- nest_terminal(klass: type[myst_parser.parsers.parse_html.TerminalElement], data: str)[源代码]#
Nest the data onto the tree.
- class myst_parser.parsers.parse_html.HtmlToAst(name: str = '', convert_charrefs: bool = False)[源代码]#
Bases:
html.parser.HTMLParser
The tokenizer class.
Initialization
Initialize and reset this instance.
If convert_charrefs is True (the default), all character references are automatically converted to the corresponding Unicode characters.
- void_elements = None#
- feed(source: str) myst_parser.parsers.parse_html.Root [源代码]#
Parse the source string.