weblib.etree

Functions to process content of lxml nodes.

weblib.etree.clean_html(html, safe_attrs=('src', 'href'), input_encoding=None, output_encoding=None, **kwargs)[source]

Fix HTML structure and remove non-allowed attributes from all tags.

weblib.etree.clone_node(elem)[source]

Create clone of Element node.

The resulted clone is not connected ot original DOM tree.

Replace all links with span tags and drop href atrributes.

weblib.etree.drop_node(tree, xpath, keep_content=False)[source]

Find sub-node by its xpath and remove it.

weblib.etree.find_node_number(node, ignore_spaces=False, make_int=True)[source]

Find number in text content of the node.

weblib.etree.get_node_text(node, smart=False, normalize_space=True)[source]

Extract text content of the node and all its descendants.

In smart mode get_node_text insert spaces between <tag><another tag> and also ignores content of the script and style tags.

In non-smart mode this func just return text_content() of node with normalized spaces

weblib.etree.parse_html(html, encoding='utf-8')[source]

Parse html into ElementTree node.

weblib.etree.render_html(node, encoding=None, make_unicode=None)[source]

Render Element node.

weblib.etree.truncate_html(html, limit, encoding='utf-8')[source]

Truncate html data to specified length and then fix broken tags.

weblib.etree.truncate_tail(node, xpath)[source]

Find sub-node by its xpath and remove it and all adjacent nodes following after found node.