Functions to process content of lxml nodes.

weblib.etree.clean_html(html, safe_attrs=('src', 'href'), input_encoding=None, output_encoding=None, **kwargs)[source]

Fix HTML structure and remove non-allowed attributes from all tags.


Create clone of Element node.

The resulted clone is not connected ot original DOM tree.

Replace all links with span tags and drop href atrributes.

weblib.etree.drop_node(tree, xpath, keep_content=False)[source]

Find sub-node by its xpath and remove it.

weblib.etree.find_node_number(node, ignore_spaces=False, make_int=True)[source]

Find number in text content of the node.

weblib.etree.get_node_text(node, smart=False, normalize_space=True)[source]

Extract text content of the node and all its descendants.

In smart mode get_node_text insert spaces between <tag><another tag> and also ignores content of the script and style tags.

In non-smart mode this func just return text_content() of node with normalized spaces

weblib.etree.parse_html(html, encoding='utf-8')[source]

Parse html into ElementTree node.

weblib.etree.render_html(node, encoding=None, make_unicode=None)[source]

Render Element node.

weblib.etree.truncate_html(html, limit, encoding='utf-8')[source]

Truncate html data to specified length and then fix broken tags.

weblib.etree.truncate_tail(node, xpath)[source]

Find sub-node by its xpath and remove it and all adjacent nodes following after found node.