weblib.etree¶
Functions to process content of lxml nodes.
-
weblib.etree.clean_html(html, safe_attrs=('src', 'href'), input_encoding=None, output_encoding=None, **kwargs)[source]¶ Fix HTML structure and remove non-allowed attributes from all tags.
-
weblib.etree.clone_node(elem)[source]¶ Create clone of Element node.
The resulted clone is not connected ot original DOM tree.
-
weblib.etree.disable_links(elem)[source]¶ Replace all links with span tags and drop href atrributes.
-
weblib.etree.drop_node(tree, xpath, keep_content=False)[source]¶ Find sub-node by its xpath and remove it.
-
weblib.etree.find_node_number(node, ignore_spaces=False, make_int=True)[source]¶ Find number in text content of the node.
-
weblib.etree.get_node_text(node, smart=False, normalize_space=True)[source]¶ Extract text content of the node and all its descendants.
In smart mode get_node_text insert spaces between <tag><another tag> and also ignores content of the script and style tags.
In non-smart mode this func just return text_content() of node with normalized spaces