# HG changeset patch # User paulb # Date 1123097701 0 # Node ID 0ee67e24c3832b44f32b253b47718361b25ce279 # Parent d527a2402d64dab4f8348095fddd4725783f9861 [project @ 2005-08-03 19:34:59 by paulb] Added tentative HTML parsing support. Introduced various libxml2 options to explicitly prevent things like network access and output of errors/warnings. diff -r d527a2402d64 -r 0ee67e24c383 libxml2dom/__init__.py --- a/libxml2dom/__init__.py Fri Jul 22 22:52:40 2005 +0000 +++ b/libxml2dom/__init__.py Wed Aug 03 19:35:01 2005 +0000 @@ -360,21 +360,21 @@ def createDocument(namespaceURI, localName, doctype): return Document(Node_createDocument(namespaceURI, localName, doctype)) -def parse(stream_or_string): +def parse(stream_or_string, html=0): if hasattr(stream_or_string, "read"): stream = stream_or_string - return parseString(stream.read()) + return parseString(stream.read(), html) else: - return parseFile(stream_or_string) + return parseFile(stream_or_string, html) -def parseFile(s): - return Document(Node_parseFile(s)) +def parseFile(s, html=0): + return Document(Node_parseFile(s, html)) -def parseString(s): - return Document(Node_parseString(s)) +def parseString(s, html=0): + return Document(Node_parseString(s, html)) -def parseURI(uri): - return Document(Node_parseURI(uri)) +def parseURI(uri, html=0): + return Document(Node_parseURI(uri, html)) def toString(node, encoding=None): return Node_toString(node.as_native_node(), encoding) diff -r d527a2402d64 -r 0ee67e24c383 libxml2dom/macrolib/macrolib.py --- a/libxml2dom/macrolib/macrolib.py Fri Jul 22 22:52:40 2005 +0000 +++ b/libxml2dom/macrolib/macrolib.py Wed Aug 03 19:35:01 2005 +0000 @@ -334,35 +334,48 @@ Node_appendChild(d, root) return d -def parse(stream_or_string): +def parse(stream_or_string, html=0): if hasattr(stream_or_string, "read"): stream = stream_or_string - return parseString(stream.read()) + return parseString(stream.read(), html) else: - return parseFile(stream_or_string) + return parseFile(stream_or_string, html) -def parseFile(s): +def parseFile(s, html=0): # NOTE: Switching off validation and remote DTD resolution. - context = libxml2mod.xmlCreateFileParserCtxt(s) - libxml2mod.xmlParserSetValidate(context, 0) - libxml2mod.xmlCtxtUseOptions(context, 0) - libxml2mod.xmlParseDocument(context) - return libxml2mod.xmlParserGetDoc(context) + if not html: + context = libxml2mod.xmlCreateFileParserCtxt(s) + libxml2mod.xmlParserSetValidate(context, 0) + libxml2mod.xmlCtxtUseOptions(context, XML_PARSE_NOERROR | XML_PARSE_NOWARNING | XML_PARSE_NONET) + libxml2mod.xmlParseDocument(context) + return libxml2mod.xmlParserGetDoc(context) + else: + return libxml2mod.htmlReadFile(s, None, HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET) -def parseString(s): +def parseString(s, html=0): # NOTE: Switching off validation and remote DTD resolution. - context = libxml2mod.xmlCreateMemoryParserCtxt(s, len(s)) - libxml2mod.xmlParserSetValidate(context, 0) - libxml2mod.xmlCtxtUseOptions(context, 0) - libxml2mod.xmlParseDocument(context) - return libxml2mod.xmlParserGetDoc(context) + if not html: + context = libxml2mod.xmlCreateMemoryParserCtxt(s, len(s)) + libxml2mod.xmlParserSetValidate(context, 0) + libxml2mod.xmlCtxtUseOptions(context, XML_PARSE_NOERROR | XML_PARSE_NOWARNING | XML_PARSE_NONET) + libxml2mod.xmlParseDocument(context) + return libxml2mod.xmlParserGetDoc(context) + else: + # NOTE: URL given as None. + html_url = None + return libxml2mod.htmlReadMemory(s, len(s), html_url, None, + HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET) -def parseURI(uri): - context = libxml2mod.xmlCreateURLParserCtxt(url) - libxml2mod.xmlParserSetValidate(context, 0) - libxml2mod.xmlCtxtUseOptions(context, 0) - libxml2mod.xmlParseDocument(context) - return libxml2mod.xmlParserGetDoc(context) +def parseURI(uri, html=0): + # NOTE: Switching off validation and remote DTD resolution. + if not html: + context = libxml2mod.xmlCreateURLParserCtxt(url) + libxml2mod.xmlParserSetValidate(context, 0) + libxml2mod.xmlCtxtUseOptions(context, XML_PARSE_NOERROR | XML_PARSE_NOWARNING | XML_PARSE_NONET) + libxml2mod.xmlParseDocument(context) + return libxml2mod.xmlParserGetDoc(context) + else: + raise NotSupportedError, "parseURI does not yet support HTML" def toString(node, encoding=None): return libxml2mod.serializeNode(node, encoding, 0) @@ -373,4 +386,13 @@ def toFile(node, f, encoding=None): libxml2mod.saveNodeTo(node, f, encoding, 0) +# libxml2mod constants. + +HTML_PARSE_NOERROR = 32 +HTML_PARSE_NOWARNING = 64 +HTML_PARSE_NONET = 2048 +XML_PARSE_NOERROR = 32 +XML_PARSE_NOWARNING = 64 +XML_PARSE_NONET = 2048 + # vim: tabstop=4 expandtab shiftwidth=4