# HG changeset patch # User paulb # Date 1174160812 0 # Node ID b2a2555d645c6a58ca137d510124134dbdf08791 # Parent 4cc51eb695eaf01f1c1f7a24c64edf23f032fa48 [project @ 2007-03-17 19:46:48 by paulb] Changed the parse functions to expose an HTML encoding parameter for overriding libxml2's encoding detection mechanisms. diff -r 4cc51eb695ea -r b2a2555d645c libxml2dom/__init__.py --- a/libxml2dom/__init__.py Sat Mar 17 19:45:57 2007 +0000 +++ b/libxml2dom/__init__.py Sat Mar 17 19:46:52 2007 +0000 @@ -502,14 +502,15 @@ def createDocument(namespaceURI, localName, doctype): return default_impl.createDocument(namespaceURI, localName, doctype) -def parse(stream_or_string, html=0, impl=None): +def parse(stream_or_string, html=0, htmlencoding=None, impl=None): """ Parse the given 'stream_or_string', where the supplied object can either be a stream (such as a file or stream object), or a string (containing the filename of a document). If the optional 'html' parameter is set to a true value, the content to be parsed will be treated as being HTML rather than - XML. + XML. If the optional 'htmlencoding' is specified, HTML parsing will be + performed with the document encoding assumed to that specified. A document object is returned by this function. """ @@ -518,42 +519,48 @@ if hasattr(stream_or_string, "read"): stream = stream_or_string - return parseString(stream.read(), html, impl) + return parseString(stream.read(), html, htmlencoding, impl) else: - return parseFile(stream_or_string, html, impl) + return parseFile(stream_or_string, html, htmlencoding, impl) -def parseFile(filename, html=0, impl=None): +def parseFile(filename, html=0, htmlencoding=None, impl=None): """ Parse the file having the given 'filename'. If the optional 'html' parameter is set to a true value, the content to be parsed will be treated as being - HTML rather than XML. + HTML rather than XML. If the optional 'htmlencoding' is specified, HTML + parsing will be performed with the document encoding assumed to be that + specified. A document object is returned by this function. """ impl = impl or default_impl - return impl.adoptDocument(Node_parseFile(filename, html)) + return impl.adoptDocument(Node_parseFile(filename, html, htmlencoding)) -def parseString(s, html=0, impl=None): +def parseString(s, html=0, htmlencoding=None, impl=None): """ Parse the content of the given string 's'. If the optional 'html' parameter is set to a true value, the content to be parsed will be treated as being - HTML rather than XML. + HTML rather than XML. If the optional 'htmlencoding' is specified, HTML + parsing will be performed with the document encoding assumed to be that + specified. A document object is returned by this function. """ impl = impl or default_impl - return impl.adoptDocument(Node_parseString(s, html)) + return impl.adoptDocument(Node_parseString(s, html, htmlencoding)) -def parseURI(uri, html=0, impl=None): +def parseURI(uri, html=0, htmlencoding=None, impl=None): """ Parse the content found at the given 'uri'. If the optional 'html' parameter is set to a true value, the content to be parsed will be treated as being - HTML rather than XML. + HTML rather than XML. If the optional 'htmlencoding' is specified, HTML + parsing will be performed with the document encoding assumed to be that + specified. XML documents are retrieved using libxml2's own network capabilities; HTML documents are retrieved using the urllib module provided by Python. To @@ -572,12 +579,12 @@ if html: f = urllib.urlopen(uri) try: - return parse(f, html, impl) + return parse(f, html, htmlencoding, impl) finally: f.close() else: impl = impl or default_impl - return impl.adoptDocument(Node_parseURI(uri, html)) + return impl.adoptDocument(Node_parseURI(uri, html, htmlencoding)) def toString(node, encoding=None, prettyprint=0): diff -r 4cc51eb695ea -r b2a2555d645c libxml2dom/macrolib/macrolib.py --- a/libxml2dom/macrolib/macrolib.py Sat Mar 17 19:45:57 2007 +0000 +++ b/libxml2dom/macrolib/macrolib.py Sat Mar 17 19:46:52 2007 +0000 @@ -501,14 +501,14 @@ libxml2mod.xmlCreateIntSubset(d, doctype.localName, doctype.publicId, doctype.systemId) return d -def parse(stream_or_string, html=0): +def parse(stream_or_string, html=0, htmlencoding=None): if hasattr(stream_or_string, "read"): stream = stream_or_string - return parseString(stream.read(), html) + return parseString(stream.read(), html, htmlencoding) else: - return parseFile(stream_or_string, html) + return parseFile(stream_or_string, html, htmlencoding) -def parseFile(s, html=0): +def parseFile(s, html=0, htmlencoding=None): # NOTE: Switching off validation and remote DTD resolution. if not html: context = libxml2mod.xmlCreateFileParserCtxt(s) @@ -518,9 +518,9 @@ libxml2mod.xmlParseDocument(context) return libxml2mod.xmlParserGetDoc(context) else: - return libxml2mod.htmlReadFile(s, None, HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET) + return libxml2mod.htmlReadFile(s, htmlencoding, HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET) -def parseString(s, html=0): +def parseString(s, html=0, htmlencoding=None): # NOTE: Switching off validation and remote DTD resolution. if not html: context = libxml2mod.xmlCreateMemoryParserCtxt(s, len(s)) @@ -532,10 +532,10 @@ else: # NOTE: URL given as None. html_url = None - return libxml2mod.htmlReadMemory(s, len(s), html_url, None, + return libxml2mod.htmlReadMemory(s, len(s), html_url, htmlencoding, HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET) -def parseURI(uri, html=0): +def parseURI(uri, html=0, htmlencoding=None): # NOTE: Switching off validation and remote DTD resolution. if not html: context = libxml2mod.xmlCreateURLParserCtxt(uri, 0) diff -r 4cc51eb695ea -r b2a2555d645c libxml2dom/svg.py --- a/libxml2dom/svg.py Sat Mar 17 19:45:57 2007 +0000 +++ b/libxml2dom/svg.py Sat Mar 17 19:46:52 2007 +0000 @@ -98,17 +98,17 @@ # Convenience functions. -def parse(stream_or_string, html=0): - return libxml2dom.parse(stream_or_string, html, default_impl) +def parse(stream_or_string, html=0, htmlencoding=None): + return libxml2dom.parse(stream_or_string, html, htmlencoding, default_impl) -def parseFile(filename, html=0): - return libxml2dom.parseFile(filename, html, default_impl) +def parseFile(filename, html=0, htmlencoding=None): + return libxml2dom.parseFile(filename, html, htmlencoding, default_impl) -def parseString(s, html=0): - return libxml2dom.parseString(s, html, default_impl) +def parseString(s, html=0, htmlencoding=None): + return libxml2dom.parseString(s, html, htmlencoding, default_impl) -def parseURI(uri, html=0): - return libxml2dom.parseURI(uri, html, default_impl) +def parseURI(uri, html=0, htmlencoding=None): + return libxml2dom.parseURI(uri, html, htmlencoding, default_impl) # Single instance of the implementation.