WebStack

docs/encodings.html

732:7f1f02b485f8
2007-11-12 paulb [project @ 2007-11-12 00:50:03 by paulb] Introduced base classes for common authentication activities. Made cookie usage "safe" for usernames containing ":" characters. Added support for OpenID signatures.
     1 <?xml version="1.0" encoding="iso-8859-1"?>     2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">     3 <html xmlns="http://www.w3.org/1999/xhtml"><head>     4   <title>Character Encodings</title>     5   <link href="styles.css" rel="stylesheet" type="text/css" /></head>     6 <body>     7 <h1>Character Encodings</h1>     8 <p>When writing applications with WebStack, you should try and use     9 Python's Unicode objects as much as possible. However, there are a    10 number of places where plain Python strings can be involved:</p>    11 <ul>    12   <li><a href="parameters-headers.html">Inspecting query strings</a></li>    13   <li><a href="responses.html">Sending output in a response</a></li>    14   <li><a href="parameters.html">Receiving uploaded content</a></li>    15   <li><a href="state.html">Accessing cookie information</a></li>    16   <li><a href="sessions.html">Accessing session information</a> (see the <a href="sessions-usage.html#Limitations">"Session Limitations and Guidelines"</a>)</li>    17 </ul>    18 <p>When Web pages (and other types of content) are sent to and from    19 users of your application, the text will be in some kind of character    20 encoding. For example, in English-speaking environments, the US-ASCII    21 encoding is common and contains the basic letters, numbers and symbols    22 used in English, whereas in Western Europe encodings like    23 ISO-8859-1 and ISO-8859-15 are typically used, since they contain    24 additional letters and symbols in order to support other languages.    25 Often, UTF-8 is used to encode text because it covers most languages    26 simultaneously and is therefore flexible enough for many applications.</p>    27 <p>When URLs are received in applications, in order for some of the    28 request parameters to be interpreted, the situation is a bit more    29 awkward. The original text is encoded in US-ASCII but will contain    30 special numeric codes that indicate character values in the    31 original text encoding - see the <a href="parameters.html">description    32 of query strings</a> for more information.</p>    33 <h2>Recommendations</h2>    34 <dl>    35   <dt>The following recommendations should help you avoid issues with    36 incorrect characters in the Web pages (and other content) that you    37 produce:</dt>    38 </dl>    39 <h3>Use Unicode Objects for Textual Content</h3>    40 <p>Handling text in specific encodings using normal Python strings can    41 be difficult, and handling text in multiple encodings in the same    42 application can be highly error-prone. Fortunately, Python has support    43 for Unicode objects which let you think of letters, numbers, symbols    44 and all other characters in an abstract way.</p>    45 <ul>    46   <li>Convert textual content to Unicode as soon as possible.</li>    47   <li>If you must include hard-coded messages in your application code,    48 make sure to specify the encoding using the <a href="http://www.python.org/peps/pep-0263.html">standard declaration</a>    49 at the top of your source file.</li>    50   <li>Remember that the standard library <code>codecs</code>    51 module contains useful functions to access streams as if Unicode    52 objects were being transmitted; for example:</li>    53 </ul>    54 <pre>import codecs<br /><br />class MyResource:<br /><br />    encoding = "utf-8"<br /><br />    def respond(self, trans):<br />        stream = trans.get_request_stream()                         # only reads strings<br />        unicode_stream = codecs.getreader(self.encoding)(stream)    # reads Unicode objects<br /><br />        [Some activity...]<br /><br />        out = trans.get_response_stream()                           # writes strings and Unicode objects<br /></pre>    55 <h3>Use Strings for Binary Content</h3>    56 <p>If you are reading and writing binary content, Unicode objects are    57 inappropriate. Make sure to open files in binary mode, where necessary.</p>    58 <h3>Use Explicit Encodings and Be Consistent</h3>    59 <p>Although WebStack has some support for detecting character encodings    60 used    61 in requests, it is often best for your application to exercise control    62 over    63 which encoding is used when <a href="parameters.html">inspecting    64 request    65 parameters</a> and when <a href="responses.html">producing responses</a>.    66 The    67 best way to do this is to decide which encoding is most suitable for    68 the data    69 presented and received in your application and then to use it    70 throughout.</p><p>One    71 approach which works acceptably for smaller applications is to define    72 an attribute (or a global) which is conveniently accessible and which    73 can be used directly with various transaction methods. Here is an    74 outline of code which does this:</p>    75 <pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br />    encoding = "utf-8"                                                     # We decide on "utf-8" as our chosen<br />                                                                           # encoding.<br />    def respond(self, trans):<br />        [Do various things.]<br /><br />        fields = trans.get_fields_from_body(encoding=self.encoding)        # Explicitly use the encoding.<br /><br />        [Do other things with the Unicode values from the fields.]<br /><br />        trans.set_content_type(ContentType("text/html", self.encoding))    # The output Web page uses the encoding.<br /><br />        [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]</pre>    76 <h3>Use EncodingSelector to Set the Default Encoding</h3><p>An arguably better approach is to use selectors (as described in <a href="selectors.html">"Selectors - Components for Dispatching to Resources"</a>), typically in a "site map" arrangement (as described in <a href="deploying.html">"Deploying a WebStack Application"</a>), specifically using the <code>EncodingSelector</code>:</p><pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br />    def respond(self, trans):<br />        [Do various things.]<br /><br />        fields = trans.get_fields_from_body()                       # Encoding set by EncodingSelector.<br /><br />        [Do other things with the Unicode values from the fields.]<br /><br />        trans.set_content_type(ContentType("text/html"))            # The output Web page uses the default encoding.<br /><br />        [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]<br /><br />def get_site_map():<br /><br />    return EncodingSelector(MyResource(), "utf-8")</pre><h3>Tell Encodings to Other Components</h3>    77 <p>When using other components to generate content (see <a href="integrating.html">"Integrating with Other Systems"</a>), it may    78 be the case that such components will just write the generated content    79 straight to a normal stream (rather than one wrapped by a <code>codecs</code>    80 module function). In such cases, it is likely that for textual content    81 such as XML or related formats (XHTML, SVG, HTML) you will need to    82 instruct the component to use your chosen encoding; for example:</p>    83 <pre>        # In the respond method, xml_document is an xml.dom.minidom.Document object...<br />        xml_document.toxml(self.encoding)</pre>    84 <p>This will then generate the appropriate characters in the output <span style="font-style: italic;">and</span> specify the correct encoding    85 for the XML document.</p>    86 </body></html>