paulb@358 | 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
paulb@436 | 2 | <html xmlns="http://www.w3.org/1999/xhtml"><head> |
paulb@436 | 3 | |
paulb@436 | 4 | <title>Character Encodings</title><meta name="generator" content="amaya 8.1a, see http://www.w3.org/Amaya/" /> |
paulb@436 | 5 | <link href="styles.css" rel="stylesheet" type="text/css" /></head> |
paulb@436 | 6 | |
paulb@335 | 7 | <body> |
paulb@335 | 8 | <h1>Character Encodings</h1> |
paulb@358 | 9 | <p>When writing applications with WebStack, you should try and use |
paulb@358 | 10 | Python's Unicode objects as much as possible. However, there are a |
paulb@358 | 11 | number of places where plain Python strings can be involved:</p> |
paulb@335 | 12 | <ul> |
paulb@436 | 13 | <li><a href="parameters-headers.html">Inspecting query strings</a></li> |
paulb@360 | 14 | <li><a href="responses.html">Sending output in a response</a></li> |
paulb@360 | 15 | <li><a href="parameters.html">Receiving uploaded content</a></li> |
paulb@360 | 16 | <li><a href="state.html">Accessing cookie information</a></li> |
paulb@360 | 17 | <li><a href="sessions.html">Accessing session information</a></li> |
paulb@335 | 18 | </ul> |
paulb@358 | 19 | <p>When Web pages (and other types of content) are sent to and from |
paulb@358 | 20 | users of your application, the text will be in some kind of character |
paulb@358 | 21 | encoding. For example, in English-speaking environments, the US-ASCII |
paulb@358 | 22 | encoding is common and contains the basic letters, numbers and symbols |
paulb@358 | 23 | used in English, whereas in Western Europe encodings like |
paulb@358 | 24 | ISO-8859-1 and ISO-8859-15 are typically used, since they contain |
paulb@358 | 25 | additional letters and symbols in order to support other languages. |
paulb@358 | 26 | Often, UTF-8 is used to encode text because it covers most languages |
paulb@358 | 27 | simultaneously and is therefore flexible enough for many applications.</p> |
paulb@358 | 28 | <p>When URLs are received in applications, in order for some of the |
paulb@358 | 29 | request parameters to be interpreted, the situation is a bit more |
paulb@358 | 30 | awkward. The original text is encoded in US-ASCII but will contain |
paulb@358 | 31 | special numeric codes that indicate character values in the |
paulb@358 | 32 | original text encoding - see the <a href="parameters.html">description |
paulb@358 | 33 | of query strings</a> for more information.</p> |
paulb@335 | 34 | <h2>Recommendations</h2> |
paulb@358 | 35 | <dl> |
paulb@358 | 36 | <dt>The following recommendations should help you avoid issues with |
paulb@358 | 37 | incorrect characters in the Web pages (and other content) that you |
paulb@358 | 38 | produce:</dt> |
paulb@358 | 39 | </dl> |
paulb@358 | 40 | <h3>Use Unicode Objects for Textual Content</h3> |
paulb@358 | 41 | <p>Handling text in specific encodings using normal Python strings can |
paulb@358 | 42 | be difficult, and handling text in multiple encodings in the same |
paulb@358 | 43 | application can be highly error-prone. Fortunately, Python has support |
paulb@358 | 44 | for Unicode objects which let you think of letters, numbers, symbols |
paulb@358 | 45 | and all other characters in an abstract way.</p> |
paulb@358 | 46 | <ul> |
paulb@358 | 47 | <li>Convert textual content to Unicode as soon as possible (see below |
paulb@358 | 48 | for choosing encodings).</li> |
paulb@358 | 49 | <li>If you must include hard-coded messages in your application code, |
paulb@436 | 50 | make sure to specify the encoding using the <a href="http://www.python.org/peps/pep-0263.html">standard declaration</a> |
paulb@358 | 51 | at the top of your source file.</li> |
paulb@358 | 52 | <li>Remember that the standard library <code>codecs</code> |
paulb@358 | 53 | module contains useful functions to access streams as if Unicode |
paulb@358 | 54 | objects were being transmitted; for example:</li> |
paulb@358 | 55 | </ul> |
paulb@358 | 56 | <pre>import codecs<br /><br />class MyResource:<br /><br /> encoding = "utf-8"<br /><br /> def respond(self, trans):<br /> stream = trans.get_request_stream() # only reads strings<br /> unicode_stream = codecs.getreader(self.encoding)(stream) # reads Unicode objects<br /><br /> [Some activity...]<br /><br /> out = trans.get_response_stream() # only writes strings<br /> unicode_out = codecs.getwriter(self.encoding)(out) # writes Unicode objects</pre> |
paulb@358 | 57 | <h3>Use Strings for Binary Content</h3> |
paulb@358 | 58 | <p>If you are reading and writing binary content, Unicode objects are |
paulb@358 | 59 | inappropriate. Make sure to open files in binary mode, where necessary.</p> |
paulb@358 | 60 | <h3>Use Explicit Encodings and Be Consistent</h3> |
paulb@358 | 61 | <p>Although WebStack has some support for detecting character encodings |
paulb@358 | 62 | used |
paulb@358 | 63 | in requests, it is often best for your application to exercise control |
paulb@358 | 64 | over |
paulb@358 | 65 | which encoding is used when <a href="parameters.html">inspecting |
paulb@358 | 66 | request |
paulb@358 | 67 | parameters</a> and when <a href="responses.html">producing responses</a>. |
paulb@358 | 68 | The |
paulb@358 | 69 | best way to do this is to decide which encoding is most suitable for |
paulb@358 | 70 | the data |
paulb@358 | 71 | presented and received in your application and then to use it |
paulb@358 | 72 | throughout. |
paulb@335 | 73 | Here is an outline of code which does this:</p> |
paulb@358 | 74 | <pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br /> encoding = "utf-8" # We decide on "utf-8" as our chosen<br /> # encoding.<br /> def respond(self, trans):<br /> [Do various things.]<br /><br /> fields = trans.get_fields_from_body(encoding=self.encoding) # Explicitly use the encoding.<br /><br /> [Do other things with the Unicode values from the fields.]<br /><br /> trans.set_content_type(ContentType("text/html", self.encoding)) # The output Web page uses the encoding.<br /><br /> [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]</pre> |
paulb@358 | 75 | <h3>Tell Encodings to Other Components</h3> |
paulb@436 | 76 | <p>When using other components to generate content (see <a href="integrating.html">"Integrating with Other Systems"</a>), it may |
paulb@358 | 77 | be the case that such components will just write the generated content |
paulb@358 | 78 | straight to a normal stream (rather than one wrapped by a <code>codecs</code> |
paulb@358 | 79 | module function). In such cases, it is likely that for textual content |
paulb@358 | 80 | such as XML or related formats (XHTML, SVG, HTML) you will need to |
paulb@358 | 81 | instruct the component to use your chosen encoding; for example:</p> |
paulb@358 | 82 | <pre> # In the respond method, xml_document is an xml.dom.minidom.Document object...<br /> xml_document.toxml(self.encoding)</pre> |
paulb@436 | 83 | <p>This will then generate the appropriate characters in the output <span style="font-style: italic;">and</span> specify the correct encoding |
paulb@358 | 84 | for the XML document.</p> |
paulb@436 | 85 | </body></html> |