WebStack

Annotated docs/encodings.html

525:00639614f763
2005-11-20 paulb [project @ 2005-11-20 23:37:51 by paulb] Additional cross-referencing.
paulb@358 1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
paulb@436 2
<html xmlns="http://www.w3.org/1999/xhtml"><head>
paulb@436 3
  
paulb@436 4
  <title>Character Encodings</title><meta name="generator" content="amaya 8.1a, see http://www.w3.org/Amaya/" />
paulb@436 5
  <link href="styles.css" rel="stylesheet" type="text/css" /></head>
paulb@335 6
<body>
paulb@335 7
<h1>Character Encodings</h1>
paulb@358 8
<p>When writing applications with WebStack, you should try and use
paulb@358 9
Python's Unicode objects as much as possible. However, there are a
paulb@358 10
number of places where plain Python strings can be involved:</p>
paulb@335 11
<ul>
paulb@436 12
  <li><a href="parameters-headers.html">Inspecting query strings</a></li>
paulb@360 13
  <li><a href="responses.html">Sending output in a response</a></li>
paulb@360 14
  <li><a href="parameters.html">Receiving uploaded content</a></li>
paulb@360 15
  <li><a href="state.html">Accessing cookie information</a></li>
paulb@525 16
  <li><a href="sessions.html">Accessing session information</a> (see the <a href="sessions-usage.html#Limitations">"Session Limitations and Guidelines"</a>)</li>
paulb@335 17
</ul>
paulb@358 18
<p>When Web pages (and other types of content) are sent to and from
paulb@358 19
users of your application, the text will be in some kind of character
paulb@358 20
encoding. For example, in English-speaking environments, the US-ASCII
paulb@358 21
encoding is common and contains the basic letters, numbers and symbols
paulb@358 22
used in English, whereas in Western Europe&nbsp;encodings like
paulb@358 23
ISO-8859-1 and ISO-8859-15 are typically used, since they&nbsp;contain
paulb@358 24
additional letters and symbols in order to support other languages.
paulb@358 25
Often, UTF-8 is used to encode text because it covers most languages
paulb@358 26
simultaneously and is therefore flexible enough for many applications.</p>
paulb@358 27
<p>When URLs are received in applications, in order for some of the
paulb@358 28
request parameters to be interpreted, the situation is a bit more
paulb@358 29
awkward. The original text is encoded in US-ASCII but will contain
paulb@358 30
special numeric codes that indicate&nbsp;character values in the
paulb@358 31
original text encoding -&nbsp;see the <a href="parameters.html">description
paulb@358 32
of query strings</a> for more information.</p>
paulb@335 33
<h2>Recommendations</h2>
paulb@358 34
<dl>
paulb@358 35
  <dt>The following recommendations should help you avoid issues with
paulb@358 36
incorrect characters in the Web pages (and other content) that you
paulb@358 37
produce:</dt>
paulb@358 38
</dl>
paulb@358 39
<h3>Use Unicode Objects for Textual Content</h3>
paulb@358 40
<p>Handling text in specific encodings using normal Python strings can
paulb@358 41
be difficult, and handling text in multiple encodings in the same
paulb@358 42
application can be highly error-prone. Fortunately, Python has support
paulb@358 43
for Unicode objects which let you think of letters, numbers, symbols
paulb@358 44
and all other characters in an abstract way.</p>
paulb@358 45
<ul>
paulb@358 46
  <li>Convert textual content to Unicode as soon as possible (see below
paulb@358 47
for choosing encodings).</li>
paulb@358 48
  <li>If you must include hard-coded messages in your application code,
paulb@436 49
make sure to specify the encoding using the <a href="http://www.python.org/peps/pep-0263.html">standard declaration</a>
paulb@358 50
at the top of your source file.</li>
paulb@358 51
  <li>Remember that the standard library&nbsp;<code>codecs</code>
paulb@358 52
module contains useful functions to access streams as if Unicode
paulb@358 53
objects were being transmitted; for example:</li>
paulb@358 54
</ul>
paulb@442 55
<pre>import codecs<br /><br />class MyResource:<br /><br />    encoding = "utf-8"<br /><br />    def respond(self, trans):<br />        stream = trans.get_request_stream()                         # only reads strings<br />        unicode_stream = codecs.getreader(self.encoding)(stream)    # reads Unicode objects<br /><br />        [Some activity...]<br /><br />        out = trans.get_response_stream()                           # writes strings and Unicode objects<br /></pre>
paulb@358 56
<h3>Use Strings for Binary Content</h3>
paulb@358 57
<p>If you are reading and writing binary content, Unicode objects are
paulb@358 58
inappropriate. Make sure to open files in binary mode, where necessary.</p>
paulb@358 59
<h3>Use Explicit Encodings and Be Consistent</h3>
paulb@358 60
<p>Although WebStack has some support for detecting character encodings
paulb@358 61
used
paulb@358 62
in requests, it is often best for your application to exercise control
paulb@358 63
over
paulb@358 64
which encoding is used when <a href="parameters.html">inspecting
paulb@358 65
request
paulb@358 66
parameters</a> and when <a href="responses.html">producing responses</a>.
paulb@358 67
The
paulb@358 68
best way to do this is to decide which encoding is most suitable for
paulb@358 69
the data
paulb@358 70
presented and received in your application and then to use it
paulb@358 71
throughout.
paulb@335 72
Here is an outline of code which does this:</p>
paulb@358 73
<pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br />    encoding = "utf-8"                                                     # We decide on "utf-8" as our chosen<br />                                                                           # encoding.<br />    def respond(self, trans):<br />        [Do various things.]<br /><br />        fields = trans.get_fields_from_body(encoding=self.encoding)        # Explicitly use the encoding.<br /><br />        [Do other things with the Unicode values from the fields.]<br /><br />        trans.set_content_type(ContentType("text/html", self.encoding))    # The output Web page uses the encoding.<br /><br />        [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]</pre>
paulb@358 74
<h3>Tell Encodings to Other Components</h3>
paulb@436 75
<p>When using other components to generate content (see <a href="integrating.html">"Integrating with Other Systems"</a>), it may
paulb@358 76
be the case that such components will just write the generated content
paulb@358 77
straight to a normal stream (rather than one wrapped by a&nbsp;<code>codecs</code>
paulb@358 78
module function). In such cases, it is likely that for textual content
paulb@358 79
such as XML or related formats (XHTML, SVG, HTML) you will need to
paulb@358 80
instruct the component to use your chosen encoding; for example:</p>
paulb@358 81
<pre>        # In the respond method, xml_document is an xml.dom.minidom.Document object...<br />        xml_document.toxml(self.encoding)</pre>
paulb@436 82
<p>This will then generate the appropriate characters in the output <span style="font-style: italic;">and</span> specify the correct encoding
paulb@358 83
for the XML document.</p>
paulb@436 84
</body></html>