WebStack

Annotated docs/encodings.html

436:077f20278617
2005-08-24 paulb [project @ 2005-08-24 21:33:04 by paulb] Introduced a topic index under "Developing WebStack Applications". Added details of the recent changes to methods which are now aware of character encodings.
paulb@358 1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
paulb@436 2
<html xmlns="http://www.w3.org/1999/xhtml"><head>
paulb@436 3
  
paulb@436 4
  <title>Character Encodings</title><meta name="generator" content="amaya 8.1a, see http://www.w3.org/Amaya/" />
paulb@436 5
  <link href="styles.css" rel="stylesheet" type="text/css" /></head>
paulb@436 6
paulb@335 7
<body>
paulb@335 8
<h1>Character Encodings</h1>
paulb@358 9
<p>When writing applications with WebStack, you should try and use
paulb@358 10
Python's Unicode objects as much as possible. However, there are a
paulb@358 11
number of places where plain Python strings can be involved:</p>
paulb@335 12
<ul>
paulb@436 13
  <li><a href="parameters-headers.html">Inspecting query strings</a></li>
paulb@360 14
  <li><a href="responses.html">Sending output in a response</a></li>
paulb@360 15
  <li><a href="parameters.html">Receiving uploaded content</a></li>
paulb@360 16
  <li><a href="state.html">Accessing cookie information</a></li>
paulb@360 17
  <li><a href="sessions.html">Accessing session information</a></li>
paulb@335 18
</ul>
paulb@358 19
<p>When Web pages (and other types of content) are sent to and from
paulb@358 20
users of your application, the text will be in some kind of character
paulb@358 21
encoding. For example, in English-speaking environments, the US-ASCII
paulb@358 22
encoding is common and contains the basic letters, numbers and symbols
paulb@358 23
used in English, whereas in Western Europe&nbsp;encodings like
paulb@358 24
ISO-8859-1 and ISO-8859-15 are typically used, since they&nbsp;contain
paulb@358 25
additional letters and symbols in order to support other languages.
paulb@358 26
Often, UTF-8 is used to encode text because it covers most languages
paulb@358 27
simultaneously and is therefore flexible enough for many applications.</p>
paulb@358 28
<p>When URLs are received in applications, in order for some of the
paulb@358 29
request parameters to be interpreted, the situation is a bit more
paulb@358 30
awkward. The original text is encoded in US-ASCII but will contain
paulb@358 31
special numeric codes that indicate&nbsp;character values in the
paulb@358 32
original text encoding -&nbsp;see the <a href="parameters.html">description
paulb@358 33
of query strings</a> for more information.</p>
paulb@335 34
<h2>Recommendations</h2>
paulb@358 35
<dl>
paulb@358 36
  <dt>The following recommendations should help you avoid issues with
paulb@358 37
incorrect characters in the Web pages (and other content) that you
paulb@358 38
produce:</dt>
paulb@358 39
</dl>
paulb@358 40
<h3>Use Unicode Objects for Textual Content</h3>
paulb@358 41
<p>Handling text in specific encodings using normal Python strings can
paulb@358 42
be difficult, and handling text in multiple encodings in the same
paulb@358 43
application can be highly error-prone. Fortunately, Python has support
paulb@358 44
for Unicode objects which let you think of letters, numbers, symbols
paulb@358 45
and all other characters in an abstract way.</p>
paulb@358 46
<ul>
paulb@358 47
  <li>Convert textual content to Unicode as soon as possible (see below
paulb@358 48
for choosing encodings).</li>
paulb@358 49
  <li>If you must include hard-coded messages in your application code,
paulb@436 50
make sure to specify the encoding using the <a href="http://www.python.org/peps/pep-0263.html">standard declaration</a>
paulb@358 51
at the top of your source file.</li>
paulb@358 52
  <li>Remember that the standard library&nbsp;<code>codecs</code>
paulb@358 53
module contains useful functions to access streams as if Unicode
paulb@358 54
objects were being transmitted; for example:</li>
paulb@358 55
</ul>
paulb@358 56
<pre>import codecs<br /><br />class MyResource:<br /><br />    encoding = "utf-8"<br /><br />    def respond(self, trans):<br />        stream = trans.get_request_stream()                         # only reads strings<br />        unicode_stream = codecs.getreader(self.encoding)(stream)    # reads Unicode objects<br /><br />        [Some activity...]<br /><br />        out = trans.get_response_stream()                           # only writes strings<br />        unicode_out = codecs.getwriter(self.encoding)(out)          # writes Unicode objects</pre>
paulb@358 57
<h3>Use Strings for Binary Content</h3>
paulb@358 58
<p>If you are reading and writing binary content, Unicode objects are
paulb@358 59
inappropriate. Make sure to open files in binary mode, where necessary.</p>
paulb@358 60
<h3>Use Explicit Encodings and Be Consistent</h3>
paulb@358 61
<p>Although WebStack has some support for detecting character encodings
paulb@358 62
used
paulb@358 63
in requests, it is often best for your application to exercise control
paulb@358 64
over
paulb@358 65
which encoding is used when <a href="parameters.html">inspecting
paulb@358 66
request
paulb@358 67
parameters</a> and when <a href="responses.html">producing responses</a>.
paulb@358 68
The
paulb@358 69
best way to do this is to decide which encoding is most suitable for
paulb@358 70
the data
paulb@358 71
presented and received in your application and then to use it
paulb@358 72
throughout.
paulb@335 73
Here is an outline of code which does this:</p>
paulb@358 74
<pre>from WebStack.Generic import ContentType<br /><br />class MyResource:<br /><br />    encoding = "utf-8"                                                     # We decide on "utf-8" as our chosen<br />                                                                           # encoding.<br />    def respond(self, trans):<br />        [Do various things.]<br /><br />        fields = trans.get_fields_from_body(encoding=self.encoding)        # Explicitly use the encoding.<br /><br />        [Do other things with the Unicode values from the fields.]<br /><br />        trans.set_content_type(ContentType("text/html", self.encoding))    # The output Web page uses the encoding.<br /><br />        [Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]</pre>
paulb@358 75
<h3>Tell Encodings to Other Components</h3>
paulb@436 76
<p>When using other components to generate content (see <a href="integrating.html">"Integrating with Other Systems"</a>), it may
paulb@358 77
be the case that such components will just write the generated content
paulb@358 78
straight to a normal stream (rather than one wrapped by a&nbsp;<code>codecs</code>
paulb@358 79
module function). In such cases, it is likely that for textual content
paulb@358 80
such as XML or related formats (XHTML, SVG, HTML) you will need to
paulb@358 81
instruct the component to use your chosen encoding; for example:</p>
paulb@358 82
<pre>        # In the respond method, xml_document is an xml.dom.minidom.Document object...<br />        xml_document.toxml(self.encoding)</pre>
paulb@436 83
<p>This will then generate the appropriate characters in the output <span style="font-style: italic;">and</span> specify the correct encoding
paulb@358 84
for the XML document.</p>
paulb@436 85
</body></html>