1 Unicode and Character Sets in WebStack
2 --------------------------------------
3
4 Unicode text should be converted to the chosen character set (encoding) when
5 written to the response stream.
6
7 Classic Python strings are written directly to the response stream without
8 encoding.
9
10 Character Set Semantics in WebStack
11 -----------------------------------
12
13 Character sets (or encodings) are relevant in two areas:
14
15 * The encoding of output data.
16 * The processing of input data.
17
18 When producing HTML pages containing form fields and interpreting the values of
19 such fields from a request body, it is necessary to know...
20
21 * The character set used to encode the values sent by the browser. This is
22 typically determined by...
23
24 * The character set used to encode the HTML page from which the field values
25 originated.
26
27 It is therefore also necessary to remain consistent in the usage of character
28 sets when specifying content types. WebStack enforces the following rules:
29
30 * Where the request content type specifies a character set, this is used to
31 decode the request body parameters unless explicitly overridden.
32
33 * Where the request content type does not specify a character set, a default
34 character set is used to decode the request body parameters unless
35 overridden.
36
37 * No conversion is done at the request stream level, since information about
38 the character set may be missing and the application may wish to override
39 any default explicitly at a higher level (such as when it gets request body
40 parameters).
41
42 * Where the response content type specifies a character set, this is used to
43 encode Unicode response data (eg. HTML pages).
44
45 * Where the response content type does not specify a character set, a default
46 character set is used to encode Unicode response data (eg. HTML pages).
47
48 Restrictions in and Omissions from Standards
49 --------------------------------------------
50
51 The encoding of character sets such as UTF-16 in HTTP POST request body
52 messages of content/media type application/x-www-form-urlencoded is not
53 properly standardised. Therefore, it is highly recommended that UTF-8 be used
54 as an encoding should the various single byte encodings (eg. ISO-8859-1) not
55 cover the range of characters to be displayed and received.
56
57 Framework Behaviour
58 -------------------
59
60 The Java Servlet API imposes restrictions on decoding request body parameters
61 by stating that the character encoding (ServletRequest.setCharacterEncoding)
62 must be set before any reading of the request body is attempted.