WebStack

Annotated docs/CHARSET.txt

503:5e29854fe10d
2005-11-15 paulb [project @ 2005-11-15 15:46:01 by paulb] Added has_key method.
paulb@230 1
Unicode and Character Sets in WebStack
paulb@230 2
--------------------------------------
paulb@230 3
paulb@230 4
Unicode text should be converted to the chosen character set (encoding) when
paulb@230 5
written to the response stream.
paulb@230 6
paulb@230 7
Classic Python strings are written directly to the response stream without
paulb@230 8
encoding.
paulb@230 9
paulb@225 10
Character Set Semantics in WebStack
paulb@225 11
-----------------------------------
paulb@225 12
paulb@225 13
Character sets (or encodings) are relevant in two areas:
paulb@225 14
paulb@225 15
 * The encoding of output data.
paulb@225 16
 * The processing of input data.
paulb@225 17
paulb@225 18
When producing HTML pages containing form fields and interpreting the values of
paulb@225 19
such fields from a request body, it is necessary to know...
paulb@225 20
paulb@225 21
 * The character set used to encode the values sent by the browser. This is
paulb@225 22
   typically determined by...
paulb@225 23
paulb@225 24
 * The character set used to encode the HTML page from which the field values
paulb@225 25
   originated.
paulb@225 26
paulb@225 27
It is therefore also necessary to remain consistent in the usage of character
paulb@230 28
sets when specifying content types. WebStack enforces the following rules:
paulb@230 29
paulb@230 30
 * Where the request content type specifies a character set, this is used to
paulb@230 31
   decode the request body parameters unless explicitly overridden.
paulb@230 32
paulb@230 33
 * Where the request content type does not specify a character set, a default
paulb@230 34
   character set is used to decode the request body parameters unless
paulb@230 35
   overridden.
paulb@230 36
paulb@298 37
 * No conversion is done at the request stream level, since information about
paulb@298 38
   the character set may be missing and the application may wish to override
paulb@298 39
   any default explicitly at a higher level (such as when it gets request body
paulb@298 40
   parameters).
paulb@298 41
paulb@230 42
 * Where the response content type specifies a character set, this is used to
paulb@230 43
   encode Unicode response data (eg. HTML pages).
paulb@230 44
paulb@230 45
 * Where the response content type does not specify a character set, a default
paulb@230 46
   character set is used to encode Unicode response data (eg. HTML pages).
paulb@230 47
paulb@232 48
Restrictions in and Omissions from Standards
paulb@232 49
--------------------------------------------
paulb@232 50
paulb@232 51
The encoding of character sets such as UTF-16 in HTTP POST request body
paulb@232 52
messages of content/media type application/x-www-form-urlencoded is not
paulb@232 53
properly standardised. Therefore, it is highly recommended that UTF-8 be used
paulb@232 54
as an encoding should the various single byte encodings (eg. ISO-8859-1) not
paulb@232 55
cover the range of characters to be displayed and received.
paulb@232 56
paulb@230 57
Framework Behaviour
paulb@230 58
-------------------
paulb@230 59
paulb@230 60
The Java Servlet API imposes restrictions on decoding request body parameters
paulb@230 61
by stating that the character encoding (ServletRequest.setCharacterEncoding)
paulb@230 62
must be set before any reading of the request body is attempted.