paulb@230 | 1 | Unicode and Character Sets in WebStack
|
paulb@230 | 2 | --------------------------------------
|
paulb@230 | 3 |
|
paulb@230 | 4 | Unicode text should be converted to the chosen character set (encoding) when
|
paulb@230 | 5 | written to the response stream.
|
paulb@230 | 6 |
|
paulb@230 | 7 | Classic Python strings are written directly to the response stream without
|
paulb@230 | 8 | encoding.
|
paulb@230 | 9 |
|
paulb@225 | 10 | Character Set Semantics in WebStack
|
paulb@225 | 11 | -----------------------------------
|
paulb@225 | 12 |
|
paulb@225 | 13 | Character sets (or encodings) are relevant in two areas:
|
paulb@225 | 14 |
|
paulb@225 | 15 | * The encoding of output data.
|
paulb@225 | 16 | * The processing of input data.
|
paulb@225 | 17 |
|
paulb@225 | 18 | When producing HTML pages containing form fields and interpreting the values of
|
paulb@225 | 19 | such fields from a request body, it is necessary to know...
|
paulb@225 | 20 |
|
paulb@225 | 21 | * The character set used to encode the values sent by the browser. This is
|
paulb@225 | 22 | typically determined by...
|
paulb@225 | 23 |
|
paulb@225 | 24 | * The character set used to encode the HTML page from which the field values
|
paulb@225 | 25 | originated.
|
paulb@225 | 26 |
|
paulb@225 | 27 | It is therefore also necessary to remain consistent in the usage of character
|
paulb@230 | 28 | sets when specifying content types. WebStack enforces the following rules:
|
paulb@230 | 29 |
|
paulb@230 | 30 | * Where the request content type specifies a character set, this is used to
|
paulb@230 | 31 | decode the request body parameters unless explicitly overridden.
|
paulb@230 | 32 |
|
paulb@230 | 33 | * Where the request content type does not specify a character set, a default
|
paulb@230 | 34 | character set is used to decode the request body parameters unless
|
paulb@230 | 35 | overridden.
|
paulb@230 | 36 |
|
paulb@298 | 37 | * No conversion is done at the request stream level, since information about
|
paulb@298 | 38 | the character set may be missing and the application may wish to override
|
paulb@298 | 39 | any default explicitly at a higher level (such as when it gets request body
|
paulb@298 | 40 | parameters).
|
paulb@298 | 41 |
|
paulb@230 | 42 | * Where the response content type specifies a character set, this is used to
|
paulb@230 | 43 | encode Unicode response data (eg. HTML pages).
|
paulb@230 | 44 |
|
paulb@230 | 45 | * Where the response content type does not specify a character set, a default
|
paulb@230 | 46 | character set is used to encode Unicode response data (eg. HTML pages).
|
paulb@230 | 47 |
|
paulb@232 | 48 | Restrictions in and Omissions from Standards
|
paulb@232 | 49 | --------------------------------------------
|
paulb@232 | 50 |
|
paulb@232 | 51 | The encoding of character sets such as UTF-16 in HTTP POST request body
|
paulb@232 | 52 | messages of content/media type application/x-www-form-urlencoded is not
|
paulb@232 | 53 | properly standardised. Therefore, it is highly recommended that UTF-8 be used
|
paulb@232 | 54 | as an encoding should the various single byte encodings (eg. ISO-8859-1) not
|
paulb@232 | 55 | cover the range of characters to be displayed and received.
|
paulb@232 | 56 |
|
paulb@230 | 57 | Framework Behaviour
|
paulb@230 | 58 | -------------------
|
paulb@230 | 59 |
|
paulb@230 | 60 | The Java Servlet API imposes restrictions on decoding request body parameters
|
paulb@230 | 61 | by stating that the character encoding (ServletRequest.setCharacterEncoding)
|
paulb@230 | 62 | must be set before any reading of the request body is attempted.
|