Jump to: navigation, search

Encoding

Revision as of 19:01, 5 March 2012 by Andrewbogott (talk)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Summary

The openstack python HACKING guidelines should adopt the following rules:

1) All text within python code should be of type 'unicode'.

2) All external text that is not explicitly encoded (database storage, commandline arguments, etc.) should be presumed to be encoded as utf-8.

3) Transitions between internal unicode and external strings should always be explicitly encoded or decoded.

Background

Python has two ways of storing text: type 'str' and type 'unicode'. The distinction between these two types is often invisible.


>>> print s
Motörhead
>>> print u
Motörhead
>>> type(s)
<type 'str'>
>>> type(u)
<type 'unicode'>


A 'Unicode' object is a list in which each item represents a Unicode character. The actual underlying size, type, and binary content of these items is a black box. A str, by contrast, is a a binary data type where each item is eight bits of storage. A str might contain ASCII data, unicode data, or any one of a hundred or so other text storage types, called 'encodings'.

What's bad about strs

List positions within a str don't correspond to character positions in the text. Similarly, str lengths don't necessarily correspond to text lengths.


>>> print s
Motörhead
>>> print u
Motörhead
>>> u[5]
u'h'
>>> s[5]
'r'
>>> len(s)
10
>>> len(u)
9


Strs have no idea what their encoding is. If your function receives an argument of type 'str', it could be anything! Python will do its best to translate it into text when asked, but will only be making a guess.


>>> print s
Ԗ?̙????
>>> # What the heck?
>>> print s.decode('cp500')
Motörhead
>>> # Good thing I knew it was EBCDIC!


Examples