Jump to: navigation, search

Encoding

Revision as of 21:02, 5 March 2012 by Andrewbogott (talk)

Summary

The openstack python HACKING guidelines should adopt the following rules:

1) All text within python code should be of type 'unicode'.

2) All external text that is not explicitly encoded (database storage, commandline arguments, etc.) should be presumed to be encoded as utf-8.

3) Transitions between internal unicode and external strings should always be explicitly encoded or decoded.

Terminology

I'm going to try to keep these distinctions:

'Unicode text' is text that complies with the Unicode standard. Unicode Text may be represented in many many ways that still comply with the standard.

A 'Unicode object' is an instance of the Python Unicode class. It's a particular data type that implements stores Unicode text.

A 'String' is a string of text (of any type.)

A 'str object' or 'string object' is an instance of the Python 'str' class. It is a binary data type that can store any kind of text.

Here's a pathological use of this terminology: "Unicode text can be stored as either a python Unicode object, or as a properly encoded Python str. The most common way to encode Unicode text in a python string is as 'utf-8.'

Hopefully by the end of this document those two sentences will seem less insane.

Background

Python has two ways of storing text: type 'str' and type 'unicode'. The distinction between these two types is often invisible.


>>> print s
Motörhead
>>> print u
Motörhead
>>> type(s)
<type 'str'>
>>> type(u)
<type 'unicode'>


A 'Unicode' object is a list in which each item represents a Unicode character. The actual underlying size, type, and binary content of these items is a black box. A str, by contrast, is a a binary data type where each item is eight bits of storage. A str might contain ASCII data, unicode data, or any one of a hundred or so other text storage types, called encodings'.

What's bad about strs

List positions within a str don't correspond to character positions in the text. Similarly, str lengths don't necessarily correspond to text lengths.


>>> print s
Motörhead
>>> print u
Motörhead
>>> u[5]
u'h'
>>> s[5]
'r'
>>> len(s)
10
>>> len(u)
9


Strs have no idea what their encoding is. If your function receives an argument of type 'str', it could be anything! Python will do its best to translate it into text when asked, but will only be making a guess.


>>> print s
Ԗ?̙????
>>> # What the heck?
>>> print s.decode('cp500')
Motörhead
>>> # Good thing I knew it was EBCDIC!


Encoding output

Unicode is strictly an in-memory data structure. In order to serialize Unicode data it must be encoded into a str (which, remember, is a binary format and therefore trivial to serialize.) There are many available standards for making a complete transformation from a Unicode Object to a str that contains unicode. Alas, Python generally defaults to an encoding that does not provide a complete transformation.


>>> f = open('file.txt','w')
>>> f.write(u)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 3: ordinal not in range(128)
>>> # That's right, it defaults to asii


What's worse: the default encoding can not be set on a per-program basis, but only on a per-system basis. You don't know what the default encoding will be when your users run your program. So, the only way to be safe is to always explicitly request a unicode-compliant encoding.


>>> f.write(u.encode('utf8'))
>>>


decoding input

When we read in text from a file or a shell, we get raw data: a str.


>>> f = open('file.txt', 'r')
>>> s = f.readline()
>>> type(s)
<type 'str'>


Since we've established that passing around strs is bad, we need to immediately decode this text into a safe-to-share Unicode object. With luck, you're reading in data that already has an explicit encoding. If it does, use it!


>>> u = s.decode('<thisistherightencoding>')


A lot of the time, though, you're reading from the shell, or from a flat text file. That gets us a string of unknown encoding. So, at this point, we're going to rely on convention. Your I/O string is probably UTF-8. Many shells use a default encoding of UTF-8 (e.g. OSX and Ubuntu). Many text editors also use it as the default (e.g. vim and emacs.)


>>> u = s.decode('utf-8')


The grand thing about utf-8 is that it is a superset of ascii. That means that if your input is ascii, decoding as utf-8 still performs a lossless transformation into a Unicode object.

And, yeah, it feels wrong to guess, but guessing is our only option. If your users are using an encoding that isn't UTF-8 or ASCII and they aren't explicitly tagging their data with that encoding, they are probably used to things not working.

Examples