Jump to: navigation, search

Encoding

Summary

The openstack python HACKING guidelines should adopt the following rules:

1) All text within python code should be of type 'unicode'.

2) All external text that is not explicitly encoded (database storage, commandline arguments, etc.) should be presumed to be encoded as utf-8.

3) Transitions between internal unicode and external strings should always be explicitly encoded or decoded.

Terminology

In this document, 'Python' refers to 2.x versions of Python. My understanding is that Python 3 resolves many of these issues, and also that Openstack is not going to be switching to 3 anytime soon.

Both the word 'string' and the word 'unicode' have multiple overlapping definitions. I'm going to try to keep these distinctions in terminology:

'Unicode text' is text that complies with the Unicode standard. Unicode Text may be represented in many many ways that still comply with the standard.

A 'Unicode object' is an instance of the Python Unicode class. It's a particular data type that implements stores Unicode text.

A 'String' is a string of text (of any type.)

A 'str object' or 'string object' is an instance of the Python 'str' class. It is a binary data type that can store any kind of text.

Here's a pathological use of this terminology: "Unicode text can be stored as either a python Unicode object, or as a properly encoded Python str. The most common way to encode Unicode text in a python string is as 'utf-8.'

Hopefully by the end of this document those two sentences will seem less insane.

Background

Python 2 has two ways of storing text: type 'str' and type 'unicode'. The distinction between these two types is often invisible.


>>> print s
Motörhead
>>> print u
Motörhead
>>> type(s)
<type 'str'>
>>> type(u)
<type 'unicode'>


A 'Unicode' object is a list in which each item represents a Unicode character. The actual underlying size, type, and binary content of these items is a black box. A str, by contrast, is a a binary data type where each item is eight bits of storage. A str might contain ASCII data, unicode data, or any one of a hundred or so other text storage types, called encodings'.

What's bad about strs

List positions within a str don't correspond to character positions in the text. Similarly, str lengths don't necessarily correspond to text lengths.


>>> print s
Motörhead
>>> print u
Motörhead
>>> u[5]
u'h'
>>> s[5]
'r'
>>> len(s)
10
>>> len(u)
9


Strs have no idea what their encoding is. If your function receives an argument of type 'str', it could be anything! Python will do its best to translate it into text when asked, but will only be making a guess.


>>> print s
Ԗ?̙????
>>> # What the heck?
>>> print s.decode('cp500')
Motörhead
>>> # Good thing I knew it was EBCDIC!


Pass an argument of type 'str' and you're asking for trouble.

Decoding input

When we read in text from a file or a shell, we get raw data: a str.


>>> f = open('file.txt', 'r')
>>> s = f.readline()
>>> type(s)
<type 'str'>


Since we've established that passing around strs is bad, we need to immediately decode this text into a safe-to-share Unicode object. With luck, you're reading in data that already has an explicit encoding. If it does, use it!


>>> u = s.decode('<thisistherightencoding>')


A lot of the time, though, you're reading from the shell, or from a flat text file. That gets us a string of unknown encoding. So, at this point, we're going to rely on convention. Your I/O string is probably UTF-8. Many shells use a default encoding of UTF-8 (e.g. OSX and Ubuntu). Many text editors also use it as the default (e.g. vim and emacs.)


>>> u = s.decode('utf-8')


The grand thing about utf-8 is that it is a superset of ascii. That means that if your input is ascii, decoding as utf-8 still performs a lossless transformation into a Unicode object.

And, yeah, it feels wrong to guess, but guessing is our only option. If your users are using an encoding that isn't UTF-8 or ASCII and they aren't explicitly tagging their data with that encoding, they are probably used to things not working.

Encoding output

Unicode is strictly an in-memory data structure. In order to serialize Unicode data it must be encoded into a str (which, remember, is a binary format and therefore trivial to serialize.) There are many available standards for making a complete transformation from a Unicode Object to a str that contains unicode. Alas, Python generally defaults to an encoding that does not provide a complete transformation.


>>> f = open('file.txt','w')
>>> f.write(u)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 3: ordinal not in range(128)
>>> # That's right, it defaults to ascii


What's worse: the default encoding can not be set on a per-program basis, but only on a per-system basis. You don't know what the default encoding will be when your users run your program. So, the only way to be safe is to always explicitly request a unicode-compliant encoding.


>>> f.write(u.encode('utf-8'))
>>>


Conclusion

Openstack is full of code like this:


mystring = infile.readline()
myreturnstring = do_some_magic_with(mystring)
outfile.write(myreturnstring)


That pattern will work remarkably well, and your American QA staff will declare it flawless. Later, you'll make a presentation to the head of IT at a German multinational named Müller, and all of a sudden your world will become a nightmare of 500s, UnicodeEncodeErrors and pink slips.

Instead, plan ahead for the Germans (and the Chinese!) and write code like this:


mystring = infile.readline()
mytext = s.decode('utf-8')
returntext = do_some_magic_with(mytext)
returnstring = returntext.encode('utf-8')
outfile.write(returnstring)


Many thanks to Kumar McMillan, whose slides I have shamelessly cribbed from for much of this content.