// Tutorial //

How To Work with Unicode in Python

Published on November 30, 2022
Default avatar
By Vivek Kumar Singh
Developer and author at DigitalOcean.
How To Work with Unicode in Python

The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.

Introduction

Unicode is the standard character encoding for the majority of the world’s computers. It ensures that text—including letters, symbols, emoji, and even control characters—appears the same across different devices, platforms, and digital documents, regardless of the operating system or software being used. It is an important part of the internet and computing industry, and without it, the internet would be much more chaotic and difficult to use.

Unicode itself is not encoding but is more like a database of almost all possible characters on earth. Unicode contains a code point, an identifier for each character in its database, which can have a value ranging from 0 to 1.1 million, which means it is highly unlikely it will run out of these unique code points anytime soon. Every code point in Unicode is represented as U+n, where U+ signifies that it is a Unicode code point and n is the set of four to six hexadecimal digits for the character. It is a much more robust encoding system than ASCII, which only represents 128 characters. Exchanging digital text around the world with ASCII was difficult because it is based on American English, with no support for accented characters. Unicode, on the other hand, has almost 150,000 characters and covers characters for every language on earth.

With this comes a requirement for programming languages, such as Python, to properly handle text and make it possible for software to achieve internationalization. Python can be used for a wide range of things—from email to servers to the web—and has an elegant way of handling Unicode, which it does by adopting the Unicode Standard for its strings.

Working with Unicode in Python, however, can be confusing and lead to errors. This tutorial will provide the fundamentals of how to use Unicode in Python to help you avoid those issues. You will use Python to interpret Unicode, normalize data with Python’s normalizing functions, and handle Python Unicode errors.

Prerequisites

To follow this tutorial you will need:

Step 1 — Converting Unicode Code Points in Python

Encoding is the process of representing data in a computer-readable form. There are many ways to encode data—ASCII, Latin-1, and more—and each encoding has its own strengths and weaknesses, but perhaps the most common is UTF-8. This is a type of encoding that allows characters from all over the world to be represented in a single character set. As such, UTF-8 is an essential tool for anyone working with internationalized data. In general, UTF-8 is a good choice for most purposes. It is relatively efficient and can be used with a variety of software. UTF-8 takes the Unicode code point and converts it to hexadecimal bytes that a computer can understand. In other words, Unicode is the mapping, and UTF-8 enables a computer to understand that mapping.

In Python 3, the default string encoding is UTF-8, which means that any Unicode code point in the Python string is automatically converted into the corresponding character.

In this step you will create the copyright symbol (©) using its Unicode code point in Python. First, start the Python interactive console in your terminal and type the following:

>>> s =  '\u00A9'
>>> s

In the preceding code you created a string s with a Unicode code point \u00A9. As mentioned earlier, since the Python string uses UTF-8 encoding by default, printing the value of s automatically changes it to the corresponding Unicode symbol. Note that the \u at the beginning of a code point is required. Without it, Python will not be able to convert the code point. The output of the preceding code returns the corresponding Unicode symbol:

Output
'©'

The Python programming language provides built-in functions for encoding and decoding strings. The encode() function converts a string into a byte string.

To demonstrate this, open the Python interactive console and type the following code:

>>> '🅥'.encode('utf-8')

This produces the byte string of the character as output:

Output
b'\xf0\x9f\x85\xa5'

Note that each byte has a \x preceding it, which indicates that it is a hexadecimal number.

Note: Typing the special Unicode characters is different in Windows and on a Mac. In the preceding code, and all the code in this tutorial that uses symbols, you can insert the symbols using the Character Map utility in Windows. Macs don’t have this feature, so your best option is to copy the character from the code examples.

Next, you will use the decode() function to convert the byte string back into a string. The decode() function accepts the encoding type as an argument. It is also worth mentioning that the decode() function can only decode the byte string, which is specified by using the letter b at the beginning of the string. Removing the b would result in an AttributeError.

In your console type:

>>> b'\xf0\x9f\x85\xa5'.decode('utf-8')

The code will return output like this:

Output
'🅥'

You now have a fundamental understanding of Unicode interpretation in Python. Next, you will dive into Python’s built-in unicodedata module to perform advanced Unicode techniques on your strings.

Step 2 — Normalizing Unicode in Python

In this step, you will normalize Unicode in Python. Normalization helps determine whether two characters written in different fonts are the same, which is useful when two characters with different code points produce the same result. For example, the Unicode character R and are the same to the human eye, as they are both the letter R, but a computer considers them to be different characters.

The following code example further demonstrates this. Open your Python console and type the following:

>>> styled_R = 'ℜ'
>>> normal_R = 'R'
>>> styled_R == normal_R

You will get the following output:

Output
False

The code prints False as the output because Python strings do not consider the two characters to be identical. This ability to differentiate is why normalization is important when working with Unicode.

In Unicode, some characters are made by combining two or more characters into one. Normalization is important in this case because it keeps your strings consistent with each other. To better understand this, open your Python console and type the following code:

>>> s1 =  'hôtel'
>>> s2 = 'ho\u0302tel'
>>> len(s1), len(s2)

In the preceding code, you created a string s1 containing the ô character, and on the second line, string s2 contains the code point of the circumflex character ( ̂ ). After execution, the code returns the following output:

Output
(5, 6)

The preceding output shows that the two strings are made up of the same characters but have different lengths, which means they will fail equality. Type the following in the same console to test it:

>>> s1 == s2

The code returns the following output:

Output
False

Although string variables s1 and s2 produce the same Unicode character, they are different in length and therefore are not equal.

You can solve this issue with the normalize() function, which is what you will do in the next step.

Step 3 — Normalizing Unicode with NFD, NFC, NFKD, and NFKC

In this step you will normalize Unicode strings with the normalize() function from Python’s unicodedata library in the unicodedata module, which provides character lookup and normalization capabilities. The normalize() function can take a normalization form as its first argument and the string being normalized as the second argument. Unicode has four types of normalization forms you can use for this: NFD, NFC, NFKD, and NFKC.

The NFD normalization form decomposes a character into multiple combining characters. It makes your text accent-insensitive, which could be useful in searching and sorting. You can do this by encoding the string into bytes.

Open your console and type in the following:

>>> from unicodedata import normalize
>>> s1 =  'hôtel'
>>> s2 = 'ho\u0302tel'
>>> s1_nfd = normalize('NFD', s1)
>>> len(s1), len(s1_nfd)

The code produces the following output:

Output
(5, 6)

As the example demonstrates, normalizing string s1 increases its length by one character. This is because the ô symbol gets split into two characters, o and ˆ, which you can confirm by using the following code:

>>> s1.encode(), s1_nfd.encode()

The resulting output reveals that after encoding the normalized string, the o character got separated from the ˆ character in string s1_nfd:

Output
(b'h\xc3\xb4tel', b'ho\xcc\x82tel')

The NFC normalization form first decomposes a character, then recomposes it with any available combining character. Since NFC composes a string to produce the shortest possible output, the W3C recommends using NFC on the web. Keyboard input returns composed strings by default, so it’s a good idea to use NFC in that case.

As an example, type the following into your interactive console:

>>> from unicodedata import normalize
>>> s2_nfc = normalize('NFC', s2)
>>> len(s2), len(s2_nfc)

The code produces the following output:

Output
(6, 5)

In the example, normalizing string s2 decreases its length by one. You can confirm this by running the following code in your interactive console:

>>> s2.encode(), s2_nfc.encode()

The output of the code is:

Output
(b'ho\xcc\x82tel', b'h\xc3\xb4tel')

The output shows that the o and ˆ characters merged into a single ô character.

The NFKD and NFKC normalization forms are used for “strict” normalization and can be used for a variety of problems relating to searching and pattern matching in Unicode strings. The “K” in NFKD and NFKC stands for compatibility.

NFD and NFC normalization forms decompose characters, but NFKD and NFKC perform a compatibility decomposition for characters that are not similar but are equivalent, removing any formatting distinctions. For example, the string ②① is not similar to 21, but they both represent the same value. The NFKC and NFKD normalization forms remove this formatting (in this case the circle around the digits) from the characters in order to provide the most stripped-down form of them.

The following example demonstrates the difference between the NFD and NFKD normalization forms. Open your Python interactive console and type the following:

>>> s1 = '2⁵ô'
>>> from unicodedata import normalize
>>> normalize('NFD', s1), normalize('NFKD', s1)

You will get the following output:

Output
('2⁵ô', '25ô')

The output reveals that the NFD form could not decompose the exponent character in string s1, but NFKD striped the exponent formatting and replaced the compatibility character (in this case the exponent 5) with its equivalent (5 as a digit). Remember that the NFD and NFKD normalization forms are still decomposing the characters, so the ô character should increase its length by one, as you saw in the earlier NFD example. You can confirm this by running the following code:

>>> len(normalize('NFD', s1)), len(normalize('NFKD', s1))

The code will return the following:

Output
(4, 4)

The NFKC normalization form works in a similar way, but composes characters rather than decomposing them. In the same Python console type the following:

>>> normalize('NFC', s1), normalize('NFKC', s1)

The code returns the following:

Output
('2⁵ô', '25ô')

Since NFKC follows the composition approach, you should expect the string for the ôcharacter to be shortened by one, rather than lengthened by one in the case of decomposition. You can confirm this by running the following line of code:

>>> len(normalize('NFC', s1)), len(normalize('NFKC', s1))

This will return the following output:

Output
(3, 3)

By performing the preceding steps, you will have working knowledge of the types of normalization forms and the differences between them. In the next step you will solve Unicode errors in Python.

Step 4 — Solving Unicode Errors in Python

Two types of Unicode errors can arise when handling Unicode in Python, UnicodeEncodeError and UnicodeDecodeError. While these Unicode errors can be confusing, they can be managed, and you will fix both of these errors in this step.

Solving a UnicodeEncodeError

Encoding in Unicode is the process of converting the Unicode string into bytes using a particular encoding. A UnicodeEncodeError occurs when trying to encode a string that contains characters that cannot be represented in the specified encoding.

To create this error you will encode a string that contains characters that are not part of the ASCII character set.

Open your console and type the following:

>>> ascii_supported = '\u0041'
>>> ascii_supported.encode('ascii')

Following is your output:

Output
b'A'

Then type the following:

>>> ascii_unsupported = '\ufb06'
>>> ascii_unsupported.encode('utf-8')

You’ll get the following output:

Output
b'\xef\xac\x86'

Finally, type the following:

>>> ascii_unsupported.encode('ascii')

When you run this code, however, you will get the following error:

Output
Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character '\ufb06' in position 0: ordinal not in range(128)

ASCII has a limited number of characters, and Python throws errors when it finds a character that is not available in the ASCII charset. Since the ASCII charset does not recognize the code point \ufb06, Python returns an error message saying that ASCII has a range of only 128 and the corresponding decimal equivalent of this code point is not within that range.

You can handle UnicodeEncodeError by using an errors argument in the encode() function. The errors argument can have one of three values: ignore, replace, and xmlcharrefreplace.

Open your console and type the following:

>>> ascii_unsupported = '\ufb06'
>>> ascii_unsupported.encode('ascii', errors='ignore')

You’ll get the following output

Output
b''

Next type the following:

>>> >>> ascii_unsupported.encode('ascii', errors='replace')

The output will be:

Output
b'?'

Finally, type the following:

>>> ascii_unsupported.encode('ascii', errors='xmlcharrefreplace')

The output is:

Output
b'&#64262;'

In each case, Python does not throw an error.

As demonstrated in the preceding example, ignore skips the character that cannot be encoded, replace replaces the character with a ?, and xmlcharrefreplace replaces unencodable characters with an XML entity.

Solving a UnicodeDecodeError

A UnicodeDecodeError occurs when trying to decode a string that contains characters that cannot be represented in the specified encoding.

To create this error you will try to decode a byte string into an encoding that cannot be decoded.

Open your console and type the following:

>>> iso_supported = '§'
>>> b = iso_supported.encode('iso8859_1')
>>> b.decode('utf-8')

You will get the following error:

Output
Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa7 in position 0: invalid start byte

If you encounter this error, you can use the errors argument in the decode() function, which can help you decode the string. The errors argument accepts two values, ignore and replace.

To demonstrate this, open your Python console and type the following code:

>>> iso_supported = '§A'
>>> b = iso_supported.encode('iso8859_1')
>>> b.decode('utf-8', errors='replace')

Your output will be:

Output
'�A'

Then type the following:

>>> b.decode('utf-8', errors='ignore')

Your output will be:

Output
'A'

In the preceding example using the replace value in the decode() function added a character, and using ignore returned nothing where the decoder (in this case utf-8) couldn’t decode the bytes.

While decoding any string, please note that you cannot assume what the encoding is. To decode any string you must know how it was encoded.

Conclusion

This article covered the fundamentals of how to use Unicode in Python. You encoded and decoded strings, normalized data using NFD, NFC, NFKD, and NFKC, and solved Unicode errors. You also used normalization forms in scenarios involving sorting and searching. These techniques will help you handle Unicode problems using Python. As a next step you can read the unicodedata module documentation to learn about other features this module offers. To continue exploring how to program with Python, read our tutorial series How To Code in Python 3.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about us


About the authors
Default avatar
Developer and author at DigitalOcean.

Default avatar
Technical Editor

Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
Leave a comment
Leave a comment...

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Try DigitalOcean for free

Click here to Sign up and get $200 of credit to try our products over 60 days!