Universal Character Set

Previous: Shell auto-completion, Up: Inner workings

10.8 Universal Character Set

In earlier versions of ponysay only the output truncation supported Universal Character Set, though handcoded UTF-8 character counting. Now ponysay lets Python decode the data, Python store all 31 bits of a character in as one character, not in UTF-16 as some other languages does, this means that the code is agnostic to the character encoding. However in Unicode 6.1 their are four ranges of combining characters, these do not take up any width in proper terminal, we therefore have a class in the code named UCS that help us take them into consideration when determine the length of a string.

Some ponies have names that contain non-ASCII characters, read about it in Environment variables. The UCS names are stored in the file share/ucsmap, in it lines that are not empty and does not start with a hash (#) are parsed, and contains a UCS name and a ASCII:ised name. The UCS name comes first, followed by the ASCII:ised name that the UCS name should replace or link towards. The two names are separated by and simple left to right arrow character [U+2192], optionally with surrounding white space.