For fun, you can browse the codepoints with the charmap utility (Start Menu > Run > Charmap) or online at. When all was done, the Unicode standard left room for over 1 million code points, enough for all known languages with room to spare for undiscovered civilizations. The Unicode group did the hard work of mapping each character in every language to some code point (not without fierce debate, I am sure). For example, “A” mapped to code point U+0041 (this code point is in hex code point 65 in decimal). Unicode labeled each abstract character with a “code point”. The Unicode group went back to the basics: Letters are abstract concepts. The world had a conundrum: they couldn’t agree on what numbers mapped to what letters in ASCII. But clearly this method was error-prone: codepages needed to be rescued. If you visit an international website, for example, your browser could try to guess the codepage if it was not specified (“Hrm… this text has a lot of character #213 and #218… probably Hebrew”). It’s a big IF whether or not someone will read your message using the same codepage you authored your text. The character mapped to #200 was different in Russian and Hebrew, and you can imagine the confusion that caused for things like email and birthday invitations. But if codepages mixed (Russian sender, Hebrew receiver), things got strange. Character #200 on my machine was the same as Character #200 on yours. If people with the same code page exchanged data, all was good. Unfortunately, 128 additional characters aren’t enough for the entire world: code pages varied by country (Russian code page, Hebrew code page, etc.). To solve this, computer makers defined “code pages” that used the undefined space from 128-255 in ASCII, mapping it to various characters they needed. Now, ASCII encoding works great for English text (using Western characters), but the world is a big place. ASCII does not explicitly define what values 128-255 map to. Note that values 0-127 fit in the lower 7 bits in an 8-bit byte. They map the numeric values 0-127 to various Western characters and control codes (newline, tab, etc.). You’ve probably heard of the ASCII/ANSI characters sets. Weird, yes, but see how much clearer it is?Įmbrace the philosophy that a concept and the data that stores it are different. Now imagine they came up and said “The following number is an ASCII character: 65”. You’d have no idea what they were talking about. Imagine if someone came up to you and said “65”. If you see the number 65 in binary, what does it really mean? “A” in ASCII? Your age? Your IQ? Unless there is some context, you’d never know. When reading data, you must know the encoding used in order to interpret it properly. Encodings differ in efficiency and compatibility. The idea of “A” can be encoded many different ways. An encoding is just a method to transform an idea (like the letter “A”) into raw data (bits and bytes). The concept of “A” is something different than marks on paper, the sound “aaay” or the number 65 stored inside a computer. If you’re like me, you’ll get an itch to read about the details in the Unicode specs or in Wikipedia. Read them alone, or as a follow-up to Joel’s unicode article above. Reading about Unicode is a nice lesson in design tradeoffs and backwards compatibility. Unicode isn’t hard to understand, but it does cover some low-level CS concepts, like byte order. But like many newbies, I had an urge to learn once my interest was piqued by an introduction to Unicode.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |