Mystery character encoding programmer forever pain.

Recommended for you: Get network issues from WhatsUp Gold. Not end users.
The character encoding that each programmer's nightmare, as long as there is a Chinese place, always meet different coding problem, and this problem is very difficult, especially in the Linux, because there is a lot of software is developed for English speaking countries, will not consider other language coding. After countless pit encounter encoded, I decided to take a closer look at the coding problem, because it is like a bridge has been cross in front of you, every time here you will fall, after each climb, you all right, so people called the warrior, a true warrior. It is a force warrior, as a new era of intellectual warrior, certainly not in the fall and then again in the fall to.
The file is stored:
File are stored own; type, such as the most common TXT, CPP, h, C, XML, PNG, RMVB, and a custom type. These files no matter what type, are stored in computer hard disk 2 hexadecimal storage, corresponding to different file type, with different software analysis. This article does not talk about how files are stored, only talk about how files are parsed.
Text file parsing:
The text file corresponds to the human readable, how from the 2 hexadecimal conversion as a text file? At first because of the computer in the America invention, naturally we consider is how to say English, English letters, a total of 26, with special characters, 128 characters, 7 is a byte can be expressed. This is known as the ascill code. The corresponding relationship is simple, a character corresponding to one one byte.
But soon found, other non English speaking countries text is far more than ascill codes, you certainly want to unify, different countries out of their code in different ways, Chinese GB2312 is make out of their own code, so that each country has its own coding switch back and forth, too much trouble. At this time the emergence of new encoding, Unicode encoding, want the unified coding, so are provided for each corresponding Unicode character code.
1, Many documents are encoded as ASCII, if Unicode is a waste.
2, No sign that the few bytes to parse as a symbol.
Then save the world of UTF, UTF is a Unicode, but smarter. Utf16 occupies two bytes, or four bytes, utf32 occupies four bytes. Utf8 is a kind of representation is very smart.
1, For single byte symbols, the first of 0 bytes, 7 bytes representing the code behind.
2, For the N byte symbol, the first byte of N bit is set to 1, the N + 1 bit is 0, the remaining bit code position.
For different code, different sign in the front of the text, the Unicode usually has two bits are FF Fe, or FEFF, fffe said Bigendian coding FEFF said Litteendian code. Utf8 efbbbf at the beginning of the. You can see UTF-8 is self explanatory, so don't take this sign documents, most programs that can be identified. But some program does not recognize the signs, such as PHP will direct the sign when the text analysis, do not ignore. I believe that many have PHP output text parsing garbled or parsing errors the students encountered such a problem.
How to solve the problem:
If you have a VIM that's better, remove the command:
set encoding=utf-8
set nobomb
Add command:
set encoding=utf-8
set bomb
Or use the notepad+ built-in function; +
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download

Posted by Darren at December 04, 2013 - 12:41 PM