Anyone who has written a web page has probably had to work out what character encoding to use for their web site, which is communicated from the web server to the browser using either the charset field in the HTML content meta tag, or the Content-Type header. Most new web sites are developed with the UTF-8 character set encoding as it can be used to represent any character possible (including Kanji, Chinese, Arabic etc), but many legacy web sites deal with an extended ASCII character set commonly called latin1. Our web site is one of those web sites that has always used the latin1 character set, and more importantly, all our database tables are encoded in that same character set. Up until this past week, I had always worked under the assumption that the ISO-8859-1 character set was in fact the same as latin1, but in the case of MySQL we found out this NOT to be the case, even though all browsers will happily display a page encoded in ISO-8859-1 with characters that are not technically legal for that encoding.
The first part of the problem we had is that most content that is encoded in latin1 that most people probably think is ISO-8859-1 is not actually using that encoding, but is rather using the Windows-1252 character set encoding, which is a superset of ISO-8859-1. For this reason, most browsers will happily go ahead and accept pages that present themselves as ISO-8859-1 but in reality they treat the page as the Windows-1252 encoding so that all the extended characters like smart quotes (“ and ”) and the Euro symbol (€) work correctly.
Now the second part of the problem we had is that all our MySQL databases are encoded in latin1, and in fact MySQL does not have any character set called ISO-8859-1 or Windows-1252 at all. So although ISO-8859-1 is commonly known as latin1 according to Wikipedia, in the case of MySQL it is really Windows-1252, or what they call the cp1252 encoding. Specifically they state:
"latin1is the default character set. MySQL’s
latin1is the same as the Windows
cp1252character set. This means it is the same as the official
ISO 8859-1or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA
latin1treats the code points between
0x9fas “undefined,” whereas
cp1252, and therefore MySQL’s
latin1, assign characters for those positions. For example,
0x80is the Euro sign.”
So why exactly is this a problem, if the browsers will happily accept a page presenting itself as ISO-8859-1 and let characters from the Windows-1252 character set work properly? Normally it would not, because as long as we can take the string from the MySQL database and send them directly to the browser, or take string from the browser and stick them directly into the database, everything works as expected (even though the encoding is not technically legal). And in fact until recently, this never caused a problem because the pages on our web site were written in PHP and internally PHP stores the strings in the native Windows-1252 encoding. Hence reading a value from the database and sending it to the browser works just fine.
Enter Phalanger (again). Phalanger is a compiler for PHP that compiles the PHP code into native .NET code so it can run natively under ASP.NET on the web server or as native .NET executable if run from the command line. Apart from being a lot faster than running it under the normal PHP interpreter, it also allows for interoperability between PHP code and native .NET code such as C# etc. Now the problem is that when running under .NET, strings are represented in Unicode format (UTF-16 to be exact). So when we read a value from the database under Phalanger, it happily converts our strings from the database latin1/Windows-1252 encoding into a native .NET System.String type, which is encoded as a Unicode UTF-16 format string. Since the MySQL database is really using the Windows-1252 code page, the strings all get converted correctly to the appropriate characters in Unicode. Hence when it takes the Unicode strings and sends them to the browser, it has to convert them to the correct format to put into the HTML page, which we had previously set up as ISO-8859-1 since we always used that with our existing PHP pages. Now the problem is that .NET is very strict about character encoding and since we told the ASP.NET server to use the ISO-8859-1 encoding, when it came across invalid characters like smart quotes mentioned above it simply replaced them with a ? character to indicate that it could not convert them!
So after a lot of Googling, head scratching and testing, it finally became clear the ISO-8859-1 != Windows-1252. And also ISO-8859-1 != MySQL latin1! The end result is that once we changed the character encoding for our Phalanger pages to Windows-1252, everything magically worked .