A free template from Joomlashack

A free template from Joomlashack

Home
DIY Tip No. 6: fix those funny looking characters Print E-mail
Written by Christopher Bumgarner   
Friday, 27 November 2009
Article Index
DIY Tip No. 6: fix those funny looking characters
Page 2

 

If you have been following my series about building your own website with free and open source software, you should be testing your site with different browsers to make sure they render your pages consistently. If so, you may have noticed some funny characters appear when you render your site in a different browser.  This is a common mistake and is easily fixed, but it involves a bit of understanding about character encodings.

Have you ever visited a website that has funny characters in the text?  Sometimes you see €,  ¿, or the Unicode replacement character . This is called mojibake and occurs when a document is tagged with an incorrect character encoding.  In English, most text will render correctly except for certain punctuation marks, namely curly quotes (aka smart quotes), curly apostrophes, the em-dash, and the en-dash.  Characters used in European languages can cause problems as well.  If you use Firefox, here an example (look in the 'download' section, Firefox renders a few characters with the Unicode replacement character).

"How did you manage to incorrectly tag the encoding?", you ask.  It has two main causes: 1) copying text from one encoding and pasting it into a document with another encoding (e.g. copying from a web browser into MS Word; 2) not being aware of the encoding your web server uses.

To fix this problem, you need to make sure your content is properly 'tagged' with the correct encoding. But first, let's do a brief overview of character encodings.

Character Encodings

Computers understand only binary 1's and 0's. So there needs to be a away to translate binary into human readable text.  Developers came up with different schemes where a specific sequence of 1s and 0s would represent each letter.  Think of it like Morse Code, except Morse code uses dashes and dits instead of 1s and 0s.  One early encoding standard was ASCII, a scheme where each character would be represented by a certain 7-bit sequence.

The problem with ASCII was that it only had about 95 printable characters.  This was fine for US and British English, but lacked the many variations required by European languages. So developers extended ASCII to include other characters, but it required more than 7 bits. 

If you write in English, you only need to be familiar with a few encodings.  The golden standard is UTF-8, a Unicode family that is the preferred character set for email and web pages.  If everyone used UTF-8, then we wouldn't have any problems.  But unfortunately, two other sets are still common for English.  They are ISO 8859-1 and Windows-1252 (sometimes erroneously referred to as 'ANSI').  Both are 8-bit encodings and are backward compatible with ASCII, but they are not entirely compatible with each other or UTF8. The two encodings are often confused, and text that is in fact Windows-1252 encoded is often mislabeled ISO 8859-1. This happens so often that the draft HTML 5 specification requires text with the 8859-1 label to be parsed as Windows 1252. Confusing, isn't it.  If you have a choice, use UTF-8 so you don't have to worry about it.

Why does it matter?

Character encoding matters to your browser.  When viewing a page, you can see which encoding is in use in the View menu. If the encoding is ISO 8859-1, try switching it to UTF-8 and see some of the text get weird.  It is usually the quotes and apostrophes that get funky. Your browser will 'auto detect' the encoding based on one of its labels.  So if one of the labels is wrong, you may have problems. 

Many problems arise because everything defaults to a different encoding. The standard encoding for XML is UTF-8, but the default encoding for the web is 8859-1.  Many Windows programs will default to Windows-1252, especially if you are saving a file as plain text or html.  

Windows-1252 and ISO 8859-1 are only compatible one way.  Win-1252 adds two rows of characters to the 8859-1 set.  The two rows include curly quotes and apostrophe (aka smart quotes), the m-dash and n-dash, and the ellipses, among others.  Thus, if your 8859-1 document is mislabeled as Win-1252, no errors will occur.  But if your Win-1252 page is mislabeled as 8859-1, then you may see funny characters. Fortunately, most browsers are aware of that common mistake, and will treat all 8859-1 pages as Win-1252 pages.

Of course, if every program used UTF-8, we wouldn't have these problems.  UTF-8 has a separate code point for every Unicode character, which currently includes more than 100,000 characters.

 



Last Updated ( Sunday, 03 January 2010 )
 
Next >
Joomla Templates by Joomlashack
© 2008 Christopher Bumgarner. All rights reserved.