Although English is a widely-used language for communication, it is by no means the only language in the world. Support for the other languages is a concern which the standards committees are still addressing. In the meantime, some non-English languages have some character support in the form of character entities, and that's what this chapter will cover.
Some fonts may not display character entities correctly. Rather than go with graphics, which would ensure display but take longer to load, I have chosen to stick with the defined character entities. If the text describes something different than what you see, then try switching display fonts and see if that helps.
Although a great deal of information can be conveyed using standard keyboard characters, such as letters and numbers, this doesn't work as well in countries where the inhabitants speak a language other than English. (Note to American readers: such countries do exist! Honest!) Languages such as French, German and Icelandic make use of characters that aren't found on the typical American-made computer keyboard.
Furthermore, there are words in English which require such characters. If you've ever put up a page listing your employment history, then you've likely put up your resume-- which, if you were to look that word up in a dictionary, would mean that you'd put up a continuation of some sort. What you really meant to put on-line was your "r�sum�."
Many of these "foreign-language" characters are found in the ISO Latin Alphabet No. 1, otherwise known as Latin-1, upon which character systems like ASCII are based. (Incidentally, ISO refers to the International Organization for Standardization.) Letters with acute and grave accents, umlauts, eths, and other such characters can be found in this table, as well as some currency and scientific symbols. Latin-1 is the base character set for HTML.
Only-- how to put these characters in a Web page? If you have a Macintosh or Windows machine, there may be keyboard shortcuts to produce some of these symbols. For example, on a Macintosh, hitting Option-e and then typing e will produce "�". However, if you simply type � in a Mac word processor, save the file as text, and then load it into a Web browser, you are likely to get something different, if you get anything at all.
Why does this happen? The character code used to represent "�" on a Macintosh doesn't directly translate to Latin-1. Your Web browser sees the character and tries to do something with it. The fact that it displays another character is not your browser's fault. A Web browser isn't supposed to recognize the Macintosh character set, nor should it be expected to do so.
Browsers are supposed to recognize the Latin-1 character set, however. This character set is a list of 255 characters, many of which appear on standard keyboards, but many of which do not. It is very similar to the basic ASCII table that computer geeks are familiar with, although it differs significantly from proprietary character sets such as the IBM and Macintosh sets.
Note
that Greek, Cyrillic, Arabic, Oriental, and most other
non-Romanic languages are not supported in Latin-1. That's part
of what the internationalization standards committees are working
on.
Okay, so how do you get these characters to appear if you can't just type them in? How did I get "�" to appear on this page, for that matter? To do this, you use what are referred to as character entities. These are special codes which let the Web browser know that it needs to use one of the Latin-1 characters.
Character entities do not look like ordinary HTML tags. Actually, they aren't tags at all, which is why we refer to them as entities. The general form of a character entity is:
&code; (where code is the character's code)
Every entity begins with an ampersand (&) and ends with a semicolon (;). Between those symbols is a code which is unique to the character in question. There are two kinds of entities: text entities and numeric entities. Each text entity uses a unique code to identify the character, and each numeric entity uses a number to represent a given character.
Text entities are fairly easy to deal with, because they are designed to be easily remembered. For example, the symbol I've been using is a lower case e with an acute accent. The code for this character is é. With a minimum of study, you can see that the code is the words "e" and "acute" mashed together. In order to get an upper-case e with an acute accent (�), you would use the entity É.
Notice the difference is that in the second entity, the "e" is capitalized. This brings up the important point that the capitalization of character entities is important. If you type an entity incorrectly, your browser will most likely display the whole thing verbatim. For example:
HTML Text Result ------------------------- é � É � É &EACUTE &eACUTE; &eACUTE
The last two didn't work because they are not valid entities-- the capitalization is all wrong.
The four most important text entities, the ones that every browser known to humanity should deal with, are as follows:
& & (ampersand) < < (less-than symbol) > > (greater-than symbol) " " (double-quotation mark)
& (&): Since the ampersand (&) is used as a begin-symbol for character entities, similar to the use of < to begin HTML tags, simply typing the ampersand character from the keyboard can be dangerous and confusing to a Web browser. If you want to be sure that an ampersand shows up in the browser display window as an ampersand, then the entity & should be used. This is useful if you want to show a text entity, as I have been doing throughout this chapter. In order to display é, I need to type &eacute;. To produce that last example, I needed to type &amp;eacute; -- and so on.
< and > (< and >): Since these symbols are used to delimit HTML tags, there is a high probability that using them in a page will confuse a Web browser, and the odds of this happening are much higher than with an ampersand. Use of the less-than and greater-than entities ensures proper display.
" ("): The double-quotation mark, since it is often used in HTML tags to enclose labels or URLs, should be represented by its text entity when it is found in normal text. This rule is probably one of the most-often ignored, because it's a lot easier to just type in the symbol on the keyboard, and nearly every browser will not have a problem displaying it as typed. Still, it's something to keep in mind as a possible source of strange display errors.
A complete list of official text entities is available; this document is valid as of 20 February 1996.
Although there are 255 characters in the Latin-1 table, there are not nearly so many text entities-- well, not yet, anyway. There are a great many symbols, such as the pound sterling, which do not have associated text entities. Therefore, using numeric references is not only useful, but "more correct" from the standpoint of making the entity more universally recognizable to browsers.
These symbols are represented using numeric entities instead. Numeric entities have the form:
&#xxx; (where xxx is a number 0 - 255)
Note the use of the symbol # (referred to as a pound-sign, hash mark, or number-sign). It must precede the actual number, or else the entity will not be recognized. The number used for a specific symbol corresponds to its position in the Latin-1 table. The 201st character would use the code É and produce the character �. Look familiar? That's right, it's a capital E with an acute accent. In other words, É and É are equivalent. Using either one will yield an �.
A complete list of numeric entities is available; this document is valid as of 20 February 1996.
References:
Chapter 1 Quiz
Next-- Chapter 2: Even
More Style Tags
Previous --
Introduction
Table of Contents
Tutorial Glossary
Tutorial Index
The HTML
Laboratory