Go to the first, previous, next, last section, table of contents.

Characters, Words, and Paragraphs

An HTML user agent should present the body of an HTML document as a collection of typeset paragraphs and preformatted text. Except for preformatted elements (PRE, XMP, LISTING, TEXTAREA), each block structuring element is regarded as a paragraph by taking the data characters in its content and the content of its descendant elements, concatenating them, and splitting the result into words, separated by space, tab, or record end characters (and perhaps hyphen characters). The sequence of words is typeset as a paragraph by breaking it into lines.

The HTML Document Character Set

The document character set specified in section SGML Declaration for HTML must be supported by HTML user agents. It includes the graphic characters of Latin Alphabet No. 1, or simply Latin-1. Latin-1 comprises 191 graphic characters, including the alphabets of most Western European languages. (24) (25)

In SGML applications, the use of control characters is limited in order to maximize the chance of successful interchange over heterogeneous networks and operating systems. In the HTML document character set only three control characters are allowed: Horizontal Tab, Carriage Return, and Line Feed (code positions 9, 13, and 10).

The HTML DTD references the Added Latin 1 entity set, to allow mnemonic representation of selected Latin 1 characters using only the widely supported ASCII character repertoire. For example:

Kurt Gödel was a famous logician and mathematician.

See section ISO Latin 1 Character Entity Set for a table of the "Added Latin 1" entities, and section The HTML Coded Character Set for a table of the code positions of [ISO 8859-1] and the control characters in the HTML document character set.


Go to the first, previous, next, last section, table of contents.