Archive for September, 2006

The ins and outs of displaying Greek text on the web… Part II

Monday, September 4th, 2006

Well, it is time to re-visit this topic as I have the promised server-side program ready for public consumption. You can freely use it in whatever manner you please for non-commercial purposes. Before we talk about the program we will talk some more about unicode and fonts. You didn’t think we covered everything last time, did you?

Please note that I am supplying many links to the places and topics in this articles at the end rather than interspersing them into the text.

In the old days all we had were ascii codes. Basically 7 bits (although most systems supported 8 bits) that were used to encode the various latin characters and a few special ones. This posed a problem once we wanted to start displaying a much larger range of characters. A number of encoding schemes came about, most of which are around today, but the best one, and the one that is clearly the future, is unicode. Unicode uses a variable number of bits to encode characters, allowing them to represent a theoretically unlimited amount of character variation. When you use unicode you need to set up your HTML properly. We addressed this in our last article on this subject. However, lots of pages use font specific encoding schemes to display unicode characters. What does this mean? Well, let’s look at an example. Here is an example from Liddell, Scott and Jones using the GraecaII font:

fh`/"

Hmmm… Not overly readable. What it actually says is φῇς which is hardly obvious. This looks weird because the GraecaII font uses particular ascii (8 bit) sequences to identify, or encode, the final character. Most fonts have these schemes. One well-known encoding scheme is the TLG beta code system, which I am sure most of us are familiar with. In the example above, it follows that f = φ and h = η and ` = ͅ (iota subscript or perispomeni) and / = ͂ (circumflex or ypogrammeni) and finally the " is the HTML method of writing ” (double quote) which is the terminal sigma (as opposed to medial). Quite a mouthful. Only 8 bit characters were used to encode this word but it only displays properly in the GraecaII font (and any other that uses the same encoding sequences). So, we need a way to translate from the particulars of a font’s encoding scheme to standard unicode. For us to accomplish this, we need to understand two things.

First, Combining Diacritical Marks. These are all the little marks that are added onto letters in many languages. We know the Greek ones but other languages have the same issues, such as the unlaut in German, for example. All these marks are defined in unicode using the codepoints from 0×300 to 0×36F. You can simply print these after a basic character and they will combine to some extent. It doesn’t always look particularly good so we will want to translate to the actual Greek character, usually in the extended set at 0×1F00 to 0×1FFF.

Now, luckily for us, the kind and worthy folks over at SIL International have done a lot of work in this area. This is our second piece of information. They have produced some nice utilities that will translate these mappings to normal unicode. Unfortunately, their utilities work only on files and only files of particular types. This is highly inconvenient so we will need to roll our own utilities. This doesn’t prevent us from using their map files, however. They have maps based on many commonly used fonts that show exactly what ascii codes combine to make a particular end character.

So now we can translate from the ascii encoding of some font to some combining unicode diacritical marks. Now all we have to do is to figure out which combination of diacritical marks map to which single unicode character. For this information we turn to the official unicode website. They have a file that lists every unicode character and the various characters that go into making that final character. It is a bit of a messy file and the utility I wrote to get the information we need had to check each character in a recursive fashion because each character can consist of up to two other characters which can in turn be two other characters and so on. Luckily I did all this work for you. :)

So, now we will talk a little bit about the program. It is in Perl, which is ugly and cumbersome but widely available and could be easily converted to PHP or whatever, and is implemented as a standard Perl package. You will need to download the .ZIP file and extract the program (greek.pm) to the /site/lib/Unicode directory. Here is an example of how to use it:

   use Unicode::greek;

   my $greek = Unicode::greek->new;

   # These two just converts between beta code and unicode

   $my_real_unicode_string = $greek->beta2unicode ( "fh=|s" );
   $my_beta_code_string = $greek->unicode2beta ( $my_real_unicode_string );

   # Stripping diacriticals is sometimes useful when comparing strings,
   # for example, in lexical lookups
   # since no one seems to be able to agree on breathing marks and, more
   # importantly, many people are unsure or forget.
   # By stripping the diacriticals, you are much more likely to get a match.

   $plain_beta_code = $greek->StripDiacriticals ( $my_beta_code_string );

   # Upper case and lower case conversion routines are also included.

   $my_upper_case_unicode = $greek->ucase ( $my_real_unicode_string );
   $my_lower_case_unicode = $greek->lcase ( $my_real_unicode_string );

   # Now for the font mapping routines. They are quite simple. They use the file
   # format supplied by SIL. They do not do any error checking.

   $my_GraecaII_font_map_reference = $greek->ReadFontMap ( "fonts/GraecaII.map" );
   $my_SD_font_map_reference = $greek->ReadFontMap ( "fonts/SemiticaDict.map" );

   $greek->SetFontMap ( $my_GraecaII_font_map_reference );
   $my_unicode_from_GraecaII = $greek->map2unicode ( "fh`/\"" );
   $greek->SetFontMap ( $my_SemiticaDict_font_map_reference );
   $my_unicode_from_SD = $greek->map2unicode ( $some_text_encoded_in_SD_format );

Well, there you have it. It has only been sparsely tested so if you find any problems, please let me know right away so I can fix them and upload a new version.

See you all at the SBL Annual meeting in Washington, DC, I hope. :)

Julian

My program described above: greek.zip

SIL International (SIL) have lots of font utilities and maps, check out their many maps here: Conversion Maps

Unicode (unicode) have a text file that explains all their characters here: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and a page, which you should read first, explaining what it all means here: http://www.unicode.org/Public/UNIDATA/UCD.html.