The ins and outs of displaying Greek text on the web… Part II
Well, it is time to re-visit this topic as I have the promised server-side program ready for public consumption. You can freely use it in whatever manner you please for non-commercial purposes. Before we talk about the program we will talk some more about unicode and fonts. You didn’t think we covered everything last time, did you?
Please note that I am supplying many links to the places and topics in this articles at the end rather than interspersing them into the text.
In the old days all we had were ascii codes. Basically 7 bits (although most systems supported 8 bits) that were used to encode the various latin characters and a few special ones. This posed a problem once we wanted to start displaying a much larger range of characters. A number of encoding schemes came about, most of which are around today, but the best one, and the one that is clearly the future, is unicode. Unicode uses a variable number of bits to encode characters, allowing them to represent a theoretically unlimited amount of character variation. When you use unicode you need to set up your HTML properly. We addressed this in our last article on this subject. However, lots of pages use font specific encoding schemes to display unicode characters. What does this mean? Well, let’s look at an example. Here is an example from Liddell, Scott and Jones using the GraecaII font:
fh`/"
Hmmm… Not overly readable. What it actually says is φῇς which is hardly obvious. This looks weird because the GraecaII font uses particular ascii (8 bit) sequences to identify, or encode, the final character. Most fonts have these schemes. One well-known encoding scheme is the TLG beta code system, which I am sure most of us are familiar with. In the example above, it follows that f = φ and h = η and ` = ͅ (iota subscript or perispomeni) and / = ͂ (circumflex or ypogrammeni) and finally the " is the HTML method of writing ” (double quote) which is the terminal sigma (as opposed to medial). Quite a mouthful. Only 8 bit characters were used to encode this word but it only displays properly in the GraecaII font (and any other that uses the same encoding sequences). So, we need a way to translate from the particulars of a font’s encoding scheme to standard unicode. For us to accomplish this, we need to understand two things.
First, Combining Diacritical Marks. These are all the little marks that are added onto letters in many languages. We know the Greek ones but other languages have the same issues, such as the unlaut in German, for example. All these marks are defined in unicode using the codepoints from 0×300 to 0×36F. You can simply print these after a basic character and they will combine to some extent. It doesn’t always look particularly good so we will want to translate to the actual Greek character, usually in the extended set at 0×1F00 to 0×1FFF.
Now, luckily for us, the kind and worthy folks over at SIL International have done a lot of work in this area. This is our second piece of information. They have produced some nice utilities that will translate these mappings to normal unicode. Unfortunately, their utilities work only on files and only files of particular types. This is highly inconvenient so we will need to roll our own utilities. This doesn’t prevent us from using their map files, however. They have maps based on many commonly used fonts that show exactly what ascii codes combine to make a particular end character.
So now we can translate from the ascii encoding of some font to some combining unicode diacritical marks. Now all we have to do is to figure out which combination of diacritical marks map to which single unicode character. For this information we turn to the official unicode website. They have a file that lists every unicode character and the various characters that go into making that final character. It is a bit of a messy file and the utility I wrote to get the information we need had to check each character in a recursive fashion because each character can consist of up to two other characters which can in turn be two other characters and so on. Luckily I did all this work for you.
So, now we will talk a little bit about the program. It is in Perl, which is ugly and cumbersome but widely available and could be easily converted to PHP or whatever, and is implemented as a standard Perl package. You will need to download the .ZIP file and extract the program (greek.pm) to the
use Unicode::greek;
my $greek = Unicode::greek->new;
# These two just converts between beta code and unicode
$my_real_unicode_string = $greek->beta2unicode ( "fh=|s" );
$my_beta_code_string = $greek->unicode2beta ( $my_real_unicode_string );
# Stripping diacriticals is sometimes useful when comparing strings,
# for example, in lexical lookups
# since no one seems to be able to agree on breathing marks and, more
# importantly, many people are unsure or forget.
# By stripping the diacriticals, you are much more likely to get a match.
$plain_beta_code = $greek->StripDiacriticals ( $my_beta_code_string );
# Upper case and lower case conversion routines are also included.
$my_upper_case_unicode = $greek->ucase ( $my_real_unicode_string );
$my_lower_case_unicode = $greek->lcase ( $my_real_unicode_string );
# Now for the font mapping routines. They are quite simple. They use the file
# format supplied by SIL. They do not do any error checking.
$my_GraecaII_font_map_reference = $greek->ReadFontMap ( "fonts/GraecaII.map" );
$my_SD_font_map_reference = $greek->ReadFontMap ( "fonts/SemiticaDict.map" );
$greek->SetFontMap ( $my_GraecaII_font_map_reference );
$my_unicode_from_GraecaII = $greek->map2unicode ( "fh`/\"" );
$greek->SetFontMap ( $my_SemiticaDict_font_map_reference );
$my_unicode_from_SD = $greek->map2unicode ( $some_text_encoded_in_SD_format );
Well, there you have it. It has only been sparsely tested so if you find any problems, please let me know right away so I can fix them and upload a new version.
See you all at the SBL Annual meeting in Washington, DC, I hope.
Julian
My program described above: greek.zip
SIL International (SIL) have lots of font utilities and maps, check out their many maps here: Conversion Maps
Unicode (unicode) have a text file that explains all their characters here: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and a page, which you should read first, explaining what it all means here: http://www.unicode.org/Public/UNIDATA/UCD.html.
Explore posts in the same categories: Programming
July 2nd, 2007 at 11:19 am
Unicode Greek is a bit of a scam. It’s easy really.
All you do is to use the <font> tag as we always did (or CSS, nowadays) and specify a font which is unicode and has characters for Greek (most unicode fonts only implement a subset of the full spec). Palatino Linotype on Windows does, for instance.
Then between the font tags, simply enter the html codes for the Greek characters. Find these by looking in the unicode specs, or using the windows charmap utility. This tells us that alpha with a rough-breathing is unicode 1F00. Just specify ampersand+H+1F00_semicolon, and it will appear just fine.
Of course then one can write loads of scripts to convert things. I have one which allows me to type in the text in the SpIonic form that I am used to, and then convert it to unicode/html entities. But the principle is easy.
It works for Syriac too. You do just the same. But you have to specify that the paragraph is formatted right-to-left, so it displays OK.
July 2nd, 2007 at 11:22 am
I.e. &H1F00&59; is &H1F00; (I hope this works, because I get no preview!).
July 2nd, 2007 at 9:28 pm
Well, what you are trying to show wouldn’t work in this case because of the tags added by Wordpress.
You are correct, but only up to a point. You call unicode a scam and then you proceed to use unicode.
When you type 1F00 in the manner you did above, you are using the unicode character codes. What you are showing is how to display unicode characters using HTML notation which works in 7 bit ASCII as opposed to a unicode encoded file. And it doesn’t solve the problem of how to copy a section of text from, say, BibleWorks 7 and pasting into a browser and having it display badly. The perl script I provided will do the translation to proper unicode, whether your editor of choice supports it natively or you prefer to type it in using HTML notation. The real problem is simply one of text mapped to a font file (like TLG, BibleWorks, and, gee, everybody else) which is very silly or pure unicode (regardless of notation :p ) which works with every unicode font.
BTW, for the record, I prefer Tahoma, then Arial or Verdana. The advantage of Tahoma is that the Greek characters look good even in regular size. The Palatino Linotype looks awful in any size smaller than Fisher-Price standard.
I am thinking that maybe it would be worthwhile making a web page that would accept a paste job from your bible program and convert it to unicode and HTML notation so that you could paste it from there into your confrontational forum of choice.
Julian
July 3rd, 2007 at 1:26 pm
You’re quite right.
By “a scam” I meant that people start fretting about all sorts of things which are simply irrelevant, fiddling with browser settings and the like.
But all we have to do is stick with ordinary Windows text files in 7-bit ASCII — none of this messing around with 8- or 16-bit text — and just write the HTML entities in. It doesn’t matter that they are unicode entities; so far as an HTML scribe is concerned, it’s just another code.
I admit that pasting Greek text in from bible-works programs has never been a concern of mine, so I defer to your point here. (As a rule all I want to do is type in some Greek polytonic text, and put it online).
Interesting to hear your thoughts on fonts. Quite a lot will support some Greek, and I wasn’t aware of that particular wrinkle on Palatino Linotype.