The ins and outs of displaying Greek text on the web…

For my first real blog entry, I will talk about some important programming aspects involved in displaying Greek text on the web. There are several issues involved that are frequently handled very poorly by various web pages. This blog entry will go step by step through the process, including the actual code later on. The four parts to be considered are:

  1. Character encoding
  2. Character sets, font files and Unicode
  3. Preferences and customization
  4. Conversions

The first one involves setting the correct MIME type for the document, ensuring proper display. Without this the Greek text is likely to appear as garbage. The user can override the encoding but how many users know that this is what they are supposed to do? How many even know how to do it?

The user can change the encoding on the View menu, selecting Encoding (or Character Encoding in Firefox) and then selecting Unicode (UTF-8). For an example of what erroneous encoding can do to your Greek display, check out Justin’s First Apology here: http://khazarzar.skeptik.net/books/justinus/apolog1g.htm

They selected a Cyrillic encoding and the result is obvious. If you change the encoding to UTF-8 the text suddenly becomes legible (provided you can read Greek to begin with, of course.) Notice also how the German on that first page becomes legible as well.

They should have set the encoding in their web document and freed the, quite possibly clueless, user from the rather cryptic task. It is a good practice to start every web page with the following unless you have a very good reason not to.

For HTML, start every document with this line:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

This ensures that the text is displayed correctly, no matter what language you are using. For server-side scripting it is much the same, here is what it looks like in Perl:

	print "Content-type: text/html;charset=utf-8 \\n\\n";

Make sure that you include the two newline characters or it will fail. For other scripting languages, check here: http://www.w3.org/International/O-HTTP-charset

Next, we will discuss the character sets, font files and Unicode. The official Greek Unicode chart can be found here: http://www.unicode.org/charts/ Note that there are two charts, both being 16 bits, the first ranges from 0×370 to 0×3ff and covers the regular upper case and lower case letters as well as the characters with tonoi. The second chart ranges from 0×1f00 to 0×1fff and covers all the lower and upper case letters with their diacritical marks. The way they are laid out is pretty decent and helps when converting characters. Most of the well-known Greek fonts available for download cover these characters. At least one standard Windows font also covers the entire range (Tahoma.)

This ties into our third point neatly. Everybody has different tastes in Greek fonts. I, personally, like the Tahoma font because it is clean, crisp, widely available and looks good when displayed in a normal size. I find it fairly essential that users are allowed to customize the font choice if the site is heavily dependent on Greek characters. There is really no excuse not to do this since it is rather uncomplicated. I won’t talk much about server-side font selection since there are about a thousand ways of doing this and if you know how to do server-side programming then you don’t need me to explain to how to work the font selection. Much can be done client-side, however, using Javascript and the Document Object Model.

The method I have chosen is to modify the global stylesheet although there some cross-browser issues. It is also rather fuzzy since the entries look like JSON but they really aren’t. This is a problem with most objects that didn’t originate from the Javascript core, it looks like a duck, quacks like a duck, but try and treat it like a duck and you’ll be sad. I will probably write more on this issue on a future date, especially the problems with the Array object as returned from the DOM and other places. Anyways, this is all solvable as will be seen below.

When allowing the user to select the font you should also be kind enough to remember his selection and set it upon his next return. Let’s first present the complete but simple example of how to do all this.

<html>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<head>
<style type=text/css>
/* Mark all elements that  display Greek with class=greek
* You can add whatever elements you want in addition to the font-family
*/
.greek
{
    font-family: Tahoma;
}

</style>
<script language=JavaScript>
function setFont( fontName )
{
    var theRules = new Array ();
    if ( document.styleSheets[ 0 ].cssRules )
        theRules = document.styleSheets[ 0 ].cssRules
    else if (document.styleSheets[ 0 ].rules)
        theRules = document.styleSheets[ 0 ].rules

    for ( var n = 0; n < theRules.length; n++ )
        if ( theRules[ n ].selectorText == '.greek' )
            theRules[ n ].style.fontFamily = fontName;

    setCookie ( 'userFontName', fontName );
}

function setCookie ( cName, cValue )
{
    var exdate = new Date();

    exdate.setDate ( exdate.getDate() + 365 ); // Set for one year.
    document.cookie = cName+ '=' + escape ( cValue ) + ';expires=' + exdate;
}

function getCookie ( cName )
{
    if ( document.cookie.length > 0 )    // Are cookies turned on?
    {
        start = document.cookie.indexOf ( cName + '=' )
        if ( start != -1 )
        {
            start = start + cName.length + 1;
            end = document.cookie.indexOf ( ";", start );
            if ( end == -1 ) end = document.cookie.length;
            return ( unescape ( document.cookie.substring ( start, end ) ) );
        }
    }

    return ( null );
}

function winLoad ()
{
    if ( ck = getCookie ( 'userFontName' ) )
    {
        setFont ( ck );

        var fontSelect = document.getElementById ( 'fontSelect' );
        for ( n = 0; n < fontSelect.options.length; n++ )
            if ( fontSelect.options[ n ].value == ck )
                fontSelect.selectedIndex = n;
    }
}

</script>
</head>
<body onload="winLoad();">
<div class=greek>
ἐπειδήπερ πολλοὶ ἐπεχείρησαν ἀνατάξασθαι διήγησιν περὶ τῶν πεπληροφορημένων…
</div>
<div>
Regular text here…
<select id=fontSelect onChange=”setFont ( this.options[ this.selectedIndex ].value );”>
<option value=Tahoma>Tahoma
<option value=SPIonic>SPIonic
</select>
</div>
<div class=greek>
καθὼς παρέδοσαν ἡμῖν οἱ ἀπ’ ἀρχῆς αὐτόπται καὶ ὑπηρέται γενόμενοι τοῦ λόγου
</div>
</body>
</html>

If you want to try out this program, make sure that you save the document in a format that supports Unicode. Word or Wordpad will both do this, just pick Save As… and change the Save as type… If you see garbage on your screen, you saved it in a format that doesn’t support wide characters.

The body of the program is pretty simple. There is a DIV tag marked as containing Greek text (you could mix the greek in with other languages as long as the font selected has those characters), then a select which allows you to SELECT a font and then another section of Greek. You can add as many fonts in the SELECT as you like.

We have an onload event for this document. It gets the cookie (if it exists), changes the stylesheet and makes the SELECT start with the current font selection. The cookie is set for a year, simply change the 365 to some other numbers of days if you wish. The setFont function goes to the first stylesheet, finds the ‘greek’ class and sets the font. It also updates the cookie.

That’s it. Nothing to it. Anyone is free to copy the above and use it as they see fit.

So now we can display the font properly, we know the character set layout, we can let the user select a font and remember it for future use. What’s left? The hardest part, as a matter of fact.

Conversion is an interesting topic. When I say conversion, I mean conversion between upper case, lower case, betacode, stripping diacriticals, HTML character entities and so on… It is entirely lame that I have to transliterate Greek into betacode on some sites in order to do a search when I have the Tavultesoft Keyman (which I highly recommend to everyone, it is excellent and free) installed.

I don’t know of any conversion programs out there, I searched, so I ended up writing my own in Perl. I was going to post it as part of this entry but I am realizing that it is not yet quite ready for public consumption. If you need it in a hurry let me know, otherwise I will simply post a link to it here once it is finished, which won’t be long. Really. It essentially does all the conversions I mentioned above. It came in handy when trying to marry up the MorphGNT and XML version of Strong’s, which is a story in itself. I will relay that in one of my next entries. The MorphGNT is actually surprisingly accurate, more so than many other GNT sites and tools that I have seen. The Strong’s…? Not so much. That blog entry will also give me a chance to rant about the poor use of computers, the NA27, ridiculous pricing and some fairly pathetic approaches to the whole technology issue with regards to biblical studies.

For now, this was my first entry. I doubt anyone will read this far. If you did, then I hope I have been of some assistance. I have worked with this for a while now and have gathered some knowledge in this area, so any questions are welcome, since I realize that this was a rather short entry that left out a large number of details.

Julian

Explore posts in the same categories: Programming

10 Comments on “The ins and outs of displaying Greek text on the web…”

  1. Ikkarim Says:

    Very nice post. Perhaps we could persuade you to also talk about Hebrew text?

  2. Julian Says:

    Thank you kindly. :)

    The problem is that I don’t know any Hebrew, at all. I know just enough about the OT to understand what the NT is talking about. ;)

    However, the title for this blog entry is kind of a trick. Because in reality the above comments will work for any language in Unicode format, including Hebrew. So above, anywhere you see the word ‘greek,’ simply replace it with ‘hebrew.’ :) The few differences would be the fonts, I don’t what Hebrew fonts are out there. The Hebrew Unicode chart can be found here: http://www.unicode.org/charts/ right alongside the Greek, check out the middle of the second column. The Hebrew range is 0×590 through 0×5ff. And 0xfb00 through 0xfb4f for the presentation forms. I have no idea what that means but the principles remain the same.

    I have to go to Europe for a few weeks which will slow down my next post which is regarding the Pericope de Adultera, or more specifically how to analyze texts using statistics. Again, it will focus on the Greek NT but in reality you could replace the words ‘greek’ and ‘NT’ with ‘Hebrew’ and ‘OT’ and get the same result.

    Julian

  3. Peter G. Says:

    Julian, this is a helpful post. Thanks much!

  4. helena Says:

    I’m designing a site in greek at the mo and have exactly the same problem as in yor example file. The thing is that I do have the

    in my file. I also configured my apache (i think) to show greek characters but the problem is still the same. Is there anything else that can go wrong??

  5. helena Says:

    hmm - i do have the meta tag…is what i was trying to say

  6. Julian Says:

    Hi Helena,

    sorry for the response delay but I was on a 3-day business trip on the west coast with no time to do any personal stuff.

    I am not sure, based on the information you give, what your problem might be. If your meta tag is done correctly and your font is also correct (i.e. is a unicode font) then it would be your server or browser. Find your error by a process of elimination, i.e. make/find an example that works and then turn it into one that is identical to the one that fails. This transformation should happen step by step so that once it goes from working to non-working you know which piece is tripping you up.

    Alternatively, you could just contact me via email so that we can discuss your problem in greater detail. :) Just click on contact on the menu above.

    Julian

  7. helena Says:

    Thought I would post here again just in case anyone has the same problem as me. I fixed the above problem by changing the default char set in my apache config to unicode 8. It seems that there was some setting in my apache config that was overriding my meta tag and displaying the characters in the apache default char set regardless. It works now.

    Thanks very much Julian for the very useful site and also for the speedy reply. :)

  8. cpanon Says:

    Hi Julian
    I am just getting into multibyte characters at the client-side html markup and on the server-side struts/jsp dynamic side. Is there some editor (MSWord?) and character map that will allow me to create the text? It seems a very cumbersome process of looking up the character you want in one of those two code charts(how do you specify which you are using in the html) and entering the three digit ascii code(is there an easier way of doing that), it just seems widely inefficient. Sorry for the naivete of the questions, I was just expecting a MSWord type of “pampered” visual editing environment.

  9. Julian Says:

    Hi cpanon,

    you have a number of options. The problem can be partitioned into two parts: entry of unicode and display of unicode. Many programs will display unicode, including MSWord and most decent programming editors, just make sure that the options are set to display it, if necessary.

    Now, as for your question about entering unicode, there are a several methods to enter characters. One way is to use the ‘Insert -> Symbol’ in MSWord. Obviously, this is somewhat inefficient even though it is a visual process. It is okay for small entries, though. I use that option when writing things in Danish/Swedish/Norwegian/German/Anglo-Saxon/Norse/Icelandic/Whatever, for example. Another method is to remap your keyboard to make Windows think that it is keyed to a different language. Under Vista, for example, you can do this under ‘Control Panel -> Regional and Language Options -> Keyboards and Languages,’ or, finally, the method I tend to use, use a program like Keyman to enter the characters. For example, I have keyman configured for Greek and I can simply type the betacode on my keyboard and perfect Greek appears complete with diacriticals. :) (Not quite true, actually. I used to have it set up under XP but haven’t gotten around to setting it up under Vista, but I am assuming a similar functionality). The nice thing about betacode input is that it is very easy to figure out and in under two minutes you can be entering Greek about as fast as you normally type. Good stuff.

    Hope this helps.

    Julian

  10. SaulaTolamulp Says:

    Hi

    As newly registered user i just want to say hello to everyone else who uses this forum ;-)

Comment: