The State of the Koine Greek Parser, Part I

As promised I will give a brief overview of where I am. It has been a few months since I looked at it last so I am a bit rusty. It would seem that I was far more clever back then than I appear to be at the present. Actually, I did have to re-discover some of the knowledge I possessed back when I was deeply immerse in the undertaking, knowledge which had been relegated to that dusty cabinet in the attic of some forgotten lobe to make room for more pertinent items. I will briefly go over the project addressing various parts in relative isolation.

How does it work?
It is basically a combination of Dependency Grammar (DG) and Phrase Structure Grammar (PSG), the latter implemented without recursion, the former applying only a single rule in a single pass. This, obviously, must seem weird to the more experienced computational linguist, who traditionally like to a apply a single grammar algorithm to a given dataset. However, more and more we are seeing grammar combine prior paradigms into more advanced algorithms. I have high hopes that this will eventually be the case with my approach. My first goal, however, is to achieve worthwhile recall, then consider how to construct a single algorithm from what is currently an arbitrarily applied sequence of grammar rules. The reason for the unorthodox implementation is that no single algorithm seemed to solve the problems presented by Koine Greek, a free word order language. Free word order languages cause tremendous problems for the traditional, probably ‘early’ would be a better word here, approaches since they depend on the word order to infer much of the meaning. English is highly dependent on word order, programming languages even more so. Free word order languages, like Greek and Latin, need very little order, relying instead on morphology to determine the role of any given word. That being said, a Greek sentence do evidence islands of local word order predictability. For some easy examples, check out Wallace’ Greek Grammar Beyond the Basics on his treatment of adjective/noun combinations. This is why PSG comes in handy. It is also why it wasn’t implemented in a recursive manner, and by that I mean that it doesn’t utilize a formal grammar in the traditional way in the form of terminals and non-terminals, trying to reduce or expand upon them. The PSG finds the ordered groupings and the DG finds the ones that are generally scattered, although DG will, of course, find the traditional groupings as well. The problem with DG in particular, but also with PSG, is their greedy nature. They will try to solve too much of the puzzle, making backtracking far too complicated when the wrong solution occurs first. I could write a small novel on this topic so I will desist at this point. If I manage to achieve a decent recall rate, say well over 90%, I will probably write a paper or a book on the topic. We’ll see. In science it is best to keep enthusiasm high and expectations low.

Added: Reading through this, it occurs to me that PSG is probably not a good way to describe it. It is more of a hybrid Constraint Grammar (without any feature structures like LFG, for example, although I have considered such an implementation) and traditional Context-free Grammar, still without the terminals/non-terminals.

How is it implemented?
It is written in perl and uses Tk for display purposes. It is in perl because I know it well, and it runs the same on many platforms. The same holds true for Tk. I did have it running in a web browser for a while but it became too tedious to manage and the rendering is not as easy to deal with as it is in a traditional windowing system. It will eventually run in some sort of client/server environment once I decide what the overall strategy will be. I have been toying with the idea of rewriting it in C++ but that is merely nostalgia talking since I don’t get to do much complied code programming these days. Here is a picture of the output from a single sentence from GMark:

and here is another:

Most of the verbiage is for programming purposes. The NNODE field has the (more or less) standard POS abbreviation and the index of the word. The parse should be recognizable as the parse field from Tauber’s GNT. I/G/S is an internal Index, Group (derived from a number of clues, including punctuation), and Strong’s word number. A newer version also has Case and whether or not the noun phrase (NP) is arthrous or not. HRT is Head Relation Type, or how the head word and the present word relates to each other. This one is sometimes hard (impossible?) to determine, sometimes trivial. Once the recall is good, one obvious avenue of research is to introduce semantics to disambiguate the relations currently elusive.

What’s the next step?
The next step is to take the www.opentext.org data, convert them to a format more compatible with the data that I generate, and compare my results to theirs. Once I get results somewhere in the 90% (which I may already be getting, there is no way to know just yet) I will have the luxury to relax a bit and refine the algorithm, increase the grammar precision, write a morphology parser so the syntax parser can be applied to all Greek works, especially the patristics, add statistical techniques to structural analyses, and so on. Many, many things.

Another possible step is also very exciting although not much has happened so far, which is largely my fault. Matthew Brook O’Donnell (and Catherine Smith, our silent, non-interactive conversation participant) and I have been discussing closer collaboration with the opentext.org group, which I would find of great utility. But between Matt being extremely busy (he has been very busy, check out his new and exciting endeavor: http://www.liabg.org/) and me being extremely busy, we have not really gotten anything done along those lines. As a matter of fact, I have yet to talk to any of the other opentext.org members about this so if any of them happen to read this they are probably quite surprised at this point. :)

Eventually I will make the code available for all to use for further study. But for now it is not ready. I have defined my own language designed for DG and PSG as described above. This language gets parsed using a traditional recursive descent parser and produces a perl module based on that. That perl module is then applied to the text. So it is a parser that writes a parser that parses the Greek. :) It could be simplified quite a bit but one thing at a time.

Anyways, that’s it in a brief overview. I will attempt to update this page more often than I do my blog, which is not saying much, but I really will try.

I was reading Marcus Aurelius’ Meditations, with the English text handy since the Greek is too advanced for my feeble skills to read without some help. I loved this bit and will close the present entry with this excellent Stoic thought.
κατὰ φύσιν γάρ· οὐδὲν δὲ κακὸν κατὰ φύσιν.
For it is according to nature, and nothing is evil that is according to nature.