The State of the Koine Parser as of October 2008
Saturday, October 25th, 2008Imagine my surprise when I learned that there are actually people out there who reads this blog. It was quite a shocker, I can assure you. Granted, my readership probably doesn’t extend into double digits, but still, I never thought that anyone would pay attention. This whole thing was more just for me to babble into the void. As it turns out, I have gotten a number of emails inquiring into the state and methodology of the syntactic parser. So, in this unexpected revival of my blog, I will describe what’s going on and how the thing works. I will also touch on the Analytical Bible project.
The parser currently works and rather well, although that is merely opinion since I have not yet computed precision and recall, mostly due to the lack of an accessible corpus. To the best of my knowledge, there is no Koine Greek corpus. The helpful folks over at www.opentext.org were kind enough to let me have some of their data. Once I get that reformatted into something more compatible with my output, I will be able to establish the appropriate metrics. I do need to point out that I have not worked on this project for the better part of a year since I, at that time, got a new job that put some heavy demands on my time and sanity. Things are calmer now and, seeing that there is interest in my work, I may just go back and finish this.
Although I looked at many different parser schemes, I couldn’t find one that would do the job to the level of completeness that I required. I looked at LFG, for example, and, while it’s notational scheme is well suited to Koine, it has rather steep requirements in terms of lexical and morphological information. While I could potentially satisfy those requirements, it doesn’t seem to offer a parsing methodology that lends itself well to free word order languages, but that could be due to a lack of knowledge on my part. SFG has also been suggested but it suffers from similar roadblocks. In the end I constructed a hybrid DG and pattern matcher scheme that looks mostly like a Frankenstein’s Monster of algorithms. I describe the grammatical rules using a custom language I made up. This, then, gets parsed into a perl script and executed against some text and produces a complete annotated parse tree. The process executes each grammar rule in turn against the text, thus iteratively creating the parse tree. The first pass may only attach the definite article to an accompanying noun, for example. The second pass may deal with compound nouns and so on. Some passes are just pattern matches, meaning certain word associations are recognized and turned into a partial hierarchy. It is basically a CFG consisting entirely of terminals. Other passes are full DG (Dependency Grammar) passes. These are very powerful but didn’t work by themselves because a single pass doesn’t always make good decisions in a free word order language.
The program is not yet complete and I am loathe to hand out any source until it is at a satisfactory level. Once done, however, I will gladly share with all and sundry. The idea is to also develop a POS tagger so that other works can be subjected to syntactic analysis. With these data in hand it would be possible to do serious stylometric analysis, which is why I started this whole thing in the first place.
The statistical analysis of biblical Greek literature is the point of the Analytical Bible. While most of that is easy and already implemented, it is not very powerful without the syntactic tree structure, far and away the best indicator of style.
I would be happy to discuss any of these topics with any interested party. I am planning to attend SBL in Boston next month if anyone wants to meet up.
Julian