The State of the Koine Parser as of October 2008

Imagine my surprise when I learned that there are actually people out there who reads this blog. It was quite a shocker, I can assure you. Granted, my readership probably doesn’t extend into double digits, but still, I never thought that anyone would pay attention. This whole thing was more just for me to babble into the void. As it turns out, I have gotten a number of emails inquiring into the state and methodology of the syntactic parser. So, in this unexpected revival of my blog, I will describe what’s going on and how the thing works. I will also touch on the Analytical Bible project.

The parser currently works and rather well, although that is merely opinion since I have not yet computed precision and recall, mostly due to the lack of an accessible corpus. To the best of my knowledge, there is no Koine Greek corpus. The helpful folks over at www.opentext.org were kind enough to let me have some of their data. Once I get that reformatted into something more compatible with my output, I will be able to establish the appropriate metrics. I do need to point out that I have not worked on this project for the better part of a year since I, at that time, got a new job that put some heavy demands on my time and sanity. Things are calmer now and, seeing that there is interest in my work, I may just go back and finish this.

Although I looked at many different parser schemes, I couldn’t find one that would do the job to the level of completeness that I required. I looked at LFG, for example, and, while it’s notational scheme is well suited to Koine, it has rather steep requirements in terms of lexical and morphological information. While I could potentially satisfy those requirements, it doesn’t seem to offer a parsing methodology that lends itself well to free word order languages, but that could be due to a lack of knowledge on my part. SFG has also been suggested but it suffers from similar roadblocks. In the end I constructed a hybrid DG and pattern matcher scheme that looks mostly like a Frankenstein’s Monster of algorithms. I describe the grammatical rules using a custom language I made up. This, then, gets parsed into a perl script and executed against some text and produces a complete annotated parse tree. The process executes each grammar rule in turn against the text, thus iteratively creating the parse tree. The first pass may only attach the definite article to an accompanying noun, for example. The second pass may deal with compound nouns and so on. Some passes are just pattern matches, meaning certain word associations are recognized and turned into a partial hierarchy. It is basically a CFG consisting entirely of terminals. Other passes are full DG (Dependency Grammar) passes. These are very powerful but didn’t work by themselves because a single pass doesn’t always make good decisions in a free word order language.

The program is not yet complete and I am loathe to hand out any source until it is at a satisfactory level. Once done, however, I will gladly share with all and sundry. The idea is to also develop a POS tagger so that other works can be subjected to syntactic analysis. With these data in hand it would be possible to do serious stylometric analysis, which is why I started this whole thing in the first place.

The statistical analysis of biblical Greek literature is the point of the Analytical Bible. While most of that is easy and already implemented, it is not very powerful without the syntactic tree structure, far and away the best indicator of style.

I would be happy to discuss any of these topics with any interested party. I am planning to attend SBL in Boston next month if anyone wants to meet up.

Julian

Explore posts in the same categories: Programming, Technology and the NT

5 Comments on “The State of the Koine Parser as of October 2008”

  1. Mike Says:

    I don’t think you looked closely enough at LFG. Much of the frameworks foundational work is based on non-configurational languages - such as the Austronesian Walipiri. It is actually highly suited for free word order languages (a good amount of LFG work has been done on Modern Greek), though I would argue that Greek is not as free as some would think. I’m presently writing an LFG grammar for Koine morphology & phrase structure using SIL software.

  2. Julian Says:

    I remember reading a very long paper on LFG by Bresnan (?) and thinking that its notional scheme was very useful for Greek but that it didn’t offer any significant advantage over a traditional tree structure. The main problem I have encountered is determining the local phrase head as I iterate over the sentence. Maybe you could recommend some reading on that topic. I am familiar with how LFG ensures grammatical matches through its nested feature structures but how does it solve ambiguity? Is there a probabilistic approach? And if so, how is this established without a corpus?

    I also agree that Greek is not all that free, there are certainly many constructs that are formulaic. I relied heavily on Wallace’s Biblical Greek Beyond the Basics for comprehensive overviews of constructs. If I had a corpus, I could have compiled a better overview.

    Julian

  3. theswain Says:

    Just a note to say its good to have you back blogging. Not nearly enough blogs on text crit about.

  4. Mike Says:

    The fact that LFG requires a lot of morpho-syntactic & semantic information might be a burden. From what I’ve read about computational linguistics and LFG, the majority has use a probabilistic approach, though I am not a computational linguistic, so I can’t say more. You could search through the presentation titles and abstracts at their conferenc website. A large number of papers are freely available there.

    http://www.essex.ac.uk/linguistics/lfg/FAQ/conferences.html

  5. Stephen C. Carlson Says:

    Great to hear that you’ll be at SBL in Boston. Since we tend to inhabit the same sessions, I’m sure we’ll get a chance to catch up.

Comment: