Archive for the 'Technology and the NT' Category

The State of the Koine Parser as of October 2008

Saturday, October 25th, 2008

Imagine my surprise when I learned that there are actually people out there who reads this blog. It was quite a shocker, I can assure you. Granted, my readership probably doesn’t extend into double digits, but still, I never thought that anyone would pay attention. This whole thing was more just for me to babble into the void. As it turns out, I have gotten a number of emails inquiring into the state and methodology of the syntactic parser. So, in this unexpected revival of my blog, I will describe what’s going on and how the thing works. I will also touch on the Analytical Bible project.

The parser currently works and rather well, although that is merely opinion since I have not yet computed precision and recall, mostly due to the lack of an accessible corpus. To the best of my knowledge, there is no Koine Greek corpus. The helpful folks over at www.opentext.org were kind enough to let me have some of their data. Once I get that reformatted into something more compatible with my output, I will be able to establish the appropriate metrics. I do need to point out that I have not worked on this project for the better part of a year since I, at that time, got a new job that put some heavy demands on my time and sanity. Things are calmer now and, seeing that there is interest in my work, I may just go back and finish this.

Although I looked at many different parser schemes, I couldn’t find one that would do the job to the level of completeness that I required. I looked at LFG, for example, and, while it’s notational scheme is well suited to Koine, it has rather steep requirements in terms of lexical and morphological information. While I could potentially satisfy those requirements, it doesn’t seem to offer a parsing methodology that lends itself well to free word order languages, but that could be due to a lack of knowledge on my part. SFG has also been suggested but it suffers from similar roadblocks. In the end I constructed a hybrid DG and pattern matcher scheme that looks mostly like a Frankenstein’s Monster of algorithms. I describe the grammatical rules using a custom language I made up. This, then, gets parsed into a perl script and executed against some text and produces a complete annotated parse tree. The process executes each grammar rule in turn against the text, thus iteratively creating the parse tree. The first pass may only attach the definite article to an accompanying noun, for example. The second pass may deal with compound nouns and so on. Some passes are just pattern matches, meaning certain word associations are recognized and turned into a partial hierarchy. It is basically a CFG consisting entirely of terminals. Other passes are full DG (Dependency Grammar) passes. These are very powerful but didn’t work by themselves because a single pass doesn’t always make good decisions in a free word order language.

The program is not yet complete and I am loathe to hand out any source until it is at a satisfactory level. Once done, however, I will gladly share with all and sundry. The idea is to also develop a POS tagger so that other works can be subjected to syntactic analysis. With these data in hand it would be possible to do serious stylometric analysis, which is why I started this whole thing in the first place.

The statistical analysis of biblical Greek literature is the point of the Analytical Bible. While most of that is easy and already implemented, it is not very powerful without the syntactic tree structure, far and away the best indicator of style.

I would be happy to discuss any of these topics with any interested party. I am planning to attend SBL in Boston next month if anyone wants to meet up.

Julian

New Page on the Koine Greek Parser

Thursday, July 19th, 2007

I have made a new page on the parser topic but until I figure out how to add the ability to add comments over there, you can add them here should you feel such an urge. Here is a link in case you didn’t see the big one to the right of your screen: State of the Parser

Julian

How different is different?

Wednesday, October 25th, 2006

Looking at some studies of biblical manuscripts using statistical techniques (for example Wieland Wilker and Stephen Carlson, whose paper I can never find but luckily I have a print-out. Where is that paper, Stephen?), I was struck by the fact that quantification of variance seems to be ultimately boolean in nature. In other words, there is either a difference or there isn’t. What would allow for greater statistical significance would be a method to measure the amount of difference. After all, there is a large difference between a simple mispelling and omission of inclusion of an entire verse. As luck would have it, there is a very simple algorithm that can measure the amount of difference between two sequences, in this case words in a manuscript. It is called the Edit Distance in general. In practice there are a few implementations that differ slightly. We shall cover these as well. Most of the examples I link to use word to word comparisons for their examples but all you would have to do would be to supply a sentence with each ‘letter’ being a word in that sentence.

The basic algorithm is refered to as the Levenshtein Distance. It measures how many operations are necessary to transform one sequence into another. The more operations, the more the two sequences vary. In other words, we now have a metric for how much the two sentences vary. The operations allowed by Levenshtein are insertion, deletion, and substitution. There is an excellent and easy-to-follow explanation here. There is an especially impressive Perl implementation of Levenshtein here.

One common phenomenon in NT MSS is the simple transposition of two words. This phenomenon receives too much weight in the preceeding algorithm. Fortunately, someone thought of this and so we have the Damerau-Levenshtein Distance. It detects the three operations of the previous algorithm but also takes transposition into account. This algorithm has problems in some applications but works well for simple comparison like the comparison of two sentences.

Another algorithm is the Needleman-Wunsch algorithm whcih is usually applied to genetic sequences using a similarity matrix. In the case of comparing words that either match or don’t we can discard the similarity matrix and it generally becomes identical to the Levenshtein algorithm mentioned above. However, if one is so inclined (and I am pretty sure this hasn’t been done by anyone) it is a simple matter to generate a similarity matrix based on the edit distance between individual words and then apply the algorithm normally. This would give a slight penalty to words that were incorrectly spelled but not as severe as more obvious mismatches. This has some potential and I might play with this idea once time allows, which currently seems like never.

Other potential algorithms include the Smith-Waterman algorithm, BLAST and probably others.

The point of this overview is simply to illustrate the availability of simple and easy to understand algorithms that allow us to quantify the magnitude of MS variations. The link I gave above explains the algorithm in such easy terms, along with illustrations, that anyone can understand it.

All these algorithms are also available on Wikipedia, as usual an excellent source for such matters, and a few minutes of reading and a quick copy-and-paste from some website and you will have the perfect tool to add some meaningful weight to your analyses.

A long time since I last posted an entry here, but I am still hoping to step it up. Don’t forget to check out the main www.textcrit.com website for ways you can help the project.

See you all at the SBL conference in Washington, DC.
Julian