How different is different?

Looking at some studies of biblical manuscripts using statistical techniques (for example Wieland Wilker and Stephen Carlson, whose paper I can never find but luckily I have a print-out. Where is that paper, Stephen?), I was struck by the fact that quantification of variance seems to be ultimately boolean in nature. In other words, there is either a difference or there isn’t. What would allow for greater statistical significance would be a method to measure the amount of difference. After all, there is a large difference between a simple mispelling and omission of inclusion of an entire verse. As luck would have it, there is a very simple algorithm that can measure the amount of difference between two sequences, in this case words in a manuscript. It is called the Edit Distance in general. In practice there are a few implementations that differ slightly. We shall cover these as well. Most of the examples I link to use word to word comparisons for their examples but all you would have to do would be to supply a sentence with each ‘letter’ being a word in that sentence.

The basic algorithm is refered to as the Levenshtein Distance. It measures how many operations are necessary to transform one sequence into another. The more operations, the more the two sequences vary. In other words, we now have a metric for how much the two sentences vary. The operations allowed by Levenshtein are insertion, deletion, and substitution. There is an excellent and easy-to-follow explanation here. There is an especially impressive Perl implementation of Levenshtein here.

One common phenomenon in NT MSS is the simple transposition of two words. This phenomenon receives too much weight in the preceeding algorithm. Fortunately, someone thought of this and so we have the Damerau-Levenshtein Distance. It detects the three operations of the previous algorithm but also takes transposition into account. This algorithm has problems in some applications but works well for simple comparison like the comparison of two sentences.

Another algorithm is the Needleman-Wunsch algorithm whcih is usually applied to genetic sequences using a similarity matrix. In the case of comparing words that either match or don’t we can discard the similarity matrix and it generally becomes identical to the Levenshtein algorithm mentioned above. However, if one is so inclined (and I am pretty sure this hasn’t been done by anyone) it is a simple matter to generate a similarity matrix based on the edit distance between individual words and then apply the algorithm normally. This would give a slight penalty to words that were incorrectly spelled but not as severe as more obvious mismatches. This has some potential and I might play with this idea once time allows, which currently seems like never.

Other potential algorithms include the Smith-Waterman algorithm, BLAST and probably others.

The point of this overview is simply to illustrate the availability of simple and easy to understand algorithms that allow us to quantify the magnitude of MS variations. The link I gave above explains the algorithm in such easy terms, along with illustrations, that anyone can understand it.

All these algorithms are also available on Wikipedia, as usual an excellent source for such matters, and a few minutes of reading and a quick copy-and-paste from some website and you will have the perfect tool to add some meaningful weight to your analyses.

A long time since I last posted an entry here, but I am still hoping to step it up. Don’t forget to check out the main www.textcrit.com website for ways you can help the project.

See you all at the SBL conference in Washington, DC.
Julian

Explore posts in the same categories: Programming, Technology and the NT, Textual Criticism

One Comment on “How different is different?”

  1. Melissa Says:

    Thank you very much! I was quite confused about the difference between Levenshtein algorithm and Needleman-Wunsch algorithm. There apperently is nothing different, even though Needleman and Wunsch are much more credited in academia than Levenshtein, for no reason at all…

Comment: