Archive for the 'Textual Criticism' Category

Codex Sinaiticus Conference

Saturday, October 25th, 2008

Dr Juan Garcés, the Curator of the Codex Sinaiticus Project, asked me to post the following:

Codex Sinaiticus Conference

British Library, London, 6-7 July 2009

The Codex Sinaiticus Project, an international initiative to reunite the entire manuscript in digital form and make it accessible to a global audience for the first time (see http://www.codexsinaiticus.org/en/), will host a conference devoted to this seminal fourth-century Bible.

Leading experts have been invited to present papers on the history, codicology, and text of Codex Sinaiticus, among other topics. A call for papers, registration information, and programme will be made available soon.

It’s a good effort and I wish that more such undertakings would be pursued.

Julian

How different is different?

Wednesday, October 25th, 2006

Looking at some studies of biblical manuscripts using statistical techniques (for example Wieland Wilker and Stephen Carlson, whose paper I can never find but luckily I have a print-out. Where is that paper, Stephen?), I was struck by the fact that quantification of variance seems to be ultimately boolean in nature. In other words, there is either a difference or there isn’t. What would allow for greater statistical significance would be a method to measure the amount of difference. After all, there is a large difference between a simple mispelling and omission of inclusion of an entire verse. As luck would have it, there is a very simple algorithm that can measure the amount of difference between two sequences, in this case words in a manuscript. It is called the Edit Distance in general. In practice there are a few implementations that differ slightly. We shall cover these as well. Most of the examples I link to use word to word comparisons for their examples but all you would have to do would be to supply a sentence with each ‘letter’ being a word in that sentence.

The basic algorithm is refered to as the Levenshtein Distance. It measures how many operations are necessary to transform one sequence into another. The more operations, the more the two sequences vary. In other words, we now have a metric for how much the two sentences vary. The operations allowed by Levenshtein are insertion, deletion, and substitution. There is an excellent and easy-to-follow explanation here. There is an especially impressive Perl implementation of Levenshtein here.

One common phenomenon in NT MSS is the simple transposition of two words. This phenomenon receives too much weight in the preceeding algorithm. Fortunately, someone thought of this and so we have the Damerau-Levenshtein Distance. It detects the three operations of the previous algorithm but also takes transposition into account. This algorithm has problems in some applications but works well for simple comparison like the comparison of two sentences.

Another algorithm is the Needleman-Wunsch algorithm whcih is usually applied to genetic sequences using a similarity matrix. In the case of comparing words that either match or don’t we can discard the similarity matrix and it generally becomes identical to the Levenshtein algorithm mentioned above. However, if one is so inclined (and I am pretty sure this hasn’t been done by anyone) it is a simple matter to generate a similarity matrix based on the edit distance between individual words and then apply the algorithm normally. This would give a slight penalty to words that were incorrectly spelled but not as severe as more obvious mismatches. This has some potential and I might play with this idea once time allows, which currently seems like never.

Other potential algorithms include the Smith-Waterman algorithm, BLAST and probably others.

The point of this overview is simply to illustrate the availability of simple and easy to understand algorithms that allow us to quantify the magnitude of MS variations. The link I gave above explains the algorithm in such easy terms, along with illustrations, that anyone can understand it.

All these algorithms are also available on Wikipedia, as usual an excellent source for such matters, and a few minutes of reading and a quick copy-and-paste from some website and you will have the perfect tool to add some meaningful weight to your analyses.

A long time since I last posted an entry here, but I am still hoping to step it up. Don’t forget to check out the main www.textcrit.com website for ways you can help the project.

See you all at the SBL conference in Washington, DC.
Julian

Basic statistics and the text of the NT

Monday, August 7th, 2006

Well, as I am delving deeper and deeper into the world of text analysis things are getting more and more complicated. So I thought that I would write a series of blog entries describing the basics of statistics and how they apply to text. Eventually, we would have to look at some very advanced techniques used in Computational Linguistics after we have worked our way through the basics. There will be math involved but I will explain exactly how it works as we go along in terms that should be clear to non-mathematicians, like myself. I will start by looking at various means, variance and standard deviation calculations and what they mean. Eventually we will work our way up to latent semantic analysis, singular value decomposition, principal component analysis, N-gram taggers, partial parsers and so on but that’s for much later.

Basic statistical functions tell us some basic facts about the text we are working with but the techniques are insufficient to provide a good separation of textual styles which is why we will need to rely on more advanced techniques from the field of computational linguistics which is a very fast moving field.

Well, let’s get to it.

First we need to distinguish between a population and a sample. In the case of NT texts we have the luxury of dealing with both. In real life, statisticians frequently only have samples and must estimate the population from those samples. We will need to do so as well when we are trying to determine authorship for fragments of text. However, we have the entire gospel of xxx, the entire epistle of yyy and so on which allows us to establish descriptive parameters for the entire text, i.e. the population.

The first function we will look at is how to calculate the mean of a set, set here meaning a collection of numbers, either a sample or the entire text (population.) Mean simply means the average, nothing more. There are several types of means with the arithmetic mean being the one that most people understand to be implied when we simply say mean. We also have the geometric mean and the harmonic mean. The latter two are not particularly useful in our case so we will not address them here. The mean, when described by a mathematician looks like this:

Mean formula

What that says is simply this: Add up all the numbers of the set and divide by the number of entries. So if you had the numbers 10, 20, 30 you would add them all, getting 60, and then divide by 3, yielding a result of 20. So the mean of 10, 20, 30 = 20.

Math note: Multiplying by 1 over n is the same as dividing by n. The value 1/n is the reciprocal of n in math talk or sometimes the multiplicative inverse for extra nerd points.

Okay, for our next trick we will look at variance and standard deviation. First, variance. Variance is a measure of the amount of variation in a dataset. It is used in a number of calculations and is simply the square of the standard deviation, which is why they are explained here in the same section. In mathematical terms, variance looks like this

Population variance formula

for a population variance and like this

Sample variance formula

when working with a sample. It can be shown mathematically that the second one tends to represent the population better than the first one when dealing with samples. A discussion of that would be far too lengthy (and can be found on wikipedia under variance and standard deviation.) Before I explain how they work and why we care, I will show the final formula for this which is the Standard Deviation, a term which I am sure most have heard and probably understand to some extent. Here it is:

Standard deviation formula

There. Now we have some building blocks. As can be seen above, the standard deviation is simply the square root of the variance. You calculate it by first calculating your mean value as described earlier. Then you go through each datum in your set and subtract the mean (this makes the set centered around the mean) and square it. You add all these numbers together (which is what is meant by the upper case sigma in the formula) and divide by number of values in your set, subtracting one from that number before dividing. This is the variance. Take the square root of that and you have the standard deviation. So, what does all this really mean?

The standard deviation tells us how spread out the data are from our mean. Remember the well-known Pythagoras formula that we all had to learn for calculating the hypotenuse? All that formula does is calculate the distance. If you have ever worked construction, or do a lot of home improvement, you probably know the 3-4-5 rule used to make sure that something is square. Say you are building a fence that needs to have 90 degree corners. Measure 3 feet along one side, measure 4 feet along the other and measure between those two points and the length should be 5 feet, because the square root of ( 3 *3 + 4* 4 ) is 5. So, if you add up the square of a bunch of distance measurements (remember, when we subtract our mean from each value we essentially get the ‘distance’ of that point from the average) and then take the square root you get the distance in however many dimensions you added up. Don’t worry about that last ‘dimension’ bit just yet. In our case, we added up our squares and then divided by the number of values in our set (minus one but that is not so important here) getting the average squared value of our set. The subsequent square root gives us the actual (non-squared) distance. So, what we end up with is the standard deviation which is another way of saying the average distance of our data from the average.

So now we have a way to get our average and determine how spread out our data is. Now, how to we apply that to text? Well, first you quantify your text in some manner (which will be our next blog entry, namely, how to quantify text) and then you calculate the mean and standard deviation. If you are doing a word count, for example, the mean will tell you the average number of times a word is being used and the standard deviation tells you how much these words counts vary. Obviously, numbers like these are far too feeble to tell us much about the text. They are, however, important building blocks that we will need for further investigation.

For example, for 15 random 1000 word blocks from Luke and John each, we can see that Luke uses, on average (the mean of 15 different 1000 word blocks in this case) 306 words with 289 being the minimum number of words and 330 the maximum. John, on the other hand, uses 216 different words with min and max being 182 and 255, respectively. So clearly, Luke uses a far greater vocabulary. We can compare these factors because we are using the same sample size, however, when dealing with the entire texts we run into problems because a longer text will naturally use a larger vocabulary. The caveat here is that the word usage does not increase linearly, meaning that if you double the length of the text it does not follow that the number of distinct words used double, as well. In fact, they never do. In some cases we will need to look at a text fragment of a particular size and if we are to check the distinct word usage against the biblical writers we will need to estimate what number of words would be typical for a fragment of that size for any given author. I have pre-calculated the word usage for the biblical books at regular sampling size intervals at a tight enough spacing that we will be able to linearly interpolate the correct word usage and its standard deviation to establish probability.

That’s probably enough for now. I shall continue soon with a look at text representation and quantification. The interval between blog entries should decrease now that I am back from europe, although I do have to go for a week and wear medieval clothes at Pennsic. :)

Julian

Pericope de Adultera: Statistical analysis of the NT, preliminary

Wednesday, July 12th, 2006

Well, due to some recent discussions regarding the Pericope de Adultera (specifically John 8:1-11) I decided to use the power of computers to put this issue to bed. Well, we’ll see. Dr. J. Gibson was kind enough to furnish me with some information regarding earlier analyses of the issue and, at first glance, the conclusions seemed sound. The passage seemed remarkably Lukan and this coincided with my own hasty statistical observations. But, I thought, let’s take a closer look. Let’s do a real statistical analysis. Right. Let’s do it. Right now. Tum-tum-tum… Hmmm… What exactly does that mean? Isn’t that always the case: you start something and realize that you need to establish an effective methodology before you can even tackle the problem.

Let me warn you in advance: I don’t have an answer yet and I am going to europe for two weeks which will severely slow me down on this issue. But I did want to record my observations as they stand at this moment.

At first I read some research on PA (Pericope de Adultera) stylistic issues which is a fancy way of saying word and, to a limited extent, phrase usage. The best I read was A Possible Case of Lukan Authorship by Henry J. Cadbury which appeared in HTR 10.3 (Jul., 1917) pp. 237-244, I don’t have my SBL style book handy for the citation, so deal. ;) He looked at particular words, compound word usage and some idiomatic phrases and eventually concluded that, although the language seems decidedly Lukan, we just don’t know, especially in light of the hard MSS contra-evidence. A good article but sort of like talking a nice looking woman into coming home with you only to find out she is a 79-year old eunuch in drag with a large collection of whips. Surely, this can be solved with some decent statistics.

The first thing you come to terms with, after you get over the shock of realizing that the solution isn’t trivial, is that you need to find a good way to quantify your NT text. Then you need a good way to correlate it to other texts. Then you need to find ways to consistently determine textual variance which is a bit tricky because this isn’t a stochastic issue. Humans, don’t you just hate them? Writers, in particular? I mean, write consistently already!

Here comes the technical math jargon. Bear in mind that I am not a mathematician so I may screw up some terminology.

The text is broken down into a frequency analysis table and treated as bivariate data with regards to its comparison target.

The data is then analyzed for general statistical style parameters for future use. This involve word and grammar frequencies, mean, standard deviation and variance. Then you compare the text to itself and establish correlation coefficients, similarity coefficients, regression slopes and so forth. This is the point when you realize that all writers should be shot. Or, in other words, a document is not overly textually consistent. Anyways, moving right along… By comparing the document to itself you can establish parameters regarding confidence intervals and expected values. You don’t, of course, compare the entire to gospel to itself in one big chunk or any values you got would be meaningless since GJohn == GJohn. I pick a section of some size which has been decided to provide good parameters and then compare that section to the rest of the gospel, divided into similar sized sections, excluding itself. You then move to the next section and repeat. This ensures that every section of the gospel has been compared to every other section without comparing any portion to itself. You now have a complete statistical snapshot of the gospel. Then you do the same thing except this time you compare each section to a section in another gospel to see if there is enough variance to determine style differential. And let me tell you, it’s subtle. Real subtle. It turns out that word usage is a crappy parameter. You can throw out words that are shorter than, say, 4 letters and that helps a lot, but even so… Anyone who argues authorship based on word usage alone should have his head examined or learn some statistics. A much better parameter is grammar. What was fuzzy before becomes much more crisp. Not to the point of obviousness but the values start to separate. An even better approach is a two word section segmentation, or two grammar sections, done in the same way. And even more powerful is a three entity analysis. The problem is that your hit count drops to the point where your sample size becomes statistically suspect. I plan to use Student’s t-test to help this along but that is too feeble to help much. The nice thing about two word combinations is that they tend to pick up adjective/noun order in addition to pleonasms. Three word combinations will pick up many idomatic phrases. These examples probably identify a writer far more surely than his choice of words.

The solution is to combine the various techniques briefly described above. Run a statistical analysis on the results of the statistical analysis. While it will not be conclusive, it will be significant. Well, that’s my prediction, at least.

In the meantime, I have that certain techniques work better than others. The values below are my eyeball estimates as I haven’t done any programming to list any min max output. Here is a geeky summary, so far:

  • Word usage overlap: Runs anywhere from 30% to 60% but is quite erratic and unreliable. And I see this used as an argument all the time. *eyeroll*

  • a,b,c,d: Similarity parameters for many formulae. a is in both texts, b is only in the first, c is only in the second and d is in neither. This last obviously never happens and thus simplify many calculations. All of these run pretty close but there is a subtle difference which might become significant if some general observation on this parameter can be established. It generally involves properly thresholding the variance of the Jaccard similarity coefficient. Probably converting it to a z-score and adding some fuzzy logic for thresholding.

  • Source and target mean, variance and standard deviation are all pretty stable which gives me some hope that authorship is detectable, at least in a sense where one can determine a probability that is moderately convincing. Not that they are useful for comparison, they only show consistency in style, but if the style is tight enough one would think that divergence could be established.

  • Correlation coefficient. Well, this isn’t working at all. No real surprise there since the bivariate variables are not really dependent on each other. The Pearson product moment correlation coefficient yields a big fat zero every time. This has the unfortunate effect described in the next point.

  • Regression analysis is entirely impossible. Scatter charts are a joke with this data and least-square analysis comes up with nonsense. The slope and intercept are almost random. :)

  • The Jaccard similarity coefficient has some potential but is too simple in this context. Half the time John agrees with Luke more than John. I hold no hopes there. Luckily we have lots of similarity coefficient calculations available and some seem to yield some consistent results. Read on…

  • Jaccard distance is as useless as its opposite number above.

  • The Dice similarity coefficient could have been designed by Andrew Dice Clay for all its efficiency. No dice, as they say…

  • The Rogers-Tanimoto similarity coefficient is not blowing my skirt up, either. Not that I am wearing a skirt. At the moment. ;)

  • The Kulczynski similarity coefficient was hard to spell, easy to implement and yields mediocre results, certainly nothing pursuable.

  • The Hamann similarity coefficient was a surprise. Despite the deceptively simple formula it is yielding very consistent results. It pretty much always detects proper authorship where the previous methods failed.

  • Pearson’s chi-square is another good one. Not completely consistent but fairly reliable, nonetheless.
  • The Normalized Pearson’s chi-square is, obviously, much like the above except better. :)

  • z-score correlation has been kind to me. Converting to z-scores started yielding some consistent results when shopping for a decent correlation coefficient. This is another good one.

So, the trick is to combine 1, 2 and 3 word analyses as well as 1, 2 and 3 word grammar combinations. On top of this we add the statistical techniques that yield consistent results when operating on texts where we know the authorship, at least in a statistically significant sense. Combining the analysis results using more statistics should eventually yield a convincing percentage.

But much of this must wait until I return. I am off to see my kids for the first time in three years. Priorities and all that… Type to you soon.

Julian

P.S. All these formulae can be found on the web. I have found Wikipedia to be especially helpful on this topic.
P.P.S. Once again this message is far too short to get into any detail. I could go on and on about this but, as usual, ask questions here and you shall receive answers. I will check this blog for comments while in europe.
P.P.P.S. All this statistical analysis stuff is part of my search/compare portion of my anayltical bible program which will soon (yeah, right) be free and online at www.textcrit.com so all this will be available to you without having to actually learn how the math works. :)