Well, due to some recent discussions regarding the Pericope de Adultera (specifically John 8:1-11) I decided to use the power of computers to put this issue to bed. Well, we’ll see. Dr. J. Gibson was kind enough to furnish me with some information regarding earlier analyses of the issue and, at first glance, the conclusions seemed sound. The passage seemed remarkably Lukan and this coincided with my own hasty statistical observations. But, I thought, let’s take a closer look. Let’s do a real statistical analysis. Right. Let’s do it. Right now. Tum-tum-tum… Hmmm… What exactly does that mean? Isn’t that always the case: you start something and realize that you need to establish an effective methodology before you can even tackle the problem.
Let me warn you in advance: I don’t have an answer yet and I am going to europe for two weeks which will severely slow me down on this issue. But I did want to record my observations as they stand at this moment.
At first I read some research on PA (Pericope de Adultera) stylistic issues which is a fancy way of saying word and, to a limited extent, phrase usage. The best I read was A Possible Case of Lukan Authorship by Henry J. Cadbury which appeared in HTR 10.3 (Jul., 1917) pp. 237-244, I don’t have my SBL style book handy for the citation, so deal. He looked at particular words, compound word usage and some idiomatic phrases and eventually concluded that, although the language seems decidedly Lukan, we just don’t know, especially in light of the hard MSS contra-evidence. A good article but sort of like talking a nice looking woman into coming home with you only to find out she is a 79-year old eunuch in drag with a large collection of whips. Surely, this can be solved with some decent statistics.
The first thing you come to terms with, after you get over the shock of realizing that the solution isn’t trivial, is that you need to find a good way to quantify your NT text. Then you need a good way to correlate it to other texts. Then you need to find ways to consistently determine textual variance which is a bit tricky because this isn’t a stochastic issue. Humans, don’t you just hate them? Writers, in particular? I mean, write consistently already!
Here comes the technical math jargon. Bear in mind that I am not a mathematician so I may screw up some terminology.
The text is broken down into a frequency analysis table and treated as bivariate data with regards to its comparison target.
The data is then analyzed for general statistical style parameters for future use. This involve word and grammar frequencies, mean, standard deviation and variance. Then you compare the text to itself and establish correlation coefficients, similarity coefficients, regression slopes and so forth. This is the point when you realize that all writers should be shot. Or, in other words, a document is not overly textually consistent. Anyways, moving right along… By comparing the document to itself you can establish parameters regarding confidence intervals and expected values. You don’t, of course, compare the entire to gospel to itself in one big chunk or any values you got would be meaningless since GJohn == GJohn. I pick a section of some size which has been decided to provide good parameters and then compare that section to the rest of the gospel, divided into similar sized sections, excluding itself. You then move to the next section and repeat. This ensures that every section of the gospel has been compared to every other section without comparing any portion to itself. You now have a complete statistical snapshot of the gospel. Then you do the same thing except this time you compare each section to a section in another gospel to see if there is enough variance to determine style differential. And let me tell you, it’s subtle. Real subtle. It turns out that word usage is a crappy parameter. You can throw out words that are shorter than, say, 4 letters and that helps a lot, but even so… Anyone who argues authorship based on word usage alone should have his head examined or learn some statistics. A much better parameter is grammar. What was fuzzy before becomes much more crisp. Not to the point of obviousness but the values start to separate. An even better approach is a two word section segmentation, or two grammar sections, done in the same way. And even more powerful is a three entity analysis. The problem is that your hit count drops to the point where your sample size becomes statistically suspect. I plan to use Student’s t-test to help this along but that is too feeble to help much. The nice thing about two word combinations is that they tend to pick up adjective/noun order in addition to pleonasms. Three word combinations will pick up many idomatic phrases. These examples probably identify a writer far more surely than his choice of words.
The solution is to combine the various techniques briefly described above. Run a statistical analysis on the results of the statistical analysis. While it will not be conclusive, it will be significant. Well, that’s my prediction, at least.
In the meantime, I have that certain techniques work better than others. The values below are my eyeball estimates as I haven’t done any programming to list any min max output. Here is a geeky summary, so far:
- Word usage overlap: Runs anywhere from 30% to 60% but is quite erratic and unreliable. And I see this used as an argument all the time. *eyeroll*
- a,b,c,d: Similarity parameters for many formulae. a is in both texts, b is only in the first, c is only in the second and d is in neither. This last obviously never happens and thus simplify many calculations. All of these run pretty close but there is a subtle difference which might become significant if some general observation on this parameter can be established. It generally involves properly thresholding the variance of the Jaccard similarity coefficient. Probably converting it to a z-score and adding some fuzzy logic for thresholding.
- Source and target mean, variance and standard deviation are all pretty stable which gives me some hope that authorship is detectable, at least in a sense where one can determine a probability that is moderately convincing. Not that they are useful for comparison, they only show consistency in style, but if the style is tight enough one would think that divergence could be established.
- Correlation coefficient. Well, this isn’t working at all. No real surprise there since the bivariate variables are not really dependent on each other. The Pearson product moment correlation coefficient yields a big fat zero every time. This has the unfortunate effect described in the next point.
- Regression analysis is entirely impossible. Scatter charts are a joke with this data and least-square analysis comes up with nonsense. The slope and intercept are almost random.
- The Jaccard similarity coefficient has some potential but is too simple in this context. Half the time John agrees with Luke more than John. I hold no hopes there. Luckily we have lots of similarity coefficient calculations available and some seem to yield some consistent results. Read on…
- Jaccard distance is as useless as its opposite number above.
- The Dice similarity coefficient could have been designed by Andrew Dice Clay for all its efficiency. No dice, as they say…
- The Rogers-Tanimoto similarity coefficient is not blowing my skirt up, either. Not that I am wearing a skirt. At the moment.
- The Kulczynski similarity coefficient was hard to spell, easy to implement and yields mediocre results, certainly nothing pursuable.
- The Hamann similarity coefficient was a surprise. Despite the deceptively simple formula it is yielding very consistent results. It pretty much always detects proper authorship where the previous methods failed.
- Pearson’s chi-square is another good one. Not completely consistent but fairly reliable, nonetheless.
- The Normalized Pearson’s chi-square is, obviously, much like the above except better.
- z-score correlation has been kind to me. Converting to z-scores started yielding some consistent results when shopping for a decent correlation coefficient. This is another good one.
So, the trick is to combine 1, 2 and 3 word analyses as well as 1, 2 and 3 word grammar combinations. On top of this we add the statistical techniques that yield consistent results when operating on texts where we know the authorship, at least in a statistically significant sense. Combining the analysis results using more statistics should eventually yield a convincing percentage.
But much of this must wait until I return. I am off to see my kids for the first time in three years. Priorities and all that… Type to you soon.
P.S. All these formulae can be found on the web. I have found Wikipedia to be especially helpful on this topic.
P.P.S. Once again this message is far too short to get into any detail. I could go on and on about this but, as usual, ask questions here and you shall receive answers. I will check this blog for comments while in europe.
P.P.P.S. All this statistical analysis stuff is part of my search/compare portion of my anayltical bible program which will soon (yeah, right) be free and online at www.textcrit.com so all this will be available to you without having to actually learn how the math works.