The Danger of Translation

Posted October 8th, 2007 by Julian Jensen
Categories: General Religion

I am hoping that a blog I just ran into can be explained away by the use of bad translation software. The result is rather humorous, however. My favorite is where John 1:1 is rendered as Toilet 1:1. Check it out: Cafe Apocalypsis

New Page on the Koine Greek Parser

Posted July 19th, 2007 by Julian Jensen
Categories: Programming, Technology and the NT

I have made a new page on the parser topic but until I figure out how to add the ability to add comments over there, you can add them here should you feel such an urge. Here is a link in case you didn’t see the big one to the right of your screen: State of the Parser

Julian

SBL and methodology…not!

Posted January 8th, 2007 by Julian Jensen
Categories: General Religion

After a long delay, I am now back to writing some more entries. For starters I have some observations on the SBL conference.

First off, I was fairly appalled at the near-total lack of knowledge of even the most basic steps of scientific methodology. While there is no question that most of these PhDs are very knowledgeable and well-read in their fields it became apparent very quickly that their many years of training and study included not even five minutes about the methods involved in formulating a scientifically sound theory. While I am probably unfairly generalizing here, it seemed true in many cases in the lectures I attended. Let’s take some examples:

Waiting for the Kingdom by Giovanni Bazzana. Now, this is obviously a smart and dedicated man. Unfortunately, his theory was far less smart. Actually, to be accurate, there is no way to know if his theory was good, bad, or indifferent. Why? Because he completely neglected one of the most significant pieces of a theory, falsification. I tried pressing him on the issue, asking for some hard numbers. Basically, he has two examples of papyri (PKoeln 7,313, 13-22; and 1 Mac 13, 36-39) where he finds some similarity in the wording between the release of debts and the Q version of the Lord’s prayer. A few natural questions should have occurred to him at this point, they certainly occurred to me right away. Two examples out of how many? He wasn’t quite sure, tens of thousands… Are there any other decrees on papyrus where similar word patterns can be found that match formulaic expressions in the NT? He didn’t know, hadn’t checked. At this point I stopped asking as his eyes were glazing over and he didn’t seem to have the faintest clue as to what I was asking him. I was asking for quantification. He said near the end of his presentation, “I think this is significant,” referring here to his findings. Really? You think it’s significant? Why do I care what you find significant? I don’t! What you think is irrelevant, it is your evidence that needs to convince me. Unfortunately, you showed up without any theory and without any evidence other than two highly dubious textual parallels. If you cannot show some hard numbers, if you cannot provide grounds for falsification, then you have nothing. Well, I guess you do have some interesting musings that might serve as inspiration to others, which is nice but hardly scientific.

Let’s look at another example, this one by Larry Hurtado, which was very disappointing since Hurtado is obviously very smart and has done some truly excellent work over the years. What is even more disappointing is that his work on Mark and Codex W was a solid piece of good scientific work, so he clearly knows how to do these things. I won’t say too much on this topic because Bart Ehrman really laid into Hurtado so effectively that very little remains to be said. Ehrman’s attack was effective and accurate but even he did not truly understand the exact nature of the problem with Hurtado’s work, all he knew was that there was something wrong with it. I asked Hurtado about his conclusions regarding scrolls and codices and their respectives roles, trying to find out how strong his case was. Again, I was trying to get hard numbers. How many scrolls and codices were we talking about, how many had markings to facilitate reading and to what extent? Again, basic information without which it is impossible to arrive at any kind of conclusion. Hurtado seemed to not understand what I was asking. I was beginning to think that I had forgotten how to speak English but subsequent tests proved that I was indeed capable of forming coherent sentences. Ah, well…

Final example, Holger Strutwolf who was talking about text types and the need for a new classification scheme. Now, he may be entirely accurate, again there is no way to know, at least not given the data he presented. He tried to show that the percentage of variation did not group the exemplars into reasonably clear and delineated text type groups. I am curious to know how he measures variations. How does he weigh them? Does he even weigh them? What words are ignored? Why? You can’t just go in and take two texts and count the differences, word by word. At least, not if you expect to get any kind of useful answer. I think he asks an important question but I am far from convinced that he has the knowledge and skills necessary to make this case, one way or another. Check out my previous blog entry on how to weigh differences, How differrent is different?

Enough complaining for now. There were a lot of bright spots as well. It was great to meet Stephen Carlson (again), Jeffrey Gibson, Ken Olson, Catherine Smith and many others. There were some great lectures. I especially enjoyed Thomas J. Kraus speaking about Miniature Books, Codices, or Formats? Categories, Contexts, and Conclusions, Peter M. Head on Named Letter Carriers in the Documentary Papyri, Gilbert van Belle The Use of the Pronomen Abundans in the Fourth Gospel and Kevin Wilkinson on “Hermeneiai” in Manuscripts of John’s Gospel and the Art of Bibliomancy just mention a few.

Anyways, that is it for now. I have been working on a syntactical parser for Koine Greek. It is coming along really, really well and I will be writing about that next time. The problem is that every time I have an oppotunity to sit down in front of the computer, I want to work on my parser rather than writing a blog entry.

TTFN,
Julian

How different is different?

Posted October 25th, 2006 by Julian Jensen
Categories: Programming, Technology and the NT, Textual Criticism

Looking at some studies of biblical manuscripts using statistical techniques (for example Wieland Wilker and Stephen Carlson, whose paper I can never find but luckily I have a print-out. Where is that paper, Stephen?), I was struck by the fact that quantification of variance seems to be ultimately boolean in nature. In other words, there is either a difference or there isn’t. What would allow for greater statistical significance would be a method to measure the amount of difference. After all, there is a large difference between a simple mispelling and omission of inclusion of an entire verse. As luck would have it, there is a very simple algorithm that can measure the amount of difference between two sequences, in this case words in a manuscript. It is called the Edit Distance in general. In practice there are a few implementations that differ slightly. We shall cover these as well. Most of the examples I link to use word to word comparisons for their examples but all you would have to do would be to supply a sentence with each ‘letter’ being a word in that sentence.

The basic algorithm is refered to as the Levenshtein Distance. It measures how many operations are necessary to transform one sequence into another. The more operations, the more the two sequences vary. In other words, we now have a metric for how much the two sentences vary. The operations allowed by Levenshtein are insertion, deletion, and substitution. There is an excellent and easy-to-follow explanation here. There is an especially impressive Perl implementation of Levenshtein here.

One common phenomenon in NT MSS is the simple transposition of two words. This phenomenon receives too much weight in the preceeding algorithm. Fortunately, someone thought of this and so we have the Damerau-Levenshtein Distance. It detects the three operations of the previous algorithm but also takes transposition into account. This algorithm has problems in some applications but works well for simple comparison like the comparison of two sentences.

Another algorithm is the Needleman-Wunsch algorithm whcih is usually applied to genetic sequences using a similarity matrix. In the case of comparing words that either match or don’t we can discard the similarity matrix and it generally becomes identical to the Levenshtein algorithm mentioned above. However, if one is so inclined (and I am pretty sure this hasn’t been done by anyone) it is a simple matter to generate a similarity matrix based on the edit distance between individual words and then apply the algorithm normally. This would give a slight penalty to words that were incorrectly spelled but not as severe as more obvious mismatches. This has some potential and I might play with this idea once time allows, which currently seems like never.

Other potential algorithms include the Smith-Waterman algorithm, BLAST and probably others.

The point of this overview is simply to illustrate the availability of simple and easy to understand algorithms that allow us to quantify the magnitude of MS variations. The link I gave above explains the algorithm in such easy terms, along with illustrations, that anyone can understand it.

All these algorithms are also available on Wikipedia, as usual an excellent source for such matters, and a few minutes of reading and a quick copy-and-paste from some website and you will have the perfect tool to add some meaningful weight to your analyses.

A long time since I last posted an entry here, but I am still hoping to step it up. Don’t forget to check out the main www.textcrit.com website for ways you can help the project.

See you all at the SBL conference in Washington, DC.
Julian

The ins and outs of displaying Greek text on the web… Part II

Posted September 4th, 2006 by Julian Jensen
Categories: Programming

Well, it is time to re-visit this topic as I have the promised server-side program ready for public consumption. You can freely use it in whatever manner you please for non-commercial purposes. Before we talk about the program we will talk some more about unicode and fonts. You didn’t think we covered everything last time, did you?

Please note that I am supplying many links to the places and topics in this articles at the end rather than interspersing them into the text.

In the old days all we had were ascii codes. Basically 7 bits (although most systems supported 8 bits) that were used to encode the various latin characters and a few special ones. This posed a problem once we wanted to start displaying a much larger range of characters. A number of encoding schemes came about, most of which are around today, but the best one, and the one that is clearly the future, is unicode. Unicode uses a variable number of bits to encode characters, allowing them to represent a theoretically unlimited amount of character variation. When you use unicode you need to set up your HTML properly. We addressed this in our last article on this subject. However, lots of pages use font specific encoding schemes to display unicode characters. What does this mean? Well, let’s look at an example. Here is an example from Liddell, Scott and Jones using the GraecaII font:

fh`/"

Hmmm… Not overly readable. What it actually says is φῇς which is hardly obvious. This looks weird because the GraecaII font uses particular ascii (8 bit) sequences to identify, or encode, the final character. Most fonts have these schemes. One well-known encoding scheme is the TLG beta code system, which I am sure most of us are familiar with. In the example above, it follows that f = φ and h = η and ` = ͅ (iota subscript or perispomeni) and / = ͂ (circumflex or ypogrammeni) and finally the " is the HTML method of writing ” (double quote) which is the terminal sigma (as opposed to medial). Quite a mouthful. Only 8 bit characters were used to encode this word but it only displays properly in the GraecaII font (and any other that uses the same encoding sequences). So, we need a way to translate from the particulars of a font’s encoding scheme to standard unicode. For us to accomplish this, we need to understand two things.

First, Combining Diacritical Marks. These are all the little marks that are added onto letters in many languages. We know the Greek ones but other languages have the same issues, such as the unlaut in German, for example. All these marks are defined in unicode using the codepoints from 0×300 to 0×36F. You can simply print these after a basic character and they will combine to some extent. It doesn’t always look particularly good so we will want to translate to the actual Greek character, usually in the extended set at 0×1F00 to 0×1FFF.

Now, luckily for us, the kind and worthy folks over at SIL International have done a lot of work in this area. This is our second piece of information. They have produced some nice utilities that will translate these mappings to normal unicode. Unfortunately, their utilities work only on files and only files of particular types. This is highly inconvenient so we will need to roll our own utilities. This doesn’t prevent us from using their map files, however. They have maps based on many commonly used fonts that show exactly what ascii codes combine to make a particular end character.

So now we can translate from the ascii encoding of some font to some combining unicode diacritical marks. Now all we have to do is to figure out which combination of diacritical marks map to which single unicode character. For this information we turn to the official unicode website. They have a file that lists every unicode character and the various characters that go into making that final character. It is a bit of a messy file and the utility I wrote to get the information we need had to check each character in a recursive fashion because each character can consist of up to two other characters which can in turn be two other characters and so on. Luckily I did all this work for you. :)

So, now we will talk a little bit about the program. It is in Perl, which is ugly and cumbersome but widely available and could be easily converted to PHP or whatever, and is implemented as a standard Perl package. You will need to download the .ZIP file and extract the program (greek.pm) to the /site/lib/Unicode directory. Here is an example of how to use it:

   use Unicode::greek;

   my $greek = Unicode::greek->new;

   # These two just converts between beta code and unicode

   $my_real_unicode_string = $greek->beta2unicode ( "fh=|s" );
   $my_beta_code_string = $greek->unicode2beta ( $my_real_unicode_string );

   # Stripping diacriticals is sometimes useful when comparing strings,
   # for example, in lexical lookups
   # since no one seems to be able to agree on breathing marks and, more
   # importantly, many people are unsure or forget.
   # By stripping the diacriticals, you are much more likely to get a match.

   $plain_beta_code = $greek->StripDiacriticals ( $my_beta_code_string );

   # Upper case and lower case conversion routines are also included.

   $my_upper_case_unicode = $greek->ucase ( $my_real_unicode_string );
   $my_lower_case_unicode = $greek->lcase ( $my_real_unicode_string );

   # Now for the font mapping routines. They are quite simple. They use the file
   # format supplied by SIL. They do not do any error checking.

   $my_GraecaII_font_map_reference = $greek->ReadFontMap ( "fonts/GraecaII.map" );
   $my_SD_font_map_reference = $greek->ReadFontMap ( "fonts/SemiticaDict.map" );

   $greek->SetFontMap ( $my_GraecaII_font_map_reference );
   $my_unicode_from_GraecaII = $greek->map2unicode ( "fh`/\"" );
   $greek->SetFontMap ( $my_SemiticaDict_font_map_reference );
   $my_unicode_from_SD = $greek->map2unicode ( $some_text_encoded_in_SD_format );

Well, there you have it. It has only been sparsely tested so if you find any problems, please let me know right away so I can fix them and upload a new version.

See you all at the SBL Annual meeting in Washington, DC, I hope. :)

Julian

My program described above: greek.zip

SIL International (SIL) have lots of font utilities and maps, check out their many maps here: Conversion Maps

Unicode (unicode) have a text file that explains all their characters here: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and a page, which you should read first, explaining what it all means here: http://www.unicode.org/Public/UNIDATA/UCD.html.

Basic statistics and the text of the NT, Part II

Posted August 10th, 2006 by Julian Jensen
Categories: Programming

I have time for one more entry before I leave next week. This time we will talk about textual representation and, if my patience with writing holds out, similarity coefficients.

But first a few quick words about linear algrebra or, more specifically, vectors and matrices. A vector can be thought of as a line that starts at 0,0. If you drew a line on a piece of paper you would need two coordinates, an X and a Y coordinate. Those would indicate where your line ended. In other words, we need two numbers to define a line in two dimensions, like a flat piece of paper. If you were looking at a stick planted in the ground at some arbitrary angle you would not be able to describe the orientation of the stick (line) with just two numbers. The X and Y components would tell you how far to the right and how far up the stick reached, but you would need to add one more, call it Z, to indicate how far away the was. That is a 3 dimensional vector. You can keep adding dimensions by simply adding yet another number, one for each dimension. After three dimensions it gets a bit hard to visualize, but that’s okay because the vectors we will be dealing with aren’t technically spatial, although we will kinda pretend that they are. A vector in two dimensions is defined like so:

and a vector in three dimensions looks like this:

and a generic representation of a vector of n dimensions is here:

That’s pretty much all for vectors for now except for one thing that is very useful, especially for us. Actually, I will wait with that until I have shown why it’s cool. :)

A matrix is a whole bunch of vectors stuck together. I can hear the mathematicians who might be reading this groan and slap their foreheads. Yeah, it is not a very good definition. A matrix is a rectangular table of numbers. Here is an example:

Or, for a far more correct generic representation with the proper labelling of the rows and columns, here is one I stole from Wikipedia:

So, there.

Now, why do we care about all this? We care because of the way that we will quantify our text. Say you want to do counts of distinct words. Here is an example sentence, a bad sentence, actually.

Here is an example sentence a bad actually
1 1 1 1 2 1 1 1

So, now we have, for our document of one bad sentence, a series of counts or occurances. We call this our term vector. Because it is a vector where each dimensions (our words) has a coordinate (the associated count). See, it is not really a vector as such but we can treat just as if it was. In this case, referring back to our definition above, a1 = 1, a2 = 1, a3 = 1, …, a5 = 2,…,a8 = 1. If we were dealing with two documents, we would simply add another set of numbers, giving us a matrix, called a Document-term matrix. Here is another bad, bad sentence.

  Here is an example sentence a bad actually another
Document 1 1 1 1 1 2 1 1 1 0
Document 2 1 1 0 0 1 0 2 0 1

In this case we have been counting single words (called unigrams), but there is no reason why we couldn’t count two-word combinations (called bigrams), three-word combinations (called trigrams) or, indeed, any number we want (called N-grams). When word combinations are counted we can detect the authors tendency to use certain words together (και ευθυς would get a high count in Mark, for example :) ) which is called collocation, by the way. We could also count grammatical tags in the same combinations as words. Actually, we can count anything we want if we think it might be revealing.

Now that we have two vectors, one for each document, I can talk about that useful vector thing I skipped earlier. It is called a dot product (or sometimes an inner product or scalar product) and it is very simple. Multiply the values in each column and add them together. However, before we do that we need to normalize our vectors, make them unit vectors, which is not that big of a deal. Simply divide each element by the length of the vector. The lenght is calculated using the Pythagorean Theorem we discussed earlier. So the length of the Document 1 term vector goes like this: Take the square root of 1 * 1 + 1 * 1 + 1 * 1 + 1 * 1 + 2 * 2 + 1 * 1 + 1 * 1 + 1 * 1 and we get about 3.32 which we then divide into each element so our first element now becomes 0.301 instead of 1, same for the next and so on. Our vector now has length = 1. We do the same to the other one. Now multiply each column and add them up 0.301 * 0.5 + 0.301 * 0.5 … 0.0 * 0.5 and the result is a single number. This number is very useful because it is the cosine of the angle between the two vectors. This means that if the vectors were similar (the texts had the same relative distribution) the result would be 1 and if they were vey far apart we would get 0. As we can see above in our table that may not be as useful as it sounds. For example, we are counting a and an as two separate words. On top of that, there are clearly words that should not be included in a test like this, namely any word that doesn’t contribute to the meaning or complexity of the sentence. Also, some words are more important than others, although defining that gets more nebulous, so some weights should be applied, this is very important if you want believable results. Once all that is done then the Vector Space Model (the fancy term for what we just did) works reasonably well. There are several methods for condensing the data automatically, some examples include Principal Component Analysis and Singular Value Decomposition, for example. I now regularly use those techniques on bible texts and achieve far greater success. (Thanks to westcott who suggested I look into it).

I guess I will get to similarity coefficients some other time as I have gotten tired. :) FYI, I have started work on a parser in order to extract lexical and syntactic structural information from the sentences. I expect stellar results from that approach because I suspect that the complexity, length and structural order has a lot to say about an author. Especially when subject to the same battery of statistical analysis as is already being applied. I will talk about that as I get it going. And to think that this all started with the Pericope de Adultera. :D

TTFN,
Julian

Basic statistics and the text of the NT

Posted August 7th, 2006 by Julian Jensen
Categories: Programming, Textual Criticism

Well, as I am delving deeper and deeper into the world of text analysis things are getting more and more complicated. So I thought that I would write a series of blog entries describing the basics of statistics and how they apply to text. Eventually, we would have to look at some very advanced techniques used in Computational Linguistics after we have worked our way through the basics. There will be math involved but I will explain exactly how it works as we go along in terms that should be clear to non-mathematicians, like myself. I will start by looking at various means, variance and standard deviation calculations and what they mean. Eventually we will work our way up to latent semantic analysis, singular value decomposition, principal component analysis, N-gram taggers, partial parsers and so on but that’s for much later.

Basic statistical functions tell us some basic facts about the text we are working with but the techniques are insufficient to provide a good separation of textual styles which is why we will need to rely on more advanced techniques from the field of computational linguistics which is a very fast moving field.

Well, let’s get to it.

First we need to distinguish between a population and a sample. In the case of NT texts we have the luxury of dealing with both. In real life, statisticians frequently only have samples and must estimate the population from those samples. We will need to do so as well when we are trying to determine authorship for fragments of text. However, we have the entire gospel of xxx, the entire epistle of yyy and so on which allows us to establish descriptive parameters for the entire text, i.e. the population.

The first function we will look at is how to calculate the mean of a set, set here meaning a collection of numbers, either a sample or the entire text (population.) Mean simply means the average, nothing more. There are several types of means with the arithmetic mean being the one that most people understand to be implied when we simply say mean. We also have the geometric mean and the harmonic mean. The latter two are not particularly useful in our case so we will not address them here. The mean, when described by a mathematician looks like this:

Mean formula

What that says is simply this: Add up all the numbers of the set and divide by the number of entries. So if you had the numbers 10, 20, 30 you would add them all, getting 60, and then divide by 3, yielding a result of 20. So the mean of 10, 20, 30 = 20.

Math note: Multiplying by 1 over n is the same as dividing by n. The value 1/n is the reciprocal of n in math talk or sometimes the multiplicative inverse for extra nerd points.

Okay, for our next trick we will look at variance and standard deviation. First, variance. Variance is a measure of the amount of variation in a dataset. It is used in a number of calculations and is simply the square of the standard deviation, which is why they are explained here in the same section. In mathematical terms, variance looks like this

Population variance formula

for a population variance and like this

Sample variance formula

when working with a sample. It can be shown mathematically that the second one tends to represent the population better than the first one when dealing with samples. A discussion of that would be far too lengthy (and can be found on wikipedia under variance and standard deviation.) Before I explain how they work and why we care, I will show the final formula for this which is the Standard Deviation, a term which I am sure most have heard and probably understand to some extent. Here it is:

Standard deviation formula

There. Now we have some building blocks. As can be seen above, the standard deviation is simply the square root of the variance. You calculate it by first calculating your mean value as described earlier. Then you go through each datum in your set and subtract the mean (this makes the set centered around the mean) and square it. You add all these numbers together (which is what is meant by the upper case sigma in the formula) and divide by number of values in your set, subtracting one from that number before dividing. This is the variance. Take the square root of that and you have the standard deviation. So, what does all this really mean?

The standard deviation tells us how spread out the data are from our mean. Remember the well-known Pythagoras formula that we all had to learn for calculating the hypotenuse? All that formula does is calculate the distance. If you have ever worked construction, or do a lot of home improvement, you probably know the 3-4-5 rule used to make sure that something is square. Say you are building a fence that needs to have 90 degree corners. Measure 3 feet along one side, measure 4 feet along the other and measure between those two points and the length should be 5 feet, because the square root of ( 3 *3 + 4* 4 ) is 5. So, if you add up the square of a bunch of distance measurements (remember, when we subtract our mean from each value we essentially get the ‘distance’ of that point from the average) and then take the square root you get the distance in however many dimensions you added up. Don’t worry about that last ‘dimension’ bit just yet. In our case, we added up our squares and then divided by the number of values in our set (minus one but that is not so important here) getting the average squared value of our set. The subsequent square root gives us the actual (non-squared) distance. So, what we end up with is the standard deviation which is another way of saying the average distance of our data from the average.

So now we have a way to get our average and determine how spread out our data is. Now, how to we apply that to text? Well, first you quantify your text in some manner (which will be our next blog entry, namely, how to quantify text) and then you calculate the mean and standard deviation. If you are doing a word count, for example, the mean will tell you the average number of times a word is being used and the standard deviation tells you how much these words counts vary. Obviously, numbers like these are far too feeble to tell us much about the text. They are, however, important building blocks that we will need for further investigation.

For example, for 15 random 1000 word blocks from Luke and John each, we can see that Luke uses, on average (the mean of 15 different 1000 word blocks in this case) 306 words with 289 being the minimum number of words and 330 the maximum. John, on the other hand, uses 216 different words with min and max being 182 and 255, respectively. So clearly, Luke uses a far greater vocabulary. We can compare these factors because we are using the same sample size, however, when dealing with the entire texts we run into problems because a longer text will naturally use a larger vocabulary. The caveat here is that the word usage does not increase linearly, meaning that if you double the length of the text it does not follow that the number of distinct words used double, as well. In fact, they never do. In some cases we will need to look at a text fragment of a particular size and if we are to check the distinct word usage against the biblical writers we will need to estimate what number of words would be typical for a fragment of that size for any given author. I have pre-calculated the word usage for the biblical books at regular sampling size intervals at a tight enough spacing that we will be able to linearly interpolate the correct word usage and its standard deviation to establish probability.

That’s probably enough for now. I shall continue soon with a look at text representation and quantification. The interval between blog entries should decrease now that I am back from europe, although I do have to go for a week and wear medieval clothes at Pennsic. :)

Julian

Pericope de Adultera: Statistical analysis of the NT, preliminary

Posted July 12th, 2006 by Julian Jensen
Categories: Programming, Textual Criticism

Well, due to some recent discussions regarding the Pericope de Adultera (specifically John 8:1-11) I decided to use the power of computers to put this issue to bed. Well, we’ll see. Dr. J. Gibson was kind enough to furnish me with some information regarding earlier analyses of the issue and, at first glance, the conclusions seemed sound. The passage seemed remarkably Lukan and this coincided with my own hasty statistical observations. But, I thought, let’s take a closer look. Let’s do a real statistical analysis. Right. Let’s do it. Right now. Tum-tum-tum… Hmmm… What exactly does that mean? Isn’t that always the case: you start something and realize that you need to establish an effective methodology before you can even tackle the problem.

Let me warn you in advance: I don’t have an answer yet and I am going to europe for two weeks which will severely slow me down on this issue. But I did want to record my observations as they stand at this moment.

At first I read some research on PA (Pericope de Adultera) stylistic issues which is a fancy way of saying word and, to a limited extent, phrase usage. The best I read was A Possible Case of Lukan Authorship by Henry J. Cadbury which appeared in HTR 10.3 (Jul., 1917) pp. 237-244, I don’t have my SBL style book handy for the citation, so deal. ;) He looked at particular words, compound word usage and some idiomatic phrases and eventually concluded that, although the language seems decidedly Lukan, we just don’t know, especially in light of the hard MSS contra-evidence. A good article but sort of like talking a nice looking woman into coming home with you only to find out she is a 79-year old eunuch in drag with a large collection of whips. Surely, this can be solved with some decent statistics.

The first thing you come to terms with, after you get over the shock of realizing that the solution isn’t trivial, is that you need to find a good way to quantify your NT text. Then you need a good way to correlate it to other texts. Then you need to find ways to consistently determine textual variance which is a bit tricky because this isn’t a stochastic issue. Humans, don’t you just hate them? Writers, in particular? I mean, write consistently already!

Here comes the technical math jargon. Bear in mind that I am not a mathematician so I may screw up some terminology.

The text is broken down into a frequency analysis table and treated as bivariate data with regards to its comparison target.

The data is then analyzed for general statistical style parameters for future use. This involve word and grammar frequencies, mean, standard deviation and variance. Then you compare the text to itself and establish correlation coefficients, similarity coefficients, regression slopes and so forth. This is the point when you realize that all writers should be shot. Or, in other words, a document is not overly textually consistent. Anyways, moving right along… By comparing the document to itself you can establish parameters regarding confidence intervals and expected values. You don’t, of course, compare the entire to gospel to itself in one big chunk or any values you got would be meaningless since GJohn == GJohn. I pick a section of some size which has been decided to provide good parameters and then compare that section to the rest of the gospel, divided into similar sized sections, excluding itself. You then move to the next section and repeat. This ensures that every section of the gospel has been compared to every other section without comparing any portion to itself. You now have a complete statistical snapshot of the gospel. Then you do the same thing except this time you compare each section to a section in another gospel to see if there is enough variance to determine style differential. And let me tell you, it’s subtle. Real subtle. It turns out that word usage is a crappy parameter. You can throw out words that are shorter than, say, 4 letters and that helps a lot, but even so… Anyone who argues authorship based on word usage alone should have his head examined or learn some statistics. A much better parameter is grammar. What was fuzzy before becomes much more crisp. Not to the point of obviousness but the values start to separate. An even better approach is a two word section segmentation, or two grammar sections, done in the same way. And even more powerful is a three entity analysis. The problem is that your hit count drops to the point where your sample size becomes statistically suspect. I plan to use Student’s t-test to help this along but that is too feeble to help much. The nice thing about two word combinations is that they tend to pick up adjective/noun order in addition to pleonasms. Three word combinations will pick up many idomatic phrases. These examples probably identify a writer far more surely than his choice of words.

The solution is to combine the various techniques briefly described above. Run a statistical analysis on the results of the statistical analysis. While it will not be conclusive, it will be significant. Well, that’s my prediction, at least.

In the meantime, I have that certain techniques work better than others. The values below are my eyeball estimates as I haven’t done any programming to list any min max output. Here is a geeky summary, so far:

  • Word usage overlap: Runs anywhere from 30% to 60% but is quite erratic and unreliable. And I see this used as an argument all the time. *eyeroll*

  • a,b,c,d: Similarity parameters for many formulae. a is in both texts, b is only in the first, c is only in the second and d is in neither. This last obviously never happens and thus simplify many calculations. All of these run pretty close but there is a subtle difference which might become significant if some general observation on this parameter can be established. It generally involves properly thresholding the variance of the Jaccard similarity coefficient. Probably converting it to a z-score and adding some fuzzy logic for thresholding.

  • Source and target mean, variance and standard deviation are all pretty stable which gives me some hope that authorship is detectable, at least in a sense where one can determine a probability that is moderately convincing. Not that they are useful for comparison, they only show consistency in style, but if the style is tight enough one would think that divergence could be established.

  • Correlation coefficient. Well, this isn’t working at all. No real surprise there since the bivariate variables are not really dependent on each other. The Pearson product moment correlation coefficient yields a big fat zero every time. This has the unfortunate effect described in the next point.

  • Regression analysis is entirely impossible. Scatter charts are a joke with this data and least-square analysis comes up with nonsense. The slope and intercept are almost random. :)

  • The Jaccard similarity coefficient has some potential but is too simple in this context. Half the time John agrees with Luke more than John. I hold no hopes there. Luckily we have lots of similarity coefficient calculations available and some seem to yield some consistent results. Read on…

  • Jaccard distance is as useless as its opposite number above.

  • The Dice similarity coefficient could have been designed by Andrew Dice Clay for all its efficiency. No dice, as they say…

  • The Rogers-Tanimoto similarity coefficient is not blowing my skirt up, either. Not that I am wearing a skirt. At the moment. ;)

  • The Kulczynski similarity coefficient was hard to spell, easy to implement and yields mediocre results, certainly nothing pursuable.

  • The Hamann similarity coefficient was a surprise. Despite the deceptively simple formula it is yielding very consistent results. It pretty much always detects proper authorship where the previous methods failed.

  • Pearson’s chi-square is another good one. Not completely consistent but fairly reliable, nonetheless.
  • The Normalized Pearson’s chi-square is, obviously, much like the above except better. :)

  • z-score correlation has been kind to me. Converting to z-scores started yielding some consistent results when shopping for a decent correlation coefficient. This is another good one.

So, the trick is to combine 1, 2 and 3 word analyses as well as 1, 2 and 3 word grammar combinations. On top of this we add the statistical techniques that yield consistent results when operating on texts where we know the authorship, at least in a statistically significant sense. Combining the analysis results using more statistics should eventually yield a convincing percentage.

But much of this must wait until I return. I am off to see my kids for the first time in three years. Priorities and all that… Type to you soon.

Julian

P.S. All these formulae can be found on the web. I have found Wikipedia to be especially helpful on this topic.
P.P.S. Once again this message is far too short to get into any detail. I could go on and on about this but, as usual, ask questions here and you shall receive answers. I will check this blog for comments while in europe.
P.P.P.S. All this statistical analysis stuff is part of my search/compare portion of my anayltical bible program which will soon (yeah, right) be free and online at www.textcrit.com so all this will be available to you without having to actually learn how the math works. :)

The ins and outs of displaying Greek text on the web…

Posted June 30th, 2006 by Julian Jensen
Categories: Programming

For my first real blog entry, I will talk about some important programming aspects involved in displaying Greek text on the web. There are several issues involved that are frequently handled very poorly by various web pages. This blog entry will go step by step through the process, including the actual code later on. The four parts to be considered are:

  1. Character encoding
  2. Character sets, font files and Unicode
  3. Preferences and customization
  4. Conversions

The first one involves setting the correct MIME type for the document, ensuring proper display. Without this the Greek text is likely to appear as garbage. The user can override the encoding but how many users know that this is what they are supposed to do? How many even know how to do it?

The user can change the encoding on the View menu, selecting Encoding (or Character Encoding in Firefox) and then selecting Unicode (UTF-8). For an example of what erroneous encoding can do to your Greek display, check out Justin’s First Apology here: http://khazarzar.skeptik.net/books/justinus/apolog1g.htm

They selected a Cyrillic encoding and the result is obvious. If you change the encoding to UTF-8 the text suddenly becomes legible (provided you can read Greek to begin with, of course.) Notice also how the German on that first page becomes legible as well.

They should have set the encoding in their web document and freed the, quite possibly clueless, user from the rather cryptic task. It is a good practice to start every web page with the following unless you have a very good reason not to.

For HTML, start every document with this line:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

This ensures that the text is displayed correctly, no matter what language you are using. For server-side scripting it is much the same, here is what it looks like in Perl:

	print "Content-type: text/html;charset=utf-8 \\n\\n";

Make sure that you include the two newline characters or it will fail. For other scripting languages, check here: http://www.w3.org/International/O-HTTP-charset

Next, we will discuss the character sets, font files and Unicode. The official Greek Unicode chart can be found here: http://www.unicode.org/charts/ Note that there are two charts, both being 16 bits, the first ranges from 0×370 to 0×3ff and covers the regular upper case and lower case letters as well as the characters with tonoi. The second chart ranges from 0×1f00 to 0×1fff and covers all the lower and upper case letters with their diacritical marks. The way they are laid out is pretty decent and helps when converting characters. Most of the well-known Greek fonts available for download cover these characters. At least one standard Windows font also covers the entire range (Tahoma.)

This ties into our third point neatly. Everybody has different tastes in Greek fonts. I, personally, like the Tahoma font because it is clean, crisp, widely available and looks good when displayed in a normal size. I find it fairly essential that users are allowed to customize the font choice if the site is heavily dependent on Greek characters. There is really no excuse not to do this since it is rather uncomplicated. I won’t talk much about server-side font selection since there are about a thousand ways of doing this and if you know how to do server-side programming then you don’t need me to explain to how to work the font selection. Much can be done client-side, however, using Javascript and the Document Object Model.

The method I have chosen is to modify the global stylesheet although there some cross-browser issues. It is also rather fuzzy since the entries look like JSON but they really aren’t. This is a problem with most objects that didn’t originate from the Javascript core, it looks like a duck, quacks like a duck, but try and treat it like a duck and you’ll be sad. I will probably write more on this issue on a future date, especially the problems with the Array object as returned from the DOM and other places. Anyways, this is all solvable as will be seen below.

When allowing the user to select the font you should also be kind enough to remember his selection and set it upon his next return. Let’s first present the complete but simple example of how to do all this.

<html>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<head>
<style type=text/css>
/* Mark all elements that  display Greek with class=greek
* You can add whatever elements you want in addition to the font-family
*/
.greek
{
    font-family: Tahoma;
}

</style>
<script language=JavaScript>
function setFont( fontName )
{
    var theRules = new Array ();
    if ( document.styleSheets[ 0 ].cssRules )
        theRules = document.styleSheets[ 0 ].cssRules
    else if (document.styleSheets[ 0 ].rules)
        theRules = document.styleSheets[ 0 ].rules

    for ( var n = 0; n < theRules.length; n++ )
        if ( theRules[ n ].selectorText == '.greek' )
            theRules[ n ].style.fontFamily = fontName;

    setCookie ( 'userFontName', fontName );
}

function setCookie ( cName, cValue )
{
    var exdate = new Date();

    exdate.setDate ( exdate.getDate() + 365 ); // Set for one year.
    document.cookie = cName+ '=' + escape ( cValue ) + ';expires=' + exdate;
}

function getCookie ( cName )
{
    if ( document.cookie.length > 0 )    // Are cookies turned on?
    {
        start = document.cookie.indexOf ( cName + '=' )
        if ( start != -1 )
        {
            start = start + cName.length + 1;
            end = document.cookie.indexOf ( ";", start );
            if ( end == -1 ) end = document.cookie.length;
            return ( unescape ( document.cookie.substring ( start, end ) ) );
        }
    }

    return ( null );
}

function winLoad ()
{
    if ( ck = getCookie ( 'userFontName' ) )
    {
        setFont ( ck );

        var fontSelect = document.getElementById ( 'fontSelect' );
        for ( n = 0; n < fontSelect.options.length; n++ )
            if ( fontSelect.options[ n ].value == ck )
                fontSelect.selectedIndex = n;
    }
}

</script>
</head>
<body onload="winLoad();">
<div class=greek>
ἐπειδήπερ πολλοὶ ἐπεχείρησαν ἀνατάξασθαι διήγησιν περὶ τῶν πεπληροφορημένων…
</div>
<div>
Regular text here…
<select id=fontSelect onChange=”setFont ( this.options[ this.selectedIndex ].value );”>
<option value=Tahoma>Tahoma
<option value=SPIonic>SPIonic
</select>
</div>
<div class=greek>
καθὼς παρέδοσαν ἡμῖν οἱ ἀπ’ ἀρχῆς αὐτόπται καὶ ὑπηρέται γενόμενοι τοῦ λόγου
</div>
</body>
</html>

If you want to try out this program, make sure that you save the document in a format that supports Unicode. Word or Wordpad will both do this, just pick Save As… and change the Save as type… If you see garbage on your screen, you saved it in a format that doesn’t support wide characters.

The body of the program is pretty simple. There is a DIV tag marked as containing Greek text (you could mix the greek in with other languages as long as the font selected has those characters), then a select which allows you to SELECT a font and then another section of Greek. You can add as many fonts in the SELECT as you like.

We have an onload event for this document. It gets the cookie (if it exists), changes the stylesheet and makes the SELECT start with the current font selection. The cookie is set for a year, simply change the 365 to some other numbers of days if you wish. The setFont function goes to the first stylesheet, finds the ‘greek’ class and sets the font. It also updates the cookie.

That’s it. Nothing to it. Anyone is free to copy the above and use it as they see fit.

So now we can display the font properly, we know the character set layout, we can let the user select a font and remember it for future use. What’s left? The hardest part, as a matter of fact.

Conversion is an interesting topic. When I say conversion, I mean conversion between upper case, lower case, betacode, stripping diacriticals, HTML character entities and so on… It is entirely lame that I have to transliterate Greek into betacode on some sites in order to do a search when I have the Tavultesoft Keyman (which I highly recommend to everyone, it is excellent and free) installed.

I don’t know of any conversion programs out there, I searched, so I ended up writing my own in Perl. I was going to post it as part of this entry but I am realizing that it is not yet quite ready for public consumption. If you need it in a hurry let me know, otherwise I will simply post a link to it here once it is finished, which won’t be long. Really. It essentially does all the conversions I mentioned above. It came in handy when trying to marry up the MorphGNT and XML version of Strong’s, which is a story in itself. I will relay that in one of my next entries. The MorphGNT is actually surprisingly accurate, more so than many other GNT sites and tools that I have seen. The Strong’s…? Not so much. That blog entry will also give me a chance to rant about the poor use of computers, the NA27, ridiculous pricing and some fairly pathetic approaches to the whole technology issue with regards to biblical studies.

For now, this was my first entry. I doubt anyone will read this far. If you did, then I hope I have been of some assistance. I have worked with this for a while now and have gathered some knowledge in this area, so any questions are welcome, since I realize that this was a rather short entry that left out a large number of details.

Julian

First entry…

Posted June 29th, 2006 by Julian Jensen
Categories: Uncategorized

This is my first blog entry. This is really more of a test so there will be nothing profound here. There may never be. This is mostly a blog dedicated to the interaction and application of computers to the field of biblical studies, especially textual criticism. There are several reasons for starting this blog, first and formost, this software was free. It came with my web host. All the cool people are doing it so, hey, I want to be one of the cool people. Secondly, I am working on a free-for-all, online program to aid in biblical studies that will appear eventually on www.textcrit.com, the main domain for this blog. I am not sure when the first version will be available. Soon, I hope, but it is a lot of work. In a few days I will write a longer blog entry about what I am working on but for now this is just a test of the blog software.

Julian