Basic statistics and the text of the NT, Part II
Thursday, August 10th, 2006I have time for one more entry before I leave next week. This time we will talk about textual representation and, if my patience with writing holds out, similarity coefficients.
But first a few quick words about linear algrebra or, more specifically, vectors and matrices. A vector can be thought of as a line that starts at 0,0. If you drew a line on a piece of paper you would need two coordinates, an X and a Y coordinate. Those would indicate where your line ended. In other words, we need two numbers to define a line in two dimensions, like a flat piece of paper. If you were looking at a stick planted in the ground at some arbitrary angle you would not be able to describe the orientation of the stick (line) with just two numbers. The X and Y components would tell you how far to the right and how far up the stick reached, but you would need to add one more, call it Z, to indicate how far away the was. That is a 3 dimensional vector. You can keep adding dimensions by simply adding yet another number, one for each dimension. After three dimensions it gets a bit hard to visualize, but that’s okay because the vectors we will be dealing with aren’t technically spatial, although we will kinda pretend that they are. A vector in two dimensions is defined like so:

and a vector in three dimensions looks like this:

and a generic representation of a vector of n dimensions is here:

That’s pretty much all for vectors for now except for one thing that is very useful, especially for us. Actually, I will wait with that until I have shown why it’s cool.
A matrix is a whole bunch of vectors stuck together. I can hear the mathematicians who might be reading this groan and slap their foreheads. Yeah, it is not a very good definition. A matrix is a rectangular table of numbers. Here is an example:

Or, for a far more correct generic representation with the proper labelling of the rows and columns, here is one I stole from Wikipedia:

So, there.
Now, why do we care about all this? We care because of the way that we will quantify our text. Say you want to do counts of distinct words. Here is an example sentence, a bad sentence, actually.
| Here | is | an | example | sentence | a | bad | actually |
| 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 |
So, now we have, for our document of one bad sentence, a series of counts or occurances. We call this our term vector. Because it is a vector where each dimensions (our words) has a coordinate (the associated count). See, it is not really a vector as such but we can treat just as if it was. In this case, referring back to our definition above, a1 = 1, a2 = 1, a3 = 1, …, a5 = 2,…,a8 = 1. If we were dealing with two documents, we would simply add another set of numbers, giving us a matrix, called a Document-term matrix. Here is another bad, bad sentence.
| Here | is | an | example | sentence | a | bad | actually | another | |
| Document 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 0 |
| Document 2 | 1 | 1 | 0 | 0 | 1 | 0 | 2 | 0 | 1 |
In this case we have been counting single words (called unigrams), but there is no reason why we couldn’t count two-word combinations (called bigrams), three-word combinations (called trigrams) or, indeed, any number we want (called N-grams). When word combinations are counted we can detect the authors tendency to use certain words together (και ευθυς would get a high count in Mark, for example
) which is called collocation, by the way. We could also count grammatical tags in the same combinations as words. Actually, we can count anything we want if we think it might be revealing.
Now that we have two vectors, one for each document, I can talk about that useful vector thing I skipped earlier. It is called a dot product (or sometimes an inner product or scalar product) and it is very simple. Multiply the values in each column and add them together. However, before we do that we need to normalize our vectors, make them unit vectors, which is not that big of a deal. Simply divide each element by the length of the vector. The lenght is calculated using the Pythagorean Theorem we discussed earlier. So the length of the Document 1 term vector goes like this: Take the square root of 1 * 1 + 1 * 1 + 1 * 1 + 1 * 1 + 2 * 2 + 1 * 1 + 1 * 1 + 1 * 1 and we get about 3.32 which we then divide into each element so our first element now becomes 0.301 instead of 1, same for the next and so on. Our vector now has length = 1. We do the same to the other one. Now multiply each column and add them up 0.301 * 0.5 + 0.301 * 0.5 … 0.0 * 0.5 and the result is a single number. This number is very useful because it is the cosine of the angle between the two vectors. This means that if the vectors were similar (the texts had the same relative distribution) the result would be 1 and if they were vey far apart we would get 0. As we can see above in our table that may not be as useful as it sounds. For example, we are counting a and an as two separate words. On top of that, there are clearly words that should not be included in a test like this, namely any word that doesn’t contribute to the meaning or complexity of the sentence. Also, some words are more important than others, although defining that gets more nebulous, so some weights should be applied, this is very important if you want believable results. Once all that is done then the Vector Space Model (the fancy term for what we just did) works reasonably well. There are several methods for condensing the data automatically, some examples include Principal Component Analysis and Singular Value Decomposition, for example. I now regularly use those techniques on bible texts and achieve far greater success. (Thanks to westcott who suggested I look into it).
I guess I will get to similarity coefficients some other time as I have gotten tired.
FYI, I have started work on a parser in order to extract lexical and syntactic structural information from the sentences. I expect stellar results from that approach because I suspect that the complexity, length and structural order has a lot to say about an author. Especially when subject to the same battery of statistical analysis as is already being applied. I will talk about that as I get it going. And to think that this all started with the Pericope de Adultera.
TTFN,
Julian
