Basic statistics and the text of the NT, Part II

I have time for one more entry before I leave next week. This time we will talk about textual representation and, if my patience with writing holds out, similarity coefficients.

But first a few quick words about linear algrebra or, more specifically, vectors and matrices. A vector can be thought of as a line that starts at 0,0. If you drew a line on a piece of paper you would need two coordinates, an X and a Y coordinate. Those would indicate where your line ended. In other words, we need two numbers to define a line in two dimensions, like a flat piece of paper. If you were looking at a stick planted in the ground at some arbitrary angle you would not be able to describe the orientation of the stick (line) with just two numbers. The X and Y components would tell you how far to the right and how far up the stick reached, but you would need to add one more, call it Z, to indicate how far away the was. That is a 3 dimensional vector. You can keep adding dimensions by simply adding yet another number, one for each dimension. After three dimensions it gets a bit hard to visualize, but that’s okay because the vectors we will be dealing with aren’t technically spatial, although we will kinda pretend that they are. A vector in two dimensions is defined like so:

and a vector in three dimensions looks like this:

and a generic representation of a vector of n dimensions is here:

That’s pretty much all for vectors for now except for one thing that is very useful, especially for us. Actually, I will wait with that until I have shown why it’s cool. :)

A matrix is a whole bunch of vectors stuck together. I can hear the mathematicians who might be reading this groan and slap their foreheads. Yeah, it is not a very good definition. A matrix is a rectangular table of numbers. Here is an example:

Or, for a far more correct generic representation with the proper labelling of the rows and columns, here is one I stole from Wikipedia:

So, there.

Now, why do we care about all this? We care because of the way that we will quantify our text. Say you want to do counts of distinct words. Here is an example sentence, a bad sentence, actually.

Here is an example sentence a bad actually
1 1 1 1 2 1 1 1

So, now we have, for our document of one bad sentence, a series of counts or occurances. We call this our term vector. Because it is a vector where each dimensions (our words) has a coordinate (the associated count). See, it is not really a vector as such but we can treat just as if it was. In this case, referring back to our definition above, a1 = 1, a2 = 1, a3 = 1, …, a5 = 2,…,a8 = 1. If we were dealing with two documents, we would simply add another set of numbers, giving us a matrix, called a Document-term matrix. Here is another bad, bad sentence.

  Here is an example sentence a bad actually another
Document 1 1 1 1 1 2 1 1 1 0
Document 2 1 1 0 0 1 0 2 0 1

In this case we have been counting single words (called unigrams), but there is no reason why we couldn’t count two-word combinations (called bigrams), three-word combinations (called trigrams) or, indeed, any number we want (called N-grams). When word combinations are counted we can detect the authors tendency to use certain words together (και ευθυς would get a high count in Mark, for example :) ) which is called collocation, by the way. We could also count grammatical tags in the same combinations as words. Actually, we can count anything we want if we think it might be revealing.

Now that we have two vectors, one for each document, I can talk about that useful vector thing I skipped earlier. It is called a dot product (or sometimes an inner product or scalar product) and it is very simple. Multiply the values in each column and add them together. However, before we do that we need to normalize our vectors, make them unit vectors, which is not that big of a deal. Simply divide each element by the length of the vector. The lenght is calculated using the Pythagorean Theorem we discussed earlier. So the length of the Document 1 term vector goes like this: Take the square root of 1 * 1 + 1 * 1 + 1 * 1 + 1 * 1 + 2 * 2 + 1 * 1 + 1 * 1 + 1 * 1 and we get about 3.32 which we then divide into each element so our first element now becomes 0.301 instead of 1, same for the next and so on. Our vector now has length = 1. We do the same to the other one. Now multiply each column and add them up 0.301 * 0.5 + 0.301 * 0.5 … 0.0 * 0.5 and the result is a single number. This number is very useful because it is the cosine of the angle between the two vectors. This means that if the vectors were similar (the texts had the same relative distribution) the result would be 1 and if they were vey far apart we would get 0. As we can see above in our table that may not be as useful as it sounds. For example, we are counting a and an as two separate words. On top of that, there are clearly words that should not be included in a test like this, namely any word that doesn’t contribute to the meaning or complexity of the sentence. Also, some words are more important than others, although defining that gets more nebulous, so some weights should be applied, this is very important if you want believable results. Once all that is done then the Vector Space Model (the fancy term for what we just did) works reasonably well. There are several methods for condensing the data automatically, some examples include Principal Component Analysis and Singular Value Decomposition, for example. I now regularly use those techniques on bible texts and achieve far greater success. (Thanks to westcott who suggested I look into it).

I guess I will get to similarity coefficients some other time as I have gotten tired. :) FYI, I have started work on a parser in order to extract lexical and syntactic structural information from the sentences. I expect stellar results from that approach because I suspect that the complexity, length and structural order has a lot to say about an author. Especially when subject to the same battery of statistical analysis as is already being applied. I will talk about that as I get it going. And to think that this all started with the Pericope de Adultera. :D

TTFN,
Julian

Explore posts in the same categories: Programming

4 Comments on “Basic statistics and the text of the NT, Part II”

  1. Noah Says:

    Wow, this is fascinating. I have never seen my two interests of Math and Biblical Studies combined before, please keep these coming!

  2. Julian Says:

    Thanks for the comment. My problem is whether or not I should could continue to present the basic steps so that every one can follow along or if I should just jump to the finish, assuming that people understand singular value decomposition, semantic lexical analysis, vector space models, phrase structure grammar, etc… Writing these entries gives me time to work ahead, of course, a decided advantage since I have yet to arrive at a grand finale. :) I suspect I shall eventually have to skip much of the preliminaries since parsing and matrix decompostion (and so forth) are really not topics one can easily present to the uninitiated in a blog.

    Julian

  3. sandra407 Says:

    Hi! I was surfing and found your blog post… nice! I love your blog. :) Cheers! Sandra. R.

  4. angelina jolie Says:

    I love your site. :) Love design!!! I just came across your blog and wanted to say that I’ve really enjoyed browsing your blog posts. Sign: ndsam

Comment: