Archive for August, 2006

Basic statistics and the text of the NT, Part II

Thursday, August 10th, 2006

I have time for one more entry before I leave next week. This time we will talk about textual representation and, if my patience with writing holds out, similarity coefficients.

But first a few quick words about linear algrebra or, more specifically, vectors and matrices. A vector can be thought of as a line that starts at 0,0. If you drew a line on a piece of paper you would need two coordinates, an X and a Y coordinate. Those would indicate where your line ended. In other words, we need two numbers to define a line in two dimensions, like a flat piece of paper. If you were looking at a stick planted in the ground at some arbitrary angle you would not be able to describe the orientation of the stick (line) with just two numbers. The X and Y components would tell you how far to the right and how far up the stick reached, but you would need to add one more, call it Z, to indicate how far away the was. That is a 3 dimensional vector. You can keep adding dimensions by simply adding yet another number, one for each dimension. After three dimensions it gets a bit hard to visualize, but that’s okay because the vectors we will be dealing with aren’t technically spatial, although we will kinda pretend that they are. A vector in two dimensions is defined like so:

and a vector in three dimensions looks like this:

and a generic representation of a vector of n dimensions is here:

That’s pretty much all for vectors for now except for one thing that is very useful, especially for us. Actually, I will wait with that until I have shown why it’s cool. :)

A matrix is a whole bunch of vectors stuck together. I can hear the mathematicians who might be reading this groan and slap their foreheads. Yeah, it is not a very good definition. A matrix is a rectangular table of numbers. Here is an example:

Or, for a far more correct generic representation with the proper labelling of the rows and columns, here is one I stole from Wikipedia:

So, there.

Now, why do we care about all this? We care because of the way that we will quantify our text. Say you want to do counts of distinct words. Here is an example sentence, a bad sentence, actually.

Here is an example sentence a bad actually
1 1 1 1 2 1 1 1

So, now we have, for our document of one bad sentence, a series of counts or occurances. We call this our term vector. Because it is a vector where each dimensions (our words) has a coordinate (the associated count). See, it is not really a vector as such but we can treat just as if it was. In this case, referring back to our definition above, a1 = 1, a2 = 1, a3 = 1, …, a5 = 2,…,a8 = 1. If we were dealing with two documents, we would simply add another set of numbers, giving us a matrix, called a Document-term matrix. Here is another bad, bad sentence.

  Here is an example sentence a bad actually another
Document 1 1 1 1 1 2 1 1 1 0
Document 2 1 1 0 0 1 0 2 0 1

In this case we have been counting single words (called unigrams), but there is no reason why we couldn’t count two-word combinations (called bigrams), three-word combinations (called trigrams) or, indeed, any number we want (called N-grams). When word combinations are counted we can detect the authors tendency to use certain words together (και ευθυς would get a high count in Mark, for example :) ) which is called collocation, by the way. We could also count grammatical tags in the same combinations as words. Actually, we can count anything we want if we think it might be revealing.

Now that we have two vectors, one for each document, I can talk about that useful vector thing I skipped earlier. It is called a dot product (or sometimes an inner product or scalar product) and it is very simple. Multiply the values in each column and add them together. However, before we do that we need to normalize our vectors, make them unit vectors, which is not that big of a deal. Simply divide each element by the length of the vector. The lenght is calculated using the Pythagorean Theorem we discussed earlier. So the length of the Document 1 term vector goes like this: Take the square root of 1 * 1 + 1 * 1 + 1 * 1 + 1 * 1 + 2 * 2 + 1 * 1 + 1 * 1 + 1 * 1 and we get about 3.32 which we then divide into each element so our first element now becomes 0.301 instead of 1, same for the next and so on. Our vector now has length = 1. We do the same to the other one. Now multiply each column and add them up 0.301 * 0.5 + 0.301 * 0.5 … 0.0 * 0.5 and the result is a single number. This number is very useful because it is the cosine of the angle between the two vectors. This means that if the vectors were similar (the texts had the same relative distribution) the result would be 1 and if they were vey far apart we would get 0. As we can see above in our table that may not be as useful as it sounds. For example, we are counting a and an as two separate words. On top of that, there are clearly words that should not be included in a test like this, namely any word that doesn’t contribute to the meaning or complexity of the sentence. Also, some words are more important than others, although defining that gets more nebulous, so some weights should be applied, this is very important if you want believable results. Once all that is done then the Vector Space Model (the fancy term for what we just did) works reasonably well. There are several methods for condensing the data automatically, some examples include Principal Component Analysis and Singular Value Decomposition, for example. I now regularly use those techniques on bible texts and achieve far greater success. (Thanks to westcott who suggested I look into it).

I guess I will get to similarity coefficients some other time as I have gotten tired. :) FYI, I have started work on a parser in order to extract lexical and syntactic structural information from the sentences. I expect stellar results from that approach because I suspect that the complexity, length and structural order has a lot to say about an author. Especially when subject to the same battery of statistical analysis as is already being applied. I will talk about that as I get it going. And to think that this all started with the Pericope de Adultera. :D

TTFN,
Julian

Basic statistics and the text of the NT

Monday, August 7th, 2006

Well, as I am delving deeper and deeper into the world of text analysis things are getting more and more complicated. So I thought that I would write a series of blog entries describing the basics of statistics and how they apply to text. Eventually, we would have to look at some very advanced techniques used in Computational Linguistics after we have worked our way through the basics. There will be math involved but I will explain exactly how it works as we go along in terms that should be clear to non-mathematicians, like myself. I will start by looking at various means, variance and standard deviation calculations and what they mean. Eventually we will work our way up to latent semantic analysis, singular value decomposition, principal component analysis, N-gram taggers, partial parsers and so on but that’s for much later.

Basic statistical functions tell us some basic facts about the text we are working with but the techniques are insufficient to provide a good separation of textual styles which is why we will need to rely on more advanced techniques from the field of computational linguistics which is a very fast moving field.

Well, let’s get to it.

First we need to distinguish between a population and a sample. In the case of NT texts we have the luxury of dealing with both. In real life, statisticians frequently only have samples and must estimate the population from those samples. We will need to do so as well when we are trying to determine authorship for fragments of text. However, we have the entire gospel of xxx, the entire epistle of yyy and so on which allows us to establish descriptive parameters for the entire text, i.e. the population.

The first function we will look at is how to calculate the mean of a set, set here meaning a collection of numbers, either a sample or the entire text (population.) Mean simply means the average, nothing more. There are several types of means with the arithmetic mean being the one that most people understand to be implied when we simply say mean. We also have the geometric mean and the harmonic mean. The latter two are not particularly useful in our case so we will not address them here. The mean, when described by a mathematician looks like this:

Mean formula

What that says is simply this: Add up all the numbers of the set and divide by the number of entries. So if you had the numbers 10, 20, 30 you would add them all, getting 60, and then divide by 3, yielding a result of 20. So the mean of 10, 20, 30 = 20.

Math note: Multiplying by 1 over n is the same as dividing by n. The value 1/n is the reciprocal of n in math talk or sometimes the multiplicative inverse for extra nerd points.

Okay, for our next trick we will look at variance and standard deviation. First, variance. Variance is a measure of the amount of variation in a dataset. It is used in a number of calculations and is simply the square of the standard deviation, which is why they are explained here in the same section. In mathematical terms, variance looks like this

Population variance formula

for a population variance and like this

Sample variance formula

when working with a sample. It can be shown mathematically that the second one tends to represent the population better than the first one when dealing with samples. A discussion of that would be far too lengthy (and can be found on wikipedia under variance and standard deviation.) Before I explain how they work and why we care, I will show the final formula for this which is the Standard Deviation, a term which I am sure most have heard and probably understand to some extent. Here it is:

Standard deviation formula

There. Now we have some building blocks. As can be seen above, the standard deviation is simply the square root of the variance. You calculate it by first calculating your mean value as described earlier. Then you go through each datum in your set and subtract the mean (this makes the set centered around the mean) and square it. You add all these numbers together (which is what is meant by the upper case sigma in the formula) and divide by number of values in your set, subtracting one from that number before dividing. This is the variance. Take the square root of that and you have the standard deviation. So, what does all this really mean?

The standard deviation tells us how spread out the data are from our mean. Remember the well-known Pythagoras formula that we all had to learn for calculating the hypotenuse? All that formula does is calculate the distance. If you have ever worked construction, or do a lot of home improvement, you probably know the 3-4-5 rule used to make sure that something is square. Say you are building a fence that needs to have 90 degree corners. Measure 3 feet along one side, measure 4 feet along the other and measure between those two points and the length should be 5 feet, because the square root of ( 3 *3 + 4* 4 ) is 5. So, if you add up the square of a bunch of distance measurements (remember, when we subtract our mean from each value we essentially get the ‘distance’ of that point from the average) and then take the square root you get the distance in however many dimensions you added up. Don’t worry about that last ‘dimension’ bit just yet. In our case, we added up our squares and then divided by the number of values in our set (minus one but that is not so important here) getting the average squared value of our set. The subsequent square root gives us the actual (non-squared) distance. So, what we end up with is the standard deviation which is another way of saying the average distance of our data from the average.

So now we have a way to get our average and determine how spread out our data is. Now, how to we apply that to text? Well, first you quantify your text in some manner (which will be our next blog entry, namely, how to quantify text) and then you calculate the mean and standard deviation. If you are doing a word count, for example, the mean will tell you the average number of times a word is being used and the standard deviation tells you how much these words counts vary. Obviously, numbers like these are far too feeble to tell us much about the text. They are, however, important building blocks that we will need for further investigation.

For example, for 15 random 1000 word blocks from Luke and John each, we can see that Luke uses, on average (the mean of 15 different 1000 word blocks in this case) 306 words with 289 being the minimum number of words and 330 the maximum. John, on the other hand, uses 216 different words with min and max being 182 and 255, respectively. So clearly, Luke uses a far greater vocabulary. We can compare these factors because we are using the same sample size, however, when dealing with the entire texts we run into problems because a longer text will naturally use a larger vocabulary. The caveat here is that the word usage does not increase linearly, meaning that if you double the length of the text it does not follow that the number of distinct words used double, as well. In fact, they never do. In some cases we will need to look at a text fragment of a particular size and if we are to check the distinct word usage against the biblical writers we will need to estimate what number of words would be typical for a fragment of that size for any given author. I have pre-calculated the word usage for the biblical books at regular sampling size intervals at a tight enough spacing that we will be able to linearly interpolate the correct word usage and its standard deviation to establish probability.

That’s probably enough for now. I shall continue soon with a look at text representation and quantification. The interval between blog entries should decrease now that I am back from europe, although I do have to go for a week and wear medieval clothes at Pennsic. :)

Julian