Basic statistics and the text of the NT

Well, as I am delving deeper and deeper into the world of text analysis things are getting more and more complicated. So I thought that I would write a series of blog entries describing the basics of statistics and how they apply to text. Eventually, we would have to look at some very advanced techniques used in Computational Linguistics after we have worked our way through the basics. There will be math involved but I will explain exactly how it works as we go along in terms that should be clear to non-mathematicians, like myself. I will start by looking at various means, variance and standard deviation calculations and what they mean. Eventually we will work our way up to latent semantic analysis, singular value decomposition, principal component analysis, N-gram taggers, partial parsers and so on but that’s for much later.

Basic statistical functions tell us some basic facts about the text we are working with but the techniques are insufficient to provide a good separation of textual styles which is why we will need to rely on more advanced techniques from the field of computational linguistics which is a very fast moving field.

Well, let’s get to it.

First we need to distinguish between a population and a sample. In the case of NT texts we have the luxury of dealing with both. In real life, statisticians frequently only have samples and must estimate the population from those samples. We will need to do so as well when we are trying to determine authorship for fragments of text. However, we have the entire gospel of xxx, the entire epistle of yyy and so on which allows us to establish descriptive parameters for the entire text, i.e. the population.

The first function we will look at is how to calculate the mean of a set, set here meaning a collection of numbers, either a sample or the entire text (population.) Mean simply means the average, nothing more. There are several types of means with the arithmetic mean being the one that most people understand to be implied when we simply say mean. We also have the geometric mean and the harmonic mean. The latter two are not particularly useful in our case so we will not address them here. The mean, when described by a mathematician looks like this:

Mean formula

What that says is simply this: Add up all the numbers of the set and divide by the number of entries. So if you had the numbers 10, 20, 30 you would add them all, getting 60, and then divide by 3, yielding a result of 20. So the mean of 10, 20, 30 = 20.

Math note: Multiplying by 1 over n is the same as dividing by n. The value 1/n is the reciprocal of n in math talk or sometimes the multiplicative inverse for extra nerd points.

Okay, for our next trick we will look at variance and standard deviation. First, variance. Variance is a measure of the amount of variation in a dataset. It is used in a number of calculations and is simply the square of the standard deviation, which is why they are explained here in the same section. In mathematical terms, variance looks like this

Population variance formula

for a population variance and like this

Sample variance formula

when working with a sample. It can be shown mathematically that the second one tends to represent the population better than the first one when dealing with samples. A discussion of that would be far too lengthy (and can be found on wikipedia under variance and standard deviation.) Before I explain how they work and why we care, I will show the final formula for this which is the Standard Deviation, a term which I am sure most have heard and probably understand to some extent. Here it is:

Standard deviation formula

There. Now we have some building blocks. As can be seen above, the standard deviation is simply the square root of the variance. You calculate it by first calculating your mean value as described earlier. Then you go through each datum in your set and subtract the mean (this makes the set centered around the mean) and square it. You add all these numbers together (which is what is meant by the upper case sigma in the formula) and divide by number of values in your set, subtracting one from that number before dividing. This is the variance. Take the square root of that and you have the standard deviation. So, what does all this really mean?

The standard deviation tells us how spread out the data are from our mean. Remember the well-known Pythagoras formula that we all had to learn for calculating the hypotenuse? All that formula does is calculate the distance. If you have ever worked construction, or do a lot of home improvement, you probably know the 3-4-5 rule used to make sure that something is square. Say you are building a fence that needs to have 90 degree corners. Measure 3 feet along one side, measure 4 feet along the other and measure between those two points and the length should be 5 feet, because the square root of ( 3 *3 + 4* 4 ) is 5. So, if you add up the square of a bunch of distance measurements (remember, when we subtract our mean from each value we essentially get the ‘distance’ of that point from the average) and then take the square root you get the distance in however many dimensions you added up. Don’t worry about that last ‘dimension’ bit just yet. In our case, we added up our squares and then divided by the number of values in our set (minus one but that is not so important here) getting the average squared value of our set. The subsequent square root gives us the actual (non-squared) distance. So, what we end up with is the standard deviation which is another way of saying the average distance of our data from the average.

So now we have a way to get our average and determine how spread out our data is. Now, how to we apply that to text? Well, first you quantify your text in some manner (which will be our next blog entry, namely, how to quantify text) and then you calculate the mean and standard deviation. If you are doing a word count, for example, the mean will tell you the average number of times a word is being used and the standard deviation tells you how much these words counts vary. Obviously, numbers like these are far too feeble to tell us much about the text. They are, however, important building blocks that we will need for further investigation.

For example, for 15 random 1000 word blocks from Luke and John each, we can see that Luke uses, on average (the mean of 15 different 1000 word blocks in this case) 306 words with 289 being the minimum number of words and 330 the maximum. John, on the other hand, uses 216 different words with min and max being 182 and 255, respectively. So clearly, Luke uses a far greater vocabulary. We can compare these factors because we are using the same sample size, however, when dealing with the entire texts we run into problems because a longer text will naturally use a larger vocabulary. The caveat here is that the word usage does not increase linearly, meaning that if you double the length of the text it does not follow that the number of distinct words used double, as well. In fact, they never do. In some cases we will need to look at a text fragment of a particular size and if we are to check the distinct word usage against the biblical writers we will need to estimate what number of words would be typical for a fragment of that size for any given author. I have pre-calculated the word usage for the biblical books at regular sampling size intervals at a tight enough spacing that we will be able to linearly interpolate the correct word usage and its standard deviation to establish probability.

That’s probably enough for now. I shall continue soon with a look at text representation and quantification. The interval between blog entries should decrease now that I am back from europe, although I do have to go for a week and wear medieval clothes at Pennsic. :)

Julian

Explore posts in the same categories: Programming, Textual Criticism

One Comment on “Basic statistics and the text of the NT”

  1. Adele Says:

    Thanks man, i agree

Comment: