Assignment 1: Estimation and Hypothesis Testing

In this assignment we will apply basic techniques of estimation and hypothesis testing to textual data. The data set we will use is the Ling-Spam corpus, a collection of email messages sent to the moderated Linguist List together with spam email from the same period. The corpus is described in Androutsopoulos et al. (2000) and can be downloaded from the ACL Wiki. It will be used also in Assignment 2.

The email data is available in four different versions, with and without stop word filtering and with and without stemming, but for this assignment we are only going to use the "bare" version. The corpus is divided into ten different subsets, which may be used for ten-fold cross-validation, but for now we are only going to use Part 1.

Estimation

Using all the email messages in Part 1 of the Ling-Spam corpus, give a maximum likelihood estimate of the probability of the following events:
  1. An email message contains the word noun.
  2. An email message contains the word verb.
  3. An email message contains the word noun or the word verb.
  4. An email message contains the word noun and the word verb.
Given your estimates, are the following claims true or false? Motivate your answers.
  1. Events 1 and 2 above are incompatible (disjoint).
  2. Events 1 and 2 above are independent.
Estimate the expected average token length (number of characters) in an email message in the two subcorpora of Part 1:
  1. Messages to the Linguist List (non-spam).
  2. Spam messages.
Note that each observation here is the average token length in a complete message and that the expectations we are interested in are averages over these messages. Give 95% confidence intervals for your estimates, assuming that the (population) variance is known and equal to the sample variance.

Hypothesis Testing

Use statistical tests to assess the following hypotheses, based on the data in Part 1 of the Ling-Spam corpus:
  1. The average token length is greater in non-spam messages than in spam messages.
  2. The word job occurs more often in spam messages than in non-spam messages.
You may again assume that the (population) variance is known and equal to the sample variance, which means that a Z test can be used in both cases.

For Distinction (VG)

Redo the interval estimation (7 and 8) and hypothesis testing (9 and 10) without assuming that the variance is know. Hint: You need to consider the t distribution instead of the normal distribution.

Requirements

Write a short report giving answers to the 10 questions and explaining the procedure used to derive the answers. The report should be submitted by email no later than February 10 to Evelina Andersson at evelina.andersson@lingfil.uu.se. Questions regarding the assignment should also be sent to Evelina.

References

Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G. and Spyropoulos, C.D. (2000) An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain, pp. 9-17.