Assignment 1: Estimation and Hypothesis Testing
In this assignment we will apply basic techniques of estimation and hypothesis testing to textual data. The data set we will
use is the Ling-Spam corpus, a collection of email messages sent to the moderated Linguist List together with spam email from
the same period. The corpus is described in Androutsopoulos et al. (2000) and can be downloaded from the
ACL Wiki. It will be used also in
The email data is available in four different versions, with and without stop word filtering and with and without stemming,
but for this assignment we are only going to use the "bare" version. The corpus is divided into ten different subsets, which
may be used for ten-fold cross-validation, but for now we are only going to use Part 1.
Using all the email messages in Part 1 of the Ling-Spam corpus, give a maximum likelihood estimate of the probability of the
Given your estimates, are the following claims true or false? Motivate your answers.
- An email message contains the word noun.
- An email message contains the word verb.
- An email message contains the word noun or the word verb.
- An email message contains the word noun and the word verb.
Estimate the expected average token length (number of characters) in an email message in the two subcorpora of Part 1:
- Events 1 and 2 above are incompatible (disjoint).
- Events 1 and 2 above are independent.
Note that each observation here is the average token length in a complete message and that the expectations we are interested in
are averages over these messages. Give 95% confidence intervals for your estimates, assuming that the (population) variance is
known and equal to the sample variance.
- Messages to the Linguist List (non-spam).
- Spam messages.
Use statistical tests to assess the following hypotheses, based on the data in Part 1 of the Ling-Spam corpus:
You may again assume that the (population) variance is known and equal to the sample variance, which means that a Z test
can be used in both cases.
- The average token length is greater in non-spam messages than in spam messages.
- The word job occurs more often in spam messages than in non-spam messages.
For Distinction (VG)
Redo the interval estimation (7 and 8) and hypothesis testing (9 and 10) without assuming that the variance is know.
Hint: You need to consider the t distribution instead of the normal distribution.
Write a short report giving answers to the 10 questions and explaining the procedure used to derive the answers.
The report should be submitted by email no later than February 10 to Evelina Andersson at
firstname.lastname@example.org. Questions regarding the
assignment should also be sent to Evelina.
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G. and Spyropoulos, C.D. (2000)
An Evaluation of Naive Bayesian Anti-Spam Filtering.
In Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine
Learning (ECML 2000), Barcelona, Spain, pp. 9-17.