MINOR 1
12th February 2002
45 minutes
* Solve the following four problems. Each problem is worth 4 marks.
- Devise a scheme to distinguish the language in which a given document is
written.
-
| x | 0 | 0 | 1 | 1
|
| y | 0 | 1 | 0 | 1
|
| p(bank=x, credit=y) | 0.9998977 | 0.00006 | 0.000042 | 0.0000003
|
- Are bank and credit as defined in the table independently distributed?
- The above probabilites are obtained using MLE on a corpus with
20,000,000 tokens. Is bank credit a collocation?
- Does the first result imply the second or is there a contradiction? Explain.
- You are required to build a digit recognizer (0-9).
- Estimate a language model for this
task. Does using this language model help in digit recognition? Why?
- What is the perplexity of the digit recognition (0-9) task?
- Let c(cantankerous person)=0 and c(cantankerous autodidact)=0 be the counts
of the two bigrams in our training corpus. Since the count is zero
for both bigrams, Laplace's Law, Lidestone's Law, and Good-Turing will all assign the
same probability i.e. p(cantankerous person) = p(cantankerous autodidact). However, intuitively
we feel p(cantankerous person) > p(cantankerous autodidact). Explain how simple linear
interpolation takes care of this problem.
BACK