HOMEWORK ASSIGNMENT #2
10th January 2002
Due Thursday, 17th January 2002
* Check the readings for this week.
* Mail your code to me.
* Written assignment to be submitted in class.
* Solve the following problems:
- Use the small size text corpora under the
resources section. For this assignment you can also use any of the NLP tools listed under the
resources section. In
your solution clearly mention the tools you used and how you used them. State
all the steps you took in solving this problem.
- Pick any two or more sets, for example, national news and sports news.
Compute word frequencies for the sets.
- Based on the word frequencies can you formulate rules for distinguishing the different sets.
- You may need to filter out, stop words (a, an, the, at, etc.), names, etc.
- Are uni-word frequencies enough to differentiate?
- You may try using bi-word or tri-word frequencies.
- Do the the article headings help you in the classification? Why?
- Is it helpful to use stemming? Why?
- What other improvements can you think of? Do they work better?
- Discuss all your attempts at differentiating the sets clearly - those suggested here and those you thought of.
Implementing your ideas, even those that finally dont work very well, will fetch extra
credit.
- Let X = (a, b, c, d) with probability (1/2, 1/4, 1/8, 1/8). Find the Entropy of X.
- What is the minimum value of H(p1, ......., pn) = H(p) as
p ranges over the set of n-dimensional probability vectors? Find all p's that
achieve this minimum.
- Give examples of joint random variables X, Y and Z such that (a) I(X; Y/Z) < I(X; Y),
(b) I(X; Y/Z) > I(X; Y).
- Exercise 2.9 (page 78) in the text.
- Exercise 2.10 (page 78) in the text.
BACK