HOMEWORK ASSIGNMENT #1
4th January 2002
Due Thursday, 10th January 2002
* READ CHAPTERS 1 and 2
in Foundations of Statistical Natural Language Processing
by Manning and Schütze.
* Mail your code to me.
* Your code must contain clear and comprehensive documentation.
* Mention the site from where you picked up your pages.
* Written assignment to be submitted in class.
* Solve the following problems:
- Pick a web site of your choice, download it and write code to
perform the following operations
- Remove all HTML tags.
- Obtain a word frequency graph of the twenty most frequent words on
the page.
- Obtain a frequency graph of the twenty most frequent
bigram collocations on the page.
- Let X Y be the most frequently occuring collocation in the text. Obtain
P(Y/X) (the probability that Y occurs immediately after X given that X has
occured).
- What is the appropriate distribution for modeling the word frequencies
and the collocation frequencies?
- Describe a scheme for classifying news articles into one of the following
categories:
- Politics
- Sports
- Business
- Entertainment
- Exercise 2.3 in the text.
- Exercise 2.4 in the text.
- Exercise 2.8 in the text.
BACK