HOMEWORK ASSIGNMENT #6

21st February 2002

Due Tuesday, 5th March 2002

* Check the readings for this week.

* Mail your code to me.

* Your code must contain clear and comprehensive documentation.

* Written assignment to be submitted in class.

* Solve the following problems:

  1. Consider two corpora C1 and C2 representing two domains of knowledge. C1 and C2 contain N1 and N2 documents respectively and they span the same vocabulary. Let ei1 and ti1 be the normalized entropy and frequency of ith word in C1. Let the corresponding values in C2 be ei2 and ti2. Suppose the two corpora are combined to yield a large corpus C. Derive an expression for ei in C. Derive an expression for ei in C in terms of the above mentioned parameters.

    Suppose you are given the scaled and normalized matrices W1 and W2 corresponding to C1 and C2 respectively. Using the above values of parameters and derived ei, what is the expression of scaled and normalized matrix W on the joint corpus C?

    read the e's as epsilon.


  2. For the text data containing 9 sentences on page 10 of "Introduction to LSA" paper by Landauer et al. in the other reading resources page, write a program to

    (i) calculate word-document co-occurrence matrix and normalized entropy of each word in the vocabulary (consider all words in the corpus of sentences).

    (ii) Then scale and normalize the matrix to get W matrix.

    (iii) Perform SVD of W with dimensions 1 to rank(W). For each dimension, calculate COR1 = average inter-document correlation (cosine measure) of the first five documents (HCI group), similarly COR2 for the last four documents (Graph group). Also calculate the average HCI-Graph inter-document correlation (CROSSCOR). Define a measure COR = (COR1 + COR2 - 2*CROSSCOR +4)/8 . Plot COR vs SVD dimension. Comment on your results.

    NOTE : Use MATLAB 'svds' routine to perform SVD of W. For finding document similarity, use W'W.

  3. Sentence-pairs are given below. Please read them carefully and evaluate them for similarity measure on a five point scale from 1 to 5, i.e., 1 if you find them completely dissimilar and 5 if they are perfectly matched. Your evaluations will be used to perform a cognitive modeling experiment. So don't consult the scores with your friends. You may use syntactic or semantic or both aspects to define "match".

    1a. ROM is a type of memory which cannot be used for primary storage.

    1b. RAM is a type of primary storage that can be used to store data temporarily.


    2a. The CPU processes all of the data the user inputs and all the data the computer puts out.

    2b. The CPU Controls everything - arithmetic and logic.


    3a. The motherboard does not mean the entire computer.

    3b. The motherboard does mean the entire computer.


    4a. There are no input or output devices.

    4b. The input and output devices are present.


    5a. The CPU and memory can't provide input or output.

    5b. The CPU can provide input and memory can't provide output.


    6a. RISC processors have reduced instruction sets.

    6b. RISC sets instruction for reduced processors.


    7a. The hard drive is used to store the operating system.

    7b. The floppy disc can copy a software.


    8a. A program boots the operating system into RAM.

    8b. The operating system is read into RAM through a process called booting.


    9a. The e-mail provides communication.

    9b. The Internet provides e-mail.


    10a. The internet enables you to use e-mail, web and telnet without sitting in one place.

    10b. Telnet enables you to log on in one place and gain access to the entire Internet.



BACK