MAJOR

18th April 2002

75 minutes

I. Latent Semantic Analysis (LSA)

(7 marks)

Consider a vocabulary of four words: { and, language, processing, speech }. Let a phrase "speech and language processing" be under speech recognition process. Using Bellegarda's 'N-gram + LSA language modeling' method, calculate the bi-lsa probability:

Pbi-lsa( processing | speech, and, language )

This requires first to calculate LSA-based probability

Plsa ( wq = processing | Hq-1 = {speech, and, language} )

and then to integrate it with bi-gram probability. The LSA-based probability is given by

where,

The LSA space was constructed using 2 latent dimensions with corresponding singular values of 2 and 1 respectively. The following information is available:

word

w

normalized entropy

e(w)

unigram probability

P(w)

bi-gram probability

P(w | language)

U matrix of SVD

u(w)1x2

and

language

processing

speech

0.9

0.3

0.4

0.2

0.1

0.2

0.4

0.3

0.2

0.0

0.8

0.0

0.318

0.000

0.424

0.848

-0.424

0.848

0.318

0.000

II. K-means Clustering

(7 Marks)

  1. Construct an example data set for which the K-means algorithm takes more than one iteration to converge.
  2. A data set consists of 4 points arranged at the vertices of a square and the k-means clustering procedure is run for k=2 clusters.

III. Tri-gram Tagging

(5 Marks)

  1. Derive the tagging sequence arg maxt1,n P (t1,n / w1,n) under the assumptions:
  2. Draw an HMM for this tri-gram tagger.

IV. Similarity Measures

(5 Marks)

  1. Consider the following sentences:

        An electron is round.

        A ball is spherical.

    In this case ball and electron are semantically similar in the context of shape.

    However in the following sentences:

        An electron is tiny.

        That ball is huge.

    ball and electron are semantically dissimilar in the context of size.

  2. The standard vector space model tries to return the document that most closely approximates the query, given that both query and document are vectors defined by their word occurrence. On the web, this strategy often returns very short documents that are the query plus a few words.

BACK