18th April 2002
75 minutes
I. Latent Semantic Analysis (LSA)
(7 marks)
Consider a vocabulary of four words: { and, language, processing, speech }. Let a phrase "speech and language processing" be under speech recognition process. Using Bellegarda's 'N-gram + LSA language modeling' method, calculate the bi-lsa probability:
Pbi-lsa( processing | speech, and, language )
This requires first to calculate LSA-based probability
Plsa ( wq = processing | Hq-1 = {speech, and, language} )
and then to integrate it with bi-gram probability. The LSA-based probability is given by
where,
The LSA space was constructed using 2 latent dimensions with corresponding singular values of 2 and 1 respectively. The following information is available:
| word w |
normalized entropy e(w) |
unigram probability P(w) |
bi-gram probability P(w | language) |
U matrix of SVD u(w)1x2 | |
| and language processing speech |
0.9 0.3 0.4 0.2 |
0.1 0.2 0.4 0.3 |
0.2 0.0 0.8 0.0 |
0.318 0.000 0.424 0.848 |
-0.424 0.848 0.318 0.000 |
II. K-means Clustering
(7 Marks)
III. Tri-gram Tagging
(5 Marks)
IV. Similarity Measures
(5 Marks)
An electron is round.
A ball is spherical.
In this case ball and electron are semantically similar in the context of shape.
However in the following sentences:
An electron is tiny.
That ball is huge.
ball and electron are semantically dissimilar in the context of size.