|
This perceptual test was prepared by Jiri
for students participating in the introductory
Lab Day of the 2002 Johns Hopkins University Summer Workshop. We make it available
here for the purpose of general interest. The answer key (link) can be found at the
bottom of this page.
Human auditory perception and its capability to recognize familiar voices
has played an important motivational role in the development of algorithms
for automatic speaker detection and identification. The speech signal provides
the ear with a variety of information on a multitude of levels to tell
about who is speaking. Among these, the speaker-specific acoustics (pronunciation
of sounds, words), the prosody (use of melodic patterns), use of lexicon,
idiosyncracies etc. etc. can be named. The following quiz consisting of
three sections is designed to illustrate the difficulties in comparing
voices when different type of information is emphasized in the signal.
Please follow the instructions below, complete the form and print out this
page when you are finished. Your comments to the experiment are welcomed!
Test and adjust you wavefile player here
START (
Headphones recommended)
TEST SECTION 1 - CLEAN SPEECH
By clicking on the link items you will hear voice samples (telephone-quality
recording). In each row of the table, choose one of the five Speaker recordings
that matches the speaker contained in the Test sample. You may listen
to the samples repeatedly and in any order.
Do you have some comments on this part?
TEST SECTION 2 - SHUFFLED SOUNDS
By clicking on the test items you will hear sequences of random-order short
speech segments. The segment shuffling removes the word structure, the
prosody, and thus the meaning from the sentences, leaving only some short-time
acoustic information for the listener. This type of (short-time acoustic)
information is used in today's most popular text-independent speaker recognition
systems. Again, try to match one of the five speakers to the test sample
in each row.
What clues did you use to make your decisions? Any comments on this
part?
TEST SECTION 3 - SPEECH MELODY
Get ready for the tough stuff! This speech was filtered using an adaptive
inverse Linear Prediction Coding (LPC) filter which removes nearly all
short-time spectral structure (i.e. it removes the articulation information),
such that mainly the fundamental tone and the loudness signal dominate
the residual. These two are basic components of Prosody - a source known
to carry relatively complex information, including the speaker style, emotion,
sentence mode, language type and more. Prosody will be, among others, a
subject of study by the Super-SID team this summer. Try to identify the
two samples in each row that belong to the same speaker, but be aware that
your ear will be robbed the usual acoustic convenience, focus on alternatives!
Comments on this part:
END
Please PRINT out this page. Use the answer key distributed by the
lab supervisor to check your answers.
ANSWER KEY, CLICK HERE
For more information about the experiment, the algorithms and scripts
used to create the audio samples, please contact the author.
|