Speaker Recognition

Voice recognition or speaker recognition refers to the automated method of identifying or confirming the identity of an individual based on his voice. Beware the difference between speaker recognition (recognizing who is speaking) and speech recognition (recognizing what is being said).

The voice is considered both a physiological and a behavioral biometric factor:

  • the physiological component of speaker recognition is the physical shape of the subject’s voice tract;
  • the behavioral component is the physical movement of jaws, tongue and larynx.

How speaker recognition works

There exist two types of speaker recognition:

  • Text dependent (restrained) : the subject has to say a fixed phrase (password) which is the same for enrollment and for verification, or the subject is prompted by the system to repeat a randomly generated phrase.
  • Text independent (unrestrained) : recognition based on whatever words the subject sais.

Text dependent recognition has better performance for subjects that cooperate. But text independent voice recognition is more flexible that it can be used for non-cooperating individuals.

Basically identification or authentication using speaker recognition consists of four steps:
  1. voice recording
  2. feature extraction
  3. pattern matching
  4. decision (accept / reject)
accoustic pattern of the voice

Visualization of the accoustic pattern of the voice: loudness of the input vs. time.

Depending on the application a voice recording is performed using a local, dedicated system or remotely (e.g. telephone). The accoustic patterns of speech can be visualized as loudness or frequency vs. time. Speaker recognition systems analyze the frequency as well as attributes such as dynamics, pitch, duration and loudness of the signal.

During feature extraction the voice recording is cut into windows of equal length, these cut-out samples are called frames which are often 10 to 30 ms long.

Pattern matching is the actual comparisson of the extracted frames with known speaker models (or templates), this results in a matching score which quantifies the similarity in between the voice recording and a known speaker model. Pattern matching is often based on Hidden Markov Models (HMMs), a statistical model which takes into account the underlying variations and temporal changes of the accoustic pattern.
Alternatively Dynamic Time Warping is used, this algorithm measures the similarity in between two sequences that vary in speed or time, even if this variation is non-linear such as when the speaking speed changes during the sequence.

Some systems use “anti-speaker” techniques such as cohort models.

Application of speaker recognition

Voice recognition is mostly used for telephone based applications, such as for telephone banking and hotel or flight bookings.

  • Nuance is a US based company and a major player when it comes to speech recognition. Through the acquisition of PerSay, an Israeli start-up, Nuance acquired two important products for speaker recognition.
  • Voice Trust is a german company specialized in speaker recognition solutions.

Suitability of speaker recognition

How suitable is speaker recognition as a biometric solution? We use the following 7 criteria to evaluate the suitability of speaker recognition:

Universality Obviously for people who are mute or having problems with their voice due to severe illness this biometric solution is not useable.
Uniqueness Because of the combination of physiological and behavioral factors the voice is a unique feature of an individual, the voice has more unique features than a fingerprint.
Permanence An issue with speaker recognition is that the voice changes with ageing, and is also influenced by factors such as sickness, tiredness, stress, etc.
Collectability Voice recordings are easy to obtain and do not require expensive hardware. The real advantage of voice recognition is that it can be done over telephone lines or using computer microphones, with variable recording and transmission quality. Pattern matching algorithms must be able to handle ambient noise and differing quality of the recordings.
Acceptability Speaker recognition is unobtrusive, speaking is a natural process so no unusual actions are required. When speaker recognition is used for surveillance applications or in general when the subject is not aware of it then the common privacy concerns of identifying unaware subjects apply.
Circumvention A major issue with speaker recognition is spoofing using voice recordings. The risk of spoofing with voice recordings can be mitigated if the system requests a random generated phrase to be repeated, an impostor cannot anticipate the random phrase that will be required and therefore cannot attempt a playback spoofing attack.
Performance Robustness is very dependent on the setup, when telephone lines or computer microphones are used the algorithms will have to compensate for noise and issues with room accoustics. Furthermore speaker recognition is, because the voice is a behavioral biometric, impacted by errors of the individual such as misreadings and misprononciations.