Processing power of voice recognition technologies requires enhancement for continuous speech recognition

Electronics Technology

5 October 2005 Electronics Technology

Information from Technical Insights

Considered unviable until recently, the realtime speech recognition technology currently used in voice portals consumes immense processing power. The computation-intensive Hidden Markov Model (HMM) technology of the mid-1980s improved the ability of voice recognition devices to identify word relationships and ultimately led to the developing of powerful speech-recognition applications.

For systems to understand and respond to continuous speech, manufacturers have to arrange for availability of a large amount of processing power. However, this will not be possible at reasonable costs. When users speak at natural speed, it becomes difficult to associate specific sounds with particular words. Since users usually do not pause between words, processing naturally spoken phrases in realtime can be tricky.

"Predominantly software-only engines demand more processing power than can be provided by traditional digital signal processing boards," notes VR Yoges, of Frost & Sullivan. "These boards are used in interactive voice recognition (IVR) systems and they need additional processors to supplement the IVR processing power as well as support and manage the system."

Nortel's modern speech-processing platform integrates technologies into a range of the media processing server (MPS) platforms. The MPS systems configured with additional speech servers decrease the response time of a voice recognition solution. The speech server is a speech-processing platform within an IVR/media processing platform offering choices, investment protection and scalability. The advanced system software developed on this platform integrates with industry-standard components to offer the advantages of open architecture systems.

"The design employs high-performance processors that plug into a separate resource subsystem integrated into the core operating architecture of the IVR/media server platform," says Yoges. "This approach provides a cost-effective and scalable resource for running advanced speech recognition and analysis."

Voice recognition systems also need to make allowances for the diverse enunciations and intonations of the same word by different people. The resultant issues of interpreting speech variability have led to the development of complex pattern analysis. Apart from accents, voice recognition systems have trouble filtering out background noise - especially from calls made by mobile phone users. Although better microphones have remedied this issue to a small extent, wind, murmurs and music still require proper isolation from the voice.

To sort out these concerns, ScanSoft introduced the OpenSpeech Recogniser (OSR), a speech recognition solution for telephony applications. A prominent feature of this solution is its ability to enable applications in understanding a range of words and phrases without requiring highly complex grammar rules.

Innovations in automatic speech recognition (ASR), along with new solutions for missing or unreliable data, seek to create minimal fuss about noisy backgrounds and rely on clean speech. It is possible to obtain highly improved speech solutions using such models. This missing-data approach to robust ASR, works on the premise that when speech is one of the several sound sources, recognition is possible through some spectral-temporal regions that remain uncorrupted.

Since spectral features are sensitive to gender differences, it will be easy to analyse the differences in what the models have learnt about male and female speech patterns. Grammar constrains the recognition hypotheses and decides on a sequence of male or female models.

Researchers in the University of Sheffield discussed four system variants. They found discrete signal-to-noise ratio (SNR) masks based on estimates of local SNR. The first 10 frames in the spectral amplitude domain averaged to form a stationary noise estimate. Subtracting this value from the noisy signal forms clean signal estimates.

"The high threshold here offers a safety margin reducing the impact of the errors introduced by a poor fitting," observes Yoges. "Softmarks SNR, in contrast, has fuzzy interpretation, allowing more points to be let through without the damage caused by admitting noise outweighing."

If readers are interested in further information about the analysis of advances in voice recognition technology, they may contact Magdalena Oberland, [email protected], with their details.

Share this article:

Categories

Electronics Technology

Processing power of voice recognition technologies requires enhancement for continuous speech recognition

Further reading:

Publications by Technews