Memory helps neural networks catch the longer patterns and linguistic and verbal components, which improves the quality of recognition.Vostock-Photo
Microsoft says it can automatically recognize conversations between Americans with nearly perfect accuracy. The margin of error was 5.9 percent, which means the system can correctly recognize more than 94 percent of the words. The details of this research were published in a scientific article.
Speech recognition can be used in entertainment devices such as Xbox game consoles, personal digital assistants, and in call centers.
Microsoft said the breakthrough was made thanks to artificial intellect developed by STC Innovations, which was founded by Russian acoustic scientists more than 25 years ago. Technologies produced by the company, ranging from voice recorders to national security systems, are sold in 75 countries.
STC's technology has noticeably decreased the number of recognition errors. "When we speak in front of a large audience, or with a robot, we do this evenly and clearly," said Alexander Zatvornitsky, director of the department of speech recognition at STC. "The recognition of telephone conversations, in which thoughts are born spontaneously, is a completely different challenge."
The neural network is a mathematical model stimulated by the human brain. Each neuron is represented by a small computer program. The technology of neural networks was developed in the past few decades when it was used to identify images, and more recently, sound.
To achieve this, sound registration must be broken down into 100 fragments per second and introduced into the neural network. Entering it requires a mathematical description of the sound wave, while exiting has several thousand types of sound, or rather, phonemes.
"Each phoneme sounds slightly different since the vocal apparatus is unable to restructure itself after the pronunciation of the previous sound as it is preparing to pronounce the next," explained Zatvornitsky. "The sound 'a' in the words 'mama' and 'bar' will be different due to the dissimilar types of sound environment. It is these subtle differences between sounds that the neural network can perceive."
The first generation of neural networks did not have memory. They determined the sound only by its fragments. Modern neural networks can remember in the middle of a sentence what was said in the beginning. Memory helps them catch the longer patterns and linguistic and verbal components, which improves the quality of recognition.
STC's method has helped perfect the neural networks' short-term memory. However, it does not allow the memory to remember what is redundant.
"If the network remembers very well that which it receives during the study process, it will work with that which it knows, but will poorly understand that which it never heard, and so it needs to learn," stressed Zatvornitsky.
Today STC researchers are working on speech recognition surrounded by real background noise. For example, they try to identify recordings made at a social event or during a drive along the highway, or at a meeting where many people are present.
For now the neural network is capable of perceiving the speaker's emotions, which is important in the field of services. The researchers must also understand if the program can work equally well regardless of the speaker's age, accent and speech abilities.
The recognition of spontaneous telephone speech in languages with complex word formations such as Russian and Arabic is still far from perfect. But in the long run researchers want the artificial mind not only to identify speech but also to reply to questions, as well as to act according to what it is told.