Naveen KT

Speaking with inanimate objects and getting work done through them has transitioned from being a figment of our imagination to a reality. Case in point, personal assistant devices like Alexa can recognize our words, interpret the meaning and carry out commands.

The journey of speech recognition technology has been nothing short of a rollercoaster ride. Let us look at the developments that enabled commercialization of ASR and what these systems could accomplish, long before any of us had heard of Siri or Google Assistant.

The speech recognition field was propelled by both the application of different approaches and the advancement of technology. Over a decade, researchers would conceive of myriad ways to dissect language: by sounds, by structure and with statistics.

Early Days

Even though human interest in recognizing and synthesizing speech goes back centuries, it was only in the last century that something recognizable as ASR was built. The ‘digit recognizer’ named Audrey, by Bell Laboratories was among the first projects. It could identify spoken numbers by looking for audio fingerprints called formants, the distilled essences of sounds.

Even though human interest in recognizing and synthesizing speech goes back centuries, it was only in the last century that something recognizable as ASR was built. The ‘digit recognizer’ named Audrey, by Bell Laboratories was among the first projects. It could identify spoken numbers by looking for audio fingerprints called formants, the distilled essences of sounds.

Next came the Shoebox in the 1960s. Developed by IBM, the Shoebox could recognize numbers and arithmetic commands (like ‘plus’ and ‘total’). Shoebox could also pass on the math problem to an adding machine, to calculate and print the answer.

Half way across the world, in Japan, hardware was being built that could recognize the constituent parts of speech like vowels. Systems were also being built to evaluate the structure of speech to figure out where a word might end.

A team at University College in England had devised a system that could recognize 4 vowels and 9 consonants by analysing phonemes, the discrete sounds of a language.

However, these were all disjointed efforts and were lacking direction.

In a surprising turn of events, the funding for ASR programs in Bell Laboratories were stopped in 1969. The reasons cited were “lack of scientific rigor” in the field and “too much wild experimentation”. It was reinstated in 1971.

In the early 1970s, the U.S. Department of Defence’s ARPA (the agency now known as DARPA) funded a five-year program called Speech Understanding Research. Several ASR systems were created and the most successful one Harpy (by Carnegie Mellon University), could recognize over 1000 words. Efforts to commercialize the technology had also picked up speed. IBM was working on speech transcription in the context of office correspondence, and Bell Laboratories on ‘command and control’ scenarios.

The key turning point was the popularization of Hidden Markov Models (HMMs). These models used a statistical approach that translated to a leap forward in accuracy. Soon, ASR field began coalescing around a set of tests that provided a benchmark to compare to. This was further encouraged by the release of shared data sets that researchers could use to train and test their models on.

ASR as we know it today, was introduced in the 1990s. Dragon Dictate launched in 1990 for a staggering $9,000, with a dictionary of 80,000 words and features like natural language processing.

These tools were time-consuming and it required that users speak in a tilted manner; Dragon could initially recognize only 30–40 words a minute; people typically talk around four times faster than that. By 1997, they introduced Dragon NaturallySpeaking, which could capture words at a more fluid pace and at a much lower price tag of $150.

Current Landscape

Voice has been touted as the future. Tech giants are investing in it and placing voice-enabled devices at the core of their business strategy.

Machine learning has been behind major breakthroughs in the field of speech recognition. Google’s efforts in this field culminated in the introduction of Google Voice Search app in 2008. They further refined this technology, with the help of huge volumes of training data and finally launched the Google Assistant.

Digital assistants like Google Assistant, Siri, Alexa and others, are changing the way people interact with their devices. Digital assistants are intended to assist individuals with performing or completing fundamental assignments and react to enquiries.

With the capacity to retrieve data from a wide variety of sources, these robots help take care of issues progressively, upgrading the UX and human productivity.

Popular Voice assistants include:

  • Amazon’s Alexa
  • Apple’s Siri
  • Google’s Google Assistant
  • Microsoft’s Cortana

Application of Speech Recognition Technology

Speech recognition technology and the use of digital has moved rapidly from our phones to our homes, and its application in ventures, for example, business, banking, advertising, and health care is rapidly becoming obvious.

In Workplace: Speech recognition technology in the work environment has been a push to increase productivity and efficiency. Examples of office tasks digital assistants are, or will be, able to perform:

  • Search for documents or reports on computer
  • Create tables or graphs using data
  • Answer queries
  • On-request document printing
  • Record minutes
  • Perform other routine tasks like scheduling meetings and making travel arrangements

In Banking: Theaim of Speech Recognition, in Banking

  • Financial industries is to reduce friction for the customer. Voice-enacted banking could diminish the requirement for human client assistance and lower employee costs. A customized financial partner could consequently help consumer loyalty and satisfaction.

How speech recognition can improve banking:

  • Request financial information
  • Make payments
  • Receive information about your transaction history

In Marketing: Voice-search can and will cause shifts in consumer behaviour. It is essential to understand such shifts and tweak the marketing activities to keep up with the times.

  • With speech recognition, there will be another type of information accessible for advertisers to examine. People’s accents, speech patterns, and vocabulary can be utilized to translate a purchaser’s area, age, and other data with respect to their socioeconomics, for example, their social alliance.
  • Speaking allows for longer, more conversational searches. Advertisers and optimisers may need to concentrate on long-tail keywords and on creating conversational substances to remain in front of these patterns.

In HealthCare: In a situation where seconds are critical and clean working conditions are essential, hands-free, prompt access to data can have a positive effect on medical efficiency.

Benefits include:

  • Quick looking up of information from medical records
  • Less paperwork
  • Reduced time on inputting data
  • Improved workflow

This is just scratching the surface of the applications of this technology. The future of speech recognition technology holds a lot of promise across various industries.



About the Author:

Naveen is a software developer at GAVS. He teaches underprivileged children and is interested in giving back to society in as many ways as he can. He is also interested in dancing, painting, playing keyboard and is a district-level handball player.