Text To Speech: Impersonating the Human Voice

By | January 8, 2016

Back in the mid 1980’s, I got my first real computer – a Texas Instruments TI-99/4A.  While playing a game called “Alpiner”, I was precariously trying to move my climber to the top of a summit when a rock knocked the little guy off a ledge.  As I started again towards the summit, the game blurted out “Onward and upward” in a slightly creepy yet human-sounding voice.  “What?!?  My computer is talking!”  I had to find out what was behind this incredible technology.  After hours and hours of reading manuals and poking around in the terminal emulator, I had put together a very rudimentary application that allowed the computer to speak whatever I typed.  My friends and I would stay up until the wee hours of the morning making the TI-99/4A speak our names, jokes and poems, and every cuss word we could think of.  Thus began my fascination of text to speech.

Text to speech (TTS) plays a vital role in the telephony sector.  When designing an IVR application, the presentation tier is built by stitching together a series of pre-recorded voice segments called “prompts”.  These prompts are typically recorded by the application developer during the development phase, and then re-recorded by professional voice talent just before production rollout.  The recorded “application prompts” are played in a specific sequence which is decided by the application logic.  For the presentation of dynamic data such as a date or dollar amount, the “application prompts” are paired with “system prompts”.  System prompts are a large collection of pre-recorded voice segments such as numbers, months, letters, and other commonly used items.  Consider this dialog that a typical IVR might speak to a caller:

“Your current balance is $425.33, and is due on 12/13/2008.”

It is assembled and spoken to the caller using 13 separate application and system prompts:

“Your current balance is” “four” “hundred” “twenty-five” “dollars” “and” “thirty-three” “cents” “and is due on” “December” “thirteenth” “two-thousand” “eight”.

Often the IVR application is required to read data back to a caller that is too dynamic and variable to be handled with system prompts.  For example, an IVR may need to read back a street address or a person’s name.  Can you imagine having a professional voice talent record every single street, city, and state in the U.S.?  The task would be daunting, and incredibly expensive.  Text to speech makes reading back dynamic information easy.  But how does it work?

Let’s imagine that an IVR application has retrieved a street address from a database and needs to speak it to the caller.  The text string retrieved from the database is “125 S. St. Louis St., Tulsa, OK, 74135.”  The IVR application will send this text string to the TTS engine for conversion to speech.  The TTS engine first normalizes the text and breaks the entire phase into a series of spoken words.  It parses the text string and looks for punctuation, capitalization, abbreviations, and even determines the intent of the phrase.  Without understanding the intent of the phrase, the TTS engine is not able to deal with homographs, words with the same spelling but different pronunciation.  Think about our example above:  “125 S. St. Louis St.”  It is clear to us that the first “S.” is an abbreviation for “South”, the “St.” before “Louis” should be read as “Saint”, while the “St.” after “Louis” should be read as “Street”.  We know this because we understand that this is an address, and we can logically make assumptions about abbreviations and homographs based on what we know about the scope of street addresses.  The TTS engine must understand these same rules in order to correctly pronounce the phrase.  Invariably, the TTS engine may sometimes misunderstand the intent of the phrase.  When this happens, a developer can help out the TTS engine by labeling the phrase with Speech Application Programming Interface (SAPI) commands or the newer Speech Synthesis Markup Language (SSML) tags.

Once the intent of the phrase has been determined and the text normalization and homographic disambiguation is complete, the word segments are broken down into the individual phonemes and sub-phonemes for proper pronunciation.  In fact, a text to speech voice model is actually made up of thousands of phonemes recorded by a real person.  The system produces the resulting prompt with these phonemes, giving the voice the proper tonal characteristics for gender, pitch, prosody, and inflection using voice models provided by the TTS manufacturer.  In fractions of a second, a TTS engine takes a text string and returns an audio segment which is then played back to the caller.

The sound quality and realism of text to speech has improved exponentially over the years.  The latest TTS offerings from Nuance are available in multiple languages, both male and female, and have the ability to emote and punctuate particular words and phrases by using SAPI or SSML markers.  Custom dictionaries can be added to handle difficult words or abbreviations that are unique to the application.  Text to speech is quickly becoming a formidable opponent for human voice talent.  In just a few short years, the two may become indistinguishable to IVR users.  As my little TI-99/4A would surely agree, text to speech technology has truly gone “onward and upward”.

Brian Smith, Director of Strategic Initiatives