Speech recognition service for mobile applications
 
Abstract
Speech recognition interfaces can make mobile devices easier to operate while on the road. This application description explains how to provide a robust and scalable speech recognition service for mobile devices. 

Presentation slides

Table of contents:

Speech recognition application
A speech interface will ease the use of mobile devices. The current Graffiti and keyboard input techniques are cumbersome, slow, and require using both hands. With Graffiti a person uses a pen to write letters in a cryptic alphabet to form words. The keyboards are so small that a typer makes motions with a pen like a peeking hen. With a speech interface the person can speak a phrase, which the computer will interpret as a command. Speaking a thought is fast, easy, and can be done while the hands are otherwise occupied. Taking notes at the speed of speech is especially useful for recording inspirational thoughts that might be shortly forgotten. For example, a person walking through a Mall is coincidently inspired of a creative idea by a store display. Speaking the idea through the speech interface means a person can continue along his/her way without stopping to incontinently write or type the idea. The fleeting creativity might otherwise be forgotten if not recorded. 

The computer interprets the speech and executes the desired command, such as recording the thought in the notepad under the category of creative ideas. The computer must first translate the speech audio to text in order to interpret the spoken command. The speech-to-text operation is very computationally intensive and this application description describes how to provide this service. When the text is available, the computer can execute the command. Typical syntax for the spoken command is an operation followed by a detail; such as, “Computer, Notepad entry, creative idea: Decorate kitchen table with butterflies.” (Surely a thought forgotten if not recorded). The “Computer, Notepad entry, creative idea” is the command for the computer to store the message about the kitchen under the category of creative ideas. 

The type of spoken commands and interaction with the mobile device is constrained by the speech recognition process. With speech recognition there is a several second delay between speaking and the output of the computed text because the process is computationally and memory intensive. Delay is even inherent on desktop systems with ample computational and memory resources. Therefore the speech interface is limited to interaction that allows for some delay, such as in the example with recording the creative idea. The person will refer to the notepad list of creative ideas at a later time when he/she has time for it. Therefore it is sufficient for the speech-to-text translation to be completed at a later time. Another instance that permits delay is a background process. A person driving to meeting can command the computer to bring up all the records relevant for the meeting. The computer has time until the end of the car drive and the person goes to the meeting to perform the speech recognition and retrieve the meeting records. 

Having a speech interface introduces social issues because other people can hear what mobile computer user is speaking. Cell phone users are faced with similar concerns. Sometimes a person talking on a cell phone appears to be talking to himself/herself, which is considered peculiar behavior. More natural communication involves addressing the conversation partner. The spoken commands to the mobile application can be made more natural by addressing the mobile device such as, “Computer please record creative idea.” 
 

Providing mobile device with speech recognition
The technical challenge is to provide the mobile user with the speech recognition service. 
The mobile device has limited resources to perform the speech recognition. A remote server with ample resources can perform the speech recognition for the mobile device. An infrastructure is required to communicate between the mobile device and remote service. This speech recognition application is envisioned to use PDAs integrated with cell-phones, such as from T-mobile and Qualcomm. The PDA is the person’s mobile computer used for schedules, notes, office applications, games etc. The specialization of cell-phone’s capability is used to communicate audio and text data to the remote speech recognition server. The speech audio is communicated in a computer-computer “phone call” from the mobile device to the remote server. Data is transmitted between the devices through SMS messages or through the wireless application protocol (WAP), the protocol to access web pages on a cell-phone. 

Speech recognition process

Examining the process of speech recognition introduces three issues that must be addressed in the mobile application. Firstly, the remote server that performs the speech recognition requires access to an individual’s customized speech recognition profile. Secondly, the mobile application can improve the recognition accuracy by provided context information such as the current location. Thirdly, noise in the mobile environment complicates the speech recognition process. 

Typically there are four stages to speech recognition as described by Dan Ellis. The first stage is to digitally record the speech audio. In the second stage the audio is compressed with a cepstral transform, which provides the information to identify phonemes of speech in the third stage. In the fourth stage words are identified from processing the phonemes with stochastic modeling, such as a Hidden Markov Chain. 

Noises in the mobile environment, such as traffic noises, are not ideal for speech recognition process. The noise complicates the recognition of phonemes in third stage of the recognition. Additional processing is required in the first stage to filter the noise from the audio.

Using an individual’s recognition profile can enhance the accuracy of phoneme recognizing in the third stage. The profile provides information about the individual’s pronunciation of phonemes and is developed during every recognition process. With the Dragon Naturally Speaking speech recognition software, the profile can be developed within 30 minutes to give accurate speech recognition. As the profile is unique to every person, the remote server performing the speech recognition requires access to the recognition profile. 

The accuracy of recognizing words from phonemes in the fourth stage can be enhanced by context information, such as sentence structure or grammar. The mobile application can also provide context information, such as the person’s location. The person’s location might determine likely commands, such at an airport commands would include “flight number”,  “connecting flight number”, “departure time”, or “gate”.
 


Infrastructure design

The objective of the infrastructure is to provide the mobile device with the speech recognition service. A remote server with ample system resources performs the speech recognition. The cell-phone on the mobile device provides the network connectivity between the mobile device and remote server. An advantage of the cellular network is that it already provides connectivity over large parts of the world. The design goal of the infrastructure is to provide scalable and robust service to potentially 100’s of mobile devices. 

The protocol for the communication between the mobile device and speech recognition service has four stages. The first stage is for the mobile device to discover the phone-number of the remote server, which is available and minimizes the cost to the user. Second, the mobile device connects to the remote server and communicates the speech audio. Third, the remote server accesses the mobile user’s recognition profile and performs the speech-to-text computation. Fourth, the text equivalent of the speech is returned to the mobile device, where the command is interpreted. 

A cell-phone uses short messaging service (SMS) and the wireless application protocol (WAP) to communicate data. WAP is used for cellular phones to access web pages. Web pages are appropriate to distribute information of interest to many clients. The SMS messages are intended to be communication between two parties. 

Cell phones of the future will be equipped with tracking technology. The E-911 legislation requires the tracking for emergency purposes. The tracking, however, can also be used for location dependent services. 

Detailed communication protocol 

The four stages of the protocol are illustrated in figure 1. The first stage of the protocol is for the mobile device to discover the remote server, i.e the phone number to call and transmit the speech audio. The mobile device accesses a WAP server at a well-known URL. With the WAP request, the mobile device also includes the location of the mobile device as determined by the cell phone. The WAP server returns a list of the remote servers best suited for the mobile device. The WAP server will load balance the speech recognition requests among the remote servers and return the remote servers that have available capacity. Furthermore, the WAP server selects the remote server that will be least expensive for the mobile device to call. 

The second stage is for the mobile device to call the remote server and transmit the speech audio for speech recognition. The mobile device includes the current location to provide context information for the speech recognition. The mobile device must also include information for the remote server to access the mobile user’s personal recognition profile.

The third stage includes the access of the recognition profile is protected because it is personal information. Here is one suggestion for securely accessing the recognition profile. The mobile device provides the remote server with an electronic check. The remote server uses the electronic check to contact the mobile user’s data server and access the recognition profile only one time. The recognition file is scheduled to expire after one use by the remote server. The third stage ends with the remote server performing the speech recognition.

The fourth stage is for the remote server to return the text to the mobile device. An SMS message is appropriate method to return the text because it is intended for private communication between two parties.


Figure 1: Four stages of speech recognition protocol for mobile devices. 
 
 
 


Discussion

This speech recognition application demonstrates how to design networked mobile applications. The system is scaleable to support 100’s of mobile users with reasonable latency to convert speech to text. The system securely manages personal data, as long the components of the system are not compromised. To be commercially viable, it is possible to implement a billing system for the speech recognition service. The system can also be made adaptive to mange some speech recognition when the network connectivity is interrupted. 

Scalability

A scalable system is one that can support increased number of users without becoming overloaded. Support in the speech recognition application means to complete the user’s requests. The server used to discover the remote speech recognition servers for the mobile hosts has a small load per user and can mange many requests simultaneously. However, the load per user for the speech recognition server is much higher. Therefore several speech recognition servers are available to manage the user’s requests. The discovery server will load balance the users’ requests among the speech recognition servers. 

Latency

The latency of this application is from when the speech is spoken to when the mobile device interprets the text translation of the speech. The majority of the latency will be from the speech-to-text computational. Additional latency comes from the communication between the mobile device, discovery server and speech recognition server. These latencies should be only a fraction of the time to perform the speech-to-text conversion.

Security

Security is important for this application because personal data is exchanged. The speech audio and text equivalent are personal information. The audio is protected with the same security that a cell-phone call has. The text equivalent of the speech is as safe as an SMS message used to send it. The Register has article demonstrating the insecurity of SMS because human operators have access to the messages. Insecurity from human operators can be mitigated by to encrypting the SMS message. The recognition profile is also private information accessed by the speech recognition server. This information should be secured with the electronic check provided by the mobile device, presumably operated by the authorized owner.

Denial of service attacks can be launched at different parts of the system. The discovery and speech recognition servers are protected by the redundancy of the servers. If one server is attacked, alternate servers are available. In the case of DoS attack by jamming the cellular airwaves, however, there is no defense.

Adaptation to available resources

A characteristic of mobile computing is that resources are unreliable and have limited availability, such as cellular coverage. When the cellular connectivity is unavailable, the mobile device has to rely on its own limited resources. With limited resources the system can recognize a limited vocabulary. Current location information can increase recognition accuracy by providing the context of the speech. Hence, as shown in figure 2, the amount of required connectivity is reduced when more spatial information is available. 

Figure 2: Knowledge of location improves the speech recognition and slightly reduces the connectivity dependency. The location information provides some context information that helps to perform speech recognition.

Further information about adaptive speech recognition applications is available from Odyssey research at CMU.

In the worst case, the speech recognition is performed when the cellular connection is established again or the mobile device connects to a PC that has speech recognition capability. Current voice records provide such disconnected service; the audio stored on a recorder is transcribed to text when uploaded to a PC (some products). 

Billing

Billing for a service makes it commercially viable. There could be a annual subscription cost. However, the billing could also be for every speech-to-text transaction. The cost of the call from the mobile device to the speech recognition server can include the cost of the call and the cost of the speech recognition service. 
 


Conclusion

Speech recognition interfaces can make mobile devices easier to operate while on the road. This application description explains how to provide a robust and scalable speech recognition service for mobile devices. 
 top * home * academics
dorian miller, 2/3/2003