Speech recognition interfaces can make mobile devices easier
to operate while on the road. This application description explains how
to provide a robust and scalable speech recognition service for mobile
devices.
Presentation slides
Table of contents:
| Speech recognition application |
A speech interface will ease the use of mobile devices. The
current Graffiti and keyboard input techniques are cumbersome, slow, and
require using both hands. With Graffiti a person uses a pen to write letters
in a cryptic alphabet to form words. The keyboards are so small that a
typer makes motions with a pen like a peeking hen. With a speech interface
the person can speak a phrase, which the computer will interpret as a command.
Speaking a thought is fast, easy, and can be done while the hands are otherwise
occupied. Taking notes at the speed of speech is especially useful for
recording inspirational thoughts that might be shortly forgotten. For example,
a person walking through a Mall is coincidently inspired of a creative
idea by a store display. Speaking the idea through the speech interface
means a person can continue along his/her way without stopping to incontinently
write or type the idea. The fleeting creativity might otherwise be forgotten
if not recorded.
The computer interprets the speech and executes the desired command,
such as recording the thought in the notepad under the category of creative
ideas. The computer must first translate the speech audio to text in order
to interpret the spoken command. The speech-to-text operation is very computationally
intensive and this application description describes how to provide this
service. When the text is available, the computer can execute the command.
Typical syntax for the spoken command is an operation followed by a detail;
such as, “Computer, Notepad entry, creative idea: Decorate kitchen table
with butterflies.” (Surely a thought forgotten if not recorded). The “Computer,
Notepad entry, creative idea” is the command for the computer to store
the message about the kitchen under the category of creative ideas.

The type of spoken commands and interaction with the mobile device is
constrained by the speech recognition process. With speech recognition
there is a several second delay between speaking and the output of the
computed text because the process is computationally and memory intensive.
Delay is even inherent on desktop systems with ample computational and
memory resources. Therefore the speech interface is limited to interaction
that allows for some delay, such as in the example with recording the creative
idea. The person will refer to the notepad list of creative ideas at a
later time when he/she has time for it. Therefore it is sufficient for
the speech-to-text translation to be completed at a later time. Another
instance that permits delay is a background process. A person driving to
meeting can command the computer to bring up all the records relevant for
the meeting. The computer has time until the end of the car drive and the
person goes to the meeting to perform the speech recognition and retrieve
the meeting records.
Having a speech interface introduces social issues because other people
can hear what mobile computer user is speaking. Cell phone users are faced
with similar concerns. Sometimes a person talking on a cell phone appears
to be talking to himself/herself, which is considered peculiar behavior.
More natural communication involves addressing the conversation partner.
The spoken commands to the mobile application can be made more natural
by addressing the mobile device such as, “Computer please record creative
idea.”
| Providing mobile device
with speech recognition |
The technical challenge is to provide the mobile user with
the speech recognition service.
The mobile device has limited resources to perform the speech recognition.
A remote server with ample resources can perform the speech recognition
for the mobile device. An infrastructure is required to communicate between
the mobile device and remote service. This speech recognition application
is envisioned to use PDAs integrated with cell-phones, such as from T-mobile
and Qualcomm. The PDA is the person’s mobile computer used for schedules,
notes, office applications, games etc. The specialization of cell-phone’s
capability is used to communicate audio and text data to the remote speech
recognition server. The speech audio is communicated in a computer-computer
“phone call” from the mobile device to the remote server. Data is transmitted
between the devices through SMS messages or through the wireless application
protocol (WAP), the protocol to access web pages on a cell-phone.
Speech recognition process
Examining the process of speech recognition introduces three issues
that must be addressed in the mobile application. Firstly, the remote server
that performs the speech recognition requires access to an individual’s
customized speech recognition profile. Secondly, the mobile application
can improve the recognition accuracy by provided context information such
as the current location. Thirdly, noise in the mobile environment complicates
the speech recognition process.
Typically there are four stages to speech recognition as described by
Dan
Ellis. The first stage is to digitally record the speech audio. In
the second stage the audio is compressed with a cepstral transform, which
provides the information to identify phonemes of speech in the third stage.
In the fourth stage words are identified from processing the phonemes with
stochastic modeling, such as a Hidden Markov Chain.
Noises in the mobile environment, such as traffic noises, are not ideal
for speech recognition process. The noise complicates the recognition of
phonemes in third stage of the recognition. Additional processing is required
in the first stage to filter the noise from the audio.
Using an individual’s recognition profile can enhance the accuracy of
phoneme recognizing in the third stage. The profile provides information
about the individual’s pronunciation of phonemes and is developed during
every recognition process. With the Dragon
Naturally Speaking speech recognition software, the profile can be
developed within 30 minutes to give accurate speech recognition. As the
profile is unique to every person, the remote server performing the speech
recognition requires access to the recognition profile.
The accuracy of recognizing words from phonemes in the fourth stage
can be enhanced by context information, such as sentence structure or grammar.
The mobile application can also provide context information, such as the
person’s location. The person’s location might determine likely commands,
such at an airport commands would include “flight number”, “connecting
flight number”, “departure time”, or “gate”.
The objective of the infrastructure is to provide the mobile
device with the speech recognition service. A remote server with ample
system resources performs the speech recognition. The cell-phone on the
mobile device provides the network connectivity between the mobile device
and remote server. An advantage of the cellular network is that it already
provides connectivity over large parts of the world. The design goal of
the infrastructure is to provide scalable and robust service to potentially
100’s of mobile devices.
The protocol for the communication between the mobile device and speech
recognition service has four stages. The first stage is for the mobile
device to discover the phone-number of the remote server, which is available
and minimizes the cost to the user. Second, the mobile device connects
to the remote server and communicates the speech audio. Third, the remote
server accesses the mobile user’s recognition profile and performs the
speech-to-text computation. Fourth, the text equivalent of the speech is
returned to the mobile device, where the command is interpreted.
A cell-phone uses short messaging service (SMS)
and the wireless application protocol (WAP)
to communicate data. WAP is used for cellular phones to access web pages.
Web pages are appropriate to distribute information of interest to many
clients. The SMS messages are intended to be communication between two
parties.
Cell phones of the future will be equipped with tracking technology.
The E-911 legislation requires the tracking for emergency purposes. The
tracking, however, can also be used for location dependent services.
Detailed communication protocol
The four stages of the protocol are illustrated in figure 1. The first
stage of the protocol is for the mobile device to discover the remote
server, i.e the phone number to call and transmit the speech audio. The
mobile device accesses a WAP server at a well-known URL. With the WAP request,
the mobile device also includes the location of the mobile device as determined
by the cell phone. The WAP server returns a list of the remote servers
best suited for the mobile device. The WAP server will load balance the
speech recognition requests among the remote servers and return the remote
servers that have available capacity. Furthermore, the WAP server selects
the remote server that will be least expensive for the mobile device to
call.
The second stage is for the mobile device to call the remote
server and transmit the speech audio for speech recognition. The mobile
device includes the current location to provide context information for
the speech recognition. The mobile device must also include information
for the remote server to access the mobile user’s personal recognition
profile.
The third stage includes the access of the recognition profile
is protected because it is personal information. Here is one suggestion
for securely accessing the recognition profile. The mobile device provides
the remote server with an electronic check. The remote server uses the
electronic check to contact the mobile user’s data server and access the
recognition profile only one time. The recognition file is scheduled to
expire after one use by the remote server. The third stage ends with the
remote server performing the speech recognition.
The fourth stage is for the remote server to return the text
to the mobile device. An SMS message is appropriate method to return the
text because it is intended for private communication between two parties.

Figure 1: Four stages of speech recognition protocol for mobile
devices.
This speech recognition application demonstrates how to design
networked mobile applications. The system is scaleable to support 100’s
of mobile users with reasonable latency to convert speech to text. The
system securely manages personal data, as long the components of the system
are not compromised. To be commercially viable, it is possible to implement
a billing system for the speech recognition service. The system can also
be made adaptive to mange some speech recognition when the network connectivity
is interrupted.
Scalability
A scalable system is one that can support increased number of users
without becoming overloaded. Support in the speech recognition application
means to complete the user’s requests. The server used to discover the
remote speech recognition servers for the mobile hosts has a small load
per user and can mange many requests simultaneously. However, the load
per user for the speech recognition server is much higher. Therefore several
speech recognition servers are available to manage the user’s requests.
The discovery server will load balance the users’ requests among the speech
recognition servers.
Latency
The latency of this application is from when the speech is spoken to
when the mobile device interprets the text translation of the speech. The
majority of the latency will be from the speech-to-text computational.
Additional latency comes from the communication between the mobile device,
discovery server and speech recognition server. These latencies should
be only a fraction of the time to perform the speech-to-text conversion.
Security
Security is important for this application because personal data is
exchanged. The speech audio and text equivalent are personal information.
The audio is protected with the same security that a cell-phone call has.
The text equivalent of the speech is as safe as an SMS message used to
send it. The Register
has article demonstrating the insecurity of SMS because human operators
have access to the messages. Insecurity from human operators can be mitigated
by to encrypting the SMS message. The recognition profile is also private
information accessed by the speech recognition server. This information
should be secured with the electronic check provided by the mobile device,
presumably operated by the authorized owner.
Denial of service attacks can be launched at different parts of the
system. The discovery and speech recognition servers are protected by the
redundancy of the servers. If one server is attacked, alternate servers
are available. In the case of DoS attack by jamming
the cellular airwaves, however, there is no defense.
Adaptation to available resources
A characteristic of mobile computing is that resources are unreliable
and have limited availability, such as cellular coverage. When the cellular
connectivity is unavailable, the mobile device has to rely on its own limited
resources. With limited resources the system can recognize a limited vocabulary.
Current location information can increase recognition accuracy by providing
the context of the speech. Hence, as shown in figure 2, the amount of required
connectivity is reduced when more spatial information is available.

Figure 2: Knowledge of location improves the speech recognition
and slightly reduces the connectivity dependency. The location information
provides some context information that helps to perform speech recognition.
Further information about adaptive speech recognition applications is
available from Odyssey
research at CMU.
In the worst case, the speech recognition is performed when the cellular
connection is established again or the mobile device connects to a PC that
has speech recognition capability. Current voice records provide such disconnected
service; the audio stored on a recorder is transcribed to text when uploaded
to a PC (some products).
Billing
Billing for a service makes it commercially viable. There could be a
annual subscription cost. However, the billing could also be for every
speech-to-text transaction. The cost of the call from the mobile device
to the speech recognition server can include the cost of the call and the
cost of the speech recognition service.
Speech recognition interfaces can make mobile devices easier
to operate while on the road. This application description explains how
to provide a robust and scalable speech recognition service for mobile
devices.
|