| |
The PDF to MP3 project provides alternative access to academic
documents for the blind, weak readers and others. Using an MP3 player the
person listens to a synthesized reading of the paper. The audio is divided
into tracks and read in multiple voices to accommodate the listener’s navigation
and comprehension. The developed software performs the PDF to MP3 conversion
using the Python programming language and third party components.
Link to main project page
Table of contents
-
Introduction
-
Related work
-
Audio layout to support comprehension and navigation
-
Annotations convey document structure
-
Practical to use
-
Design and prototype
-
Evaluation
-
Future Work
-
References
Some people must be provided with alternative access to the
power of the written word. The blind are unable to read conventional books.
People with dyslexia or other reading disabilities have trouble efficiently
decoding the written word. Even people without a disability could benefit
from an alternative to reading material; for example, when riding the bus,
instead of reading they could listen to the text, as an audiotext. A computer
system enables a person to access audiotexts, which give similar information
as from reading conventional books.
Audio versions of text must provide a literal reading and also convey
the text’s structure. Text structure conveys meaning. Comprehension of
writing comes from understanding a series of independent consecutive points.
Sections or chapters represent different parts of a document’s main idea.
Furthermore paragraphs are sub-points of the idea expressed in a section.
In printed material, the format of section headings and whitespace between
paragraphs implicitly indicates the transition between individual ideas.
Similar format information must also be conveyed in audiotexts.
The organization of a document’s sections is also important for how
a person reads the document. Newspapers, manuals, and academic papers are
examples of documents that can be accessed in random order; that is instead
of reading the document front to back the reader chooses to read an individual
article or section. The focus of this project is to provide audiotext versions
of academic papers to convey the text and document structure. An academic
paper is usually 6 to 15 pages and broken into sections like this paper.
The typical reader might skim the paper front to back but then randomly
access sections to learn more about the details.
The design, prototype, and evaluation of software and hardware that
provides the audiotext is described in this paper. Software automatically
converts the electronic text file in Portable Document Format (PDF) into
MPEG-1 audio layer 3 (MP3) audio files, which can be listened to on an
appropriate player. Images in the academic paper, however, are ignored.
PDF is a standard format to distribute academic papers. The MP3 audio file
format is chosen because users can easily afford a simple to use and versatile
MP3 player. The concepts discussed, however, apply more generally to any
kind of electronic document and audio format.
This section is an overview of different computer solutions
to provide audiotexts. Each solution has its unique navigation techniques
between pages, sections, paragraphs etc. and its special features.
Books on tape is the oldest form of audio books started in 1970’s and
continued to be provided by the Recordings for Blind and Dyslexic (RFBD)
[1]. Readings of a book are stored on tapes. A special tape recorder is
used to listen to the four sides of the tape; besides the two sides of
a normal tape, two more sides are available by reversing the play direction.
The tape recorder also provides a dial to vary the playback speed; listeners
familiar with the text or just skimming it can increase the tape speed.
The inconvenience of using books on tape is that one book requires a large
volume of tapes and it is tedious to flip and rewind/fast-forward the tape
to find the desired section. However, to assist with the navigation, beeps
are used while fast-forwarding or rewinding to indicate page transitions.
The listener finds relevant information with an index card accompanying
the tapes, which indicates which tape, side and direction pages are on.
Digital talking book (DTB) is a standard for audio books developed by
the Digital Accessible Information System (DASIY) [2] consortium and standardized
in 2002 by the National Information Standards Organization (NISO). The
DTB standard describes how multimedia information, such as audio files,
text files and images, are composed to create an audio book. The standard
is flexible and combines variable amounts of audio and corresponding text,
which enables text searches. Having complete audio and text is not needed,
for example, in a dictionary, which might have the complete text and audio
only for pronunciations. A DTB viewer program can use the text to display
word definitions to a Braille display. Also with a DTB viewer the reader
can efficiently navigate between or within sections, because of the hierarchical
document structure defined by the DTB standard.
Blind people rely on screen reader software, such as JAWS, to use a
PC and read electronic documents. The screen reader reads all text that
appears on a screen. Navigating the PC desktop and applications is possible
with a series of keystrokes. Electronic documents, such as web pages,
are also navigated with keystrokes, which enable moving between pages,
lines, and words. The navigation, however, is limited because there are
no direct keystrokes to find the beginning of sections, paragraphs, or
sentences.
The original motivation for this project was to provide access to PDF
files, which until recently were not accessible with screen readers. In
the meanwhile, however, Adobe has released Acrobat Reader 5.1, the standard
PDF viewer, with screen reader accessibility. Regardless of Abode’s
recent development, the solution to listening to audiotexts on an MP3 player
provides a unique and convenient access to text material.
| Audio layout to support
comprehension and navigation {top} |
To improve the listener’s comprehension the layout of the audiotext
supports the active listening [3] strategy. The active listening strategy
is advocated by RFBD to improve people's comprehension when listening to
books on tape. The strategy has two stages. The first stage is for the
reader to gain an overview of the document by skimming the material; for
example, reading the abstract, section headings, and thinking about how
the sections relate to the overall idea. In the second stage, the reader
reads the complete text. In the audiotext the overview material is placed
before the complete text so that the listener does not have to search for
the overview material. So the title, author information, abstract and section
headings are heard first like the table of contents in a book. Following
the overview are the sections in their entirety. With this technique the
listener can gain an overview before listening to the rest of the paper.
The audiotext is divided into individual tracks to enable a listener
to easily traverse the document. Each part of the overview information
(title, author information, and each section heading) is in a separate
track. Each of the sections in their entirety are also on individual tracks.
Individual tracks are easy to access by moving back and forth between tracks.
The typical paper with 10 sections will have approximately 25 tracks, which
is a reasonable amount for a listener to flip through. Dividing the paper
further into smaller portions, such as paragraphs, would drastically increase
the track-count and thereby make it more cumbersome for the user to flip
between sections.
With this technique the listener can efficiently access a desired section.
First the listener can flip between the section headings (corresponding
tracks) in the overview to find the section he/she wants to read. Then
by means of the section number included with the section headings, the
listener can browse to the desired section; the section number announced
at the start of each section indicates if the desired section is before
or after the current track.
The drawback to this design is that navigating within a section is complicated.
A section is designated to a track which means navigation within it is
by rewinding/fast-forwarding. It is a matter of hit and miss to find
the beginning of paragraphs or sentences. Future work will hopefully improve
the navigation within a section.
| Annotations convey document
structure {top} |
Annotating the original text of the document conveys the document
structure in the audiotext, which is important for the listener’s comprehension.
In this work annotations take the form of spoken keywords or beeps.
Some of the tracks are annotated by spoken words with additional information
to help the listeners orient themselves in the audiotext. The tracks for
the title, author information, abstract, and sections start with the corresponding
keywords. When the listener hears the keyword, he/she will know the approximate
position in the paper and whether it is necessary to move forward or backward
to reach the desired part of the document. The tracks representing the
section headings are also identifiable because they do not have keywords.
Listeners are most likely to listen to these short tracks consecutively
so keywords would be distracting here.
Emphasizing the keyword distinguishes it from the same words appearing
in the text and alerts the listener to the newly started track. Changing
the characteristics of the voice that speaks them emphasizes the keywords;
for example, the voice for the majority of the text may be in a female
voice but the keywords will be spoken in a male voice. Hearing the emphasized
“section” keyword between sections alerts the listener to a section transition
and transition between major ideas.
Emphasizing printed structure by changing voice characteristics has
been explored by other research and could also be applied to this application.
TV Ramen uses the technique to emphasize the structure of mathematical
formulas and table structures [4]; for example, when reading mathematical
equations subscripts are read with a deeper voice. The voice characteristics
are unique enough to be used in conjunction with the keywords and beeps
and not confuse the listener.
Structural information can also be conveyed by beeps, which is an appropriate
technique for experienced listeners. A short beep between paragraphs signifies
the transition between them. Listeners familiar with an audiotext will
understand the meaning of the beep as opposed to the novice user. Also
experienced listeners might appreciate a shortening of the audiotext and
a beep is likely to be shorter than a keyword. Ideally, the listener can
customize the annotations as keywords or beeps.
The PDF to MP3 conversion tool will only be useful if it is
convenient and practical, which is the case with MP3 technology. The tool
executes in two stages. The first stage is the automatic conversion from
PDF to MP3. The second stage is for the user to listen to the MP3 tracks
on a MP3 player.
The conversion process is convenient because it is automated. The user
provides the parameters for a customized reading voice and a PDF, which
is automatically converted into MP3 files. Given the proper settings, the
files are downloaded directly onto the MP3 player. Although the process
is automated, it still takes several minutes for the computationally intensive
conversion process and for copying files to the MP3 player.
The MP3 player's handy form factor is like a Walkman and convenient
to use. The user can choose the most comfortable location to listen to
a document, such as on a couch at home or outside on a grassy field. Although
the user could listen to MP3's on a laptop, moving a laptop is less convenient
than a MP3 player.
A MP3 player is also financially practical as it is a mainstream consumer
product. The high sales volume and innovations in computer hardware
are bound to further decrease the price. Besides listening to audio documents
the user can use the same hardware to listen to music files, as originally
intended. Using an MP3 player is more practical than the BookCourier [5]
alternative. The BookCourier is a specialized text-to-speech device; the
text is downloaded to the device and read. Although it has the same form
factor the price is less likely to decrease because there is a limited
market for it.
One function not readily available on MP3 players is changing the speed
of playing the audio. The user can set the playing speed before the PDF
conversion process but cannot change it while listening to it. However,
certain Mp3 players are programmable, such as Archos Jukebox [6] and Neuros
[7], and can be modified to provide this functionality. Furthermore using
the phase vocoder the speed can be changed without changing the pitch [8].
In the frequency domain the frequencies are multiplied by the inverse of
the rate change. To preserve the audio signal the phase corresponding to
the original frequency is matched to the new frequency. The new audio signals
time domain equivalent is played back at the new rate.
| Design and prototype
{top} |
The PDF to MP3 conversion is feasible to implement in three
stages by combining existing techniques.
The first stage, the most complex, extracts the text and structure from
the electronic documents. Extracting text from electronic documents is
straightforward, however, extracting structure is not. Most document formats
only preserve typesetting and layout information about the text. The structure,
such as title, section headings, or footnotes, is not preserved. The challenge
is to infer the structure from the text style; for example, headings are
likely to be bold and with a slightly larger font size than the main text.
However, there is no standard so it is difficult to determine the criteria
that apply in all cases. The exception is latex documents, which strictly
enforce structure. The latex source, however, is not widely distributed
because it is converted into other formats, such as PDF.
The second stage is to perform a text-to-speech (TTS) conversion and
modify the synthesized voice to reflect the paper structure. TTS is thoroughly
studied and several practical solutions exist. In TTS, the phonetics of
each syllable is determined and the corresponding audio produced. Altering
the pitch and speed of spoken words can change the characteristics of the
voice. Ongoing efforts in the field are to make the robotic sounding synthetic
voices more humanlike.
(Reference TTS research)
The final stage is to store the audio to file, which is a matter of
choosing a file format. The WAV file format is the raw audio data, which
a soundboard uses to reproduce the sound. The WAV data can be considerably
compressed when converted into the MP3 format [9]. Depending on the audio
quality and the number of channels the WAV to MP3 compression ratio ranges
from 12 to 24. The audio quality can be reduced as long as the recorded
synthesized speech is still comfortable to listen to.
The prototype of the described system has been implemented from existing
software components. The Python [10] programming language is used to combine
the different components of the system. The “pdftohtml” [11] software extracts
text and font settings from a PDF and converts them into HTML. The Python
HTML parser reads in the HTML file, divides the paper’s components based
on format, and stores the text in a data structure. The text in the data
structure is processed by Microsoft's TTS [12] engine using the male voice
to emphasize keywords and the female voice to read text. Microsoft Speech
Application Programming Interface (SAPI) also provides the functionality
to save the spoken word into WAV files. The WAV files are converted to
MP3 files using LAME [13] software. The final audiotext is about 30-45
minutes long depending on paper length and reading speed.
The prototype is limited to converting papers of a fixed format, in
which headings are the only bold text. The parts of the paper, such as
title, abstract, and sections, are identified as being between headings.
Documents with other formats could be converted but would be randomly divided
into audio tracks. The listener can still listen to the paper but does
not have convenient access to sections.
The purpose of the evaluation is to measure the listener's
comprehension of an audiotext. The evaluation should be performed
with blind readers, weak readers, and others interested in an alternative
access to text.
Although a formal study was not performed, one possibility would be
to perform it as follows. All participants are given an audiotext with
a general subject and a chance to practice using MP3 player. The effectiveness
of the audiotext is measured by the time and accuracy with which the participants
complete a questionnaire about the audiotext. At the end of the experiment
the participants can share their experiences of using the audiotext and
suggest improvements.
It will be interesting to compare the performance between those that
have used a form of audiotext and those that have not. My hypothesis is
that those experienced with audiotexts will outperform those with no audiotext
experience.
A possible enhancement to the experiment might improve the listeners
comprehension. In addition to listening to the audiotext, the participant
has an alternative access to text; for example, the sighted follow along
on a paper copy and the blind use a Braille display. I hypothesize
that the comprehension will improve because the listener's attention is
more focused on the content. Also having two forms of the text might
make the facts more memorable and therefore improve the comprehension.
The audiotext navigation techniques explored in this project
can be enhanced and possibly applied to electronic books, which display
one-page at a time.
With programmable MP3 players it will be possible to expand the interface;
buttons for play, stop, forward, backward can be replaced. The paper
can be divided down to the sentence level and organized in a hierarchy
of folders supported by MP3 player; for example, at the first level are
sections, at the second paragraphs, and at the third sentences. Then the
buttons on the MP3 player could be used to navigate the folders.
The navigation techniques used for an audiotext also apply to electronic
books. Unlike a conventional book, an electronic book does not provide
the same easy method of flipping pages to find the desired information.
Some simple navigation techniques include flipping consecutive pages by
jumping directly to a page number. Enabling access to sections can enhance
the navigation.
1. web. Recordings for Blind and Dyslexic. in http://www.rfbd.org/.
April 15, 2003.
2. web. Digital Accessible Information SYstem. in http://www.daisy.org.
April 15, 2003.
3. video. Video on active listening. in RFBD. 1998.
4. Ramen, T. Emacspeak
--A Speech Interface. in CHI. 1996.
5. web. BookCourier. in http://www.ostrichsoftware.com/.
April 15, 2003.
6. web. Archos Jukebox. in http://www.archos.com/.
April 15, 2003.
7. web. Neuros. in http://www.neurosaudio.com/store/prod_neuros.asp.
April 15, 2003.
8. Robinson, A. Changing the Speed of Music Without Changing the Pitch
(technical discussion). in http://www.seventhstring.demon.co.uk/xscribe/slowdown.html.
April 15, 2003.
9. web. Audio & Multimedia MPEG Audio Layer-3. in http://www.iis.fraunhofer.de/amm/techinf/layer3/.
April 15, 2003.
10. web. Python programming language home-page. in http://www.python.org/.
April 15, 2003.
11. Kruk, M. PDF to HTML conversion tool. in http://pdftohtml.sourceforge.net/.
April 15, 2003.
12. web. Microsoft Text-to-Speech research. in http://research.microsoft.com/srg/.
April 15, 2003.
13. web. LAME Ain't an Mp3 Encoder (LAME). in http://lame.sourceforge.net/.
April 15, 2003.
|