|
The PDF to MP3 project provides alternative access to
academic documents for the blind, weak readers and others. Using an MP3
player the person listens to a synthesized reading of the paper. The
audio is divided into tracks and read in multiple voices to accommodate
the listener’s navigation and comprehension. The developed software
performs the PDF to MP3 conversion using the Python programming language
and third party components.
Link
to main project page
Table of contents
- Introduction
- Related
work
- Audio
layout to support comprehension and navigation
- Annotations
convey document structure
- Practical
to use
- Design
and prototype
- Evaluation
- Future
Work
- References
Some people must be provided with alternative access to the
power of the written word. The blind are unable to read conventional
books. People with dyslexia or other reading disabilities have
trouble efficiently decoding the written word. Even people without a
disability could benefit from an alternative to reading material; for
example, when riding the bus, instead of reading they could listen to
the text, as an audiotext. A computer system enables a person to access
audiotexts, which give similar information as from reading conventional
books.
Audio versions of text must provide a literal reading and also convey
the text’s structure. Text structure conveys meaning. Comprehension of
writing comes from understanding a series of independent consecutive
points. Sections or chapters represent different parts of a document’s
main idea. Furthermore paragraphs are sub-points of the idea expressed
in a section. In printed material, the format of section headings and
whitespace between paragraphs implicitly indicates the transition
between individual ideas. Similar format information must also be
conveyed in audiotexts.
The organization of a document’s sections is also important for how a
person reads the document. Newspapers, manuals, and academic papers are
examples of documents that can be accessed in random order; that is
instead of reading the document front to back the reader chooses to read
an individual article or section. The focus of this project is to
provide audiotext versions of academic papers to convey the text and
document structure. An academic paper is usually 6 to 15 pages and
broken into sections like this paper. The typical reader might skim the
paper front to back but then randomly access sections to learn more
about the details.
The design, prototype, and evaluation of software and hardware that
provides the audiotext is described in this paper. Software
automatically converts the electronic text file in Portable Document
Format (PDF) into MPEG-1 audio layer 3 (MP3) audio files, which can be
listened to on an appropriate player. Images in the academic paper,
however, are ignored. PDF is a standard format to distribute academic
papers. The MP3 audio file format is chosen because users can easily
afford a simple to use and versatile MP3 player. The concepts discussed,
however, apply more generally to any kind of electronic document and
audio format.
This section is an overview of different computer solutions
to provide audiotexts. Each solution has its unique navigation
techniques between pages, sections, paragraphs etc. and its special
features.
Books on tape is the oldest form of audio books started in 1970’s and
continued to be provided by the Recordings for Blind and Dyslexic (RFBD)
[1]. Readings of a book are stored on tapes. A special tape recorder is
used to listen to the four sides of the tape; besides the two sides of a
normal tape, two more sides are available by reversing the play
direction. The tape recorder also provides a dial to vary the playback
speed; listeners familiar with the text or just skimming it can increase
the tape speed. The inconvenience of using books on tape is that one
book requires a large volume of tapes and it is tedious to flip and
rewind/fast-forward the tape to find the desired section. However, to
assist with the navigation, beeps are used while fast-forwarding or
rewinding to indicate page transitions. The listener finds relevant
information with an index card accompanying the tapes, which indicates
which tape, side and direction pages are on.
Digital talking book (DTB) is a standard for audio books developed by
the Digital Accessible Information System (DASIY) [2] consortium and
standardized in 2002 by the National Information Standards Organization
(NISO). The DTB standard describes how multimedia information, such as
audio files, text files and images, are composed to create an audio
book. The standard is flexible and combines variable amounts of audio
and corresponding text, which enables text searches. Having complete
audio and text is not needed, for example, in a dictionary, which might
have the complete text and audio only for pronunciations. A DTB viewer
program can use the text to display word definitions to a Braille
display. Also with a DTB viewer the reader can efficiently navigate
between or within sections, because of the hierarchical document
structure defined by the DTB standard.
Blind people rely on screen reader software, such as JAWS, to use a
PC and read electronic documents. The screen reader reads all text that
appears on a screen. Navigating the PC desktop and applications is
possible with a series of keystrokes. Electronic documents, such
as web pages, are also navigated with keystrokes, which enable moving
between pages, lines, and words. The navigation, however, is limited
because there are no direct keystrokes to find the beginning of
sections, paragraphs, or sentences.
The original motivation for this project was to provide access to PDF
files, which until recently were not accessible with screen readers. In
the meanwhile, however, Adobe has released Acrobat Reader 5.1, the
standard PDF viewer, with screen reader accessibility. Regardless
of Abode’s recent development, the solution to listening to audiotexts
on an MP3 player provides a unique and convenient access to text
material.
Audio layout to
support comprehension and navigation {top} |
To improve the listener’s comprehension the layout of the
audiotext supports the active listening [3] strategy. The active
listening strategy is advocated by RFBD to improve people's
comprehension when listening to books on tape. The strategy has two
stages. The first stage is for the reader to gain an overview of the
document by skimming the material; for example, reading the abstract,
section headings, and thinking about how the sections relate to the
overall idea. In the second stage, the reader reads the complete text.
In the audiotext the overview material is placed before the complete
text so that the listener does not have to search for the overview
material. So the title, author information, abstract and section
headings are heard first like the table of contents in a book. Following
the overview are the sections in their entirety. With this technique the
listener can gain an overview before listening to the rest of the
paper.
The audiotext is divided into individual tracks to enable a listener
to easily traverse the document. Each part of the overview information
(title, author information, and each section heading) is in a separate
track. Each of the sections in their entirety are also on individual
tracks. Individual tracks are easy to access by moving back and forth
between tracks. The typical paper with 10 sections will have
approximately 25 tracks, which is a reasonable amount for a listener to
flip through. Dividing the paper further into smaller portions, such as
paragraphs, would drastically increase the track-count and thereby make
it more cumbersome for the user to flip between sections.
With this technique the listener can efficiently access a desired
section. First the listener can flip between the section headings
(corresponding tracks) in the overview to find the section he/she wants
to read. Then by means of the section number included with the section
headings, the listener can browse to the desired section; the section
number announced at the start of each section indicates if the desired
section is before or after the current track.
The drawback to this design is that navigating within a section is
complicated. A section is designated to a track which means navigation
within it is by rewinding/fast-forwarding. It is a matter of hit
and miss to find the beginning of paragraphs or sentences. Future work
will hopefully improve the navigation within a section.
Annotations convey
document structure {top} |
Annotating the original text of the document conveys the
document structure in the audiotext, which is important for the
listener’s comprehension. In this work annotations take the form of
spoken keywords or beeps.
Some of the tracks are annotated by spoken words with additional
information to help the listeners orient themselves in the audiotext.
The tracks for the title, author information, abstract, and sections
start with the corresponding keywords. When the listener hears the
keyword, he/she will know the approximate position in the paper and
whether it is necessary to move forward or backward to reach the desired
part of the document. The tracks representing the section headings are
also identifiable because they do not have keywords. Listeners are most
likely to listen to these short tracks consecutively so keywords would
be distracting here.
Emphasizing the keyword distinguishes it from the same words
appearing in the text and alerts the listener to the newly started
track. Changing the characteristics of the voice that speaks them
emphasizes the keywords; for example, the voice for the majority of the
text may be in a female voice but the keywords will be spoken in a male
voice. Hearing the emphasized “section” keyword between sections alerts
the listener to a section transition and transition between major
ideas.
Emphasizing printed structure by changing voice characteristics has
been explored by other research and could also be applied to this
application. TV Ramen uses the technique to emphasize the structure of
mathematical formulas and table structures [4]; for example, when
reading mathematical equations subscripts are read with a deeper voice.
The voice characteristics are unique enough to be used in conjunction
with the keywords and beeps and not confuse the listener.
Structural information can also be conveyed by beeps, which is an
appropriate technique for experienced listeners. A short beep between
paragraphs signifies the transition between them. Listeners familiar
with an audiotext will understand the meaning of the beep as opposed to
the novice user. Also experienced listeners might appreciate a
shortening of the audiotext and a beep is likely to be shorter than a
keyword. Ideally, the listener can customize the annotations as keywords
or beeps.
The PDF to MP3 conversion tool will only be useful if it is
convenient and practical, which is the case with MP3 technology. The
tool executes in two stages. The first stage is the automatic conversion
from PDF to MP3. The second stage is for the user to listen to the MP3
tracks on a MP3 player.
The conversion process is convenient because it is automated. The
user provides the parameters for a customized reading voice and a PDF,
which is automatically converted into MP3 files. Given the proper
settings, the files are downloaded directly onto the MP3 player.
Although the process is automated, it still takes several minutes for
the computationally intensive conversion process and for copying files
to the MP3 player.
The MP3 player's handy form factor is like a Walkman and convenient
to use. The user can choose the most comfortable location to listen to a
document, such as on a couch at home or outside on a grassy field.
Although the user could listen to MP3's on a laptop, moving a laptop is
less convenient than a MP3 player.
A MP3 player is also financially practical as it is a mainstream
consumer product. The high sales volume and innovations in
computer hardware are bound to further decrease the price. Besides
listening to audio documents the user can use the same hardware to
listen to music files, as originally intended. Using an MP3 player is
more practical than the BookCourier [5] alternative. The BookCourier is
a specialized text-to-speech device; the text is downloaded to the
device and read. Although it has the same form factor the price is less
likely to decrease because there is a limited market for it.
One function not readily available on MP3 players is changing the
speed of playing the audio. The user can set the playing speed before
the PDF conversion process but cannot change it while listening to it.
However, certain Mp3 players are programmable, such as Archos Jukebox
[6] and Neuros [7], and can be modified to provide this functionality.
Furthermore using the phase vocoder the speed can be changed without
changing the pitch [8]. In the frequency domain the frequencies are
multiplied by the inverse of the rate change. To preserve the audio
signal the phase corresponding to the original frequency is matched to
the new frequency. The new audio signals time domain equivalent is
played back at the new rate.
Design and prototype
{top} |
The PDF to MP3 conversion is feasible to implement in three
stages by combining existing techniques.
The first stage, the most complex, extracts the text and structure
from the electronic documents. Extracting text from electronic documents
is straightforward, however, extracting structure is not. Most document
formats only preserve typesetting and layout information about the text.
The structure, such as title, section headings, or footnotes, is not
preserved. The challenge is to infer the structure from the text style;
for example, headings are likely to be bold and with a slightly larger
font size than the main text. However, there is no standard so it is
difficult to determine the criteria that apply in all cases. The
exception is latex documents, which strictly enforce structure. The
latex source, however, is not widely distributed because it is converted
into other formats, such as PDF.
The second stage is to perform a text-to-speech (TTS) conversion and
modify the synthesized voice to reflect the paper structure. TTS is
thoroughly studied and several practical solutions exist. In TTS, the
phonetics of each syllable is determined and the corresponding audio
produced. Altering the pitch and speed of spoken words can change the
characteristics of the voice. Ongoing efforts in the field are to make
the robotic sounding synthetic voices more humanlike.
(Reference TTS research)
The final stage is to store the audio to file, which is a matter of
choosing a file format. The WAV file format is the raw audio data, which
a soundboard uses to reproduce the sound. The WAV data can be
considerably compressed when converted into the MP3 format [9].
Depending on the audio quality and the number of channels the WAV to MP3
compression ratio ranges from 12 to 24. The audio quality can be reduced
as long as the recorded synthesized speech is still comfortable to
listen to.
The prototype of the described system has been implemented from
existing software components. The Python [10] programming language is
used to combine the different components of the system. The “pdftohtml”
[11] software extracts text and font settings from a PDF and converts
them into HTML. The Python HTML parser reads in the HTML file, divides
the paper’s components based on format, and stores the text in a data
structure. The text in the data structure is processed by Microsoft's
TTS [12] engine using the male voice to emphasize keywords and the
female voice to read text. Microsoft Speech Application Programming
Interface (SAPI) also provides the functionality to save the spoken word
into WAV files. The WAV files are converted to MP3 files using LAME [13]
software. The final audiotext is about 30-45 minutes long depending on
paper length and reading speed.
The prototype is limited to converting papers of a fixed format, in
which headings are the only bold text. The parts of the paper, such as
title, abstract, and sections, are identified as being between headings.
Documents with other formats could be converted but would be randomly
divided into audio tracks. The listener can still listen to the paper
but does not have convenient access to sections.
The purpose of the evaluation is to measure the listener's
comprehension of an audiotext. The evaluation should be performed
with blind readers, weak readers, and others interested in an
alternative access to text.
Although a formal study was not performed, one possibility would be
to perform it as follows. All participants are given an audiotext with a
general subject and a chance to practice using MP3 player. The
effectiveness of the audiotext is measured by the time and accuracy with
which the participants complete a questionnaire about the audiotext. At
the end of the experiment the participants can share their experiences
of using the audiotext and suggest improvements.
It will be interesting to compare the performance between those that
have used a form of audiotext and those that have not. My hypothesis is
that those experienced with audiotexts will outperform those with no
audiotext experience.
A possible enhancement to the experiment might improve the listeners
comprehension. In addition to listening to the audiotext, the
participant has an alternative access to text; for example, the sighted
follow along on a paper copy and the blind use a Braille display.
I hypothesize that the comprehension will improve because the listener's
attention is more focused on the content. Also having two forms of
the text might make the facts more memorable and therefore improve the
comprehension.
The audiotext navigation techniques explored in this project
can be enhanced and possibly applied to electronic books, which display
one-page at a time.
With programmable MP3 players it will be possible to expand the
interface; buttons for play, stop, forward, backward can be
replaced. The paper can be divided down to the sentence level and
organized in a hierarchy of folders supported by MP3 player; for
example, at the first level are sections, at the second paragraphs, and
at the third sentences. Then the buttons on the MP3 player could be used
to navigate the folders.
The navigation techniques used for an audiotext also apply to
electronic books. Unlike a conventional book, an electronic book does
not provide the same easy method of flipping pages to find the desired
information. Some simple navigation techniques include flipping
consecutive pages by jumping directly to a page number. Enabling access
to sections can enhance the navigation.
1. web. Recordings for Blind and Dyslexic. in http://www.rfbd.org/. April 15, 2003.
2. web. Digital Accessible Information SYstem. in http://www.daisy.org/. April 15, 2003.
3. video. Video on active listening. in RFBD. 1998. 4. Ramen, T.
Emacspeak
--A Speech Interface. in CHI. 1996. 5. web. BookCourier. in http://www.ostrichsoftware.com/.
April 15, 2003. 6. web. Archos Jukebox. in http://www.archos.com/. April 15,
2003. 7. web. Neuros. in http://www.neurosaudio.com/store/prod_neuros.asp.
April 15, 2003. 8. Robinson, A. Changing the Speed of Music Without
Changing the Pitch (technical discussion). in http://www.seventhstring.demon.co.uk/xscribe/slowdown.html.
April 15, 2003. 9. web. Audio & Multimedia MPEG Audio Layer-3.
in http://www.iis.fraunhofer.de/amm/techinf/layer3/.
April 15, 2003. 10. web. Python programming language home-page. in
http://www.python.org/. April 15,
2003. 11. Kruk, M. PDF to HTML conversion tool. in http://pdftohtml.sourceforge.net/.
April 15, 2003. 12. web. Microsoft Text-to-Speech research. in http://research.microsoft.com/srg/.
April 15, 2003. 13. web. LAME Ain't an Mp3 Encoder (LAME). in http://lame.sourceforge.net/.
April 15, 2003.
|