A Document Skimmer

Our Work: Features and Justifications

Our Work: Screen Shots

Code for Our Document Skimmer
Other Downloads
How To Install and Run Our Software

Hearing text read aloud is slower than reading it visually. The user listens passively, unable to skim for specific information, scan for an overview, or skip ahead as may be desired. Conventional text readers allow no scanning or skimming. They skip always by one line or one page, without allowing the user to specify a different amount. Furthermore, if the user chooses to listen passively and not skip ahead, then the only way the user can finish sooner is to speed up the speech, which makes it less comprehensible after a point. Nearly any method of speeding up reading can make it less comprehensible, but certain methods may be better than others, and different users may have different preferences.

Our Document Skimmer speeds up reading and allows the user to decide how to make the tradeoff with comprehension. The user can listen actively, skipping by specified distances or scanning by specified detail amounts. He/she can skip words, lines, or segments of specified size, both forward and backwards. He/she can scan by listening to text with less detail, simultaneous speech streams, words with dropped phonemes, or words with blended phonemes.

Rate and Voice

Our program allows users to choose among multiple voices to read the text. MSMary seemed to be a favorite of people who used this software. The program also allows users to vary the rate at which text is read.

Hierarchy

The hierarchy allows the user to hear and traverse the document at various levels of detail (LODs). The document is split into segments of size N, and for each segment only the first subsegment of size M is read (M <= N). Before hearing the text, the user specifies the values "segment size" N and "subsegment size" M. The hierarchy is then built. While hearing the text, the user can hit keys to increase or decrease the segment size in increments of N, or to jump forward or backwards in increments of M. These actions are traversing parent-child to more or less detail, and skipping sibling-sibling forwards or backwards in the text.

For example, the user may use segments of size 6 and subsegments of size 3. He/she may then traverse up the hierarchy to LODs characterized by segments of size 12, 18, 24,..., 6*floor(len(doc)/6). Within each LOD, he/she hears the first 3 words of every segment. He/she and may jump forward or backward by segments of size corresponding to the LOD.

Hierarchical levels of detail. Subsegments are size M=3. Segments are size N=6 on the level with the highest level of detail, and they increase in increments of 6 on each subsequent level.

The hierarchy simulates how people naturally skim visually, by alternately reading and skipping portions of text at will. The Speech Skimmer presents a similar hierarchy, except it processes speech instead of text, and its segments sizes are based on semantics instead of fixed. A speech stream is split into segments at estimated locations of topic change. Subsegments are a fixed number of seconds long.

Words and Phonemes

People who research speech, literacy, and reading are trying to understand the processes involved when people read silently. When some people read, they subvocalize, which means they repeat each word they read silently in their head. Some researchers believe this is a common cause (or symptom) of reading disabilities, partly because it decreases reading speed. In addition, some researchers believe subvocalization decreases comprehension because the reader's attention is on individual words instead of the idea the words are creating. Because of this, we thought that hearing each word read aloud may not be a necessary prerequisite to understanding text. Our software experiments with this idea.

First, our software has a "Remove Common Words" menu option. When chosen, common articles and prepositions are removed from the text. A total of 56 words compose our list of common words. That list can be viewed here.

Our software also has a "Drop Phonemes" menu option. When chosen, phonemes with no lexical stress are removed from the text. For example, in the word "computing", essentially the first syllable would be dropped.

Finally, our software has a "Blend Phonemes" menu option. When chosen, the spaces between words which share phonemes are removed. Specifically, if either of the last two phonemes of a word are found in either of the first two phonemes of the next word, the space between the two words is removed. For example, in the phrase "What up", since the "uh" sound is heard in both words, the space between these words would be removed.

Spatial Sound

As part of our investigation into how text could be understood more quickly when it is heard, we wanted to learn how text can be understood more quickly as it is seen. We read about a variety of techniques used when people learn how to speed read. In particular, many speed reading exercises teach people to read text as chunks, instead of word-by-word. One exercise asks readers to make a circle out of their index fingers and thumbs and to try to read all of the words in that circle at once. This led us to experiment with the idea of such a circle as text is heard. Specifically, we use spatial sound to allow multiple words to be heard at once.

In our TextSkimmer the spatial sound was implemented as follows. On the menu bar there is a Spatial Sound menu that consists of two choices, Spatial Sound and Spatial Orientation. If you choose Spatial Sound in this menu a box will pop up asking you for number of sources you would like to have. We decided to support only 1, 2, 3, or 4 sources.

If you choose to have only 1 source, you will hear only one voice speaking the text. So this option is playing the text as normal.
If you choose to have 2 sources, you will hear 2 voices speaking the appropriate segments of the text at the same time, i.e. text you have opened or typed in to the window will be divided such that first voice speaks every 1, 3, 5, 7, etc word, while the second voice speaks every 2, 4, 6, etc word. They will speak at the same time but the second source will be closer to you, it will dominate over the other source and you will hear it louder.
If you choose to have 3 sources, you will hear 3 voices speaking the appropriate segments of text at the same time. Similar to the above, text in the window will be divided into 3 parts, first part that consist of every 1, 4, 7, etc., second part that consist of every 2, 5, 8, etc, and third part that consist of every 3, 6, 9, etc. word. Each source will play one of the segments. The second source will be closest to the listener so it will dominate over the other two. Also, it is a female speaker, while the other two are male and that makes it even more recognizable. This set up corresponds to the speed reading lens of size 3.
If you choose to have 4 sources, you will hear 4 voices speaking the appropriate segments of the text at the same time. The text is divided into 4 parts similar to above. This time sources are placed such that two are closer to the listener and other two are further away. So the two that are closer dominate other two. This set up should simulate the speed reading lens of size 4.

In order to evaluate our software, we conducted a pilot study, approved by the University of North Carolina's Internal Review Board. You can read the study proposal, the informed consent form, and the parental consent form.

Four people participated in this study. In addition, three others used our software informally and provided feedback. The protocol, including questions asked as part of this study, can be read here.

Participants were asked to perform three tasks:

In Task 1, participants were asked asked questions about some text, and were asked to find answers to these questions by navigating through the text. In general, participants found this difficult to do. Although in some cases the correct answers were located, this generally took more than a minute to do. Some participants skipped forward and backward lines, and often skipped forward and backward words, but the hierarchy was rarely used.

In Task 2, participants heard a story read. Each section of the story was read in a different way. After each portion of the text was heard, three questions were asked.

The first part the story was heard as unaltered text. Partipants were asked questions about the story. Out of 12 questions, there were 12 correct responses.
The second part of the story was heard with common words removed. With the text used, 29% of the words were removed. Out of 12 questions, there were 11 correct responses.
The third part of the story was heard with phonemes dropped. With the text used, 29% of the words were modified. Out of 12 questions, there were 9 correct responses. Three of four participants correctly answered all three questions. One participant incorrectly answered all three questions.
The final part of the story was heard with phonemes blended. With the text used, 10% of the words were modified. Out of 12 questions, there were 4 correct responses.

In Task 3, participants heard a story read. Each section of the story was read in a different way. After each portion of the text was heard, three questions were asked.

The first part of the story was heard as unaltered text. Out of 12 questions, there were 11 correct responses.
The second part of the story was heard with two voices speaking at once. Out of 12 questions, there were 6 correct reponses. Three people each answered two questions correctly. One person answered none of these questions correctly.
The third part of the story was heard with three voices speaking at once. Out of 12 questions, 3 were answered correctly.
The final part of this story was heard with four voices speaking at once. Out of 12 questions, 2 were answered correctly. These two were answered by the same person.

Although our experiments began to answer some questions, it raised even more. For example:

How much does the voice selected to read the text matter? Several of our study participants preferred to hear MSMary. Is it because they preferred female voices, or is Mary just clearer? Could voice selection drastically decrease comprehension?
Would our software, especially the hierarchy, be more beneficial to users if they had more time to practice using the software? During our experiments, some participants used the computer very hesitantly. Was this because of a learning curve?
What is the relationship between phoneme dropping (or blending) and the speed at which text is heard? One participant noted that when the text was heard slowly, her brain tried to decipher unclear words. When the text was heard more quickly, her brain did not have time to do this, and her comprehension was increased.
What is the role of prior knowledge with regard to skimming? For example, if the user has heard text before and just wants to review it, perhaps hearing it in spatial sound is more effective than hearing text for the first time in spatial sound.
How does skimming relate to simply trying to "find" text (using control-F)? Answers to some questions can be located using a "find" function as part of the software. What type of questions lend themselves more easily to this? For what types of tasks is skimming more effective?

Clearly, there is plenty more to investigate here!

Code for Our Document Skimmer

Other Downloads

Since our program is only a prototype we do not support the one link installation process. If you want to download our program and see how it works here are the things you will need to have:

BATS NCDemo: Blind Audio Tactile Mapping System
- OpenAL.dll: OpenAL spatial sound library
- MSVRTD.dll
- pyTTS.dll: Wrapper for Microsoft text-to-speech library
- pyOpenAL.dll: Wrapper for OpenAL spatial sound library
Python: Interpreter, sample code, and editor
wxPython: GUI toolkit for Python
Python Numeric 22.0 Library
Win32 Python library
Microsoft voices

How To Install and Run Our Software

Download and run the installation programs for everything.
Put the DLLs in Win32, with the others that are already there.
Add to the front of the PATH system variable the path to python.exe, the Python interpreter.
In our software, find docSkimmer.py and run it: Double-click it or enter on the command line "python docSkimmer.py".

Arons, B., Techniques, Perception, and Applications of Time-Compressed Speech. Proceedings of American Voice I/O Society, 1992: p. 169-177.
Arons, B., SpeechSkimmer: A System for Interactively Skimming Recorded Speech. Proceedings of ACM Symposium on User Interface Software and Technology, 1993: p. 187-196.
Bauwens, B., et al., Increasing access to information for the print disabled through electronic documents in SGML. ACM SIGCAPH Conference on Assistive Technologies, 1994: p. 55-61.
Brewster, S., A. Capriotti, and C. Hall, Using Compound Earcons to Represent Hierarchies. Human Computer Interaction (HCI) Letters, 1998.
Brewster, S.A., V.-P. Raty, and A. Kortekangas, Enhancing scanning input with non-speech sounds. ACM SIGCAPH Conference on Assistive Technologies, 1996: p. 10-14.
Brewster, S.A., P.C. Wright, and A.D.N. Edwards, An evaluation of earcons for use in auditory human-computer interfaces. Proceedings of the SIGCHI conference on Human factors in computing systems, 1993: p. 222-227.
Buzan, T., Speed Reading. Penguin Group, New York, New York: 1991.
Cutler, W. E., Triple your Reading Speed. Arco, Lawrenceville, NJ: 2002.
Janse, E., Time-Compressing Natural and Synthetic Speech. Proceedings of 7th International Conference on Spoken Language Processing, 2002: p. 1645-1648.
Pitt, I.J. and A.D.N. Edwards, Improving the usability of speech-based interfaces for blind users. ACM SIGCAPH Conference on Assistive Technologies, 1996: p. 124-130.
Pontelli, E., et al., A domain specific language framework for non-visual browsing of complex HTML structures. ACM SIGCAPH Conference on Assistive Technologies, 2000: p. 180-187.
Prior, S. M. and K. A. Welling, Read in Your Head: A Vygotskian Analysis of the Transition from Oral to Silent Reading. Reading Psychology 2001: vol 22, p. 1-15.
Raman, T.V. and D. Gries, Interactive Audio Documents. ACM SIGCAPH Conference on Assistive Technologies, 1994: p. 62-68.
Roy, D.K. and C. Schmandt, NewsComm: A Hand-Held Interface for Interactive Access to Structured Audio. ACM Computer-Human Interfaces (CHI), 1996.
Schnmdt, C. and A. Mullins, AudioStreamer: Exploiting Simultaneity for Listening. Computer-Human Interaction (CHI), 1995.

A Document Skimmer

Overcoming the Soda-Straw Effect

Page Navigation

Overview

Our Work: Features and Justifications

Rate and Voice

Hierarchy

Words and Phonemes

Spatial Sound

Our Work: Screen Shots

User Evaluations

Future Work

Downloads

Other Downloads

How To Install and Run Our Software

References