Registration (or alignment) of the synthetic imagery with the real world is crucial in augmented reality (AR) systems. The data from user-input devices, tracking devices, and imaging devices need to be registered spatially and temporally with the user's view of the surroundings. Each device has an associated delay between its observations of the world and the moment when the AR display presented to the user appears to be affected by a change in the data. We call the differences in delay the relative latencies. Relative latency is a source of misregistration and should be reduced. We give general methods for handling multiple data streams with different latency values associated with them in a working AR system. We measure the latency differences (part of the system dependent set of calibrations), time-stamp on-host, adjust the moment of sampling, and interpolate or extrapolate data streams. By using these schemes, a more accurate and consistent view is computed and presented to the user.
CR Categories and Subject Descriptors: I.3.7 [Three-Dimensional Graphics and Realism]: Virtual Reality; I.3.1 [Hardware Architecture]: Three-dimensional displays; I.3.6 [Methodology and Techniques]: Interaction Techniques.
Additional Keywords: Augmented Reality, Latency Management, Ultrasound Echography.
Augmented reality (AR) is the term used to describe systems in which the user is presented with an enhanced view of the surroundings. This view is created by compositing computer graphics with a view of the real world. The graphics must be generated in such a way that the user believes that the synthetic objects exist in the environment. Azuma  gives an introduction to the field of AR, the technology that can be used to achieve it, current existing applications, and the potential of the paradigm.
As has been recognized in previous AR systems, registration (or alignment) of the synthetic imagery with the real is crucial. AR systems have a variety of input streams (tracking devices, real-time imaging devices, user input devices and others), which differ in accuracy, bandwidth, dynamics and frequency. The devices need to be both spatially and temporally registered.
Consider the example of a movie in which sound effects are overdubbed. The illusion suffers if a crash is not heard at the precise time an object is seen to hit the ground. Although both the video and audio signal may have considerable delay, it is the difference in delay that is noticed1. The differences in latency between data streams or external signals cause misregistration. In the rest of this paper, we call the difference in latency between two streams the relative latency. We concentrate on minimizing relative latencies since they are sources for misregistration.
We develop methods for measuring relative latency and a variety of techniques for managing latency to reduce the misregistration it causes. We introduce these methods while keeping in mind that we are building a complete system, and thus must focus on our goal-providing a convincing and accurate illusion. We apply these methods to a previously described AR system for real-time ultrasound visualization  and demonstrate improved registration and visualization resulting from our latency management techniques.
The following sources of delay have been classified :
Relative latency has its source in off-host delay, computational delay and synchronization delay. The data from separate external devices follow different paths in the system and each path has its own latency. The relative latency between the different data paths causes misregistration. Rendering delay and display delay do not contribute to misregistration because all the data follows one path2. Delay in this path will result in a lower frame rate and in higher latency between real-world events and the displayed image, but will not cause any misregistration since the relative latency between streams here is constant. Figure 1 explains the symbols and terms used.
Figure 1: ACQUIRING DATA AND PRODUCING A FRAME.
is the time a real-world sample was taken by the external device
and is the time it arrived on host.
is the off-host latency (Toffhost) for stream i.
the start and end times for the generation of the frame.
Many authors have examined latency and have tried to reduce its effects. Previous efforts can be categorized as bounding latency, reducing latency, compensating for latency, and achieving registration despite latency.
Time-critical computing  is a technique in which quality is traded for speed. The application is constantly aware of time. Reducing latency here means reducing computation accuracy, which might not be always advisable.
Most real-time graphics system are aimed at high throughput instead of low latency. High throughput is achieved through pipelining which results in latency. Olano et al.  specifically discuss a low-latency rendering system and a technique to reduce errors caused by delays in the display system. Parallelization reduces latency for Tcomp, Trender, and Tsync by increasing throughput. Wloka  dedicates a processor to sample the external data streams at a high frequency.
Prediction of future head position and orientation can be used in order to reduce perceived delay . This position and orientation is estimated by extrapolation of current and older values. This predicted value can then be used to generate an image for the time when the image will be displayed. Since Trender is not a constant, it may be difficult to determine the extrapolation interval.
Image generation delay can be reduced by using a post-rendering warp. Rendering starts with an initial guess for the head position. This process results in a data structure from which an image can quickly be computed based on a newer head position. This newer head position can be acquired after rendering or be approximated through prediction. Mark et al.  use a warping mechanism to reduce latency caused by bandwidth limitations. Regan et al.  render the scene on the faces of a cube. A head rotation causes an address offset for the final image to be taken from this cube.
State et al.  and Bajura et al.  synchronize a video stream with head-tracking data by reducing the head-tracking error with the use of videometrically tracked landmarks. Bajura et al.  temporally synchronize the output of a rendering system and a video stream by buffering the lower latency video.
The naive way of writing an AR application is to sample all the data streams at the start of a frame after which the application starts computing and rendering (Figure 2). This implies that those sources whose data is not required immediately will have extra computational delay associated with them, and leads to large relative latency values (si - sj) causing misregistration and a large maximum end-to-end latency (Fend - min(si)).
Figure 2: NAIVE ORDER OF ACQUIRING DATA.
The application samples
all data streams at the beginning of a frame, then computes and
renders the output. This leads to high maximum end-to-end
latency and high relative latency.
If we know the moment in time that the data was sampled (si), or the difference in time between the samples of two streams, we can adapt the program to reduce the effects of the known relative latencies. Unfortunately, most data streams are not timestamped at the source. However, we can measure the relative latency between data streams with experiments (see section 4.1). Timestamping upon arrival on host enables us to measure the dynamic computational delay (Tcomp).
Once we know the relative latencies, we can reduce the effects by adjusting the moment of sampling. Sampling a higher latency stream later than a lower latency stream will reduce the relative latency between the two streams. Another method is to use interpolation and extrapolation to compute a value for a data stream for any moment in time based on previously sampled data values. These techniques reduce relative latency and therefore improve registration.
Perhaps the simplest technique to reduce relative latency is to adjust the moment of sampling the incoming data stream. This requires no computation or special hardware, and proves to be a useful method for reducing relative latency.
One method to do this would be to schedule the polling of input devices at relative intervals that correspond to the relative latency between the various devices (Figure 3), and wait between readings. Processing begins after all input has been received. However, this would increase the end-to-end latency of the frame produced from this data and reduce the frame rate to 1/(max(ri) - min(ri) + Tcomp + Trender).
Figure 3: DEFERRED ACQUISITION OF DATA.
The data streams are
sampled in order of increasing latency. The wait times are
equal to the respective relative latencies. This results in
zero relative latencies, but very high maximum end-to-end
A compromise that does not increase maximum latency in order to decrease relative latency is just-in-time acquisition of the data while interleaving computation with the data acquisition (Figure 4). This reduces both relative latency and maximum end-to-end latency, since wait time is filled with useful work.
Figure 4: JUST-IN-TIME ACQUISITION OF DATA.
Polling of data streams is delayed until the data is
required for computation to continue or when the relative latency is
This reduces relative latencies and maximum end-to-end latency.
The next method is to store multiple readings and either interpolate or extrapolate these readings to simulate new readings. The value of a data sample for each moment in time can now be computed.
For example, if a tracking system's acquisition path has lower latency than the video camera's acquisition path, we can buffer readings and interpolate the position and orientation reported.
On the other hand, if a tracking system's acquisition path has higher latency than the video camera's acquisition path, we can use predictive tracking methods  to compensate for the difference.
Our testbed AR system is the ultrasound visualization system currently being developed at the University of North Carolina at Chapel Hill . This system is a video-see-through AR system designed for the medical procedure known as ultrasound-guided needle biopsy. The physician wears a head-mounted display fitted with two video cameras. A Flock of Birds (FOB) magnetic tracker from Ascension Technology Corporation is used to track the head-mounted cameras. Ultrasound image data is acquired from a Pie Medical Scanner 200. A Metrecom IND-01 (Faro) mechanical arm from Faro Technologies, Inc. tracks the ultrasound probe. The system runs on a Silicon Graphics Onyx Reality Engine. It is equipped with Sirius Video unit for video capture and Multi-Channel Option for stereo display. The data from the four input streams is captured asynchronously on a per-frame basis from the four input devices under CPU control. The ultrasound images are rendered and merged with a view of the patient.
We first determined by visual tests that the real-world camera video is the lowest latency stream. For the external device trackers, we determined relative latency by rendering a model in the coordinate system of the tracker that should be aligned with the real-world object. It was then easy to visually determine whether the tracker lagged behind the camera video by moving the tracked object and checking whether the virtual or real copy appeared to move first in the AR view. Table 1 summarizes the results.
Table 1: MEASURED RELATIVE LATENCY BETWEEN THE DATA STREAMS.
Relative latency between the different data streams.
These measurements were made at a frame rate of 10Hz.
The tests showed that both the Faro and the FOB have higher latency than the real-world video. We applied linear predictors to compensate for this latency. By adjusting the prediction interval until the rendered coordinate system matches the tracker image in the video, we measured the latency. Plate 1, center, and Plate 4, center, show the result for the Faro tracker without prediction and with partially improved registration via prediction. We could buffer the video over an interval, but video is high bandwidth and expensive to store and move in memory. Also, the end-to-end latency of the video would slow the response to the user's head movements, thereby diminishing the AR illusion.
In order to measure the relative latency between the camera video and the ultrasound video, we used a single live video signal sent into both video paths. By positioning the ultrasound image data in the AR view, we visually aligned the ``ultrasound data'' with the real world. By imaging a rotating drum with a regular pattern of vertical bars, we determined the latency between the two streams by aligning the vertical bars in one image to be exactly one pattern width behind the other image (Plate 2). We computed the relative latency from the rotational velocity of the drum.
We next measured the latency between the FOB and the Faro. We did this by rigidly fixing the two to each other via an intermediate plastic rod. The rod reduces magnetic interference of the FOB by the Faro's stainless steel mount plate. Using wooden blocks, we created a track for moving this assembly back and forth. By moving the assembly in an oscillating pattern, we determined the latency between the two trackers by examining a graph of their respective positions versus a global timer. The relative latency between the two streams proved to be small and dependent on . More precisely, the difference in frequencies at which the devices operate caused a variable delay in the signal. Neither the Faro nor the FOB was consistently ahead of the other.
Finally, we measured the relative latency between the ultrasound image data and the mechanical tracker by holding the tracker stationary until the image data catches up, then setting a marker at the location of a bolt being imaged. We then swept the probe in an oscillating pattern, and progressively increased the delay added to the mechanical tracker readings until the ultrasound image data did not appear to waver from the known location of the bolt.
It appeared that the ultrasound image has higher latency than the tracking data. We decided to buffer the tracking data over an adjustable interval. We again made a sweeping motion over the bolt and adjusted the interval until the imaged object did not lag behind and appeared in the same place as the real object. The relative latency between the ultrasound image data and the Faro was 220ms at a frame rate of 10fps. The reason for such a high value is the long video format conversion path from the ultrasound machine to the host. This path could be shortened with hardware solutions currently not available to us.
In addition to knowing and managing the relative latency, we wanted to know the end-to-end latencies of the streams in the system-that is, the latencies between the data streams and the real world. We measured the latency of the camera to the real world. To do this, we used an LED blinking at a rate of 5Hz, set with a pulse generator. We used the LED as one trigger for an oscilloscope. We pointed our video-see-through camera at the LED. We taped a photoelectric sensor to the monitor where the image of the LED appeared, and connected it to the oscilloscope as a second trigger. (Figure 5) This allowed us to see the two signals on the oscilloscope screen simultaneously: one with negligible latency that came directly from the pulse generator and one that came from the photoelectric sensor. We measured the latency between the real world event (LED blinks) and the time the photosensor ``saw'' that event by measuring the distance of the two signals on the oscilloscope. For our system, this value was 40ms, and ranged between 30ms and 60ms. We took these measurements with computation turned off, so that the latency was Toffhost + Tdisplay + Tsync.
Figure 5: SYSTEM DIAGRAM OF CAMERA LATENCY EXPERIMENT.
We measured the end-to-end latency of the
camera video stream. The pulse generator triggers the LED, which in
turn triggers the oscilloscope. The LED blinks and is seen in the
camera image, which is subsequently seen by the photosensor. The photosensor
also triggers the oscilloscope. This causes two traces to appear on the
screen of the oscilloscope. By reading the scale of the oscilloscope
and measuring the distance between the traces, we measured the latency.
We measured the end-to-end latency of the camera (40ms). Adding the relative latency between these two streams (30ms), we conclude that the results of the experiments are consistent with the Faro specification. The off-host latency of the Faro is specified as 67ms .
First, we arranged the order of polling the devices to reflect the latencies measured in the previous section.
By predicting every stream to match the environment the user is in, the system would behave as if it had no latency. Only errors in the prediction can spoil the illusion. However, video streams are hard (or impossible) to predict since they have high bandwidth and are dynamic in behavior.
Our second option is to buffer and interpolate lower latency streams to match the higher latency ultrasound stream. Buffering the real-world video is expensive. We can only store the video directly in the frame-buffer. Also, the perceived delay of the system would grow since the real-world video gives the most visual cues to the user. We therefore decided to define two synchronization points, represented by the ultrasound and real-world video streams. We had to synchronize the tracker information with these two points (sUS, sCamera). (Figure 6)
Figure 6: EXTRAPOLATION, INTERPOLATION AND JUST-IN-TIME SAMPLING.
Just-in-time sampling is used to reduce relative latency.
The FOB is predicted to match the video
after which it is corrected with a vision-based algorithm.
The Faro is buffered and interpolated to match the
ultrasound video stream, and predicted to match the cameras.
We used linear extrapolation based on the last two values for the ultrasound probe's position and orientation. (Pfaro) The prediction interval is based on the relative latency between the Faro and the cameras. Both streams are timestamped as they arrive on the host (rFaro, rCamera). Timestamping is done by reading a system clock before and after sampling and averaging these values in order to compensate for the time it takes to sample the stream. The extrapolation interval for the probe relative to the last Faro reading is: rFaro + rCamera + RelativeLatency
The rendered model of the probe is now registered with the video image of the probe. The model's depth values are used in order to achieve correct occlusion.
We have eliminated a relative latency of 30ms in the tracking of the ultrasound probe. In analyzing registration error, Holloway  found that 1ms of latency can cause 1mm of registration error, so we have eliminated a potential source of up to 30mm of registration error. We of course introduce error in the prediction, but this is a small price.
We also use a linear extrapolation scheme for the camera position and orientation (Pflock, Figure 6) in order to synchronize with the real-world video stream. However, due to the noise and inaccuracy of the magnetic tracker signal the linear predictor is not precise enough. We therefore apply a vision-based corrector (Cflock, Figure 6) that relies on color-coded landmarks visible in the image . This corrector typically eliminates relative latency (and inaccuracies in the tracking report), since its tracking information comes from the video image, to which we are synchronizing (sCamera) (Plate 1, left; original, Plate 4, left; method applied). We have thus eliminated a relative latency of 30ms in head tracking. Again, this was a potential source of up to 30mm of registration error.
Finally, we need a solution for the relative latency between the ultrasound video stream and the probe tracking data. The latency would otherwise cause misregistration (Plate 1, right and Plate 3). Since the Faro has lower latency than the ultrasound video, (and we cannot predict the video) we store and interpolate (Ifaro, Figure 6) the tracking information. Again, upon arrival on host we timestamp the video image and the tracking data. This tracking data is then stored in a buffer. The interpolation interval for the probe relative to the last Faro reading is : rfaro + rultrasound + RelativeLatency
A linearly interpolated value is computed from the previously stored data for the probe's position at the time the ultrasound image was taken. This computed position and orientation is then used for rendering the ultrasound slices. By applying this interpolation technique, we align ultrasound slices spatially with (the video of) the patient. This detaches the slice from the visible probe (in particular when the probe is moving), but eliminates relative latency between the ultrasound image data and the probe tracking data. This eliminates the relative latency as a source of up to 220mm of registration error between the location of the acquired ultrasound slices and the patient.
We have given general methods for handling multiple data streams. These methods are applicable to AR systems that have multiple input streams with different latencies. By eliminating relative latency, we remove a potential source of registration error. While latency-based misregistration often goes unnoticed or is considered unimportant, performance can be improved with simple synchronization schemes.
Relative latency can be managed by measuring the latency differences (part of the system-dependent set of calibrations), on-host timestamping, adjusting the moment of sampling, and interpolation or extrapolation of these data streams.
In our ultrasound system, synchronizing the probe tracking data with the ultrasound video data and prediction of the probe model's position and orientation reduced errors significantly. The vision-based tracking scheme reduces latency-based misregistration by reducing the error.
These schemes show the importance of thinking about latency early in the design phase. Sampling a stream once per frame might not be sufficient if prediction is used. Adjusting the moment of sampling requires rearranging code which might be difficult to adjust later in the design phase.
The sampling frequency of the data streams in our system is currently proportional to the frame rate. We would like to have autonomous sampling processes for each stream . Every stream could then have its own (higher) sampling frequency making it possible to have more accurate interpolation and extrapolation functions. For video images this is hard, since video has high bandwidth and moving video around in memory costs time, but for tracking information this is easily feasible. Having more knowledge about the frequency and phase of the external device also holds the potential for further adjusting the moment of sampling.
Our assumption of off-host relative latency being static would not be necessary if manufacturers of the data-gathering devices would timestamp the data with a real-world clock value. We think and hope it will be more common in the future for data gathering devices to have real-world clock timestamping mechanisms. For the video signals, we could have a hardware device inserting a bit pattern at the source, representing the real-world time. This bitpattern could then be read on host. If all devices had timestamping mechanisms, latency measurement experiments and on-host timestamping would not be necessary. This would allow dynamic measurement of relative latency and thus more accurate compensation using the techniques we described.
Our hardware platform uses the Unix operating system. In Unix there are no accurate timing guarantees. The program is written in the C language. A better suited operating system and programming language to real-time actions would be useful. Even on such a machine one could not guarantee constant end-to-end latency due to lack of synchronization between input and output video streams. This could be remedied by applying a generator lock (genlock) to both the input and the output video streams. A similar level of control could be achieved by timestamping the vertical retrace events of input and output video streams.
We would like to express our gratitude to Henry Fuchs, Bill Garrett, Todd Gaul, David Harrison, Gentaro Hirota, Erik Jansen, Kevin Jeffay, Bill Mark, Mark Mine, Etta D. Pisano, Stephen M. Pizer, Frits Post, Paul Rademacher, Allen Sajedi, Mary Whitton, and the anonymous reviewers,
This work was supported in part by the ARPA DABT63-93-C-0048 ("Enabling Technologies and Application Demonstrations for Synthetic Environments"), the NSF Science and Technology Center for Computer Graphics and Scientific Visualization, and PIE Medical Corporation. Approved by ARPA for Public Release-Distribution Unlimited.
Managing Latency in Complex Augmented Reality Systems
This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The translation was initiated by Marco Jacobs on Wed Apr 16 21:33:55 EDT 1997