\documentstyle[11pt]{article}

\input{epsf}
\input{psfig}
\textheight 8.5in
\addtolength{\oddsidemargin}{-0.8in}
\textwidth 6.8in
\topmargin -0.3in
\headheight 0.1in
\headsep 0.1in
\def\heading#1{\noindent {\bf #1} \sp\sp}
\def\op{\odot}
\def\real{\cal R}
%\pagestyle{empty}

\begin{document}

\begin{center}
{\bf COMP205 Lecture Notes} \\
{\bf Prepared by: Dinesh Manocha} \\
{\bf Lectures 2-3: Finite Precision Computation}
\end{center}

\section{Relative Error and Significant Digits}

Let $\overline{x}$ be an approximation to a real $x$. The most useful
measure of the accuracy of $\overline{x}$ is its absolute error
\[
   E_{abs}(\overline{x}) = | x - \overline{x}|,
\]
and its relative error
\[
   E_{rel}(\overline{x}) = \frac{| x - \overline{x}|}{|x|},
\]
which is undefined, if $x = 0$. An equivalent definition of relative error
is $E_{rel}(\overline{x}) = |\delta|$, where $\overline{x} = x (1 + \delta)$.

Relative error is connected with the notion of correct significant digits
or correct significant figures. The significant digits in a number are the
first nonzero digit and all succeeding digits.

\section{Sources of Error}

There are three main sources of errors in numerical computation: rounding,
data uncertainty, and truncation.

\begin{itemize}
\item {\em Rounding errors},  are an unavoidable consequence of working
in finite precision arithmetic. We will deal with these errors for the
first half of the course (in the context of polynomial evaluation and
solving linear equations).
\item {\em Uncertainty} in the data is always a possibility when we 
are solving practical problems. It may arise in several ways: from 
errors in measuring physical quantities, from errors in storing the
data on the computer (rounding errors), or, if the data is itself the
solution to another problem, it may be the result of errors in an
earlier computation.  The effects of errors in the data are generally
easier to understand than the effects of rounding errors  committed during
a computation, because data errors can be analyzed using perturbation
theory for the problem at hand, while intermediate rounding errors
require an analysis specific to the given method.
\item {\em Truncation or Discretization errors} are much harder to 
analyze. We will deal with them in the second half of the course,
in terms of dealing with differential equations. Many standard 
numerical methods (for example, the trapezium rule for quadrature,
Euler's method for differential equations, and Newton's method for
nonlinear equations) can be derived by taking finitely many terms of a
Taylor series. The terms omitted  constitute the truncation error, and
for many methods, the size of this error depends on a parameter (often
called the stepsize), whose appropriate value is a compromise between
obtaining a small error and a fast computation. For many scientific
problems, such approximations are necessary.
\end{itemize}

\section{Floating Point Number System}

A  floating point number system $F \subset \real$ is a subset of the real
numbers whose elements have the form
\[
    y =  \pm m \times \beta^{e-t}.
\]
The system $F$ is characterized by four integer parameters:
\begin{itemize}
\item the {\em base} $\beta$ (sometimes called the radix),
\item the {\em precision} $t$, and
\item the {\em exponent range} $e_{min} \leq e \leq e_{max}$.
\end{itemize}

The {\em mantissa m} is an integer satisfying $0 \leq m \leq \beta^t - 1$. To
ensure a unique representation for each $y \in F$, it is assumed that
$m \geq \beta^{t-1}$ if $y \neq 0$, so that the system is normalized. 
In other words the first digit of the mantissa is non-zero. The 
range of the non-zero floating point numbers in $F$ is given by
\[
\beta^{e_{min} - 1} \leq | y | \leq  \beta^{e_{max}}(1 - \beta^{-t}).
\]
It follows that every real number $x$ lying in the range of $F$ can be 
approximated by an element of $F$ with a {\em relative error} no larger
than $\epsilon = \frac{1}{2}\beta^{1-t}$. The quantity is called the {\em machine
epsilon} or {\em unit roundoff}. It is the most useful quantity associated with
$F$ and is ubiquitous in the world of rounding error analysis.

We are mainly interested in IEEE floating point number system. Over the last
few years it has become a standard. 

\vspace*{0.2in}

\noindent{\bf IEEE Single Precision Arithmetic}

\begin{figure}[htb]
\centering
  \vspace*{-.1in}
 \mbox{\epsfxsize=3.0in\epsffile{single.ps}}
% \vspace*{3.0in}
\vspace*{-.1in}
\caption{IEEE Single Precision Arithmetic}
\label{Single}
\end{figure}

\noindent Based on these values the various parameters are:
\[
  \mbox{Relative Error } \epsilon  : 2^{-24} \approx 6 \times 10^{-8}
\]
\[
  \mbox{Exponent Range} : 2^{-126} \mbox{ to } 2^{127} \  \Rightarrow \  10^{-38} \leq |\mbox{ non-zero number }| \  \leq 2 \times 10^{38}
\]

\noindent{\bf IEEE Double Precision Arithmetic}
\begin{figure}[htb]
\centering
  \vspace*{-.1in}
 \mbox{\epsfxsize=3.0in\epsffile{double.ps}}
% \vspace*{3.0in}
\vspace*{-.1in}
\caption{IEEE Double Precision Arithmetic}
\label{Double}
\end{figure}
\[
  \mbox{Relative Error } \epsilon : 2^{-53} \approx 10^{-16}
\]
\[
  \mbox{Exponent Range} : 2^{-1022} \mbox{ to } 2^{1023} \Rightarrow 2 \times 10^{-308} \leq |\mbox{ non-zero number }| \leq 10^{308}.
\]

Roundoff error results since the true value of $a \op b$, where $(\op \in \{+,-,*,/ \})$ can't be represented exactly and needs to be rounded off. If we roundoff
as accurate as possible, and the floating point result is within the exponent
range than
\[
   \mbox{fl} (a \op b) = (a \op b) (1 + \delta), \ \ \ |\delta| \leq \epsilon.
\]

We say that fl$(x)$ {\em overflows} if $|\mbox{fl}(x)| > \mbox{max}\{ |y|: y \in F\}$ and {\em underflows} if $0 < |\mbox{fl}(x)| < \mbox{min}\{ |y| : 0 \neq y \in F\}$.
To see the impact of rounding and truncation, lets consider the 
following C program.

\begin{verbatim}

#include <stdio.h>
#include <math.h>

main()
{

    float f;
    double d,p;
    int i;

    i = 32768 * 32768 + 256 + 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1;
    f = (float) i;
    d = (double) i;

    p = fabs(f - d)/i;


    printf("%d  %4.16f %4.16lf %4.16f \n",i,f,d,p);

}
\end{verbatim}
What are the values of $i$, $f$, $d$ and $p$? The errors caused to 
improper rounding or truncation can be very serious at times.

\section{Patriot Missile Software Problem}

A report from the United States General Accounting Office begins ``On
February 25, 1991, a Patriot missile defense system operating at
Dhahran, Saudi Arabia, during Operation Desert Storm failed to track and
intercept an incoming Scud. This Scud subsequently hit an Army barracks,
killing $28$ Americans". More details can be found in the following
reference:

\noindent {\em Patriot missile defense: Software problems led to system failure at Dhahran,
Saudi Arabia. Report GAO/IMTEC-92-26, Information Management and Technology
Division, US General Accounting Office, Washington DC, Feb. 11992, 16 pp.
}

\noindent The report finds that the failure to track the Scud missile was
caused by a precision problem in the software.

The computer used to control the Patriot missile is based on a 1970s design
and uses 24-bit arithmetic. The Patriot system tracks its target by
measuring the time it takes for radar pulses to bounce back from them. Time
is recorded by the system clock in tenths of a second, but is stored as
an integer. To enable tracking calculations the time is converted to 
a $24$-bit floating point number. Rounding errors in the time conversions
cause shifts in the system's ``range gate", which is used to track the target.

\begin{table}
\begin{center}
\begin{tabular}{||c|c|c|c|c||}
\hline
\hline
Hours & Seconds  & Calculation Time   & Inaccuracy & Approximate Shift in  \\
       & & (seconds) & (seconds) & range gate (meters) \\
\hline
0 & 0 & 0 & 0 & 0 \\
1 & 3600 & 3599.9966 & .0034 & 7 \\ 
8 & 28800 & 28799.9725 & .0275 & 55 \\ 
20\footnotemark[1]
& 72000 & 71999.9313 & .0687 & 137 \\
48 & 172800 & 172799.8352  & .1648 & 330 \\
72 & 259200 & 259199.7528  & .2472 & 494 \\
100\footnotemark[2] 
 & 360000 & 359999.6567  & .3433 & 687 \\
\hline
\hline
\end{tabular}
\caption{Effect of extended run time on Patriot Missile Operation}
\label{Army}
\end{center}
\end{table}

On Feb. 11, 1991, the Patriot Project Office received field data identifying
a $20\%$ shift in the Patriot system's range gate after the system had been
running continuously for $8$ hours. This data implied that after $20$
consecutive hours of use the system would fail to track and intercept a
Scud. Modified software that compensated for the inaccurate time calculation
was released on Feb. 16 by army officials. On Feb. 25, Alpha Battery,
which was protecting the Dhahran Air Base, had been in continuous operation
for over $100$ hours. The inaccurate time calculations caused the range
gate to shift so much that the system could not track the incoming Scud.
On Feb. 26, the next day, the modified software arrived in Dhahran. Table
\ref{Army}, taken from the report cited above, shows clearly how, with
increasing time of operation, the Patriot lost track of its target. Note
that the numbers in Table \ref{Army} are consistent with a relative error
of $2^{-20}$ in the computer's representation of $0.1$, this constant
being used to convert from the system's clock tenths of a second to a 
second ($2^{-20}$ is the relative error introduced by chopping $0.1$
to $23$ bits after the binary point).

\footnotetext[1]{For continuous operation exceeding $20$ hours is outside target range.} 

\footnotetext[2]{Alpha battery ran continuously for $100$ hours.}

\begin{figure}[thb]
\centering
% \vspace*{-.1in}
% \mbox{\epsfysize=3.0in\epsffile{error.ps}}
\psfig{file=error.ps}
% \vspace*{3.0in}
%\vspace*{-.1in}
\caption{Backward and Forward Errors for $y = f(x)$.  The thick line corresponds
to exact and the thin line is the computed value.}
\label{Error}
\end{figure}

\section{Backward and Forward Errors}

Suppose that an approximation $\overline{y}$ to $y = f(x)$ is computed
in an arithmetic of precision $\epsilon$, where $f$ is a real scalar 
function of a real scalar variable. How should we measure the 
``quality" of $\overline{y}$?

In many cases, we would be happy with a tiny relative error, $E_{rel}(\overline{y}) \approx \epsilon$, but this cannot always be achieved. Instead of
focusing on the relative error of $\overline{y}$ we can ask ``for what
set of data have we actually solved our problem?", that is, for what $\Delta x$
do we have $\overline{y} = f(x + \Delta x)$? In general, there may be
many such $\Delta x$, so we should ask for the smallest one. The value
of $|\Delta x|$, possibly divided by $|x|$, is called the {\em 
backward error}. The absolute and relative errors of $\overline{y}$ are
called {\em forward errors}.  Figure \ref{Error} highlights the relationship
between these errors.

The process of bounding the backward error of a computed solution is called
{\em backward error analysis}. It interprets rounding errors as being
equivalent to perturbations in the data. The data frequently contains
uncertainties due to previous computations or errors committed in 
storing numbers on the computer. If the backward error is no 
larger than these uncertainties then the computed solution can 
hardly be criticized.


A method for computing $y = f(x)$ is called {\em backward stable} if,
for any $x$, it produces a computed $\overline{y}$ with a small backward
error, that is, $\overline{y} = f(x + \Delta x)$ for some small
$\Delta x$. The definition of ``small" is context dependent. Many
a times we use $\Delta x = O(\epsilon)$, where $\epsilon$ is the 
machine epsilon.

\section{Conditioning}
The relationship between forward and backward error for a problem
is governed by the conditioning of the problem, that is, the
sensitivity of the solution to the perturbations in the data. Using
$y = f(x)$ example, let an approximate solution $\overline{y}$ satisfy
$\overline{y} = f(x + \Delta x)$. Then, assuming that $f$ is twice
differentiable,
\[
\overline{y} - y = f(x + \Delta x) - f(x) = f^{'}(x) \Delta x + \frac{f^{''}(x + \theta \Delta x)}{2!} (\Delta x)^2, \ \ \ \ \theta \in (0,1),
\]
and we can bound or estimate the right hand side. This expansion leads
to the notion of condition number. Since
\[
  \frac{\overline{y} - y}{y} = \left( \frac{x f^{'}(x)}{f(x)} \right) \frac{\Delta x}{x} + O((\Delta x)^2),
\]
the quantity
\[
  c(x) = \left| \frac{x f^{'}(x)}{f(x)} \right|
\]
measures, for small $\Delta x$, the relative change in the output for a 
given relative change in the input, and it is called the {\em condition
number} of $f$. If $x$ or $f$ is a vector then the condition number is
defined in a similar way using norms (we will cover them later in the
course) and it measures the maximum relative change.
In general, the forward error, backward error and the condition number
satisfy the following rule:
\[
\mbox{ forward error } \leq \mbox{ condition number} \times \mbox{ backward
error}.
\]
The condition number of a problem is always greater than or equal to one.
A problem is {\em ill-conditioned} if its condition number  is high. It
is {\em ill-posed} if its condition number is very high (say infinity).

\section{Error Analysis}

Consider the dot product of two vectors, $s_n = {\bf x}^T {\bf y}$, where
${\bf x}, {\bf y} \in {\real}^n$. Since the order of evaluation of 
$s_n = x_1 y_1 + \ldots + x_n y_n$ affects the analysis (but not the
final error bound), we will assume that the evaluation is from left
to right.

Let $s_i = x_1 y_1 + \ldots + x_i y_i$ denote the $i$th partial sum. Using
the IEEE error model, we have
\[
  \overline{s_1} = \mbox{fl}(x_1 y_1) = x_1 y_1 (1 + \delta_1),
\]
\[
  \overline{s_2} = \mbox{fl}(\overline{s_1} + x_2 y_2) = (\overline{s}_1 + x_2 y_2 (1 + \delta_2))(1 + \delta_3)
\]
\[
  = x_1 y_1 (1 + \delta_1)(1 + \delta_3) + x_2 y_2 (1 + \delta_2) (1 + \delta_3),
\]
where $|\delta_i| \leq \epsilon, \ i = 1:3$. For our analysis we will not
differentiate between the different $\delta_i$, so to simplify the 
expressions let us drop the subscripts on the $\delta_i$ and write
$1 + \delta_i \equiv 1 \pm \delta$. Then
\[
  \overline{s_3} = \mbox{ fl}(\overline{s_2} + x_3 y_3) = (\overline{s_2} + x_3 y_3 ( 1 \pm \delta))(1 \pm \delta)
\]
\[
= x_1 y_1 ( 1 \pm \delta)^3 + x_2 y_2 (1 \pm \delta)^3 + x_3 y_3 (1 \pm \delta)^2.
\]

\noindent Based on this pattern, we have
\[
\overline{s_n} = x_1 y_1 (1 \pm \delta)^n + x_2 y_2 ( 1 \pm \delta)^n + x_3 y_3 ( 1 \pm \delta)^{n-1} + \ldots + x_n y_n ( 1 \pm \delta)^2.
\]


There are various ways to simplify this result. In particular, we make use
of the following lemma:

\noindent {\bf Lemma}
{\em  If $|\delta_i| \leq \epsilon$ and $\rho_i = \pm 1$
for $i = 1:n$, and $n \epsilon < 1$, then
\[
   \prod_{i=1}^n (1 + \delta_i)^{\rho_i} = 1 + \theta_n,
\]
where
\[
  |\theta_n| \leq \frac{n \epsilon}{1 - n \epsilon} \leq n \epsilon.
\]
}

\noindent Applying the lemma to $\overline{s_n}$, we obtain
\[
 \overline{s}_n = x_1 y_1 ( 1 + \theta_n) + x_2 y_2 (1 + \theta^{'}_n) + 
x_3 y_3 (1 + \theta_{n-1}) + \ldots + x_n y_n (1 + \theta_2).
\]
This is backward error results and may be interpreted as follows: the 
computed inner product is the exact one for a perturbed set of data
$x_1, \ldots, x_n, y_1 (1 + \theta_n), y_2 (1 + \theta^{'}_n), \ldots, y_n(1 + \theta_2).$ Each perturbation is ``small".

\section{Polynomial Evaluation}

{\bf Goal}: Evaluate a polynomial $p(x) = \sum^{d}_{i=0} a_i x^i$. We use
Horner's rule, given as:
\begin{eqnarray}
 p & = & a_d \nonumber \\
\mbox{for } i & = & d-1 \mbox{ down to } 0 \mbox{ do } \nonumber \\
 p & = & x * p + a_i
\end{eqnarray}

Lets apply it to $p(x) = (x - 2)^9 = x^9 - 18 x^8 + 144 x^7 - 672 x^6 + 2016 x^5 - 4032 x^4 + 5376 x^3 - 4608 x^2 + 2304 x - 512$.

\noindent What would happen if we tried to find a zero of $p(x)$ using a simple
zero finder? The root finder is based on the bisection algorithm:

\begin{verbatim}

function bisect(xl, xu, tol, p)

/* find a zero of p(x) in [xl, xu] assuming  p(xl) . p(xu) < 0 */
/* stop if zero found to within +/- tol  */

if (xu-xl) <= (2 * tol)
    return ((xl + xu)/2)
else 
    mid = (xl + xu)/2
    pm = p(mid)  /* the polynomial P(x) is evaluated a x = mid */
    if p(xl) . pm < 0 then
        return bisect(xl,mid,tol,p)
    else
        return bisect(mid xu,tol,p)
    endif
endif

\end{verbatim}

\noindent Let us apply to $p(x)$ with  $xl = 1.96$ to $xu = 2.04$. What
will happen? Look at the graphs of $p(x)$ shown in Fig. \ref{Horner1} and
Fig. \ref{Horner2}.
 
\begin{figure}[htb]
\centering
 \mbox{\epsfxsize=6.5in\epsffile{hornera.ps}}
\caption{Graph of p(x)}
\label{Horner1}
\end{figure}

\begin{figure}[htb]
\centering
  \vspace*{-.1in}
 \mbox{\epsfxsize=6.5in\epsffile{horner2.ps}}
% \vspace*{3.0in}
\vspace*{-.1in}
\caption{Graph of $p(x)$ in the neighborhood of $x =2$ evaluated using IEEE
double precision arithmetic}
\label{Horner2}
\end{figure}
\end{document}