\documentstyle[11pt]{article} \input{epsf} \input{psfig} \textheight 8.5in \addtolength{\oddsidemargin}{-0.8in} \textwidth 6.8in \topmargin -0.3in \headheight 0.1in \headsep 0.1in \def\heading#1{\noindent {\bf #1} \sp\sp} \def\op{\odot} \def\real{\cal R} %\pagestyle{empty} \begin{document} \begin{center} {\bf COMP205 Lecture Notes} \\ {\bf Prepared by: Dinesh Manocha} \\ {\bf Lectures 2-3: Finite Precision Computation} \end{center} \section{Relative Error and Significant Digits} Let $\overline{x}$ be an approximation to a real $x$. The most useful measure of the accuracy of $\overline{x}$ is its absolute error \[ E_{abs}(\overline{x}) = | x - \overline{x}|, \] and its relative error \[ E_{rel}(\overline{x}) = \frac{| x - \overline{x}|}{|x|}, \] which is undefined, if $x = 0$. An equivalent definition of relative error is $E_{rel}(\overline{x}) = |\delta|$, where $\overline{x} = x (1 + \delta)$. Relative error is connected with the notion of correct significant digits or correct significant figures. The significant digits in a number are the first nonzero digit and all succeeding digits. \section{Sources of Error} There are three main sources of errors in numerical computation: rounding, data uncertainty, and truncation. \begin{itemize} \item {\em Rounding errors}, are an unavoidable consequence of working in finite precision arithmetic. We will deal with these errors for the first half of the course (in the context of polynomial evaluation and solving linear equations). \item {\em Uncertainty} in the data is always a possibility when we are solving practical problems. It may arise in several ways: from errors in measuring physical quantities, from errors in storing the data on the computer (rounding errors), or, if the data is itself the solution to another problem, it may be the result of errors in an earlier computation. The effects of errors in the data are generally easier to understand than the effects of rounding errors committed during a computation, because data errors can be analyzed using perturbation theory for the problem at hand, while intermediate rounding errors require an analysis specific to the given method. \item {\em Truncation or Discretization errors} are much harder to analyze. We will deal with them in the second half of the course, in terms of dealing with differential equations. Many standard numerical methods (for example, the trapezium rule for quadrature, Euler's method for differential equations, and Newton's method for nonlinear equations) can be derived by taking finitely many terms of a Taylor series. The terms omitted constitute the truncation error, and for many methods, the size of this error depends on a parameter (often called the stepsize), whose appropriate value is a compromise between obtaining a small error and a fast computation. For many scientific problems, such approximations are necessary. \end{itemize} \section{Floating Point Number System} A floating point number system $F \subset \real$ is a subset of the real numbers whose elements have the form \[ y = \pm m \times \beta^{e-t}. \] The system $F$ is characterized by four integer parameters: \begin{itemize} \item the {\em base} $\beta$ (sometimes called the radix), \item the {\em precision} $t$, and \item the {\em exponent range} $e_{min} \leq e \leq e_{max}$. \end{itemize} The {\em mantissa m} is an integer satisfying $0 \leq m \leq \beta^t - 1$. To ensure a unique representation for each $y \in F$, it is assumed that $m \geq \beta^{t-1}$ if $y \neq 0$, so that the system is normalized. In other words the first digit of the mantissa is non-zero. The range of the non-zero floating point numbers in $F$ is given by \[ \beta^{e_{min} - 1} \leq | y | \leq \beta^{e_{max}}(1 - \beta^{-t}). \] It follows that every real number $x$ lying in the range of $F$ can be approximated by an element of $F$ with a {\em relative error} no larger than $\epsilon = \frac{1}{2}\beta^{1-t}$. The quantity is called the {\em machine epsilon} or {\em unit roundoff}. It is the most useful quantity associated with $F$ and is ubiquitous in the world of rounding error analysis. We are mainly interested in IEEE floating point number system. Over the last few years it has become a standard. \vspace*{0.2in} \noindent{\bf IEEE Single Precision Arithmetic} \begin{figure}[htb] \centering \vspace*{-.1in} \mbox{\epsfxsize=3.0in\epsffile{single.ps}} % \vspace*{3.0in} \vspace*{-.1in} \caption{IEEE Single Precision Arithmetic} \label{Single} \end{figure} \noindent Based on these values the various parameters are: \[ \mbox{Relative Error } \epsilon : 2^{-24} \approx 6 \times 10^{-8} \] \[ \mbox{Exponent Range} : 2^{-126} \mbox{ to } 2^{127} \ \Rightarrow \ 10^{-38} \leq |\mbox{ non-zero number }| \ \leq 2 \times 10^{38} \] \noindent{\bf IEEE Double Precision Arithmetic} \begin{figure}[htb] \centering \vspace*{-.1in} \mbox{\epsfxsize=3.0in\epsffile{double.ps}} % \vspace*{3.0in} \vspace*{-.1in} \caption{IEEE Double Precision Arithmetic} \label{Double} \end{figure} \[ \mbox{Relative Error } \epsilon : 2^{-53} \approx 10^{-16} \] \[ \mbox{Exponent Range} : 2^{-1022} \mbox{ to } 2^{1023} \Rightarrow 2 \times 10^{-308} \leq |\mbox{ non-zero number }| \leq 10^{308}. \] Roundoff error results since the true value of $a \op b$, where $(\op \in \{+,-,*,/ \})$ can't be represented exactly and needs to be rounded off. If we roundoff as accurate as possible, and the floating point result is within the exponent range than \[ \mbox{fl} (a \op b) = (a \op b) (1 + \delta), \ \ \ |\delta| \leq \epsilon. \] We say that fl$(x)$ {\em overflows} if $|\mbox{fl}(x)| > \mbox{max}\{ |y|: y \in F\}$ and {\em underflows} if $0 < |\mbox{fl}(x)| < \mbox{min}\{ |y| : 0 \neq y \in F\}$. To see the impact of rounding and truncation, lets consider the following C program. \begin{verbatim} #include #include main() { float f; double d,p; int i; i = 32768 * 32768 + 256 + 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1; f = (float) i; d = (double) i; p = fabs(f - d)/i; printf("%d %4.16f %4.16lf %4.16f \n",i,f,d,p); } \end{verbatim} What are the values of $i$, $f$, $d$ and $p$? The errors caused to improper rounding or truncation can be very serious at times. \section{Patriot Missile Software Problem} A report from the United States General Accounting Office begins ``On February 25, 1991, a Patriot missile defense system operating at Dhahran, Saudi Arabia, during Operation Desert Storm failed to track and intercept an incoming Scud. This Scud subsequently hit an Army barracks, killing $28$ Americans". More details can be found in the following reference: \noindent {\em Patriot missile defense: Software problems led to system failure at Dhahran, Saudi Arabia. Report GAO/IMTEC-92-26, Information Management and Technology Division, US General Accounting Office, Washington DC, Feb. 11992, 16 pp. } \noindent The report finds that the failure to track the Scud missile was caused by a precision problem in the software. The computer used to control the Patriot missile is based on a 1970s design and uses 24-bit arithmetic. The Patriot system tracks its target by measuring the time it takes for radar pulses to bounce back from them. Time is recorded by the system clock in tenths of a second, but is stored as an integer. To enable tracking calculations the time is converted to a $24$-bit floating point number. Rounding errors in the time conversions cause shifts in the system's ``range gate", which is used to track the target. \begin{table} \begin{center} \begin{tabular}{||c|c|c|c|c||} \hline \hline Hours & Seconds & Calculation Time & Inaccuracy & Approximate Shift in \\ & & (seconds) & (seconds) & range gate (meters) \\ \hline 0 & 0 & 0 & 0 & 0 \\ 1 & 3600 & 3599.9966 & .0034 & 7 \\ 8 & 28800 & 28799.9725 & .0275 & 55 \\ 20\footnotemark[1] & 72000 & 71999.9313 & .0687 & 137 \\ 48 & 172800 & 172799.8352 & .1648 & 330 \\ 72 & 259200 & 259199.7528 & .2472 & 494 \\ 100\footnotemark[2] & 360000 & 359999.6567 & .3433 & 687 \\ \hline \hline \end{tabular} \caption{Effect of extended run time on Patriot Missile Operation} \label{Army} \end{center} \end{table} On Feb. 11, 1991, the Patriot Project Office received field data identifying a $20\%$ shift in the Patriot system's range gate after the system had been running continuously for $8$ hours. This data implied that after $20$ consecutive hours of use the system would fail to track and intercept a Scud. Modified software that compensated for the inaccurate time calculation was released on Feb. 16 by army officials. On Feb. 25, Alpha Battery, which was protecting the Dhahran Air Base, had been in continuous operation for over $100$ hours. The inaccurate time calculations caused the range gate to shift so much that the system could not track the incoming Scud. On Feb. 26, the next day, the modified software arrived in Dhahran. Table \ref{Army}, taken from the report cited above, shows clearly how, with increasing time of operation, the Patriot lost track of its target. Note that the numbers in Table \ref{Army} are consistent with a relative error of $2^{-20}$ in the computer's representation of $0.1$, this constant being used to convert from the system's clock tenths of a second to a second ($2^{-20}$ is the relative error introduced by chopping $0.1$ to $23$ bits after the binary point). \footnotetext[1]{For continuous operation exceeding $20$ hours is outside target range.} \footnotetext[2]{Alpha battery ran continuously for $100$ hours.} \begin{figure}[thb] \centering % \vspace*{-.1in} % \mbox{\epsfysize=3.0in\epsffile{error.ps}} \psfig{file=error.ps} % \vspace*{3.0in} %\vspace*{-.1in} \caption{Backward and Forward Errors for $y = f(x)$. The thick line corresponds to exact and the thin line is the computed value.} \label{Error} \end{figure} \section{Backward and Forward Errors} Suppose that an approximation $\overline{y}$ to $y = f(x)$ is computed in an arithmetic of precision $\epsilon$, where $f$ is a real scalar function of a real scalar variable. How should we measure the ``quality" of $\overline{y}$? In many cases, we would be happy with a tiny relative error, $E_{rel}(\overline{y}) \approx \epsilon$, but this cannot always be achieved. Instead of focusing on the relative error of $\overline{y}$ we can ask ``for what set of data have we actually solved our problem?", that is, for what $\Delta x$ do we have $\overline{y} = f(x + \Delta x)$? In general, there may be many such $\Delta x$, so we should ask for the smallest one. The value of $|\Delta x|$, possibly divided by $|x|$, is called the {\em backward error}. The absolute and relative errors of $\overline{y}$ are called {\em forward errors}. Figure \ref{Error} highlights the relationship between these errors. The process of bounding the backward error of a computed solution is called {\em backward error analysis}. It interprets rounding errors as being equivalent to perturbations in the data. The data frequently contains uncertainties due to previous computations or errors committed in storing numbers on the computer. If the backward error is no larger than these uncertainties then the computed solution can hardly be criticized. A method for computing $y = f(x)$ is called {\em backward stable} if, for any $x$, it produces a computed $\overline{y}$ with a small backward error, that is, $\overline{y} = f(x + \Delta x)$ for some small $\Delta x$. The definition of ``small" is context dependent. Many a times we use $\Delta x = O(\epsilon)$, where $\epsilon$ is the machine epsilon. \section{Conditioning} The relationship between forward and backward error for a problem is governed by the conditioning of the problem, that is, the sensitivity of the solution to the perturbations in the data. Using $y = f(x)$ example, let an approximate solution $\overline{y}$ satisfy $\overline{y} = f(x + \Delta x)$. Then, assuming that $f$ is twice differentiable, \[ \overline{y} - y = f(x + \Delta x) - f(x) = f^{'}(x) \Delta x + \frac{f^{''}(x + \theta \Delta x)}{2!} (\Delta x)^2, \ \ \ \ \theta \in (0,1), \] and we can bound or estimate the right hand side. This expansion leads to the notion of condition number. Since \[ \frac{\overline{y} - y}{y} = \left( \frac{x f^{'}(x)}{f(x)} \right) \frac{\Delta x}{x} + O((\Delta x)^2), \] the quantity \[ c(x) = \left| \frac{x f^{'}(x)}{f(x)} \right| \] measures, for small $\Delta x$, the relative change in the output for a given relative change in the input, and it is called the {\em condition number} of $f$. If $x$ or $f$ is a vector then the condition number is defined in a similar way using norms (we will cover them later in the course) and it measures the maximum relative change. In general, the forward error, backward error and the condition number satisfy the following rule: \[ \mbox{ forward error } \leq \mbox{ condition number} \times \mbox{ backward error}. \] The condition number of a problem is always greater than or equal to one. A problem is {\em ill-conditioned} if its condition number is high. It is {\em ill-posed} if its condition number is very high (say infinity). \section{Error Analysis} Consider the dot product of two vectors, $s_n = {\bf x}^T {\bf y}$, where ${\bf x}, {\bf y} \in {\real}^n$. Since the order of evaluation of $s_n = x_1 y_1 + \ldots + x_n y_n$ affects the analysis (but not the final error bound), we will assume that the evaluation is from left to right. Let $s_i = x_1 y_1 + \ldots + x_i y_i$ denote the $i$th partial sum. Using the IEEE error model, we have \[ \overline{s_1} = \mbox{fl}(x_1 y_1) = x_1 y_1 (1 + \delta_1), \] \[ \overline{s_2} = \mbox{fl}(\overline{s_1} + x_2 y_2) = (\overline{s}_1 + x_2 y_2 (1 + \delta_2))(1 + \delta_3) \] \[ = x_1 y_1 (1 + \delta_1)(1 + \delta_3) + x_2 y_2 (1 + \delta_2) (1 + \delta_3), \] where $|\delta_i| \leq \epsilon, \ i = 1:3$. For our analysis we will not differentiate between the different $\delta_i$, so to simplify the expressions let us drop the subscripts on the $\delta_i$ and write $1 + \delta_i \equiv 1 \pm \delta$. Then \[ \overline{s_3} = \mbox{ fl}(\overline{s_2} + x_3 y_3) = (\overline{s_2} + x_3 y_3 ( 1 \pm \delta))(1 \pm \delta) \] \[ = x_1 y_1 ( 1 \pm \delta)^3 + x_2 y_2 (1 \pm \delta)^3 + x_3 y_3 (1 \pm \delta)^2. \] \noindent Based on this pattern, we have \[ \overline{s_n} = x_1 y_1 (1 \pm \delta)^n + x_2 y_2 ( 1 \pm \delta)^n + x_3 y_3 ( 1 \pm \delta)^{n-1} + \ldots + x_n y_n ( 1 \pm \delta)^2. \] There are various ways to simplify this result. In particular, we make use of the following lemma: \noindent {\bf Lemma} {\em If $|\delta_i| \leq \epsilon$ and $\rho_i = \pm 1$ for $i = 1:n$, and $n \epsilon < 1$, then \[ \prod_{i=1}^n (1 + \delta_i)^{\rho_i} = 1 + \theta_n, \] where \[ |\theta_n| \leq \frac{n \epsilon}{1 - n \epsilon} \leq n \epsilon. \] } \noindent Applying the lemma to $\overline{s_n}$, we obtain \[ \overline{s}_n = x_1 y_1 ( 1 + \theta_n) + x_2 y_2 (1 + \theta^{'}_n) + x_3 y_3 (1 + \theta_{n-1}) + \ldots + x_n y_n (1 + \theta_2). \] This is backward error results and may be interpreted as follows: the computed inner product is the exact one for a perturbed set of data $x_1, \ldots, x_n, y_1 (1 + \theta_n), y_2 (1 + \theta^{'}_n), \ldots, y_n(1 + \theta_2).$ Each perturbation is ``small". \section{Polynomial Evaluation} {\bf Goal}: Evaluate a polynomial $p(x) = \sum^{d}_{i=0} a_i x^i$. We use Horner's rule, given as: \begin{eqnarray} p & = & a_d \nonumber \\ \mbox{for } i & = & d-1 \mbox{ down to } 0 \mbox{ do } \nonumber \\ p & = & x * p + a_i \end{eqnarray} Lets apply it to $p(x) = (x - 2)^9 = x^9 - 18 x^8 + 144 x^7 - 672 x^6 + 2016 x^5 - 4032 x^4 + 5376 x^3 - 4608 x^2 + 2304 x - 512$. \noindent What would happen if we tried to find a zero of $p(x)$ using a simple zero finder? The root finder is based on the bisection algorithm: \begin{verbatim} function bisect(xl, xu, tol, p) /* find a zero of p(x) in [xl, xu] assuming p(xl) . p(xu) < 0 */ /* stop if zero found to within +/- tol */ if (xu-xl) <= (2 * tol) return ((xl + xu)/2) else mid = (xl + xu)/2 pm = p(mid) /* the polynomial P(x) is evaluated a x = mid */ if p(xl) . pm < 0 then return bisect(xl,mid,tol,p) else return bisect(mid xu,tol,p) endif endif \end{verbatim} \noindent Let us apply to $p(x)$ with $xl = 1.96$ to $xu = 2.04$. What will happen? Look at the graphs of $p(x)$ shown in Fig. \ref{Horner1} and Fig. \ref{Horner2}. \begin{figure}[htb] \centering \mbox{\epsfxsize=6.5in\epsffile{hornera.ps}} \caption{Graph of p(x)} \label{Horner1} \end{figure} \begin{figure}[htb] \centering \vspace*{-.1in} \mbox{\epsfxsize=6.5in\epsffile{horner2.ps}} % \vspace*{3.0in} \vspace*{-.1in} \caption{Graph of $p(x)$ in the neighborhood of $x =2$ evaluated using IEEE double precision arithmetic} \label{Horner2} \end{figure} \end{document}