¶

Text provided under a Creative Commons Attribution license, CC-BY. All code is made available under the FSF-approved MIT license. (c) Kyle T. Mandli

from __future__ import print_function

%matplotlib inline
import matplotlib.pyplot as plt
import numpy

Sources of Error¶

Error can come from many sources when formulating problems and/or applying numerical methods:

Model/Data Error
Discretization Error
Floating Point Error
Convergence Error

Objectives¶

Understand the different sources of error
Explore some simple approaches to error analysis
Quantify errors
- absolute error
- relative error
- precision
Long term goals
- Use error estimates to control accuracy/reliability of solutions
- Understand errors so you can believe and justify your solutions

Model and Data Error¶

Errors in fundamental formulation

SIR model
- simplistic averaged model of interactions
  $\dot{I} = \alpha SI - \beta I$
  (1)
- basic model predicts a single peak
Data Error - Inaccuracy in measurement or uncertainties in parameters

Unfortunately we cannot control model and data error directly but we can use methods that may be more robust in the presense of these types of errors.

Discretization or Truncation Error¶

Errors arising from approximating a function with a simpler function, e.g. Using the approximation $\sin(x) \approx x$ when $|x| \approx 0$ .

Floating Point Error¶

Errors arising from approximating real numbers with finite-precision numbers and arithmetic.

Convergence Error¶

In some instances an algorithm is developed that will take a current approximation and then find an improvement on the current approximation. In some instances the errors generated in each indivudal step can accumulate or become magnified after repeating the algorithm a number of times.

Basic Definitions¶

Before exploring the different kinds of error, it is important to first define the ways that error is measured. Given a true value of a function $f$ and an approximate solution $F$ define:

Absolute Error:

e = | f - F |

(2)

Relative Error:

r = \frac{e}{|f|} = \frac{|f - F|}{|f|}

(3)

Note: these definitions assume $f$ and $F$ are scalar valued. However these definitions are readily extended to more complicate objects such as vectors or matrices with appropriate norms.

Decimal Precision¶

This definition of relative error provides a convenient estimate for the number of digits of decimal precision $p$

given a relative error $r$ , the precision $p$ is the largest integer such that

r \leq 5\times 10^{-p}

(4)

Example

if $r = 0.001 < 5\times10^{-3}$ has $p=3$ significant digits
if $r = 0.006 < 5\times10^{-2}$ has $p=2$ significant digits (because this error would cause rounding up)

Example¶

let

f = e^1,\quad F=2.71

(5)

f = numpy.exp(1.0)
F = 2.71
print("f = {}".format(f))
print("F = {}".format(F))

e = numpy.abs(f - F)
r = e / numpy.abs(f)
print("Absolute Error: {}".format(e))
print("Relative Error: {}".format(r))

p = int(-numpy.log10(r / 5.0))
print("Decimal precision: {}".format(p))

Big-O Notation¶

In many situations an approximation will have a parameter associated with it, and the value of the parameter is often chosen to insure that the error is reasonable in a given situation. In such circumstances we often want to know the impact on the error if we change the value of the parameter. This leads to the definition of Big-O notation:

f(x) = O(g(x)) \quad \text{as} \quad x \rightarrow a

(6)

if and only if

|f(x)| \leq M |g(x)| \quad \text{as}\quad |x - a| < \delta \quad \text{where} \quad M,a > 0.

(7)

In practice we use Big-O notation to say something about how the terms we may have left out of a series might behave. We saw an example earlier of this with the Taylor’s series approximations.

Example:¶

Consider approximating a differentiable function $f(x)$ by its Taylor polynomial (truncated Taylor series) expanded around $x_0=0$ ., i.e.

F(x) = T_N(x_0 + \Delta x) = \sum^N_{n=0} f^{(n)}(x_0) \frac{\Delta x^n}{n!}

(8)

where

f(x)=\lim_{N\rightarrow\infty} T_N

(9)

assuming the Taylor series converges

or for the case $f(x)=\sin(x)$ expanded around $x_0=0$

T_N(\Delta x) = \sum^N_{n=0} (-1)^{n} \frac{\Delta x^{2n+1}}{(2n+1)!}

(10)

For $N=2$ , we can then write $F(x)$ as

F(\Delta x) = \Delta x - \frac{\Delta x^3}{6} + \frac{\Delta x^5}{120}

(11)

so our true function is

f(x) = F(\Delta x) + O(\Delta x^7)

(12)

or the absolute error

e = | f -F | \sim O(\Delta x^7)

(13)

We can also develop rules for error propagation based on Big-O notation:

In general, there are two theorems that do not need proof and hold when the value of x is large:

Let

\begin{aligned} f(x) &= p(x) + O(x^n) \\ g(x) &= q(x) + O(x^m) \\ k &= \max(n, m) \end{aligned}

(14)

then

f+g = p + q + O(x^k)

(15)

and

\begin{align} f \cdot g &= p \cdot q + p O(x^m) + q O(x^n) + O(x^{n + m}) \\ &= p \cdot q + O(x^{n+m}) \end{align}

(16)

On the other hand, if we are interested in small values of x, say $\Delta x$ , the above expressions can be modified as follows:

\begin{align} f(\Delta x) &= p(\Delta x) + O(\Delta x^n) \\ g(\Delta x) &= q(\Delta x) + O(\Delta x^m) \\ r &= \min(n, m) \end{align}

(17)

then

f+g = p + q + O(\Delta x^r)

(18)

and

\begin{align} f \cdot g &= p \cdot q + p \cdot O(\Delta x^m) + q \cdot O(\Delta x^n) + O(\Delta x^{n+m}) \\ &= p \cdot q + O(\Delta x^r) \end{align}

(19)

Note: In this case we suppose that at least the polynomial with $k = \max(n, m)$ has the following form:

p(\Delta x) = 1 + p_1 \Delta x + p_2 \Delta x^2 + \ldots

(20)

or

q(\Delta x) = 1 + q_1 \Delta x + q_2 \Delta x^2 + \ldots

(21)

so that there is an $\mathcal{O}(1)$ term that guarantees the existence of $\mathcal{O}(\Delta x^r)$ in the final product.

To get a sense of why we care most about the power on $\Delta x$ when considering convergence the following figure shows how different powers on the convergence rate can effect how quickly we converge to our solution. Note that here we are plotting the same data two different ways. Plotting the error as a function of $\Delta x$ is a common way to show that a numerical method is doing what we expect and exhibits the correct convergence behavior. Since errors can get small quickly it is very common to plot these sorts of plots on a log-log scale to easily visualize the results. Note that if a method was truly of the order $n$ that they will be a linear function in log-log space with slope $n$ .

Behavior of error as a function of $\Delta x$ ¶

dx = numpy.linspace(1.0, 1e-4, 100)

fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 2.0)
axes = []
axes.append(fig.add_subplot(1, 2, 1))
axes.append(fig.add_subplot(1, 2, 2))

for n in range(1, 5):
    axes[0].plot(dx, dx**n, label="$\Delta x^%s$" % n)
    axes[1].loglog(dx, dx**n, label="$\Delta x^%s$" % n)

axes[0].legend(loc=2)
axes[1].set_xticks([10.0 ** (-n) for n in range(5)])
axes[1].set_yticks([10.0 ** (-n) for n in range(16)])
axes[1].legend(loc=4)
for n in range(2):
    axes[n].set_title("Growth of Error vs. $\Delta x^n$")
    axes[n].set_xlabel("$\Delta x$")
    axes[n].set_ylabel("Estimated Error")

plt.show()

Discretization Error¶

Taylor’s Theorem: Let $f(x) \in C^{N+1}[a,b]$ and $x_0 \in [a,b]$ , then for all $x \in (a,b)$ there exists a number $c = c(x)$ that lies between $x_0$ and $x$ such that

f(x) = T_N(x) + R_N(x)

(22)

where $T_N(x)$ is the Taylor polynomial approximation

T_N(x) = \sum^N_{n=0} \frac{f^{(n)}(x_0)\cdot(x-x_0)^n}{n!}

(23)

and $R_N(x)$ is the residual (the part of the series we left off)

R_N(x) = \frac{f^{(N+1)}(c) \cdot (x - x_0)^{N+1}}{(N+1)!}

(24)

Note¶

The residual:

R_N(x) = \frac{f^{(N+1)}(c) \cdot (x - x_0)^{N+1}}{(N+1)!}

(25)

depends on the $N+1$ order derivative of $f$ evaluated at an unknown value $c\in[x,x_0]$ .

If we knew the value of $c$ we would know the exact value of $R_N(x)$ and therefore the function $f(x)$ . In general we don’t know this value but we can use $R_N(x)$ to put upper bounds on the error and to understand how the error changes as we move away from $x_0$ .

Start by replacing $x - x_0$ with $\Delta x$ . The primary idea here is that the residual $R_N(x)$ becomes smaller as $\Delta x \rightarrow 0$ (at which point $T_N(x) = f(x_0)$ ).

T_N(x) = \sum^N_{n=0} \frac{f^{(n)}(x_0)\cdot\Delta x^n}{n!}

(26)

and $R_N(x)$ is the residual (the part of the series we left off)

R_N(x) = \frac{f^{(n+1)}(c) \cdot \Delta x^{n+1}}{(n+1)!} \leq M \Delta x^{n+1} = O(\Delta x^{n+1})

(27)

where $M$ is an upper bound on

\frac{f^{(n+1)}(c)}{(n+1)!}

(28)

Example 1¶

$f(x) = e^x$ with $x_0 = 0$ on the interval $x\in(-1,1)$

Using this we can find expressions for the relative and absolute error as a function of $x$ assuming $N=2$ .

Derivatives:

\begin{aligned} f'(x) &= e^x \\ f''(x) &= e^x \\ f^{(n)}(x) &= e^x \end{aligned}

(29)

Taylor polynomials:

\begin{aligned} T_N(x) &= \sum^N_{n=0} e^0 \frac{x^n}{n!} \Rightarrow \\ T_2(x) &= 1 + x + \frac{x^2}{2} \end{aligned}

(30)

Remainders:

\begin{aligned} R_N(x) &= e^c \frac{x^{N+1}}{(N+1)!} \\ R_2(x) &= e^c \cdot \frac{x^3}{6} \leq \frac{e^1}{6} \approx 0.5 \end{aligned}

(31)

Accuracy:

\begin{align} \exp(1) &= 2.718\ldots \\ T_2(1) &= 2.5 \end{align}

(32)

\Rightarrow e \approx 0.2,\quad r \approx 0.08,\quad p = ?

(33)

We can also use the package sympy which has the ability to calculate Taylor polynomials built-in!

import sympy

sympy.init_printing(pretty_print=True)
x = sympy.symbols("x")
f = sympy.exp(x)
f.series(x0=0, n=3)

Lets plot this numerically for a section of $x$ .

x = numpy.linspace(-1, 1, 100)
f = numpy.exp(x)
T_N = 1.0 + x + x**2 / 2.0
R_N = numpy.exp(1) * x**3 / 6.0

fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, T_N, "r", x, f, "k", x, numpy.abs(R_N), "b")
axes.plot(x, numpy.abs(numpy.exp(x) - T_N), "g--")
axes.plot(0.0, 1.0, "o", markersize=10)

axes.grid()
axes.set_xlabel("x", fontsize=16)
axes.set_ylabel("$f(x)$, $T_N(x)$, $|R_N(x)|$", fontsize=16)
axes.legend(["$T_N(x)$", "$f(x)$", "$|R_N(x)|$", "e(x)"], loc=2)
plt.show()

Example 2¶

Approximate

f(x) = \frac{1}{x} \quad x_0 = 1,

(34)

using $x_0 = 1$ to the 3rd Taylor series term on the inverval $x\in[1,\infty)$

\begin{matrix} f'(x) = -\frac{1}{x^2}, & f''(x) = \frac{2}{x^3}, & f'''(x) = -\frac{6}{x^4}, & \ldots, & f^{(n)}(x) &= \frac{(-1)^n n!}{x^{n+1}} \end{matrix}

(35)

\begin{aligned} T_N(x) &= \sum^N_{n=0} (-1)^n (x-1)^n \Rightarrow \\ T_2(x) &= 1 - (x - 1) + (x - 1)^2 \end{aligned}

(36)

\begin{aligned} R_N(x) &= \frac{(-1)^{n+1}(x - 1)^{n+1}}{c^{n+2}} \Rightarrow \\ R_2(x) &= \frac{-(x - 1)^{3}}{c^{4}} \end{aligned}

(37)

plot this problem¶

x = numpy.linspace(0.8, 2, 100)
f = 1.0 / x
T_N = 1.0 - (x - 1) + (x - 1) ** 2
R_N = -((x - 1.0) ** 3) / (1.0**4)

plt.figure(figsize=(8, 6))
plt.plot(x, T_N, "r", x, f, "k", x, numpy.abs(R_N), "b")
plt.plot(x, numpy.abs(f - T_N), "g--")
plt.plot(1.0, 1.0, "o", markersize=10)
plt.grid(True)
plt.xlabel("x", fontsize=16)
plt.ylabel("$f(x)$, $T_N(x)$, $R_N(x)$", fontsize=16)
plt.title("$f(x) = 1/x$", fontsize=18)
plt.legend(["$T_N(x)$", "$f(x)$", "$|R_N(x)|$", "$e(x)$"], loc="best")
plt.show()

Computational Issue #1: Accuracy... how many terms?¶

Given a Taylor Polynomial approximation of an arbitrary function $f(x)$ , how do we determine how many terms are required such that $|R_N(x)|<tol$ . And how do we determine the tolerance?

Computational Issue #2 Efficiency... Operation counts for polynomial evaluation¶

Given

P_N(x) = a_0 + a_1 x + a_2 x^2 + \ldots + a_N x^N

(38)

or

P_N(x) = p_0 x^N + p_1 x^{N-1} + p_2 x^{N-2} + \ldots + p_{N}

(39)

what is the most efficient way to evaluate $P_N(x)$ ? (i.e. minimize number of floating point operations)

Consider two ways to write $P_3$

The standard way:

P_3(x) = p_0 x^3 + p_1 x^2 + p_2 x + p_3

(40)

using nested multiplication (aka Horner’s Method):

P_3(x) = ((p_0 x + p_1) x + p_2) x + p_3

(41)

Consider how many operations it takes for each...

P_3(x) = p_0 x^3 + p_1 x^2 + p_2 x + p_3

(42)

P_3(x) = \overbrace{p_0 \cdot x \cdot x \cdot x}^3 + \overbrace{p_1\cdot x \cdot x}^2 + \overbrace{p_2 \cdot x}^1 + p_3

(43)

Note: here we’re just counting multiplications as they will dominate the flop count

Adding up all the operations we can in general think of this as a pyramid (it’s really the triangle numbers)

\sum_{n=1}^N n = \frac{N(N+1)}{2}

(44)

Thus we can estimate that the algorithm written this way will take approximately $O(N^2 / 2)$ operations to complete.

Looking at nested iteration, however:

P_3(x) = ((p_0 x + p_1) x + p_2) x + p_3

(45)

Here we find that the method is $O(N)$ compared to the first evaluation which $O(N^2)$ (we usually drop the 2 in these cases). That’s a huge difference for large $N$ !

Algorithm¶

Fill in the function and implement Horner’s method:

def eval_poly(p, x):
    '''Evaluates a polynomial using Horner's method given coefficients p at x
    
      The polynomial is defined as
    
        P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]
        
    Parameters:
        p: list or numpy array of coefficients 
        x:  scalar float
        
    returns:
        P(x):  value of the polynomial at point x (float)
    '''
    pass

def eval_poly(p, x):
    """Evaluates a polynomial using Horner's method given coefficients p at x

      The polynomial is defined as

        P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]

    Parameters:
        p: list or numpy array of coefficients
        x:  scalar float or numpy array (this version is more robust to floating point error)

    returns:
        P(x):  value of the polynomial at point x, P will return as the same type as x

    """

    if isinstance(x, numpy.ndarray):
        y = p[0] * numpy.ones(x.shape)
    elif isinstance(x, float):
        y = p[0]
    else:
        raise TypeError

    for element in p[1:]:
        y = y * x + element

    return y

# Scalar test

p = [1, 2, 3]
x = 1.0
test = eval_poly(p, x)
answer = numpy.array([x**2, x, 1]).dot(p)

print("test = {} ({}), answer = {} ({})".format(test, type(test), answer, type(answer)))
numpy.testing.assert_allclose(test, answer)
print("success")

# Vectorized test with x a numpy array

p = [1, -3, 10, 4, 5, 5]
x = numpy.linspace(-10, 10, 100)
P = eval_poly(p, x)
print("x: {}, P(x): {}".format(type(x), type(P)))

plt.plot(x, P)
plt.xlabel("x")
plt.ylabel("P(x)")
plt.title("{}th order polynomial, p={}".format(len(p) - 1, p))
plt.grid()
plt.show()

Convergence Error¶

In some circumstances a formula or algorithm is applied repeatedly as a way to obtain a final approximation. Usually, the errors that occur at each individual step are small. By repeating the algorithm, though, the errors can sometimes grow or become magnified.

As example of this phenomena is given below. The values of the terms in a difference equation are calculated, $$

\begin{align} y_0 &= 1, \\ y_1 &= \frac{1}{5}, \\ y_{n+1} &= \frac{16}{5} y_n - \frac{3}{5} y_{n-1}. \end{align}

(46)

$$

The true solution to the difference equation is $y_n = \left(\frac{1}{5}\right)^n$ , where $n=$ 0, 1, 2, $\ldots$

# Choose the number of iterations
N = 40
y = numpy.empty(N + 1)  # Allocate an empty vector with N+1 entries

# Now use the difference equation to generate the numbers in the sequence
y[0] = 1
y[1] = 1 / 5
for n in range(2, N + 1):
    y[n] = 16 / 5 * y[n - 1] - 3 / 5 * y[n - 2]

And plot the result

# Now plot the result
n = numpy.arange(0, N + 1)
fig = plt.figure(figsize=(10.0, 5.0))
axes = fig.add_subplot(1, 1, 1)
axes.semilogy(n, y, "rx", markersize=5, label="$y_n$")
axes.semilogy(n, (1 / 5) ** n, "b.", label="$y_{true}$")
axes.grid()
axes.set_title("Calculated Values Of A Difference Equation", fontsize=18)
axes.set_xlabel("$n$", fontsize=16)
axes.set_ylabel("$y_n$", fontsize=16)
axes.legend(loc="best", shadow=True)
plt.show()

Simply looking at the exact solution, the sequence of numbers generated by the difference equation above should get very close to zero. Instead, the numbers in the sequence initially get closer to zero, but at some point they begin to grow and get larger. An underlying problem is that the computer is not able to store the numbers exactly. The second number in the sequence, $y_1=\frac{1}{5}$ has a small error, and the computer stores it as $y_1 = \frac{1}{5}+\epsilon$ where $\epsilon$ is some small error.

Each time a new number in the loop is generated, the error is multiplied. For example, after the first iteration $y_2$ is

\begin{align} y_2 &= \frac{16}{5} \left( \frac{1}{5}+\epsilon \right) - \frac{3}{5} \left( 1 \right), \\ &= \frac{1}{5^2} + \frac{16}{5} \epsilon. \end{align}

(47)

After the second time through the loop, the value of $y_3$ is

y_3=\frac{1}{5^3} + \frac{241}{25}\epsilon

(48)

Even though the value of $\epsilon$ is very close to zero, every iteration makes the error grow. Repeated multiplication will result in a very large number.

The error associated with the initial representation of the number $\frac{1}{5}$ is a problem with the way a digital computer stores floating point numbers. In most instances the computer cannot represent a number exactly, and the small error in approximating a given number can give rise to other problems.

Floating Point Error¶

Errors arising from approximating real numbers with finite-precision numbers

\pi \approx 3.14

(49)

or $\frac{1}{3} \approx 0.333333333$ in decimal, results from a finite number of registers to represent each number.

Floating Point Systems¶

Numbers in floating point systems are represented as a series of bits that represent different pieces of a number. In normalized floating point systems there are some standard conventions for what these bits are used for. In general the numbers are stored by breaking them down into the form

F = \pm d_1 . d_2 d_3 d_4 \ldots d_p \times \beta^E

(50)

where

± is a single bit representing the sign of the number
$d_1 . d_2 d_3 d_4 \ldots d_p$ is called the mantissa. Note that technically the decimal could be moved but generally, using scientific notation, the decimal can always be placed at this location. The digits $d_2 d_3 d_4 \ldots d_p$ are called the fraction with $p$ digits of precision. Normalized systems specifically put the decimal point in the front like we have and assume $d_1 \neq 0$ unless the number is exactly 0.
$\beta$ is the base. For binary $\beta = 2$ , for decimal $\beta = 10$ , etc.
$E$ is the exponent, an integer in the range $[E_{\min}, E_{\max}]$

The important points on any floating point system is that

There exist a discrete and finite set of representable numbers
These representable numbers are not evenly distributed on the real line
Arithmetic in floating point systems yield different results from infinite precision arithmetic (i.e. “real” math)

Properties of Floating Point Systems¶

All floating-point systems are characterized by several important numbers

Smalled normalized number (underflow if below - related to subnormal numbers around zero)
Largest normalized number (overflow if above)
Zero
Machine $\epsilon$ or $\epsilon_{\text{machine}}$
inf and nan, infinity and Not a Number respectively

Example: Toy System¶

Consider the toy 2-digit precision decimal system (normalized)

f = \pm d_1 . d_2 \times 10^E

(51)

with $E \in [-2, 0]$ .

Number and distribution of numbers

How many numbers can we represent with this system?
What is the distribution on the real line?
What is the underflow and overflow limits?
What is the smallest number $\epsilon_{mach}$ such that $1+\epsilon_{mach} > 1$ ?

How many numbers can we represent with this system?

f = \pm d_1 . d_2 \times 10^E ~~~ \text{with} E \in [-2, 0]

(52)

sign bit: 2

$d_1$ : 9 (normalized numbers $d_1\neq 0$ )
$d_2$ : 10

$E$ : 3

zero: 1

total:

2 \times 9 \times 10 \times 3 + 1 = 541

(53)

What is the distribution on the real line?

f = \pm d_1 . d_2 \times 10^E ~~~ \text{with} ~~~ E \in [-2, 0]

(54)

d_1_values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
d_2_values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
E_values = [
    0,
    -1,
    -2,
]

fig = plt.figure(figsize=(10.0, 1.5))
axes = fig.add_subplot(1, 1, 1)

for E in E_values:
    for d1 in d_1_values:
        for d2 in d_2_values:
            axes.plot((d1 + d2 * 0.1) * 10**E, 0.0, "r|", markersize=20)
            axes.plot(-(d1 + d2 * 0.1) * 10**E, 0.0, "r|", markersize=20)

axes.plot(0.0, 0.0, "|", markersize=20)
axes.plot([-1.0, 1.0], [0.0, 0.0], "k|", markersize=30)

axes.plot([-10.0, 10.0], [0.0, 0.0], "k")

axes.set_title("Distribution of Values $[-10, 10]$")
axes.set_yticks([])
ticks = [i for i in range(-10, 11, 1)]
axes.set_xticks(ticks)
axes.set_xlabel("x")
axes.set_ylabel("")
axes.set_xlim([-10, 10])
plt.show()

fig = plt.figure(figsize=(10.0, 1.5))
axes = fig.add_subplot(1, 1, 1)

for E in E_values:
    for d1 in d_1_values:
        for d2 in d_2_values:
            axes.plot((d1 + d2 * 0.1) * 10**E, 0.0, "r+", markersize=20)
            axes.plot(-(d1 + d2 * 0.1) * 10**E, 0.0, "r+", markersize=20)

axes.plot(0.0, 0.0, "+", markersize=20)
axes.plot([-0.1, 0.1], [0.0, 0.0], "k|", markersize=30)
axes.plot([-1, 1], [0.0, 0.0], "k")

axes.set_title("Close up $[-1, 1]$")
axes.set_yticks([])
ticks = numpy.linspace(-1.0, 1.0, 21)
axes.set_xticks(ticks)
axes.set_xlabel("x")
axes.set_ylabel("")
axes.set_xlim([-1, 1])
# fig.tight_layout(h_pad=1, w_pad=5)

plt.show()

What is the underflow and overflow limits?

Smallest number that can be represented is the underflow: $1.0 \times 10^{-2} = 0.01$
Largest number that can be represented is the overflow: $9.9 \times 10^0 = 9.9$

What is the smallest number $\epsilon_{mach}$ such that $1+\epsilon_{mach} > 1$ ?

$\epsilon_{mach} = 0.1$

Binary Systems¶

Consider the 2-digit precision base 2 system:

f=\pm d_1 . d_2 \times 2^E \quad \text{with} \quad E \in [-1, 1]

(55)

Number and distribution of numbers¶

How many numbers can we represent with this system?
What is the distribution on the real line?
What is the underflow and overflow limits?
What is $\epsilon_{mach}$ ?

How many numbers can we represent with this system?

f=\pm d_1 . d_2 \times 2^E ~~~~ \text{with} ~~~~ E \in [-1, 1]

(56)

2 \times 1 \times 2 \times 3 + 1 = 13

(57)

What is the distribution on the real line?

d_1_values = [1]
d_2_values = [0, 1]
E_values = [1, 0, -1]

fig = plt.figure(figsize=(10.0, 1.0))
axes = fig.add_subplot(1, 1, 1)

for E in E_values:
    for d1 in d_1_values:
        for d2 in d_2_values:
            axes.plot((d1 + d2 * 0.5) * 2**E, 0.0, "r+", markersize=20)
            axes.plot(-(d1 + d2 * 0.5) * 2**E, 0.0, "r+", markersize=20)

axes.plot(0.0, 0.0, "r+", markersize=20)
axes.plot([-4.5, 4.5], [0.0, 0.0], "k")

axes.set_title("Distribution of Values")
axes.set_yticks([])
axes.set_xticks(numpy.linspace(-4, 4, 9))
axes.set_xlabel("x")
axes.set_ylabel("")
axes.grid()
axes.set_xlim([-5, 5])
plt.show()

Smallest number that can be represented is the underflow: $1.0 \times 2^{-1} = 0.5$
Largest number that can be represented is the overflow: $1.1 \times 2^1 = 3$
$\epsilon_{mach} = 0.1 = 2^{-1}= 1/2$

Note: these numbers are in a binary system.

Quick rule of thumb:

2^3 2^2 2^1 2^0 . 2^{-1} 2^{-2} 2^{-3}

(58)

correspond to 8s, 4s, 2s, 1s . halves, quarters, eighths, ...

Real Systems - IEEE 754 Binary Floating Point Systems¶

Single Precision¶

Total storage alloted is 32 bits
Exponent is 8 bits $\Rightarrow E \in [-126, 128]$
Fraction 23 bits ( $p = 24$ )

s EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1      8 9                     31

Overflow $= 2^{128}\approx3.40\times10^{38}$
Underflow $= 2^{-126} \approx 1.17 \times 10^{-38}$
$\epsilon_{\text{machine}} = 2^{-23} \approx 1.19 \times 10^{-7}$

Double Precision¶

Total storage alloted is 64 bits
Exponent is 11 bits $\Rightarrow E \in [-1022, 1024]$
Fraction 52 bits ( $p = 53$ )

s EEEEEEEEEE FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FF
0 1       11 12                                                      63

Overflow $= 2^{1024} \approx 1.8 \times 10^{308}$
Underflow $= 2^{-1022} \approx 2.2 \times 10^{-308}$
$\epsilon_{\text{machine}} = 2^{-52} \approx 2.2 \times 10^{-16}$

Python Access to IEEE Numbers¶

Access many important parameters, such as machine epsilon:

import numpy
numpy.finfo(float).eps

print(numpy.finfo(numpy.float16))

print(numpy.finfo(numpy.float32))

print(numpy.finfo(float))

print(numpy.finfo(numpy.float128))

Examples¶

eps = numpy.finfo(float).eps
MAX = numpy.finfo(float).max
print("eps = {}".format(eps))
print("MAX = {}".format(MAX))

Show that $(1 + \epsilon_{mach}) > 1$

print(MAX)

print(MAX * (1 + 0.4 * eps))

print(1 + 0.4 * eps == 1.0)

Why should we care about this?¶

Floating point arithmetic is not commutative or associative
Floating point errors compound, do not assume even double precision is enough!
Mixing precision can be dangerous

Example 1: Simple Arithmetic¶

Simple arithmetic $\delta < \epsilon_{\text{machine}}$ .

Compare

1+\delta - 1 \quad vs. \quad 1 - 1 + \delta

(59)

eps = numpy.finfo(float).eps
delta = 0.5 * eps
x = 1 + delta - 1
y = 1 - 1 + delta
print("1 + delta - 1 = {}".format(x))
print("1 - 1 + delta = {}".format(y))
print(x == y)

Example 2: Catastrophic cancellation¶

Let us examine what happens when we add two numbers $x$ and $y$ where $x + y \neq 0$ . We can actually estimate these bounds by doing some error analysis. Here we need to introduce the idea that each floating point operation introduces an error such that

\text{fl}(x ~\text{op}~ y) = (x ~\text{op}~ y) (1 + \delta)

(60)

where $\text{fl}(\cdot)$ is a function that returns the floating point representation of the expression enclosed, $\text{op}$ is some operation (e.g. $+, -, \times, /$ ), and $\delta$ is the floating point error due to $\text{op}$ .

Back to our problem at hand. The floating point error due to addition is

\text{fl}(x + y) = (x + y) (1 + \delta).

(61)

Comparing this to the true solution using a relative error we have

\begin{aligned} \frac{|(x + y) - \text{fl}(x + y)|}{|x + y|} &= \frac{|(x + y) - (x + y) (1 + \delta)|}{|x + y|} = \delta. \end{aligned}

(62)

so that if $\delta = \mathcal{O}(\epsilon_{\text{machine}})$ we are not too concerned.

What if instead we consider a floating point error on the representations of $x$ and $y$ , $x \neq y$ , and say $\delta_x$ and $\delta_y$ are the magnitude of the errors in their representation. Here we will assume this constitutes the floating point error rather than being associated with the operation itself.

Now consider the difference between the two numbers

\begin{aligned} \text{fl}(x - y) &= x (1 + \delta_x) - y (1 + \delta_y) \\ &= x - y + x \delta_x - y \delta_y \\ &= (x - y) \left(1 + \frac{x \delta_x - y \delta_y}{x - y}\right) \end{aligned}

(63)

Again computing the relative error we then have

\begin{aligned} \frac{\left|(x - y) - (x - y) \left(1 + \frac{x \delta_x - y \delta_y}{x - y}\right)\right|}{|x - y|} &= \left|1 - \left(1 + \frac{x \delta_x - y \delta_y}{x - y}\right)\right|\\ &=\frac{|x \delta_x - y \delta_y|}{|x - y|} \\ \end{aligned}

(64)

The important distinction here is that now the error is dependent on the values of $x$ and $y$ and more importantly, their difference. Of particular concern is the relative size of $x - y$ . As it approaches zero relative to the magnitudes of $x$ and $y$ the error could be arbitrarily large. This is known as catastrophic cancellation.

dx = numpy.array([10 ** (-n) for n in range(1, 16)])
x = 1.0 + dx
y = numpy.ones(x.shape)
error = numpy.abs(x - y - dx) / (dx)

fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1)
axes.loglog(dx, x + y, "o-")
axes.set_xlabel("$\Delta x$")
axes.set_ylabel("$x + y$")
axes.set_title("$\Delta x$ vs. $x+y$")

axes = fig.add_subplot(1, 2, 2)
axes.loglog(dx, error, "o-")
axes.set_xlabel("$\Delta x$")
axes.set_ylabel("$|x + y - \Delta x| / \Delta x$")
axes.set_title("Difference between $x$ and $y$ vs. Relative Error")

plt.show()

Example 3: Function Evaluation¶

Consider the function

f(x) = \frac{1 - \cos x}{x^2}

(65)

with $x\in[-10^{-4}, 10^{-4}]$ .

Taking the limit as $x \rightarrow 0$ we can see what behavior we would expect to see from evaluating this function:

\lim_{x \rightarrow 0} \frac{1 - \cos x}{x^2} = \lim_{x \rightarrow 0} \frac{\sin x}{2 x} = \lim_{x \rightarrow 0} \frac{\cos x}{2} = \frac{1}{2}.

(66)

What does floating point representation do?

x = numpy.linspace(-1e-3, 1e-3, 100, dtype=numpy.float32)
f = 0.5
F = (1.0 - numpy.cos(x)) / x**2
rel_err = numpy.abs((f - F)) / f

fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, rel_err, "o")
axes.set_xlabel("x")
axes.grid()
axes.set_ylabel("Relative Error")
axes.set_title("$\\frac{1-\\cos{x}}{x^2} - \\frac{1}{2}$", fontsize=18)
plt.show()

Example 4: Evaluation of a Polynomial¶

f(x) = x^7 - 7x^6 + 21 x^5 - 35 x^4 + 35x^3-21x^2 + 7x - 1

(67)

Note: $f(1) = 0$ (and will be close to zero for $x\approx 1$ )

Here we compare polynomial evaluation using naive powers compared to Horner’s method as implemented in eval_poly(p,x) defined above.

x = numpy.linspace(0.988, 1.012, 1000, dtype=numpy.float16)
y = (
    x**7
    - 7.0 * x**6
    + 21.0 * x**5
    - 35.0 * x**4
    + 35.0 * x**3
    - 21.0 * x**2
    + 7.0 * x
    - 1.0
)

# repeat using Horner's method from above
p = numpy.array([1, -7, 21, -35, 35, -21, 7, -1])
yh = eval_poly(p, x)

fig = plt.figure(figsize=(8, 6))
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1)
axes.plot(x, y, "r", label="naive")
axes.plot(x, yh, "b", label="horner")
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_ylim((-0.1, 0.1))
axes.set_xlim((x[0], x[-1]))
axes.grid()
axes.legend()

axes = fig.add_subplot(1, 2, 2)
axes.plot(x, yh - y, "g")
axes.grid()
axes.set_xlabel("x")
axes.set_ylabel("$f_{horner} - f_n$")
axes.set_title("error")
plt.show()

def eval_polys(p, x):
    """Evaluates a polynomial using Horner's method given coefficients p at x

      The polynomial is defined as

        P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]

    Parameters:
        p: list or numpy array of coefficients
        x:  scalar float or numpy array this version is less careful about input type

    returns:
        P(x):  value of the polynomial at point x, P will return as the same type as x

    """

    y = p[0]
    for element in p[1:]:
        y = y * x + element

    return y

# repeat using different Horner's method from above
yh = eval_polys(p, x)

fig = plt.figure(figsize=(8, 6))
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1)
axes.plot(x, y, "r", label="naive")
axes.plot(x, yh, "b", label="horner")
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_ylim((-0.1, 0.1))
axes.set_xlim((x[0], x[-1]))
axes.grid()
axes.legend()

axes = fig.add_subplot(1, 2, 2)
axes.plot(x, yh - y, "g")
axes.grid()
axes.set_xlabel("x")
axes.set_ylabel("$f_{horner} - f_n$")
axes.set_title("error")
plt.show()

Example 5: Rational Function Evaluation¶

Compute $f(x) = x + 1$ by the function

F(x) = \frac{x^2 - 1}{x - 1}

(68)

Do you expect there to be issues?

x = numpy.linspace(0.5, 1.5, 101, dtype=numpy.float32)
f_hat = (x**2 - 1.0) / (x - 1.0)
f = x + 1.0

fig = plt.figure()
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, numpy.abs(f - f_hat) / numpy.abs(f))
axes.set_xlabel("$x$")
axes.set_ylabel("Relative Error")
axes.grid()
plt.show()

Combination of Errors¶

In general we need to concern ourselves with the combination of both discretization error and floating point error.

Reminder:¶

Discretization error: Errors arising from approximation of a function, truncation of a series...

\sin x \approx x - \frac{x^3}{3!} + \frac{x^5}{5!} + O(x^7)

(69)

Floating-point Error: Errors arising from approximating real numbers with finite-precision numbers

\pi \approx 3.14

(70)

or $\frac{1}{3} \approx 0.333333333$ in decimal cannot be represented exactly as a binary floating point number

Example 1¶

Consider a finite difference approximation to the first derivative of a function

f^\prime(x) \approx \frac{f(x + \Delta x) - f(x)}{\Delta x}

(71)

Note: in the limit $\Delta x\rightarrow 0$ , this is the standard definition of the first derivative. However we’re interested in the error for a finite $\Delta x$ .

Moreover, (as we will see in future notebooks), there are many ways to approximate the first derivative. For example we can write the “centered first derivative” as

f^\prime(x) \approx \frac{f(x + \Delta x) - f(x - \Delta x)}{2\Delta x}

(72)

Here we will just compare the the error for the two different finite-difference formulas given

f(x) = f^\prime(x) = e^x

(73)

at $x=1$ for decreasing values of $\Delta x$ . We will also introduce the idea of an ‘inline’ or lambda function in python.

f = lambda x: numpy.exp(x)
f_prime = lambda x: numpy.exp(x)

delta_x = numpy.array([2.0 ** (-n) for n in range(1, 60)])
x = 1.0

# Forward finite difference approximation to first derivative
f_hat_1 = (f(x + delta_x) - f(x)) / (delta_x)
# Centered finite difference approximation to first derivative
f_hat_2 = (f(x + delta_x) - f(x - delta_x)) / (2.0 * delta_x)

fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.loglog(delta_x, numpy.abs(f_hat_1 - f_prime(x)), "o-", label="One-Sided")
axes.loglog(delta_x, numpy.abs(f_hat_2 - f_prime(x)), "s-", label="Centered")
axes.legend(loc=3, fontsize=14)
axes.set_xlabel("$\Delta x$", fontsize=16)
axes.set_ylabel("Absolute Error", fontsize=16)
axes.set_title("Finite Difference approximations to $df/dx$", fontsize=18)
axes.grid()
plt.show()

Example 2¶

Evaluate $e^x$ with its Taylor series.

e^x = \sum^\infty_{n=0} \frac{x^n}{n!}

(74)

Can we pick $N < \infty$ that can approximate $e^x$ over a give range $x \in [a,b]$ such that the relative error $E$ satisfies $E < 8 \cdot \varepsilon_{\text{machine}}$ ?

We can try simply evaluating the Taylor polynomial directly for various $N$

from scipy.special import factorial


def my_exp(x, N=10):
    value = 0.0
    for n in range(N + 1):
        value += x**n / float(factorial(n))

    return value

And test this

eps = numpy.finfo(numpy.float64).eps

x = numpy.linspace(-2, 50.0, 100, dtype=numpy.float64)
MAX_N = 300
for N in range(1, MAX_N + 1):
    rel_error = numpy.abs((numpy.exp(x) - my_exp(x, N=N)) / numpy.exp(x))
    if numpy.all(rel_error < 8.0 * eps):
        break

fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, rel_error / eps)
axes.set_xlabel("x")
axes.set_ylabel("Relative Error/eps")
axes.set_title("N = {} terms".format(N))
axes.grid()
plt.show()

Can we do better?¶

Note:

the largest value of $x$ such that $e^x <$ MAX is:

print(numpy.log(numpy.finfo(float).max))

and numpy.exp handles that just fine

print(numpy.exp(709, dtype=numpy.float64))
print(numpy.exp(-709, dtype=numpy.float64))

Your homework: the great Exp Challenge

One final example (optional): How to calculate Relative Error¶

Say we wanted to compute the relative error between two values $x$ and $y$ using $x$ as the normalizing value. Algebraically the forms

E = \frac{x - y}{x}

(75)

and

E = 1 - \frac{y}{x}

(76)

are equivalent. In finite precision what form might be expected to be more accurate and why?

Example based on a blog post by Nick Higham

Using this model the original definition contains two floating point operations such that

\begin{aligned} E_1 = \text{fl}\left(\frac{x - y}{x}\right) &= \text{fl}(\text{fl}(x - y) / x) \\ &= \left[ \frac{(x - y) (1 + \delta_1)}{x} \right ] (1 + \delta_2) \\ &= \frac{x - y}{x} (1 + \delta_1) (1 + \delta_2) \end{aligned}

(77)

For the other formulation we have

\begin{aligned} E_2 = \text{fl}\left( 1 - \frac{y}{x} \right ) &= \text{fl}\left(1 - \text{fl}\left(\frac{y}{x}\right) \right) \\ &= \left(1 - \frac{y}{x} (1 + \delta_1) \right) (1 + \delta_2) \end{aligned}

(78)

If we assume that all $\text{op}$ s have similar error magnitudes then we can simplify things by letting

|\delta_\ast| \le \epsilon.

(79)

To compare the two formulations we again use the relative error between the true relative error $e_i$ and our computed versions $E_i$ .

Original definition:

\begin{aligned} \frac{e - E_1}{e} &= \frac{\frac{x - y}{x} - \frac{x - y}{x} (1 + \delta_1) (1 + \delta_2)}{\frac{x - y}{x}} \\ &\le 1 - (1 + \epsilon) (1 + \epsilon) = 2 \epsilon + \epsilon^2 \end{aligned}

(80)

Manipulated definition:

\begin{aligned} \frac{e - E_2}{e} &= \frac{e - \left[1 - \frac{y}{x}(1 + \delta_1) \right] (1 + \delta_2)}{e} \\ &= \frac{e - \left[e - \frac{y}{x} \delta_1 \right] (1 + \delta_2)}{e} \\ &= \frac{e - \left[e + e\delta_2 - \frac{y}{x} \delta_1 - \frac{y}{x} \delta_1 \delta_2)) \right] }{e} \\ &= - \delta_2 + \frac{1}{e} \frac{y}{x} \left(\delta_1 + \delta_1 \delta_2 \right) \\ &= - \delta_2 + \frac{1 -e}{e} \left(\delta_1 + \delta_1 \delta_2 \right) \\ &\le \epsilon + \left |\frac{1 - e}{e}\right | (\epsilon + \epsilon^2) \end{aligned}

(81)

We see then that our floating point error will be dependent on the relative magnitude of $e$

Comparison of Relative Errors of estimates of Relative Error ;^)¶

# Based on the code by Nick Higham
# https://gist.github.com/higham/6f2ce1cdde0aae83697bca8577d22a6e
# Compares relative error formulations using single precision and compared to double precision

N = 501  # Note: Use 501 instead of 500 to avoid the zero value
d = numpy.finfo(numpy.float32).eps * 1e4
a = 3.0
x = a * numpy.ones(N, dtype=numpy.float32)
y = [
    x[i]
    + numpy.multiply(
        (i - numpy.divide(N, 2.0, dtype=numpy.float32)), d, dtype=numpy.float32
    )
    for i in range(N)
]

# Compute errors and "true" error
relative_error = numpy.empty((2, N), dtype=numpy.float32)
relative_error[0, :] = numpy.abs(x - y) / x
relative_error[1, :] = numpy.abs(1.0 - y / x)
exact = numpy.abs((numpy.float64(x) - numpy.float64(y)) / numpy.float64(x))

# Compute differences between error calculations
error = numpy.empty((2, N))
for i in range(2):
    error[i, :] = numpy.abs((relative_error[i, :] - exact) / numpy.abs(exact))

fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.semilogy(y, error[0, :], ".", markersize=10, label="$|x-y|/|x|$")
axes.semilogy(y, error[1, :], ".", markersize=10, label="$|1-y/x|$")

axes.grid(True)
axes.set_xlabel("y")
axes.set_ylabel("Relative Error")
axes.set_xlim((numpy.min(y), numpy.max(y)))
axes.set_ylim((5e-9, numpy.max(error[1, :])))
axes.set_title("Relative Error Comparison: x,y {}".format(y[0].dtype))
axes.legend()
plt.show()

Some other links that might be helpful regarding IEEE Floating Point:

Future issues with fp64 and High-Performance Computing¶

The Issues¶

In traditional High-Performance computing IEEE fp64 has become the standard precision necessary for accurate, reproducible calculations for a wide range of scientific computing (e.g. climate models, fusion, solid mechanics)
Until recently, the needs for HPC drove the development of Chips/Hardware such that Commodity Computers and Super Computers benefited from the same technology.
However, with the rise of general purpose GPU’s and AI, the landscape is changing rapidly

A brief history¶

Dongarra et al., 2024

A brief history of floating point hardware¶

Dongarra et al., 2024

1980’s: Dedicated seperate floating point co-processors (e.g. Intel 8087, Motorola 68881 co-processors)
1987: Introduction of Intel x486 CPU’s with built in floating point registers
1999: Introduction of Nvidia GeForce 256 separate “Graphical Processing Unit” GPU low precision, fast parallel graphics.
mid-2000s: Adoption of programmable General Purpose GPU’s for fp acceleration, addition of fp64 on GPU’s
2006: Introduction of Nvidia CUDA language for programmable GPU’s: shift to GPU’s for high-performance computing and ML/AI
~2020+: ML/AI revolution: Deep learning algorithms driven by matrix multiplications that tolerate low precision

Current Floating Point fp64 performance for CPU’s and GPU’s¶

Current Floating Point formats for CPU’s and GPU’s¶

Near Future NVIDIA GPU FloatingPoint roadmap StorageReview.com¶

Specification	H100	H200	B100	B200	B300
Max Memory	80 GBs HBM3	141 GBs HBM3e	192 GBs HBM3e	192 GBs HBM3e	288 GBs HBM3e
Memory Bandwidth	3.35 TB/s	4.8TB/s	8TB/s	8TB/s	8TB/s
FP4 Tensor Core	–	–	14 PFLOPS	18 PFLOPS	30 PFLOPS
FP6 Tensor Core	–	–	7 PFLOPS	9 PFLOPS	15 PFLOPS*
FP8 Tensor Core	3958 TFLOPS (~4 PFLOPS)	3958 TFLOPS (~4 PFLOPS)	7 PFLOPS	9 PFLOPS	15 PFLOPS*
INT 8 Tensor Core	3958 TOPS	3958 TOPS	7 POPS	9 POPS	15 PFLOPS*
FP16/BF16 Tensor Core	1979 TFLOPS (~2 PFLOPS)	1979 TFLOPS (~2 PFLOPS)	3.5 PFLOPS	4.5 PFLOPS	7.5 PFLOPS*
TF32 Tensor Core	989 TFLOPS	989 TFLOPS	1.8 PFLOPS	2.2 PFLOPS	3.3 PFLOPS*
FP32 (Dense)	67 TFLOPS	67 TFLOPS	30 TFLOPS	40 TFLOPS	Information Unknown
FP64 Tensor Core (Dense)	67 TFLOPS	67 TFLOPS	30 TFLOPS	40 TFLOPS	Information Unknown
FP64 (Dense)	34 TFLOPS	34 TFLOPS	30 TFLOPS	40 TFLOPS	Information Unknown
Max Power Consumption	700W	700W	700W	1000W	Information Unknown

Beyond Blackwell¶

Interesting times indeed¶

...the landscape of high-performance computation is increasingly complex...but there are important classes of problems that still need high-precision floating point. Some Options:

fp64 Emulation leveraging low-precision hardware
clever mixed precision algorithms

Operation Counting¶

Discretization Error: Why not use more terms in the Taylor series?

Floating Point Error: Why not use the highest precision possible?

Example 1: Matrix-Vector Multiplication¶

Let $A, B \in \mathbb{R}^{N \times N}$ and $x \in \mathbb{R}^N$ .

Count the approximate number of operations it will take to compute $A x$ .
Do the same for $A B$ .

Matrix-vector product: Defining $[A]_i$ as the $i$ th row of $A$ and $A_{ij}$ as the $i$ , $j$ th entry then

A x = \sum^N_{i=1} [A]_i \cdot x = \sum^N_{i=1} \sum^N_{j=1} A_{ij} x_j

(82)

Take an explicit case, say $N = 3$ , then the operation count is

A x = [A]_1 \cdot v + [A]_2 \cdot v + [A]_3 \cdot v = \begin{bmatrix} A_{11} \times v_1 + A_{12} \times v_2 + A_{13} \times v_3 \\ A_{21} \times v_1 + A_{22} \times v_2 + A_{23} \times v_3 \\ A_{31} \times v_1 + A_{32} \times v_2 + A_{33} \times v_3 \end{bmatrix}

(83)

This leads to 15 operations (6 additions and 9 multiplications).

Take another case, say $N = 4$ , then the operation count is

A x = [A]_1 \cdot v + [A]_2 \cdot v + [A]_3 \cdot v = \begin{bmatrix} A_{11} \times v_1 + A_{12} \times v_2 + A_{13} \times v_3 + A_{14} \times v_4 \\ A_{21} \times v_1 + A_{22} \times v_2 + A_{23} \times v_3 + A_{24} \times v_4 \\ A_{31} \times v_1 + A_{32} \times v_2 + A_{33} \times v_3 + A_{34} \times v_4 \\ A_{41} \times v_1 + A_{42} \times v_2 + A_{43} \times v_3 + A_{44} \times v_4 \\ \end{bmatrix}

(84)

This leads to 28 operations (12 additions and 16 multiplications).

Generalizing this there are $N^2$ multiplications and $N (N -1)$ additions for a total of

\text{operations} = N (N - 1) + N^2 = \mathcal{O}(N^2).

(85)

Matrix-Matrix product ( $AB$ ): Defining $[B]_j$ as the $j$ th column of $B$ then

(A B)_{ij} = \sum^N_{i=1} \sum^N_{j=1} [A]_i \cdot [B]_j

(86)

The inner product of two vectors is represented by

a \cdot b = \sum^N_{i=1} a_i b_i

(87)

leading to $\mathcal{O}(3N)$ operations. Since there are $N^2$ entries in the resulting matrix then we would have $\mathcal{O}(N^3)$ operations.

There are methods for performing matrix-matrix multiplication faster. In the following figure we see a collection of algorithms over time that have been able to bound the number of operations in certain circumstances. Here

\mathcal{O}(N^\omega)

(88)

matrix multiplication operation bound

<table>

¶

Sources of Error¶

Objectives¶

Model and Data Error¶

Discretization or Truncation Error¶

Floating Point Error¶

Convergence Error¶

Basic Definitions¶

Decimal Precision¶

Example¶

Big-O Notation¶

Example:¶

Behavior of error as a function of Δx\Delta xΔx¶

Discretization Error¶

Note¶

Example 1¶

Example 2¶

plot this problem¶

Computational Issue #1: Accuracy... how many terms?¶

Computational Issue #2 Efficiency... Operation counts for polynomial evaluation¶

Algorithm¶

Convergence Error¶

Floating Point Error¶

Floating Point Systems¶

Properties of Floating Point Systems¶

Example: Toy System¶

Binary Systems¶

Number and distribution of numbers¶

Real Systems - IEEE 754 Binary Floating Point Systems¶

Single Precision¶

Double Precision¶

Python Access to IEEE Numbers¶

Examples¶

Why should we care about this?¶

Example 1: Simple Arithmetic¶

Example 2: Catastrophic cancellation¶

Example 3: Function Evaluation¶

Example 4: Evaluation of a Polynomial¶

Example 5: Rational Function Evaluation¶

Combination of Errors¶

Reminder:¶

Example 1¶

Example 2¶

Can we do better?¶

One final example (optional): How to calculate Relative Error¶

Comparison of Relative Errors of estimates of Relative Error ;^)¶

Future issues with fp64 and High-Performance Computing¶

The Issues¶

A brief history¶

A brief history of floating point hardware¶

Current Floating Point fp64 performance for CPU’s and GPU’s¶

Current Floating Point formats for CPU’s and GPU’s¶

Near Future NVIDIA GPU FloatingPoint roadmap StorageReview.com¶

Beyond Blackwell¶

Interesting times indeed¶

Operation Counting¶

Example 1: Matrix-Vector Multiplication¶

Behavior of error as a function of $\Delta x$ ¶