Skip to article frontmatterSkip to article content

<table>

Text provided under a Creative Commons Attribution license, CC-BY. All code is made available under the FSF-approved MIT license. (c) Kyle T. Mandli
from __future__ import print_function

%matplotlib inline
import matplotlib.pyplot as plt
import numpy

Sources of Error

Error can come from many sources when formulating problems and/or applying numerical methods:

  • Model/Data Error

  • Discretization Error

  • Floating Point Error

  • Convergence Error

Objectives

  • Understand the different sources of error

  • Explore some simple approaches to error analysis

  • Quantify errors

    • absolute error

    • relative error

    • precision

  • Long term goals

    • Use error estimates to control accuracy/reliability of solutions

    • Understand errors so you can believe and justify your solutions

Model and Data Error

Errors in fundamental formulation

  • SIR model

    • simplistic averaged model of interactions

      I˙=αSIβI\dot{I} = \alpha SI - \beta I
    • basic model predicts a single peak

  • Data Error - Inaccuracy in measurement or uncertainties in parameters

Unfortunately we cannot control model and data error directly but we can use methods that may be more robust in the presense of these types of errors.

Discretization or Truncation Error

Errors arising from approximating a function with a simpler function, e.g. Using the approximation sin(x)x\sin(x) \approx x when x0|x| \approx 0.

Floating Point Error

Errors arising from approximating real numbers with finite-precision numbers and arithmetic.

Convergence Error

In some instances an algorithm is developed that will take a current approximation and then find an improvement on the current approximation. In some instances the errors generated in each indivudal step can accumulate or become magnified after repeating the algorithm a number of times.

Basic Definitions

Before exploring the different kinds of error, it is important to first define the ways that error is measured. Given a true value of a function ff and an approximate solution FF define:

Absolute Error:

e=fFe = | f - F |

Relative Error:

r=ef=fFfr = \frac{e}{|f|} = \frac{|f - F|}{|f|}

Note: these definitions assume ff and FF are scalar valued. However these definitions are readily extended to more complicate objects such as vectors or matrices with appropriate norms.

Decimal Precision

This definition of relative error provides a convenient estimate for the number of digits of decimal precision pp

given a relative error rr, the precision pp is the largest integer such that

r5×10pr \leq 5\times 10^{-p}

Example

  • if r=0.001<5×103r = 0.001 < 5\times10^{-3} has p=3p=3 significant digits

  • if r=0.006<5×102r = 0.006 < 5\times10^{-2} has p=2p=2 significant digits (because this error would cause rounding up)

Example

let

f=e1,F=2.71f = e^1,\quad F=2.71
f = numpy.exp(1.0)
F = 2.71
print("f = {}".format(f))
print("F = {}".format(F))
e = numpy.abs(f - F)
r = e / numpy.abs(f)
print("Absolute Error: {}".format(e))
print("Relative Error: {}".format(r))
p = int(-numpy.log10(r / 5.0))
print("Decimal precision: {}".format(p))

Big-O Notation

In many situations an approximation will have a parameter associated with it, and the value of the parameter is often chosen to insure that the error is reasonable in a given situation. In such circumstances we often want to know the impact on the error if we change the value of the parameter. This leads to the definition of Big-O notation:

f(x)=O(g(x))asxaf(x) = O(g(x)) \quad \text{as} \quad x \rightarrow a

if and only if

f(x)Mg(x)asxa<δwhereM,a>0.|f(x)| \leq M |g(x)| \quad \text{as}\quad |x - a| < \delta \quad \text{where} \quad M,a > 0.

In practice we use Big-O notation to say something about how the terms we may have left out of a series might behave. We saw an example earlier of this with the Taylor’s series approximations.

Example:

Consider approximating a differentiable function f(x)f(x) by its Taylor polynomial (truncated Taylor series) expanded around x0=0x_0=0., i.e.

F(x)=TN(x0+Δx)=n=0Nf(n)(x0)Δxnn!F(x) = T_N(x_0 + \Delta x) = \sum^N_{n=0} f^{(n)}(x_0) \frac{\Delta x^n}{n!}

where

f(x)=limNTNf(x)=\lim_{N\rightarrow\infty} T_N

assuming the Taylor series converges

or for the case f(x)=sin(x)f(x)=\sin(x) expanded around x0=0x_0=0

TN(Δx)=n=0N(1)nΔx2n+1(2n+1)!T_N(\Delta x) = \sum^N_{n=0} (-1)^{n} \frac{\Delta x^{2n+1}}{(2n+1)!}

For N=2N=2, we can then write F(x)F(x) as

F(Δx)=ΔxΔx36+Δx5120F(\Delta x) = \Delta x - \frac{\Delta x^3}{6} + \frac{\Delta x^5}{120}

so our true function is

f(x)=F(Δx)+O(Δx7)f(x) = F(\Delta x) + O(\Delta x^7)

or the absolute error

e=fFO(Δx7)e = | f -F | \sim O(\Delta x^7)

We can also develop rules for error propagation based on Big-O notation:

In general, there are two theorems that do not need proof and hold when the value of x is large:

Let

f(x)=p(x)+O(xn)g(x)=q(x)+O(xm)k=max(n,m)\begin{aligned} f(x) &= p(x) + O(x^n) \\ g(x) &= q(x) + O(x^m) \\ k &= \max(n, m) \end{aligned}

then

f+g=p+q+O(xk)f+g = p + q + O(x^k)

and

fg=pq+pO(xm)+qO(xn)+O(xn+m)=pq+O(xn+m)\begin{align} f \cdot g &= p \cdot q + p O(x^m) + q O(x^n) + O(x^{n + m}) \\ &= p \cdot q + O(x^{n+m}) \end{align}

On the other hand, if we are interested in small values of x, say Δx\Delta x, the above expressions can be modified as follows:

f(Δx)=p(Δx)+O(Δxn)g(Δx)=q(Δx)+O(Δxm)r=min(n,m)\begin{align} f(\Delta x) &= p(\Delta x) + O(\Delta x^n) \\ g(\Delta x) &= q(\Delta x) + O(\Delta x^m) \\ r &= \min(n, m) \end{align}

then

f+g=p+q+O(Δxr)f+g = p + q + O(\Delta x^r)

and

fg=pq+pO(Δxm)+qO(Δxn)+O(Δxn+m)=pq+O(Δxr)\begin{align} f \cdot g &= p \cdot q + p \cdot O(\Delta x^m) + q \cdot O(\Delta x^n) + O(\Delta x^{n+m}) \\ &= p \cdot q + O(\Delta x^r) \end{align}

Note: In this case we suppose that at least the polynomial with k=max(n,m)k = \max(n, m) has the following form:

p(Δx)=1+p1Δx+p2Δx2+p(\Delta x) = 1 + p_1 \Delta x + p_2 \Delta x^2 + \ldots

or

q(Δx)=1+q1Δx+q2Δx2+q(\Delta x) = 1 + q_1 \Delta x + q_2 \Delta x^2 + \ldots

so that there is an O(1)\mathcal{O}(1) term that guarantees the existence of O(Δxr)\mathcal{O}(\Delta x^r) in the final product.

To get a sense of why we care most about the power on Δx\Delta x when considering convergence the following figure shows how different powers on the convergence rate can effect how quickly we converge to our solution. Note that here we are plotting the same data two different ways. Plotting the error as a function of Δx\Delta x is a common way to show that a numerical method is doing what we expect and exhibits the correct convergence behavior. Since errors can get small quickly it is very common to plot these sorts of plots on a log-log scale to easily visualize the results. Note that if a method was truly of the order nn that they will be a linear function in log-log space with slope nn.

Behavior of error as a function of Δx\Delta x

dx = numpy.linspace(1.0, 1e-4, 100)

fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 2.0)
axes = []
axes.append(fig.add_subplot(1, 2, 1))
axes.append(fig.add_subplot(1, 2, 2))

for n in range(1, 5):
    axes[0].plot(dx, dx**n, label="$\Delta x^%s$" % n)
    axes[1].loglog(dx, dx**n, label="$\Delta x^%s$" % n)

axes[0].legend(loc=2)
axes[1].set_xticks([10.0 ** (-n) for n in range(5)])
axes[1].set_yticks([10.0 ** (-n) for n in range(16)])
axes[1].legend(loc=4)
for n in range(2):
    axes[n].set_title("Growth of Error vs. $\Delta x^n$")
    axes[n].set_xlabel("$\Delta x$")
    axes[n].set_ylabel("Estimated Error")

plt.show()

Discretization Error

Taylor’s Theorem: Let f(x)CN+1[a,b]f(x) \in C^{N+1}[a,b] and x0[a,b]x_0 \in [a,b], then for all x(a,b)x \in (a,b) there exists a number c=c(x)c = c(x) that lies between x0x_0 and xx such that

f(x)=TN(x)+RN(x)f(x) = T_N(x) + R_N(x)

where TN(x)T_N(x) is the Taylor polynomial approximation

TN(x)=n=0Nf(n)(x0)(xx0)nn!T_N(x) = \sum^N_{n=0} \frac{f^{(n)}(x_0)\cdot(x-x_0)^n}{n!}

and RN(x)R_N(x) is the residual (the part of the series we left off)

RN(x)=f(N+1)(c)(xx0)N+1(N+1)!R_N(x) = \frac{f^{(N+1)}(c) \cdot (x - x_0)^{N+1}}{(N+1)!}

Note

The residual:

RN(x)=f(N+1)(c)(xx0)N+1(N+1)!R_N(x) = \frac{f^{(N+1)}(c) \cdot (x - x_0)^{N+1}}{(N+1)!}

depends on the N+1N+1 order derivative of ff evaluated at an unknown value c[x,x0]c\in[x,x_0].

If we knew the value of cc we would know the exact value of RN(x)R_N(x) and therefore the function f(x)f(x). In general we don’t know this value but we can use RN(x)R_N(x) to put upper bounds on the error and to understand how the error changes as we move away from x0x_0.

Start by replacing xx0x - x_0 with Δx\Delta x. The primary idea here is that the residual RN(x)R_N(x) becomes smaller as Δx0\Delta x \rightarrow 0 (at which point TN(x)=f(x0)T_N(x) = f(x_0)).

TN(x)=n=0Nf(n)(x0)Δxnn!T_N(x) = \sum^N_{n=0} \frac{f^{(n)}(x_0)\cdot\Delta x^n}{n!}

and RN(x)R_N(x) is the residual (the part of the series we left off)

RN(x)=f(n+1)(c)Δxn+1(n+1)!MΔxn+1=O(Δxn+1)R_N(x) = \frac{f^{(n+1)}(c) \cdot \Delta x^{n+1}}{(n+1)!} \leq M \Delta x^{n+1} = O(\Delta x^{n+1})

where MM is an upper bound on

f(n+1)(c)(n+1)!\frac{f^{(n+1)}(c)}{(n+1)!}

Example 1

f(x)=exf(x) = e^x with x0=0x_0 = 0 on the interval x(1,1)x\in(-1,1)

Using this we can find expressions for the relative and absolute error as a function of xx assuming N=2N=2.

Derivatives:

f(x)=exf(x)=exf(n)(x)=ex\begin{aligned} f'(x) &= e^x \\ f''(x) &= e^x \\ f^{(n)}(x) &= e^x \end{aligned}

Taylor polynomials:

TN(x)=n=0Ne0xnn!T2(x)=1+x+x22\begin{aligned} T_N(x) &= \sum^N_{n=0} e^0 \frac{x^n}{n!} \Rightarrow \\ T_2(x) &= 1 + x + \frac{x^2}{2} \end{aligned}

Remainders:

RN(x)=ecxN+1(N+1)!R2(x)=ecx36e160.5\begin{aligned} R_N(x) &= e^c \frac{x^{N+1}}{(N+1)!} \\ R_2(x) &= e^c \cdot \frac{x^3}{6} \leq \frac{e^1}{6} \approx 0.5 \end{aligned}

Accuracy:

exp(1)=2.718T2(1)=2.5\begin{align} \exp(1) &= 2.718\ldots \\ T_2(1) &= 2.5 \end{align}
e0.2,r0.08,p=?\Rightarrow e \approx 0.2,\quad r \approx 0.08,\quad p = ?

We can also use the package sympy which has the ability to calculate Taylor polynomials built-in!

import sympy

sympy.init_printing(pretty_print=True)
x = sympy.symbols("x")
f = sympy.exp(x)
f.series(x0=0, n=3)

Lets plot this numerically for a section of xx.

x = numpy.linspace(-1, 1, 100)
f = numpy.exp(x)
T_N = 1.0 + x + x**2 / 2.0
R_N = numpy.exp(1) * x**3 / 6.0
fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, T_N, "r", x, f, "k", x, numpy.abs(R_N), "b")
axes.plot(x, numpy.abs(numpy.exp(x) - T_N), "g--")
axes.plot(0.0, 1.0, "o", markersize=10)

axes.grid()
axes.set_xlabel("x", fontsize=16)
axes.set_ylabel("$f(x)$, $T_N(x)$, $|R_N(x)|$", fontsize=16)
axes.legend(["$T_N(x)$", "$f(x)$", "$|R_N(x)|$", "e(x)"], loc=2)
plt.show()

Example 2

Approximate

f(x)=1xx0=1,f(x) = \frac{1}{x} \quad x_0 = 1,

using x0=1x_0 = 1 to the 3rd Taylor series term on the inverval x[1,)x\in[1,\infty)

f(x)=1x2,f(x)=2x3,f(x)=6x4,,f(n)(x)=(1)nn!xn+1\begin{matrix} f'(x) = -\frac{1}{x^2}, & f''(x) = \frac{2}{x^3}, & f'''(x) = -\frac{6}{x^4}, & \ldots, & f^{(n)}(x) &= \frac{(-1)^n n!}{x^{n+1}} \end{matrix}
TN(x)=n=0N(1)n(x1)nT2(x)=1(x1)+(x1)2\begin{aligned} T_N(x) &= \sum^N_{n=0} (-1)^n (x-1)^n \Rightarrow \\ T_2(x) &= 1 - (x - 1) + (x - 1)^2 \end{aligned}
RN(x)=(1)n+1(x1)n+1cn+2R2(x)=(x1)3c4\begin{aligned} R_N(x) &= \frac{(-1)^{n+1}(x - 1)^{n+1}}{c^{n+2}} \Rightarrow \\ R_2(x) &= \frac{-(x - 1)^{3}}{c^{4}} \end{aligned}

plot this problem

x = numpy.linspace(0.8, 2, 100)
f = 1.0 / x
T_N = 1.0 - (x - 1) + (x - 1) ** 2
R_N = -((x - 1.0) ** 3) / (1.0**4)
plt.figure(figsize=(8, 6))
plt.plot(x, T_N, "r", x, f, "k", x, numpy.abs(R_N), "b")
plt.plot(x, numpy.abs(f - T_N), "g--")
plt.plot(1.0, 1.0, "o", markersize=10)
plt.grid(True)
plt.xlabel("x", fontsize=16)
plt.ylabel("$f(x)$, $T_N(x)$, $R_N(x)$", fontsize=16)
plt.title("$f(x) = 1/x$", fontsize=18)
plt.legend(["$T_N(x)$", "$f(x)$", "$|R_N(x)|$", "$e(x)$"], loc="best")
plt.show()

Computational Issue #1: Accuracy... how many terms?

Given a Taylor Polynomial approximation of an arbitrary function f(x)f(x), how do we determine how many terms are required such that RN(x)<tol|R_N(x)|<tol. And how do we determine the tolerance?

Computational Issue #2 Efficiency... Operation counts for polynomial evaluation

Given

PN(x)=a0+a1x+a2x2++aNxNP_N(x) = a_0 + a_1 x + a_2 x^2 + \ldots + a_N x^N

or

PN(x)=p0xN+p1xN1+p2xN2++pNP_N(x) = p_0 x^N + p_1 x^{N-1} + p_2 x^{N-2} + \ldots + p_{N}

what is the most efficient way to evaluate PN(x)P_N(x)? (i.e. minimize number of floating point operations)

Consider two ways to write P3P_3

  • The standard way:

P3(x)=p0x3+p1x2+p2x+p3P_3(x) = p_0 x^3 + p_1 x^2 + p_2 x + p_3
  • using nested multiplication (aka Horner’s Method):

P3(x)=((p0x+p1)x+p2)x+p3P_3(x) = ((p_0 x + p_1) x + p_2) x + p_3

Consider how many operations it takes for each...

P3(x)=p0x3+p1x2+p2x+p3P_3(x) = p_0 x^3 + p_1 x^2 + p_2 x + p_3
P3(x)=p0xxx3+p1xx2+p2x1+p3P_3(x) = \overbrace{p_0 \cdot x \cdot x \cdot x}^3 + \overbrace{p_1\cdot x \cdot x}^2 + \overbrace{p_2 \cdot x}^1 + p_3

Note: here we’re just counting multiplications as they will dominate the flop count

Adding up all the operations we can in general think of this as a pyramid (it’s really the triangle numbers)

n=1Nn=N(N+1)2\sum_{n=1}^N n = \frac{N(N+1)}{2}
Original Count

Thus we can estimate that the algorithm written this way will take approximately O(N2/2)O(N^2 / 2) operations to complete.

Looking at nested iteration, however:

P3(x)=((p0x+p1)x+p2)x+p3P_3(x) = ((p_0 x + p_1) x + p_2) x + p_3

Here we find that the method is O(N)O(N) compared to the first evaluation which O(N2)O(N^2) (we usually drop the 2 in these cases). That’s a huge difference for large NN!

Algorithm

Fill in the function and implement Horner’s method:

def eval_poly(p, x):
    '''Evaluates a polynomial using Horner's method given coefficients p at x
    
      The polynomial is defined as
    
        P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]
        
    Parameters:
        p: list or numpy array of coefficients 
        x:  scalar float
        
    returns:
        P(x):  value of the polynomial at point x (float)
    '''
    pass
def eval_poly(p, x):
    """Evaluates a polynomial using Horner's method given coefficients p at x

      The polynomial is defined as

        P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]

    Parameters:
        p: list or numpy array of coefficients
        x:  scalar float or numpy array (this version is more robust to floating point error)

    returns:
        P(x):  value of the polynomial at point x, P will return as the same type as x

    """

    if isinstance(x, numpy.ndarray):
        y = p[0] * numpy.ones(x.shape)
    elif isinstance(x, float):
        y = p[0]
    else:
        raise TypeError

    for element in p[1:]:
        y = y * x + element

    return y
# Scalar test

p = [1, 2, 3]
x = 1.0
test = eval_poly(p, x)
answer = numpy.array([x**2, x, 1]).dot(p)

print("test = {} ({}), answer = {} ({})".format(test, type(test), answer, type(answer)))
numpy.testing.assert_allclose(test, answer)
print("success")
# Vectorized test with x a numpy array

p = [1, -3, 10, 4, 5, 5]
x = numpy.linspace(-10, 10, 100)
P = eval_poly(p, x)
print("x: {}, P(x): {}".format(type(x), type(P)))
plt.plot(x, P)
plt.xlabel("x")
plt.ylabel("P(x)")
plt.title("{}th order polynomial, p={}".format(len(p) - 1, p))
plt.grid()
plt.show()

Convergence Error

In some circumstances a formula or algorithm is applied repeatedly as a way to obtain a final approximation. Usually, the errors that occur at each individual step are small. By repeating the algorithm, though, the errors can sometimes grow or become magnified.

As example of this phenomena is given below. The values of the terms in a difference equation are calculated, $$

y0=1,y1=15,yn+1=165yn35yn1.\begin{align} y_0 &= 1, \\ y_1 &= \frac{1}{5}, \\ y_{n+1} &= \frac{16}{5} y_n - \frac{3}{5} y_{n-1}. \end{align}

$$

The true solution to the difference equation is yn=(15)ny_n = \left(\frac{1}{5}\right)^n, where n=n=0, 1, 2, \ldots

# Choose the number of iterations
N = 40
y = numpy.empty(N + 1)  # Allocate an empty vector with N+1 entries

# Now use the difference equation to generate the numbers in the sequence
y[0] = 1
y[1] = 1 / 5
for n in range(2, N + 1):
    y[n] = 16 / 5 * y[n - 1] - 3 / 5 * y[n - 2]

And plot the result

# Now plot the result
n = numpy.arange(0, N + 1)
fig = plt.figure(figsize=(10.0, 5.0))
axes = fig.add_subplot(1, 1, 1)
axes.semilogy(n, y, "rx", markersize=5, label="$y_n$")
axes.semilogy(n, (1 / 5) ** n, "b.", label="$y_{true}$")
axes.grid()
axes.set_title("Calculated Values Of A Difference Equation", fontsize=18)
axes.set_xlabel("$n$", fontsize=16)
axes.set_ylabel("$y_n$", fontsize=16)
axes.legend(loc="best", shadow=True)
plt.show()

Simply looking at the exact solution, the sequence of numbers generated by the difference equation above should get very close to zero. Instead, the numbers in the sequence initially get closer to zero, but at some point they begin to grow and get larger. An underlying problem is that the computer is not able to store the numbers exactly. The second number in the sequence, y1=15y_1=\frac{1}{5} has a small error, and the computer stores it as y1=15+ϵy_1 = \frac{1}{5}+\epsilon where ϵ\epsilon is some small error.

Each time a new number in the loop is generated, the error is multiplied. For example, after the first iteration y2y_2 is

y2=165(15+ϵ)35(1),=152+165ϵ.\begin{align} y_2 &= \frac{16}{5} \left( \frac{1}{5}+\epsilon \right) - \frac{3}{5} \left( 1 \right), \\ &= \frac{1}{5^2} + \frac{16}{5} \epsilon. \end{align}

After the second time through the loop, the value of y3y_3 is

y3=153+24125ϵy_3=\frac{1}{5^3} + \frac{241}{25}\epsilon

Even though the value of ϵ\epsilon is very close to zero, every iteration makes the error grow. Repeated multiplication will result in a very large number.

The error associated with the initial representation of the number 15\frac{1}{5} is a problem with the way a digital computer stores floating point numbers. In most instances the computer cannot represent a number exactly, and the small error in approximating a given number can give rise to other problems.

Floating Point Error

Errors arising from approximating real numbers with finite-precision numbers

π3.14\pi \approx 3.14

or 130.333333333\frac{1}{3} \approx 0.333333333 in decimal, results from a finite number of registers to represent each number.

Floating Point Systems

Numbers in floating point systems are represented as a series of bits that represent different pieces of a number. In normalized floating point systems there are some standard conventions for what these bits are used for. In general the numbers are stored by breaking them down into the form

F=±d1.d2d3d4dp×βEF = \pm d_1 . d_2 d_3 d_4 \ldots d_p \times \beta^E

where

  1. ± is a single bit representing the sign of the number

  2. d1.d2d3d4dpd_1 . d_2 d_3 d_4 \ldots d_p is called the mantissa. Note that technically the decimal could be moved but generally, using scientific notation, the decimal can always be placed at this location. The digits d2d3d4dpd_2 d_3 d_4 \ldots d_p are called the fraction with pp digits of precision. Normalized systems specifically put the decimal point in the front like we have and assume d10d_1 \neq 0 unless the number is exactly 0.

  3. β\beta is the base. For binary β=2\beta = 2, for decimal β=10\beta = 10, etc.

  4. EE is the exponent, an integer in the range [Emin,Emax][E_{\min}, E_{\max}]

The important points on any floating point system is that

  1. There exist a discrete and finite set of representable numbers

  2. These representable numbers are not evenly distributed on the real line

  3. Arithmetic in floating point systems yield different results from infinite precision arithmetic (i.e. “real” math)

Properties of Floating Point Systems

All floating-point systems are characterized by several important numbers

  • Smalled normalized number (underflow if below - related to subnormal numbers around zero)

  • Largest normalized number (overflow if above)

  • Zero

  • Machine ϵ\epsilon or ϵmachine\epsilon_{\text{machine}}

  • inf and nan, infinity and Not a Number respectively

Example: Toy System

Consider the toy 2-digit precision decimal system (normalized)

f=±d1.d2×10Ef = \pm d_1 . d_2 \times 10^E

with E[2,0]E \in [-2, 0].

Number and distribution of numbers

  1. How many numbers can we represent with this system?

  2. What is the distribution on the real line?

  3. What is the underflow and overflow limits?

  4. What is the smallest number ϵmach\epsilon_{mach} such that 1+ϵmach>11+\epsilon_{mach} > 1?

How many numbers can we represent with this system?

f=±d1.d2×10E   withE[2,0]f = \pm d_1 . d_2 \times 10^E ~~~ \text{with} E \in [-2, 0]
  • sign bit: 2

  • d1d_1: 9 (normalized numbers d10d_1\neq 0)

  • d2d_2: 10

  • EE: 3

  • zero: 1

total:

2×9×10×3+1=5412 \times 9 \times 10 \times 3 + 1 = 541

What is the distribution on the real line?

f=±d1.d2×10E   with   E[2,0]f = \pm d_1 . d_2 \times 10^E ~~~ \text{with} ~~~ E \in [-2, 0]
d_1_values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
d_2_values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
E_values = [
    0,
    -1,
    -2,
]

fig = plt.figure(figsize=(10.0, 1.5))
axes = fig.add_subplot(1, 1, 1)

for E in E_values:
    for d1 in d_1_values:
        for d2 in d_2_values:
            axes.plot((d1 + d2 * 0.1) * 10**E, 0.0, "r|", markersize=20)
            axes.plot(-(d1 + d2 * 0.1) * 10**E, 0.0, "r|", markersize=20)

axes.plot(0.0, 0.0, "|", markersize=20)
axes.plot([-1.0, 1.0], [0.0, 0.0], "k|", markersize=30)

axes.plot([-10.0, 10.0], [0.0, 0.0], "k")

axes.set_title("Distribution of Values $[-10, 10]$")
axes.set_yticks([])
ticks = [i for i in range(-10, 11, 1)]
axes.set_xticks(ticks)
axes.set_xlabel("x")
axes.set_ylabel("")
axes.set_xlim([-10, 10])
plt.show()
fig = plt.figure(figsize=(10.0, 1.5))
axes = fig.add_subplot(1, 1, 1)

for E in E_values:
    for d1 in d_1_values:
        for d2 in d_2_values:
            axes.plot((d1 + d2 * 0.1) * 10**E, 0.0, "r+", markersize=20)
            axes.plot(-(d1 + d2 * 0.1) * 10**E, 0.0, "r+", markersize=20)

axes.plot(0.0, 0.0, "+", markersize=20)
axes.plot([-0.1, 0.1], [0.0, 0.0], "k|", markersize=30)
axes.plot([-1, 1], [0.0, 0.0], "k")

axes.set_title("Close up $[-1, 1]$")
axes.set_yticks([])
ticks = numpy.linspace(-1.0, 1.0, 21)
axes.set_xticks(ticks)
axes.set_xlabel("x")
axes.set_ylabel("")
axes.set_xlim([-1, 1])
# fig.tight_layout(h_pad=1, w_pad=5)

plt.show()

What is the underflow and overflow limits?

  • Smallest number that can be represented is the underflow: 1.0×102=0.011.0 \times 10^{-2} = 0.01

  • Largest number that can be represented is the overflow: 9.9×100=9.99.9 \times 10^0 = 9.9

What is the smallest number ϵmach\epsilon_{mach} such that 1+ϵmach>11+\epsilon_{mach} > 1?

  • ϵmach=0.1\epsilon_{mach} = 0.1

Binary Systems

Consider the 2-digit precision base 2 system:

f=±d1.d2×2EwithE[1,1]f=\pm d_1 . d_2 \times 2^E \quad \text{with} \quad E \in [-1, 1]

Number and distribution of numbers

  1. How many numbers can we represent with this system?

  2. What is the distribution on the real line?

  3. What is the underflow and overflow limits?

  4. What is ϵmach\epsilon_{mach}?

How many numbers can we represent with this system?

f=±d1.d2×2E    with    E[1,1]f=\pm d_1 . d_2 \times 2^E ~~~~ \text{with} ~~~~ E \in [-1, 1]
2×1×2×3+1=132 \times 1 \times 2 \times 3 + 1 = 13

What is the distribution on the real line?

d_1_values = [1]
d_2_values = [0, 1]
E_values = [1, 0, -1]

fig = plt.figure(figsize=(10.0, 1.0))
axes = fig.add_subplot(1, 1, 1)

for E in E_values:
    for d1 in d_1_values:
        for d2 in d_2_values:
            axes.plot((d1 + d2 * 0.5) * 2**E, 0.0, "r+", markersize=20)
            axes.plot(-(d1 + d2 * 0.5) * 2**E, 0.0, "r+", markersize=20)

axes.plot(0.0, 0.0, "r+", markersize=20)
axes.plot([-4.5, 4.5], [0.0, 0.0], "k")

axes.set_title("Distribution of Values")
axes.set_yticks([])
axes.set_xticks(numpy.linspace(-4, 4, 9))
axes.set_xlabel("x")
axes.set_ylabel("")
axes.grid()
axes.set_xlim([-5, 5])
plt.show()
  • Smallest number that can be represented is the underflow: 1.0×21=0.51.0 \times 2^{-1} = 0.5

  • Largest number that can be represented is the overflow: 1.1×21=31.1 \times 2^1 = 3

  • ϵmach=0.1=21=1/2\epsilon_{mach} = 0.1 = 2^{-1}= 1/2

Note: these numbers are in a binary system.

Quick rule of thumb:

23222120.2122232^3 2^2 2^1 2^0 . 2^{-1} 2^{-2} 2^{-3}

correspond to 8s, 4s, 2s, 1s . halves, quarters, eighths, ...

Real Systems - IEEE 754 Binary Floating Point Systems

Single Precision

  • Total storage alloted is 32 bits

  • Exponent is 8 bits E[126,128]\Rightarrow E \in [-126, 128]

  • Fraction 23 bits (p=24p = 24)

s EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1      8 9                     31
  • Overflow =21283.40×1038= 2^{128}\approx3.40\times10^{38}

  • Underflow =21261.17×1038= 2^{-126} \approx 1.17 \times 10^{-38}

  • ϵmachine=2231.19×107\epsilon_{\text{machine}} = 2^{-23} \approx 1.19 \times 10^{-7}

Double Precision

  • Total storage alloted is 64 bits

  • Exponent is 11 bits E[1022,1024]\Rightarrow E \in [-1022, 1024]

  • Fraction 52 bits (p=53p = 53)

s EEEEEEEEEE FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FF
0 1       11 12                                                      63
  • Overflow =210241.8×10308= 2^{1024} \approx 1.8 \times 10^{308}

  • Underflow =210222.2×10308= 2^{-1022} \approx 2.2 \times 10^{-308}

  • ϵmachine=2522.2×1016\epsilon_{\text{machine}} = 2^{-52} \approx 2.2 \times 10^{-16}

Python Access to IEEE Numbers

Access many important parameters, such as machine epsilon:

import numpy
numpy.finfo(float).eps
print(numpy.finfo(numpy.float16))
print(numpy.finfo(numpy.float32))
print(numpy.finfo(float))
print(numpy.finfo(numpy.float128))

Examples

eps = numpy.finfo(float).eps
MAX = numpy.finfo(float).max
print("eps = {}".format(eps))
print("MAX = {}".format(MAX))

Show that (1+ϵmach)>1(1 + \epsilon_{mach}) > 1

print(MAX)
print(MAX * (1 + 0.4 * eps))
print(1 + 0.4 * eps == 1.0)

Why should we care about this?

  • Floating point arithmetic is not commutative or associative

  • Floating point errors compound, do not assume even double precision is enough!

  • Mixing precision can be dangerous

Example 1: Simple Arithmetic

Simple arithmetic δ<ϵmachine\delta < \epsilon_{\text{machine}}.

Compare

1+δ1vs.11+δ1+\delta - 1 \quad vs. \quad 1 - 1 + \delta
eps = numpy.finfo(float).eps
delta = 0.5 * eps
x = 1 + delta - 1
y = 1 - 1 + delta
print("1 + delta - 1 = {}".format(x))
print("1 - 1 + delta = {}".format(y))
print(x == y)

Example 2: Catastrophic cancellation

Let us examine what happens when we add two numbers xx and yy where x+y0x + y \neq 0. We can actually estimate these bounds by doing some error analysis. Here we need to introduce the idea that each floating point operation introduces an error such that

fl(x op y)=(x op y)(1+δ)\text{fl}(x ~\text{op}~ y) = (x ~\text{op}~ y) (1 + \delta)

where fl()\text{fl}(\cdot) is a function that returns the floating point representation of the expression enclosed, op\text{op} is some operation (e.g. +,,×,/+, -, \times, /), and δ\delta is the floating point error due to op\text{op}.

Back to our problem at hand. The floating point error due to addition is

fl(x+y)=(x+y)(1+δ).\text{fl}(x + y) = (x + y) (1 + \delta).

Comparing this to the true solution using a relative error we have

(x+y)fl(x+y)x+y=(x+y)(x+y)(1+δ)x+y=δ.\begin{aligned} \frac{|(x + y) - \text{fl}(x + y)|}{|x + y|} &= \frac{|(x + y) - (x + y) (1 + \delta)|}{|x + y|} = \delta. \end{aligned}

so that if δ=O(ϵmachine)\delta = \mathcal{O}(\epsilon_{\text{machine}}) we are not too concerned.

What if instead we consider a floating point error on the representations of xx and yy, xyx \neq y, and say δx\delta_x and δy\delta_y are the magnitude of the errors in their representation. Here we will assume this constitutes the floating point error rather than being associated with the operation itself.

Now consider the difference between the two numbers

fl(xy)=x(1+δx)y(1+δy)=xy+xδxyδy=(xy)(1+xδxyδyxy)\begin{aligned} \text{fl}(x - y) &= x (1 + \delta_x) - y (1 + \delta_y) \\ &= x - y + x \delta_x - y \delta_y \\ &= (x - y) \left(1 + \frac{x \delta_x - y \delta_y}{x - y}\right) \end{aligned}

Again computing the relative error we then have

(xy)(xy)(1+xδxyδyxy)xy=1(1+xδxyδyxy)=xδxyδyxy\begin{aligned} \frac{\left|(x - y) - (x - y) \left(1 + \frac{x \delta_x - y \delta_y}{x - y}\right)\right|}{|x - y|} &= \left|1 - \left(1 + \frac{x \delta_x - y \delta_y}{x - y}\right)\right|\\ &=\frac{|x \delta_x - y \delta_y|}{|x - y|} \\ \end{aligned}

The important distinction here is that now the error is dependent on the values of xx and yy and more importantly, their difference. Of particular concern is the relative size of xyx - y. As it approaches zero relative to the magnitudes of xx and yy the error could be arbitrarily large. This is known as catastrophic cancellation.

dx = numpy.array([10 ** (-n) for n in range(1, 16)])
x = 1.0 + dx
y = numpy.ones(x.shape)
error = numpy.abs(x - y - dx) / (dx)
fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1)
axes.loglog(dx, x + y, "o-")
axes.set_xlabel("$\Delta x$")
axes.set_ylabel("$x + y$")
axes.set_title("$\Delta x$ vs. $x+y$")

axes = fig.add_subplot(1, 2, 2)
axes.loglog(dx, error, "o-")
axes.set_xlabel("$\Delta x$")
axes.set_ylabel("$|x + y - \Delta x| / \Delta x$")
axes.set_title("Difference between $x$ and $y$ vs. Relative Error")

plt.show()

Example 3: Function Evaluation

Consider the function

f(x)=1cosxx2f(x) = \frac{1 - \cos x}{x^2}

with x[104,104]x\in[-10^{-4}, 10^{-4}].

Taking the limit as x0x \rightarrow 0 we can see what behavior we would expect to see from evaluating this function:

limx01cosxx2=limx0sinx2x=limx0cosx2=12.\lim_{x \rightarrow 0} \frac{1 - \cos x}{x^2} = \lim_{x \rightarrow 0} \frac{\sin x}{2 x} = \lim_{x \rightarrow 0} \frac{\cos x}{2} = \frac{1}{2}.

What does floating point representation do?

x = numpy.linspace(-1e-3, 1e-3, 100, dtype=numpy.float32)
f = 0.5
F = (1.0 - numpy.cos(x)) / x**2
rel_err = numpy.abs((f - F)) / f
fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, rel_err, "o")
axes.set_xlabel("x")
axes.grid()
axes.set_ylabel("Relative Error")
axes.set_title("$\\frac{1-\\cos{x}}{x^2} - \\frac{1}{2}$", fontsize=18)
plt.show()

Example 4: Evaluation of a Polynomial

f(x)=x77x6+21x535x4+35x321x2+7x1f(x) = x^7 - 7x^6 + 21 x^5 - 35 x^4 + 35x^3-21x^2 + 7x - 1

Note: f(1)=0f(1) = 0 (and will be close to zero for x1x\approx 1)

Here we compare polynomial evaluation using naive powers compared to Horner’s method as implemented in eval_poly(p,x) defined above.

x = numpy.linspace(0.988, 1.012, 1000, dtype=numpy.float16)
y = (
    x**7
    - 7.0 * x**6
    + 21.0 * x**5
    - 35.0 * x**4
    + 35.0 * x**3
    - 21.0 * x**2
    + 7.0 * x
    - 1.0
)

# repeat using Horner's method from above
p = numpy.array([1, -7, 21, -35, 35, -21, 7, -1])
yh = eval_poly(p, x)
fig = plt.figure(figsize=(8, 6))
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1)
axes.plot(x, y, "r", label="naive")
axes.plot(x, yh, "b", label="horner")
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_ylim((-0.1, 0.1))
axes.set_xlim((x[0], x[-1]))
axes.grid()
axes.legend()

axes = fig.add_subplot(1, 2, 2)
axes.plot(x, yh - y, "g")
axes.grid()
axes.set_xlabel("x")
axes.set_ylabel("$f_{horner} - f_n$")
axes.set_title("error")
plt.show()
def eval_polys(p, x):
    """Evaluates a polynomial using Horner's method given coefficients p at x

      The polynomial is defined as

        P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]

    Parameters:
        p: list or numpy array of coefficients
        x:  scalar float or numpy array this version is less careful about input type

    returns:
        P(x):  value of the polynomial at point x, P will return as the same type as x

    """

    y = p[0]
    for element in p[1:]:
        y = y * x + element

    return y
# repeat using different Horner's method from above
yh = eval_polys(p, x)
fig = plt.figure(figsize=(8, 6))
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1)
axes.plot(x, y, "r", label="naive")
axes.plot(x, yh, "b", label="horner")
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_ylim((-0.1, 0.1))
axes.set_xlim((x[0], x[-1]))
axes.grid()
axes.legend()

axes = fig.add_subplot(1, 2, 2)
axes.plot(x, yh - y, "g")
axes.grid()
axes.set_xlabel("x")
axes.set_ylabel("$f_{horner} - f_n$")
axes.set_title("error")
plt.show()

Example 5: Rational Function Evaluation

Compute f(x)=x+1f(x) = x + 1 by the function

F(x)=x21x1F(x) = \frac{x^2 - 1}{x - 1}

Do you expect there to be issues?

x = numpy.linspace(0.5, 1.5, 101, dtype=numpy.float32)
f_hat = (x**2 - 1.0) / (x - 1.0)
f = x + 1.0
fig = plt.figure()
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, numpy.abs(f - f_hat) / numpy.abs(f))
axes.set_xlabel("$x$")
axes.set_ylabel("Relative Error")
axes.grid()
plt.show()

Combination of Errors

In general we need to concern ourselves with the combination of both discretization error and floating point error.

Reminder:

  • Discretization error: Errors arising from approximation of a function, truncation of a series...

sinxxx33!+x55!+O(x7)\sin x \approx x - \frac{x^3}{3!} + \frac{x^5}{5!} + O(x^7)
  • Floating-point Error: Errors arising from approximating real numbers with finite-precision numbers

π3.14\pi \approx 3.14

or 130.333333333\frac{1}{3} \approx 0.333333333 in decimal cannot be represented exactly as a binary floating point number

Example 1

Consider a finite difference approximation to the first derivative of a function

f(x)f(x+Δx)f(x)Δxf^\prime(x) \approx \frac{f(x + \Delta x) - f(x)}{\Delta x}

Note: in the limit Δx0\Delta x\rightarrow 0, this is the standard definition of the first derivative. However we’re interested in the error for a finite Δx\Delta x.

Moreover, (as we will see in future notebooks), there are many ways to approximate the first derivative. For example we can write the “centered first derivative” as

f(x)f(x+Δx)f(xΔx)2Δxf^\prime(x) \approx \frac{f(x + \Delta x) - f(x - \Delta x)}{2\Delta x}

Here we will just compare the the error for the two different finite-difference formulas given

f(x)=f(x)=exf(x) = f^\prime(x) = e^x

at x=1x=1 for decreasing values of Δx\Delta x. We will also introduce the idea of an ‘inline’ or lambda function in python.

f = lambda x: numpy.exp(x)
f_prime = lambda x: numpy.exp(x)
delta_x = numpy.array([2.0 ** (-n) for n in range(1, 60)])
x = 1.0

# Forward finite difference approximation to first derivative
f_hat_1 = (f(x + delta_x) - f(x)) / (delta_x)
# Centered finite difference approximation to first derivative
f_hat_2 = (f(x + delta_x) - f(x - delta_x)) / (2.0 * delta_x)
fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.loglog(delta_x, numpy.abs(f_hat_1 - f_prime(x)), "o-", label="One-Sided")
axes.loglog(delta_x, numpy.abs(f_hat_2 - f_prime(x)), "s-", label="Centered")
axes.legend(loc=3, fontsize=14)
axes.set_xlabel("$\Delta x$", fontsize=16)
axes.set_ylabel("Absolute Error", fontsize=16)
axes.set_title("Finite Difference approximations to $df/dx$", fontsize=18)
axes.grid()
plt.show()

Example 2

Evaluate exe^x with its Taylor series.

ex=n=0xnn!e^x = \sum^\infty_{n=0} \frac{x^n}{n!}

Can we pick N<N < \infty that can approximate exe^x over a give range x[a,b]x \in [a,b] such that the relative error EE satisfies E<8εmachineE < 8 \cdot \varepsilon_{\text{machine}}?

We can try simply evaluating the Taylor polynomial directly for various NN

from scipy.special import factorial


def my_exp(x, N=10):
    value = 0.0
    for n in range(N + 1):
        value += x**n / float(factorial(n))

    return value

And test this

eps = numpy.finfo(numpy.float64).eps

x = numpy.linspace(-2, 50.0, 100, dtype=numpy.float64)
MAX_N = 300
for N in range(1, MAX_N + 1):
    rel_error = numpy.abs((numpy.exp(x) - my_exp(x, N=N)) / numpy.exp(x))
    if numpy.all(rel_error < 8.0 * eps):
        break
fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, rel_error / eps)
axes.set_xlabel("x")
axes.set_ylabel("Relative Error/eps")
axes.set_title("N = {} terms".format(N))
axes.grid()
plt.show()

Can we do better?

Note:

the largest value of xx such that ex<e^x < MAX is:

print(numpy.log(numpy.finfo(float).max))

and numpy.exp handles that just fine

print(numpy.exp(709, dtype=numpy.float64))
print(numpy.exp(-709, dtype=numpy.float64))

Your homework: the great Exp Challenge

One final example (optional): How to calculate Relative Error

Say we wanted to compute the relative error between two values xx and yy using xx as the normalizing value. Algebraically the forms

E=xyxE = \frac{x - y}{x}

and

E=1yxE = 1 - \frac{y}{x}

are equivalent. In finite precision what form might be expected to be more accurate and why?

Example based on a blog post by Nick Higham

Using this model the original definition contains two floating point operations such that

E1=fl(xyx)=fl(fl(xy)/x)=[(xy)(1+δ1)x](1+δ2)=xyx(1+δ1)(1+δ2)\begin{aligned} E_1 = \text{fl}\left(\frac{x - y}{x}\right) &= \text{fl}(\text{fl}(x - y) / x) \\ &= \left[ \frac{(x - y) (1 + \delta_1)}{x} \right ] (1 + \delta_2) \\ &= \frac{x - y}{x} (1 + \delta_1) (1 + \delta_2) \end{aligned}

For the other formulation we have

E2=fl(1yx)=fl(1fl(yx))=(1yx(1+δ1))(1+δ2)\begin{aligned} E_2 = \text{fl}\left( 1 - \frac{y}{x} \right ) &= \text{fl}\left(1 - \text{fl}\left(\frac{y}{x}\right) \right) \\ &= \left(1 - \frac{y}{x} (1 + \delta_1) \right) (1 + \delta_2) \end{aligned}

If we assume that all op\text{op}s have similar error magnitudes then we can simplify things by letting

δϵ.|\delta_\ast| \le \epsilon.

To compare the two formulations we again use the relative error between the true relative error eie_i and our computed versions EiE_i.

Original definition:

eE1e=xyxxyx(1+δ1)(1+δ2)xyx1(1+ϵ)(1+ϵ)=2ϵ+ϵ2\begin{aligned} \frac{e - E_1}{e} &= \frac{\frac{x - y}{x} - \frac{x - y}{x} (1 + \delta_1) (1 + \delta_2)}{\frac{x - y}{x}} \\ &\le 1 - (1 + \epsilon) (1 + \epsilon) = 2 \epsilon + \epsilon^2 \end{aligned}

Manipulated definition:

eE2e=e[1yx(1+δ1)](1+δ2)e=e[eyxδ1](1+δ2)e=e[e+eδ2yxδ1yxδ1δ2))]e=δ2+1eyx(δ1+δ1δ2)=δ2+1ee(δ1+δ1δ2)ϵ+1ee(ϵ+ϵ2)\begin{aligned} \frac{e - E_2}{e} &= \frac{e - \left[1 - \frac{y}{x}(1 + \delta_1) \right] (1 + \delta_2)}{e} \\ &= \frac{e - \left[e - \frac{y}{x} \delta_1 \right] (1 + \delta_2)}{e} \\ &= \frac{e - \left[e + e\delta_2 - \frac{y}{x} \delta_1 - \frac{y}{x} \delta_1 \delta_2)) \right] }{e} \\ &= - \delta_2 + \frac{1}{e} \frac{y}{x} \left(\delta_1 + \delta_1 \delta_2 \right) \\ &= - \delta_2 + \frac{1 -e}{e} \left(\delta_1 + \delta_1 \delta_2 \right) \\ &\le \epsilon + \left |\frac{1 - e}{e}\right | (\epsilon + \epsilon^2) \end{aligned}

We see then that our floating point error will be dependent on the relative magnitude of ee

Comparison of Relative Errors of estimates of Relative Error ;^)

# Based on the code by Nick Higham
# https://gist.github.com/higham/6f2ce1cdde0aae83697bca8577d22a6e
# Compares relative error formulations using single precision and compared to double precision

N = 501  # Note: Use 501 instead of 500 to avoid the zero value
d = numpy.finfo(numpy.float32).eps * 1e4
a = 3.0
x = a * numpy.ones(N, dtype=numpy.float32)
y = [
    x[i]
    + numpy.multiply(
        (i - numpy.divide(N, 2.0, dtype=numpy.float32)), d, dtype=numpy.float32
    )
    for i in range(N)
]

# Compute errors and "true" error
relative_error = numpy.empty((2, N), dtype=numpy.float32)
relative_error[0, :] = numpy.abs(x - y) / x
relative_error[1, :] = numpy.abs(1.0 - y / x)
exact = numpy.abs((numpy.float64(x) - numpy.float64(y)) / numpy.float64(x))

# Compute differences between error calculations
error = numpy.empty((2, N))
for i in range(2):
    error[i, :] = numpy.abs((relative_error[i, :] - exact) / numpy.abs(exact))

fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.semilogy(y, error[0, :], ".", markersize=10, label="$|x-y|/|x|$")
axes.semilogy(y, error[1, :], ".", markersize=10, label="$|1-y/x|$")

axes.grid(True)
axes.set_xlabel("y")
axes.set_ylabel("Relative Error")
axes.set_xlim((numpy.min(y), numpy.max(y)))
axes.set_ylim((5e-9, numpy.max(error[1, :])))
axes.set_title("Relative Error Comparison: x,y {}".format(y[0].dtype))
axes.legend()
plt.show()

Future issues with fp64 and High-Performance Computing

The Issues

  • In traditional High-Performance computing IEEE fp64 has become the standard precision necessary for accurate, reproducible calculations for a wide range of scientific computing (e.g. climate models, fusion, solid mechanics)

  • Until recently, the needs for HPC drove the development of Chips/Hardware such that Commodity Computers and Super Computers benefited from the same technology.

  • However, with the rise of general purpose GPU’s and AI, the landscape is changing rapidly

A brief history of floating point hardware

Dongarra et al., 2024

  • 1980’s: Dedicated seperate floating point co-processors (e.g. Intel 8087, Motorola 68881 co-processors)

  • 1987: Introduction of Intel x486 CPU’s with built in floating point registers

  • 1999: Introduction of Nvidia GeForce 256 separate “Graphical Processing Unit” GPU low precision, fast parallel graphics.

  • mid-2000s: Adoption of programmable General Purpose GPU’s for fp acceleration, addition of fp64 on GPU’s

  • 2006: Introduction of Nvidia CUDA language for programmable GPU’s: shift to GPU’s for high-performance computing and ML/AI

  • ~2020+: ML/AI revolution: Deep learning algorithms driven by matrix multiplications that tolerate low precision

Current Floating Point fp64 performance for CPU’s and GPU’s

Current Floating Point formats for CPU’s and GPU’s

Near Future NVIDIA GPU FloatingPoint roadmap StorageReview.com

SpecificationH100H200B100B200B300
Max Memory80 GBs HBM3141 GBs HBM3e192 GBs HBM3e192 GBs HBM3e288 GBs HBM3e
Memory Bandwidth3.35 TB/s4.8TB/s8TB/s8TB/s8TB/s
FP4 Tensor Core14 PFLOPS18 PFLOPS30 PFLOPS
FP6 Tensor Core7 PFLOPS9 PFLOPS15 PFLOPS*
FP8 Tensor Core3958 TFLOPS (~4 PFLOPS)3958 TFLOPS (~4 PFLOPS)7 PFLOPS9 PFLOPS15 PFLOPS*
INT 8 Tensor Core3958 TOPS3958 TOPS7 POPS9 POPS15 PFLOPS*
FP16/BF16 Tensor Core1979 TFLOPS (~2 PFLOPS)1979 TFLOPS (~2 PFLOPS)3.5 PFLOPS4.5 PFLOPS7.5 PFLOPS*
TF32 Tensor Core989 TFLOPS989 TFLOPS1.8 PFLOPS2.2 PFLOPS3.3 PFLOPS*
FP32 (Dense)67 TFLOPS67 TFLOPS30 TFLOPS40 TFLOPSInformation Unknown
FP64 Tensor Core (Dense)67 TFLOPS67 TFLOPS30 TFLOPS40 TFLOPSInformation Unknown
FP64 (Dense)34 TFLOPS34 TFLOPS30 TFLOPS40 TFLOPSInformation Unknown
Max Power Consumption700W700W700W1000WInformation Unknown

Beyond Blackwell

Interesting times indeed

...the landscape of high-performance computation is increasingly complex...but there are important classes of problems that still need high-precision floating point. Some Options:

  • fp64 Emulation leveraging low-precision hardware

  • clever mixed precision algorithms

Operation Counting

Discretization Error: Why not use more terms in the Taylor series?

Floating Point Error: Why not use the highest precision possible?

Example 1: Matrix-Vector Multiplication

Let A,BRN×NA, B \in \mathbb{R}^{N \times N} and xRNx \in \mathbb{R}^N.

  1. Count the approximate number of operations it will take to compute AxA x.

  2. Do the same for ABA B.

Matrix-vector product: Defining [A]i[A]_i as the iith row of AA and AijA_{ij} as the ii, jjth entry then

Ax=i=1N[A]ix=i=1Nj=1NAijxjA x = \sum^N_{i=1} [A]_i \cdot x = \sum^N_{i=1} \sum^N_{j=1} A_{ij} x_j

Take an explicit case, say N=3N = 3, then the operation count is

Ax=[A]1v+[A]2v+[A]3v=[A11×v1+A12×v2+A13×v3A21×v1+A22×v2+A23×v3A31×v1+A32×v2+A33×v3]A x = [A]_1 \cdot v + [A]_2 \cdot v + [A]_3 \cdot v = \begin{bmatrix} A_{11} \times v_1 + A_{12} \times v_2 + A_{13} \times v_3 \\ A_{21} \times v_1 + A_{22} \times v_2 + A_{23} \times v_3 \\ A_{31} \times v_1 + A_{32} \times v_2 + A_{33} \times v_3 \end{bmatrix}

This leads to 15 operations (6 additions and 9 multiplications).

Take another case, say N=4N = 4, then the operation count is

Ax=[A]1v+[A]2v+[A]3v=[A11×v1+A12×v2+A13×v3+A14×v4A21×v1+A22×v2+A23×v3+A24×v4A31×v1+A32×v2+A33×v3+A34×v4A41×v1+A42×v2+A43×v3+A44×v4]A x = [A]_1 \cdot v + [A]_2 \cdot v + [A]_3 \cdot v = \begin{bmatrix} A_{11} \times v_1 + A_{12} \times v_2 + A_{13} \times v_3 + A_{14} \times v_4 \\ A_{21} \times v_1 + A_{22} \times v_2 + A_{23} \times v_3 + A_{24} \times v_4 \\ A_{31} \times v_1 + A_{32} \times v_2 + A_{33} \times v_3 + A_{34} \times v_4 \\ A_{41} \times v_1 + A_{42} \times v_2 + A_{43} \times v_3 + A_{44} \times v_4 \\ \end{bmatrix}

This leads to 28 operations (12 additions and 16 multiplications).

Generalizing this there are N2N^2 multiplications and N(N1)N (N -1) additions for a total of

operations=N(N1)+N2=O(N2).\text{operations} = N (N - 1) + N^2 = \mathcal{O}(N^2).

Matrix-Matrix product (ABAB): Defining [B]j[B]_j as the jjth column of BB then

(AB)ij=i=1Nj=1N[A]i[B]j(A B)_{ij} = \sum^N_{i=1} \sum^N_{j=1} [A]_i \cdot [B]_j

The inner product of two vectors is represented by

ab=i=1Naibia \cdot b = \sum^N_{i=1} a_i b_i

leading to O(3N)\mathcal{O}(3N) operations. Since there are N2N^2 entries in the resulting matrix then we would have O(N3)\mathcal{O}(N^3) operations.

There are methods for performing matrix-matrix multiplication faster. In the following figure we see a collection of algorithms over time that have been able to bound the number of operations in certain circumstances. Here

O(Nω)\mathcal{O}(N^\omega)

matrix multiplication operation bound

References
  1. Overton, M. L. (2001). Numerical Computing with IEEE Floating Point Arithmetic. Society for Industrial. 10.1137/1.9780898718072