¶

from __future__ import print_function
%matplotlib inline
import matplotlib.pyplot as plt
import numpySources of Error¶
Error can come from many sources when formulating problems and/or applying numerical methods:
Model/Data Error
Discretization Error
Floating Point Error
Convergence Error
Objectives¶
Understand the different sources of error
Explore some simple approaches to error analysis
Quantify errors
absolute error
relative error
precision
Long term goals
Use error estimates to control accuracy/reliability of solutions
Understand errors so you can believe and justify your solutions
Model and Data Error¶
Errors in fundamental formulation
SIR model
simplistic averaged model of interactions
basic model predicts a single peak
Data Error - Inaccuracy in measurement or uncertainties in parameters
Unfortunately we cannot control model and data error directly but we can use methods that may be more robust in the presense of these types of errors.
Discretization or Truncation Error¶
Errors arising from approximating a function with a simpler function, e.g. Using the approximation when .
Floating Point Error¶
Errors arising from approximating real numbers with finite-precision numbers and arithmetic.
Convergence Error¶
In some instances an algorithm is developed that will take a current approximation and then find an improvement on the current approximation. In some instances the errors generated in each indivudal step can accumulate or become magnified after repeating the algorithm a number of times.
Basic Definitions¶
Before exploring the different kinds of error, it is important to first define the ways that error is measured. Given a true value of a function and an approximate solution define:
Absolute Error:
Relative Error:
Note: these definitions assume and are scalar valued. However these definitions are readily extended to more complicate objects such as vectors or matrices with appropriate norms.
Decimal Precision¶
This definition of relative error provides a convenient estimate for the number of digits of decimal precision
given a relative error , the precision is the largest integer such that
Example
if has significant digits
if has significant digits (because this error would cause rounding up)
f = numpy.exp(1.0)
F = 2.71
print("f = {}".format(f))
print("F = {}".format(F))e = numpy.abs(f - F)
r = e / numpy.abs(f)
print("Absolute Error: {}".format(e))
print("Relative Error: {}".format(r))p = int(-numpy.log10(r / 5.0))
print("Decimal precision: {}".format(p))Big-O Notation¶
In many situations an approximation will have a parameter associated with it, and the value of the parameter is often chosen to insure that the error is reasonable in a given situation. In such circumstances we often want to know the impact on the error if we change the value of the parameter. This leads to the definition of Big-O notation:
if and only if
In practice we use Big-O notation to say something about how the terms we may have left out of a series might behave. We saw an example earlier of this with the Taylor’s series approximations.
Example:¶
Consider approximating a differentiable function by its Taylor polynomial (truncated Taylor series) expanded around ., i.e.
or for the case expanded around
or the absolute error
We can also develop rules for error propagation based on Big-O notation:
In general, there are two theorems that do not need proof and hold when the value of x is large:
Let
then
and
On the other hand, if we are interested in small values of x, say , the above expressions can be modified as follows:
then
and
Note: In this case we suppose that at least the polynomial with has the following form:
or
so that there is an term that guarantees the existence of in the final product.
To get a sense of why we care most about the power on when considering convergence the following figure shows how different powers on the convergence rate can effect how quickly we converge to our solution. Note that here we are plotting the same data two different ways. Plotting the error as a function of is a common way to show that a numerical method is doing what we expect and exhibits the correct convergence behavior. Since errors can get small quickly it is very common to plot these sorts of plots on a log-log scale to easily visualize the results. Note that if a method was truly of the order that they will be a linear function in log-log space with slope .
Behavior of error as a function of ¶
dx = numpy.linspace(1.0, 1e-4, 100)
fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 2.0)
axes = []
axes.append(fig.add_subplot(1, 2, 1))
axes.append(fig.add_subplot(1, 2, 2))
for n in range(1, 5):
axes[0].plot(dx, dx**n, label="$\Delta x^%s$" % n)
axes[1].loglog(dx, dx**n, label="$\Delta x^%s$" % n)
axes[0].legend(loc=2)
axes[1].set_xticks([10.0 ** (-n) for n in range(5)])
axes[1].set_yticks([10.0 ** (-n) for n in range(16)])
axes[1].legend(loc=4)
for n in range(2):
axes[n].set_title("Growth of Error vs. $\Delta x^n$")
axes[n].set_xlabel("$\Delta x$")
axes[n].set_ylabel("Estimated Error")
plt.show()Discretization Error¶
Taylor’s Theorem: Let and , then for all there exists a number that lies between and such that
where is the Taylor polynomial approximation
and is the residual (the part of the series we left off)
If we knew the value of we would know the exact value of and therefore the function . In general we don’t know this value but we can use to put upper bounds on the error and to understand how the error changes as we move away from .
Start by replacing with . The primary idea here is that the residual becomes smaller as (at which point ).
and is the residual (the part of the series we left off)
where is an upper bound on
Example 1¶
with on the interval
Using this we can find expressions for the relative and absolute error as a function of assuming .
Remainders:
We can also use the package sympy which has the ability to calculate Taylor polynomials built-in!
import sympy
sympy.init_printing(pretty_print=True)
x = sympy.symbols("x")
f = sympy.exp(x)
f.series(x0=0, n=3)Lets plot this numerically for a section of .
x = numpy.linspace(-1, 1, 100)
f = numpy.exp(x)
T_N = 1.0 + x + x**2 / 2.0
R_N = numpy.exp(1) * x**3 / 6.0fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, T_N, "r", x, f, "k", x, numpy.abs(R_N), "b")
axes.plot(x, numpy.abs(numpy.exp(x) - T_N), "g--")
axes.plot(0.0, 1.0, "o", markersize=10)
axes.grid()
axes.set_xlabel("x", fontsize=16)
axes.set_ylabel("$f(x)$, $T_N(x)$, $|R_N(x)|$", fontsize=16)
axes.legend(["$T_N(x)$", "$f(x)$", "$|R_N(x)|$", "e(x)"], loc=2)
plt.show()plot this problem¶
x = numpy.linspace(0.8, 2, 100)
f = 1.0 / x
T_N = 1.0 - (x - 1) + (x - 1) ** 2
R_N = -((x - 1.0) ** 3) / (1.0**4)plt.figure(figsize=(8, 6))
plt.plot(x, T_N, "r", x, f, "k", x, numpy.abs(R_N), "b")
plt.plot(x, numpy.abs(f - T_N), "g--")
plt.plot(1.0, 1.0, "o", markersize=10)
plt.grid(True)
plt.xlabel("x", fontsize=16)
plt.ylabel("$f(x)$, $T_N(x)$, $R_N(x)$", fontsize=16)
plt.title("$f(x) = 1/x$", fontsize=18)
plt.legend(["$T_N(x)$", "$f(x)$", "$|R_N(x)|$", "$e(x)$"], loc="best")
plt.show()Computational Issue #1: Accuracy... how many terms?¶
Given a Taylor Polynomial approximation of an arbitrary function , how do we determine how many terms are required such that . And how do we determine the tolerance?
Computational Issue #2 Efficiency... Operation counts for polynomial evaluation¶
Given
or
what is the most efficient way to evaluate ? (i.e. minimize number of floating point operations)
using nested multiplication (aka Horner’s Method):
Consider how many operations it takes for each...
Note: here we’re just counting multiplications as they will dominate the flop count
Adding up all the operations we can in general think of this as a pyramid (it’s really the triangle numbers)

Thus we can estimate that the algorithm written this way will take approximately operations to complete.
Looking at nested iteration, however:
Here we find that the method is compared to the first evaluation which (we usually drop the 2 in these cases). That’s a huge difference for large !
Algorithm¶
Fill in the function and implement Horner’s method:
def eval_poly(p, x):
'''Evaluates a polynomial using Horner's method given coefficients p at x
The polynomial is defined as
P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]
Parameters:
p: list or numpy array of coefficients
x: scalar float
returns:
P(x): value of the polynomial at point x (float)
'''
passdef eval_poly(p, x):
"""Evaluates a polynomial using Horner's method given coefficients p at x
The polynomial is defined as
P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]
Parameters:
p: list or numpy array of coefficients
x: scalar float or numpy array (this version is more robust to floating point error)
returns:
P(x): value of the polynomial at point x, P will return as the same type as x
"""
if isinstance(x, numpy.ndarray):
y = p[0] * numpy.ones(x.shape)
elif isinstance(x, float):
y = p[0]
else:
raise TypeError
for element in p[1:]:
y = y * x + element
return y# Scalar test
p = [1, 2, 3]
x = 1.0
test = eval_poly(p, x)
answer = numpy.array([x**2, x, 1]).dot(p)
print("test = {} ({}), answer = {} ({})".format(test, type(test), answer, type(answer)))
numpy.testing.assert_allclose(test, answer)
print("success")# Vectorized test with x a numpy array
p = [1, -3, 10, 4, 5, 5]
x = numpy.linspace(-10, 10, 100)
P = eval_poly(p, x)
print("x: {}, P(x): {}".format(type(x), type(P)))plt.plot(x, P)
plt.xlabel("x")
plt.ylabel("P(x)")
plt.title("{}th order polynomial, p={}".format(len(p) - 1, p))
plt.grid()
plt.show()Convergence Error¶
In some circumstances a formula or algorithm is applied repeatedly as a way to obtain a final approximation. Usually, the errors that occur at each individual step are small. By repeating the algorithm, though, the errors can sometimes grow or become magnified.
As example of this phenomena is given below. The values of the terms in a difference equation are calculated, $$
$$
The true solution to the difference equation is , where 0, 1, 2,
# Choose the number of iterations
N = 40
y = numpy.empty(N + 1) # Allocate an empty vector with N+1 entries
# Now use the difference equation to generate the numbers in the sequence
y[0] = 1
y[1] = 1 / 5
for n in range(2, N + 1):
y[n] = 16 / 5 * y[n - 1] - 3 / 5 * y[n - 2]And plot the result
# Now plot the result
n = numpy.arange(0, N + 1)
fig = plt.figure(figsize=(10.0, 5.0))
axes = fig.add_subplot(1, 1, 1)
axes.semilogy(n, y, "rx", markersize=5, label="$y_n$")
axes.semilogy(n, (1 / 5) ** n, "b.", label="$y_{true}$")
axes.grid()
axes.set_title("Calculated Values Of A Difference Equation", fontsize=18)
axes.set_xlabel("$n$", fontsize=16)
axes.set_ylabel("$y_n$", fontsize=16)
axes.legend(loc="best", shadow=True)
plt.show()Simply looking at the exact solution, the sequence of numbers generated by the difference equation above should get very close to zero. Instead, the numbers in the sequence initially get closer to zero, but at some point they begin to grow and get larger. An underlying problem is that the computer is not able to store the numbers exactly. The second number in the sequence, has a small error, and the computer stores it as where is some small error.
Each time a new number in the loop is generated, the error is multiplied. For example, after the first iteration is
After the second time through the loop, the value of is
Even though the value of is very close to zero, every iteration makes the error grow. Repeated multiplication will result in a very large number.
The error associated with the initial representation of the number is a problem with the way a digital computer stores floating point numbers. In most instances the computer cannot represent a number exactly, and the small error in approximating a given number can give rise to other problems.
Floating Point Error¶
Errors arising from approximating real numbers with finite-precision numbers
or in decimal, results from a finite number of registers to represent each number.
Floating Point Systems¶
Numbers in floating point systems are represented as a series of bits that represent different pieces of a number. In normalized floating point systems there are some standard conventions for what these bits are used for. In general the numbers are stored by breaking them down into the form
where
± is a single bit representing the sign of the number
is called the mantissa. Note that technically the decimal could be moved but generally, using scientific notation, the decimal can always be placed at this location. The digits are called the fraction with digits of precision. Normalized systems specifically put the decimal point in the front like we have and assume unless the number is exactly 0.
is the base. For binary , for decimal , etc.
is the exponent, an integer in the range
The important points on any floating point system is that
There exist a discrete and finite set of representable numbers
These representable numbers are not evenly distributed on the real line
Arithmetic in floating point systems yield different results from infinite precision arithmetic (i.e. “real” math)
Properties of Floating Point Systems¶
All floating-point systems are characterized by several important numbers
Smalled normalized number (underflow if below - related to subnormal numbers around zero)
Largest normalized number (overflow if above)
Zero
Machine or
infandnan, infinity and Not a Number respectively
Number and distribution of numbers
How many numbers can we represent with this system?
What is the distribution on the real line?
What is the underflow and overflow limits?
What is the smallest number such that ?
How many numbers can we represent with this system?
sign bit: 2
: 9 (normalized numbers )
: 10
: 3
zero: 1
total:
What is the distribution on the real line?
d_1_values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
d_2_values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
E_values = [
0,
-1,
-2,
]
fig = plt.figure(figsize=(10.0, 1.5))
axes = fig.add_subplot(1, 1, 1)
for E in E_values:
for d1 in d_1_values:
for d2 in d_2_values:
axes.plot((d1 + d2 * 0.1) * 10**E, 0.0, "r|", markersize=20)
axes.plot(-(d1 + d2 * 0.1) * 10**E, 0.0, "r|", markersize=20)
axes.plot(0.0, 0.0, "|", markersize=20)
axes.plot([-1.0, 1.0], [0.0, 0.0], "k|", markersize=30)
axes.plot([-10.0, 10.0], [0.0, 0.0], "k")
axes.set_title("Distribution of Values $[-10, 10]$")
axes.set_yticks([])
ticks = [i for i in range(-10, 11, 1)]
axes.set_xticks(ticks)
axes.set_xlabel("x")
axes.set_ylabel("")
axes.set_xlim([-10, 10])
plt.show()fig = plt.figure(figsize=(10.0, 1.5))
axes = fig.add_subplot(1, 1, 1)
for E in E_values:
for d1 in d_1_values:
for d2 in d_2_values:
axes.plot((d1 + d2 * 0.1) * 10**E, 0.0, "r+", markersize=20)
axes.plot(-(d1 + d2 * 0.1) * 10**E, 0.0, "r+", markersize=20)
axes.plot(0.0, 0.0, "+", markersize=20)
axes.plot([-0.1, 0.1], [0.0, 0.0], "k|", markersize=30)
axes.plot([-1, 1], [0.0, 0.0], "k")
axes.set_title("Close up $[-1, 1]$")
axes.set_yticks([])
ticks = numpy.linspace(-1.0, 1.0, 21)
axes.set_xticks(ticks)
axes.set_xlabel("x")
axes.set_ylabel("")
axes.set_xlim([-1, 1])
# fig.tight_layout(h_pad=1, w_pad=5)
plt.show()What is the underflow and overflow limits?
Smallest number that can be represented is the underflow:
Largest number that can be represented is the overflow:
What is the smallest number such that ?
Binary Systems¶
Consider the 2-digit precision base 2 system:
Number and distribution of numbers¶
How many numbers can we represent with this system?
What is the distribution on the real line?
What is the underflow and overflow limits?
What is ?
How many numbers can we represent with this system?
What is the distribution on the real line?
d_1_values = [1]
d_2_values = [0, 1]
E_values = [1, 0, -1]
fig = plt.figure(figsize=(10.0, 1.0))
axes = fig.add_subplot(1, 1, 1)
for E in E_values:
for d1 in d_1_values:
for d2 in d_2_values:
axes.plot((d1 + d2 * 0.5) * 2**E, 0.0, "r+", markersize=20)
axes.plot(-(d1 + d2 * 0.5) * 2**E, 0.0, "r+", markersize=20)
axes.plot(0.0, 0.0, "r+", markersize=20)
axes.plot([-4.5, 4.5], [0.0, 0.0], "k")
axes.set_title("Distribution of Values")
axes.set_yticks([])
axes.set_xticks(numpy.linspace(-4, 4, 9))
axes.set_xlabel("x")
axes.set_ylabel("")
axes.grid()
axes.set_xlim([-5, 5])
plt.show()Smallest number that can be represented is the underflow:
Largest number that can be represented is the overflow:
Note: these numbers are in a binary system.
Quick rule of thumb:
correspond to 8s, 4s, 2s, 1s . halves, quarters, eighths, ...
Real Systems - IEEE 754 Binary Floating Point Systems¶
Single Precision¶
Total storage alloted is 32 bits
Exponent is 8 bits
Fraction 23 bits ()
s EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1 8 9 31Overflow
Underflow
Double Precision¶
Total storage alloted is 64 bits
Exponent is 11 bits
Fraction 52 bits ()
s EEEEEEEEEE FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FF
0 1 11 12 63Overflow
Underflow
Python Access to IEEE Numbers¶
Access many important parameters, such as machine epsilon:
import numpy
numpy.finfo(float).epsprint(numpy.finfo(numpy.float16))print(numpy.finfo(numpy.float32))print(numpy.finfo(float))print(numpy.finfo(numpy.float128))Examples¶
eps = numpy.finfo(float).eps
MAX = numpy.finfo(float).max
print("eps = {}".format(eps))
print("MAX = {}".format(MAX))Show that
print(MAX)print(MAX * (1 + 0.4 * eps))print(1 + 0.4 * eps == 1.0)Why should we care about this?¶
Floating point arithmetic is not commutative or associative
Floating point errors compound, do not assume even double precision is enough!
Mixing precision can be dangerous
eps = numpy.finfo(float).eps
delta = 0.5 * eps
x = 1 + delta - 1
y = 1 - 1 + delta
print("1 + delta - 1 = {}".format(x))
print("1 - 1 + delta = {}".format(y))
print(x == y)Example 2: Catastrophic cancellation¶
Let us examine what happens when we add two numbers and where . We can actually estimate these bounds by doing some error analysis. Here we need to introduce the idea that each floating point operation introduces an error such that
where is a function that returns the floating point representation of the expression enclosed, is some operation (e.g. ), and is the floating point error due to .
Back to our problem at hand. The floating point error due to addition is
Comparing this to the true solution using a relative error we have
so that if we are not too concerned.
What if instead we consider a floating point error on the representations of and , , and say and are the magnitude of the errors in their representation. Here we will assume this constitutes the floating point error rather than being associated with the operation itself.
Now consider the difference between the two numbers
Again computing the relative error we then have
The important distinction here is that now the error is dependent on the values of and and more importantly, their difference. Of particular concern is the relative size of . As it approaches zero relative to the magnitudes of and the error could be arbitrarily large. This is known as catastrophic cancellation.
dx = numpy.array([10 ** (-n) for n in range(1, 16)])
x = 1.0 + dx
y = numpy.ones(x.shape)
error = numpy.abs(x - y - dx) / (dx)fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 2)
axes = fig.add_subplot(1, 2, 1)
axes.loglog(dx, x + y, "o-")
axes.set_xlabel("$\Delta x$")
axes.set_ylabel("$x + y$")
axes.set_title("$\Delta x$ vs. $x+y$")
axes = fig.add_subplot(1, 2, 2)
axes.loglog(dx, error, "o-")
axes.set_xlabel("$\Delta x$")
axes.set_ylabel("$|x + y - \Delta x| / \Delta x$")
axes.set_title("Difference between $x$ and $y$ vs. Relative Error")
plt.show()Taking the limit as we can see what behavior we would expect to see from evaluating this function:
What does floating point representation do?
x = numpy.linspace(-1e-3, 1e-3, 100, dtype=numpy.float32)
f = 0.5
F = (1.0 - numpy.cos(x)) / x**2
rel_err = numpy.abs((f - F)) / ffig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, rel_err, "o")
axes.set_xlabel("x")
axes.grid()
axes.set_ylabel("Relative Error")
axes.set_title("$\\frac{1-\\cos{x}}{x^2} - \\frac{1}{2}$", fontsize=18)
plt.show()Example 4: Evaluation of a Polynomial¶
Note: (and will be close to zero for )
Here we compare polynomial evaluation using naive powers compared to Horner’s method as implemented in eval_poly(p,x) defined above.
x = numpy.linspace(0.988, 1.012, 1000, dtype=numpy.float16)
y = (
x**7
- 7.0 * x**6
+ 21.0 * x**5
- 35.0 * x**4
+ 35.0 * x**3
- 21.0 * x**2
+ 7.0 * x
- 1.0
)
# repeat using Horner's method from above
p = numpy.array([1, -7, 21, -35, 35, -21, 7, -1])
yh = eval_poly(p, x)fig = plt.figure(figsize=(8, 6))
fig.set_figwidth(fig.get_figwidth() * 2)
axes = fig.add_subplot(1, 2, 1)
axes.plot(x, y, "r", label="naive")
axes.plot(x, yh, "b", label="horner")
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_ylim((-0.1, 0.1))
axes.set_xlim((x[0], x[-1]))
axes.grid()
axes.legend()
axes = fig.add_subplot(1, 2, 2)
axes.plot(x, yh - y, "g")
axes.grid()
axes.set_xlabel("x")
axes.set_ylabel("$f_{horner} - f_n$")
axes.set_title("error")
plt.show()def eval_polys(p, x):
"""Evaluates a polynomial using Horner's method given coefficients p at x
The polynomial is defined as
P(x) = p[0] x**n + p[1] x**(n-1) + ... + p[n-1] x + p[n]
Parameters:
p: list or numpy array of coefficients
x: scalar float or numpy array this version is less careful about input type
returns:
P(x): value of the polynomial at point x, P will return as the same type as x
"""
y = p[0]
for element in p[1:]:
y = y * x + element
return y# repeat using different Horner's method from above
yh = eval_polys(p, x)fig = plt.figure(figsize=(8, 6))
fig.set_figwidth(fig.get_figwidth() * 2)
axes = fig.add_subplot(1, 2, 1)
axes.plot(x, y, "r", label="naive")
axes.plot(x, yh, "b", label="horner")
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_ylim((-0.1, 0.1))
axes.set_xlim((x[0], x[-1]))
axes.grid()
axes.legend()
axes = fig.add_subplot(1, 2, 2)
axes.plot(x, yh - y, "g")
axes.grid()
axes.set_xlabel("x")
axes.set_ylabel("$f_{horner} - f_n$")
axes.set_title("error")
plt.show()x = numpy.linspace(0.5, 1.5, 101, dtype=numpy.float32)
f_hat = (x**2 - 1.0) / (x - 1.0)
f = x + 1.0fig = plt.figure()
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, numpy.abs(f - f_hat) / numpy.abs(f))
axes.set_xlabel("$x$")
axes.set_ylabel("Relative Error")
axes.grid()
plt.show()Combination of Errors¶
In general we need to concern ourselves with the combination of both discretization error and floating point error.
Reminder:¶
Discretization error: Errors arising from approximation of a function, truncation of a series...
Floating-point Error: Errors arising from approximating real numbers with finite-precision numbers
or in decimal cannot be represented exactly as a binary floating point number
Example 1¶
Consider a finite difference approximation to the first derivative of a function
Note: in the limit , this is the standard definition of the first derivative. However we’re interested in the error for a finite .
Moreover, (as we will see in future notebooks), there are many ways to approximate the first derivative. For example we can write the “centered first derivative” as
Here we will just compare the the error for the two different finite-difference formulas given
at for decreasing values of . We will also introduce the idea of an ‘inline’ or lambda function in python.
f = lambda x: numpy.exp(x)
f_prime = lambda x: numpy.exp(x)delta_x = numpy.array([2.0 ** (-n) for n in range(1, 60)])
x = 1.0
# Forward finite difference approximation to first derivative
f_hat_1 = (f(x + delta_x) - f(x)) / (delta_x)
# Centered finite difference approximation to first derivative
f_hat_2 = (f(x + delta_x) - f(x - delta_x)) / (2.0 * delta_x)fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.loglog(delta_x, numpy.abs(f_hat_1 - f_prime(x)), "o-", label="One-Sided")
axes.loglog(delta_x, numpy.abs(f_hat_2 - f_prime(x)), "s-", label="Centered")
axes.legend(loc=3, fontsize=14)
axes.set_xlabel("$\Delta x$", fontsize=16)
axes.set_ylabel("Absolute Error", fontsize=16)
axes.set_title("Finite Difference approximations to $df/dx$", fontsize=18)
axes.grid()
plt.show()Example 2¶
Evaluate with its Taylor series.
Can we pick that can approximate over a give range such that the relative error satisfies ?
We can try simply evaluating the Taylor polynomial directly for various
from scipy.special import factorial
def my_exp(x, N=10):
value = 0.0
for n in range(N + 1):
value += x**n / float(factorial(n))
return valueAnd test this
eps = numpy.finfo(numpy.float64).eps
x = numpy.linspace(-2, 50.0, 100, dtype=numpy.float64)
MAX_N = 300
for N in range(1, MAX_N + 1):
rel_error = numpy.abs((numpy.exp(x) - my_exp(x, N=N)) / numpy.exp(x))
if numpy.all(rel_error < 8.0 * eps):
breakfig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.plot(x, rel_error / eps)
axes.set_xlabel("x")
axes.set_ylabel("Relative Error/eps")
axes.set_title("N = {} terms".format(N))
axes.grid()
plt.show()print(numpy.log(numpy.finfo(float).max))and numpy.exp handles that just fine
print(numpy.exp(709, dtype=numpy.float64))
print(numpy.exp(-709, dtype=numpy.float64))Your homework: the great Exp Challenge
One final example (optional): How to calculate Relative Error¶
Say we wanted to compute the relative error between two values and using as the normalizing value. Algebraically the forms
and
are equivalent. In finite precision what form might be expected to be more accurate and why?
Example based on a blog post by Nick Higham
Using this model the original definition contains two floating point operations such that
For the other formulation we have
If we assume that all s have similar error magnitudes then we can simplify things by letting
To compare the two formulations we again use the relative error between the true relative error and our computed versions .
Original definition:
Manipulated definition:
We see then that our floating point error will be dependent on the relative magnitude of
Comparison of Relative Errors of estimates of Relative Error ;^)¶
# Based on the code by Nick Higham
# https://gist.github.com/higham/6f2ce1cdde0aae83697bca8577d22a6e
# Compares relative error formulations using single precision and compared to double precision
N = 501 # Note: Use 501 instead of 500 to avoid the zero value
d = numpy.finfo(numpy.float32).eps * 1e4
a = 3.0
x = a * numpy.ones(N, dtype=numpy.float32)
y = [
x[i]
+ numpy.multiply(
(i - numpy.divide(N, 2.0, dtype=numpy.float32)), d, dtype=numpy.float32
)
for i in range(N)
]
# Compute errors and "true" error
relative_error = numpy.empty((2, N), dtype=numpy.float32)
relative_error[0, :] = numpy.abs(x - y) / x
relative_error[1, :] = numpy.abs(1.0 - y / x)
exact = numpy.abs((numpy.float64(x) - numpy.float64(y)) / numpy.float64(x))
# Compute differences between error calculations
error = numpy.empty((2, N))
for i in range(2):
error[i, :] = numpy.abs((relative_error[i, :] - exact) / numpy.abs(exact))
fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.semilogy(y, error[0, :], ".", markersize=10, label="$|x-y|/|x|$")
axes.semilogy(y, error[1, :], ".", markersize=10, label="$|1-y/x|$")
axes.grid(True)
axes.set_xlabel("y")
axes.set_ylabel("Relative Error")
axes.set_xlim((numpy.min(y), numpy.max(y)))
axes.set_ylim((5e-9, numpy.max(error[1, :])))
axes.set_title("Relative Error Comparison: x,y {}".format(y[0].dtype))
axes.legend()
plt.show()Some other links that might be helpful regarding IEEE Floating Point:
Future issues with fp64 and High-Performance Computing¶
The Issues¶
In traditional High-Performance computing IEEE fp64 has become the standard precision necessary for accurate, reproducible calculations for a wide range of scientific computing (e.g. climate models, fusion, solid mechanics)
Until recently, the needs for HPC drove the development of Chips/Hardware such that Commodity Computers and Super Computers benefited from the same technology.
However, with the rise of general purpose GPU’s and AI, the landscape is changing rapidly
A brief history of floating point hardware¶
1980’s: Dedicated seperate floating point co-processors (e.g. Intel 8087, Motorola 68881 co-processors)
1987: Introduction of Intel x486 CPU’s with built in floating point registers
1999: Introduction of Nvidia GeForce 256 separate “Graphical Processing Unit” GPU low precision, fast parallel graphics.
mid-2000s: Adoption of programmable General Purpose GPU’s for fp acceleration, addition of fp64 on GPU’s
2006: Introduction of Nvidia CUDA language for programmable GPU’s: shift to GPU’s for high-performance computing and ML/AI
~2020+: ML/AI revolution: Deep learning algorithms driven by matrix multiplications that tolerate low precision
Near Future NVIDIA GPU FloatingPoint roadmap StorageReview.com¶
| Specification | H100 | H200 | B100 | B200 | B300 |
|---|---|---|---|---|---|
| Max Memory | 80 GBs HBM3 | 141 GBs HBM3e | 192 GBs HBM3e | 192 GBs HBM3e | 288 GBs HBM3e |
| Memory Bandwidth | 3.35 TB/s | 4.8TB/s | 8TB/s | 8TB/s | 8TB/s |
| FP4 Tensor Core | – | – | 14 PFLOPS | 18 PFLOPS | 30 PFLOPS |
| FP6 Tensor Core | – | – | 7 PFLOPS | 9 PFLOPS | 15 PFLOPS* |
| FP8 Tensor Core | 3958 TFLOPS (~4 PFLOPS) | 3958 TFLOPS (~4 PFLOPS) | 7 PFLOPS | 9 PFLOPS | 15 PFLOPS* |
| INT 8 Tensor Core | 3958 TOPS | 3958 TOPS | 7 POPS | 9 POPS | 15 PFLOPS* |
| FP16/BF16 Tensor Core | 1979 TFLOPS (~2 PFLOPS) | 1979 TFLOPS (~2 PFLOPS) | 3.5 PFLOPS | 4.5 PFLOPS | 7.5 PFLOPS* |
| TF32 Tensor Core | 989 TFLOPS | 989 TFLOPS | 1.8 PFLOPS | 2.2 PFLOPS | 3.3 PFLOPS* |
| FP32 (Dense) | 67 TFLOPS | 67 TFLOPS | 30 TFLOPS | 40 TFLOPS | Information Unknown |
| FP64 Tensor Core (Dense) | 67 TFLOPS | 67 TFLOPS | 30 TFLOPS | 40 TFLOPS | Information Unknown |
| FP64 (Dense) | 34 TFLOPS | 34 TFLOPS | 30 TFLOPS | 40 TFLOPS | Information Unknown |
| Max Power Consumption | 700W | 700W | 700W | 1000W | Information Unknown |
Interesting times indeed¶
...the landscape of high-performance computation is increasingly complex...but there are important classes of problems that still need high-precision floating point. Some Options:
fp64 Emulation leveraging low-precision hardware
clever mixed precision algorithms
Operation Counting¶
Discretization Error: Why not use more terms in the Taylor series?
Floating Point Error: Why not use the highest precision possible?
Example 1: Matrix-Vector Multiplication¶
Let and .
Count the approximate number of operations it will take to compute .
Do the same for .
Matrix-vector product: Defining as the th row of and as the , th entry then
Take an explicit case, say , then the operation count is
This leads to 15 operations (6 additions and 9 multiplications).
Take another case, say , then the operation count is
This leads to 28 operations (12 additions and 16 multiplications).
Generalizing this there are multiplications and additions for a total of
Matrix-Matrix product (): Defining as the th column of then
The inner product of two vectors is represented by
leading to operations. Since there are entries in the resulting matrix then we would have operations.
There are methods for performing matrix-matrix multiplication faster. In the following figure we see a collection of algorithms over time that have been able to bound the number of operations in certain circumstances. Here

- Overton, M. L. (2001). Numerical Computing with IEEE Floating Point Arithmetic. Society for Industrial. 10.1137/1.9780898718072



