Skip to article frontmatterSkip to article content

Numerical Linear Algebra

Text provided under a Creative Commons Attribution license, CC-BY. All code is made available under the FSF-approved MIT license. (c) Kyle T. Mandli

Note: This material largely follows the text “Numerical Linear Algebra” by Trefethen and Bau (SIAM, 1997) and is meant as a guide and supplement to the material presented there.

%matplotlib inline
%precision 3
import matplotlib.pyplot as plt
import numpy

Numerical Linear Algebra

Numerical methods for linear algebra problems lies at the heart of many numerical approaches and is something we will spend some time on. Roughly we can break down problems that we would like to solve into three general problems, solving a system of equations

Ax=bA \mathbf{x} = \mathbf{b}

Projection problems or Linear Least Squares

ATAx=ATbA^TA\mathbf{x} = A^T\mathbf{b}

and solving the eigenvalue problem

Av=λv.A \mathbf{v} = \lambda \mathbf{v}.

We examine each of these problems separately and will evaluate some of the fundamental properties and methods for solving these problems. We will be careful in deciding how to evaluate the results of our calculations and try to gain some understanding of when and how they fail.

Factorizations

For each fundamental problem, there are a range of algorithms to solve them, but all of these algorithms can often be succinctly described in terms of a matrix “factorization”, the ability to write an arbitrary matrix AA as a product of matrices with special properties (e.g. triangular, diagonal, orthogonal) that make solving the general problem straightforward. For example

ProblemAlgorithms"Factorizations"
$$A \mathbf{x} = \mathbf{b}$$Gaussian-Elimination, Gauss-Jordan Elimination$PA=LU$, $A=ER$
$$ A^TA\mathbf{x} = A^T\mathbf{b}$$Orthogonalization algorithms (Gram-Schmidt, Modified GS, Householder, Givens)$A=QR$
$$A \mathbf{v} = \lambda \mathbf{v}$$Various Iterative methods (Power, inverse power, $RQ$ with shifts$\ldots$$A=S\Lambda S^{-1}$, $A=Q\Lambda Q^T$, $A=MJM^{-1}$
All of the aboveSingular Value Decomposition (SVD)$A = U\Sigma V^T$

To develop all of these algorithms in depth is a course in itself, however, here we will highlight some of the key factorizations and algorithms used in modern computational linear algebra. In particular we will highlight issues of accuracy, stability and computational cost particularly in the limit of large systems.

Some Example Problems

The number and power of the different tools made available from the study of linear algebra makes it an invaluable field of study. Before we dive in to numerical approximations we first consider some of the pivotal problems that numerical methods for linear algebra are used to address.

For this discussion we will be using the common notation m×nm \times n to denote the dimensions of a matrix AA. The mm refers to the number of rows and nn the number of columns. If a matrix is square, i.e. m=nm = n, then we will use the notation that AA is m×mm \times m.

Systems of Equations

The first type of problem is to find the solution to a linear system of equations. If we have mm equations for mm unknowns it can be written in matrix/vector form,

Ax=b.A \mathbf{x} = \mathbf{b}.

For this example AA is an m×mm \times m matrix, denoted as being in Rm×m\mathbb{R}^{m\times m}, and x\mathbf{x} and b\mathbf{b} are column vectors with mm entries, denoted as Rm\mathbb{R}^m.

Example 1: Polynomial Fitting

In our previous work on interpolation we found that the unique interpolating polynomial of order nn through the n+1n+1 points (xi,yi)(x_i, y_i) can be found by solving a Linear system of equations

V[ϕj(x)]w=yV[\phi_j(\mathbf{x})]\mathbf{w} = \mathbf{y}

where VV is a VanderMonde matrix whose columns are the basis functions ϕj(x)\phi_j(\mathbf{x}) (e.g. the monomials) and wRn+1\mathbf{w}\in\mathbb{R}^{n+1} are the coefficients (weights) such that the interpolating polynomial is given uniquely by

Pn(x)=j=0nwjϕj(x)\cal{P}_n(x) = \sum_{j=0}^n w_j\phi_j(x)

Examples 2: Solution of Non-linear equations by Newton’s Method

Given a non-linear system of equations

F(x)=0\mathbf{F}(\mathbf{x})=\mathbf{0}

Newton’s method provides an iterative method to find roots where at every step of the iteration a linear system of equations

J(xk)δk=F(xk)J(\mathbf{x_k})\boldsymbol{\delta}_k = -\mathbf{F}(\mathbf{x_k})

Examples 3: Solution of Boundary Value problems by Finite Differences or Finite Elements

Given a general linear differential equation

Lu=f{\cal L}u = f

where L{\cal L} is a linear differential operator e.g.

L=d2dx2+αddx+β{\cal L} = \frac{d^2}{dx^2} + \alpha\frac{d}{dx} + \beta

Finite difference or finite element discretizations usually reduce the continuous, infinite dimensional problem to a discrete linear problem of form

Au=fA\mathbf{u} = \mathbf{f}

Where u\mathbf{u} and f\mathbf{f} are discrete approximations to the solution function and right hand side and AA is a discrete approximation to L{\cal L}. Moreover, for many discretizations, AA can be very sparse.

Linear least squares

In a similar case as above, say we want to fit a particular function (could be a polynomial) to a given number of data points except in this case we have more data points than free parameters. In the case of polynomials this could be the same as saying we have mm data points but only want to fit a n1n - 1 order polynomial through the data where n1mn - 1 \leq m. One of the common approaches to this problem is to minimize the “least-squares” error between the data and the resulting function:

E=(i=1myif(xi)2)1/2.E = \left( \sum^m_{i=1} |y_i - f(x_i)|^2 \right )^{1/2}.

E.g. Consider fitting the function

f(x)=w1+w2x+w3exf(x) = w_1 + w_2 x + w_3 e^x

to data that has random noise added to it.

If our function f(x)f(x) can be written as a linear combination of basis functions

f(x)=j=1nwjϕj(x)f(x) = \sum_{j=1}^n w_j\phi_j(x)

then the linear system through mm points becomes Aw=yA\mathbf{w} = \mathbf{y} where AA is a generalized m×nm\times n vandermonde matrix

A=[ϕ1(x1)ϕ2(x1)ϕn(x1)ϕ1(x2)ϕ2(x2)ϕn(x2)ϕ1(xm)ϕ2(xm)ϕn(xm)]A = \begin{bmatrix} \phi_1(x_1) & \phi_2(x_1) & \cdots & \phi_n(x_1) \\ \phi_1(x_2) & \phi_2(x_2) & \cdots & \phi_n(x_2) \\ \vdots & \vdots & &\vdots \\ \phi_1(x_m) & \phi_2(x_m) & \cdots & \phi_n(x_m) \\ \end{bmatrix}

Which will be over-determined for m>nm > n. However the Least-squares solution for the weights w\mathbf{w} will satisfy

ATAw=ATyA^T A \mathbf{w} = A^T \mathbf{y}

we can guarantee that the error is minimized in the least-squares sense1. (Although we will also show that this is not the most numerically stable way to solve this problem)

Define the data

N = 20
x = numpy.linspace(-1.0, 1.0, N)
y = x + numpy.random.random((N))

Define basis functions and Vandermonde matrix

# define the basis functions
phi_1 = lambda x: numpy.ones(x.shape)
phi_2 = lambda x: x
phi_3 = lambda x: numpy.exp(x)
# Define various Vandermonde matrix based on our x-values
A = numpy.array([phi_1(x), phi_2(x), phi_3(x)]).T
A.shape
# Determine the weights of our linear function
# result in the smallest sum of the squares of the residual.
w = numpy.linalg.solve(numpy.dot(A.T, A), numpy.dot(A.T, y))
error = y - A.dot(w)
print("w = {}, ||e|| = {}".format(w, numpy.linalg.norm(error)))
# Do the same using numpy's lstsq routine (which solves a more numerically stable problem)
w = numpy.linalg.lstsq(A, y, rcond=None)[0]
error = y - A.dot(w)
print("w = {}, ||e|| = {}".format(w, numpy.linalg.norm(error)))
# just repeat the calculations to draw the figure
# define data
N = 20
x = numpy.linspace(-1.0, 1.0, N)
y = x + numpy.random.random((N))

# define the basis functions
phi_1 = lambda x: numpy.ones(x.shape)
phi_2 = lambda x: x
phi_3 = lambda x: numpy.exp(x)
# Define various Vandermonde matrix based on our x-values
A = numpy.array([phi_1(x), phi_2(x), phi_3(x)]).T
w = numpy.linalg.lstsq(A, y, rcond=None)[0]


# Plot it out, cuz pictures are fun!
fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
f = A.dot(w)
axes.plot(x, y, "ko")
axes.plot(x, f, "r")
axes.set_title("Least Squares Fit to Data")
axes.set_xlabel("$x$")
axes.set_ylabel("$f(x)$ and $y_i$")
axes.grid()

plt.show()

Eigenproblems

Eigenproblems come up in a variety of contexts and often are integral to many problem of scientific and engineering interest. It is such a powerful idea that it is not uncommon for us to take a problem and convert it into an eigenproblem.

One of my favorite examples The Tacoma Narrows Bridge Collapse

Or the original Google Page-Rank algorithm which essentially finds the dominant eigenvector of an enormous sparse Markov Matrix.

Here we introduce the idea and give some examples.

As a review, if ACm×mA \in \mathbb{C}^{m\times m} (a square matrix with complex values), a non-zero vector vCm\mathbf{v}\in\mathbb{C}^m is an eigenvector of AA with a corresponding eigenvalue λC\lambda \in \mathbb{C} if

Av=λv.A \mathbf{v} = \lambda \mathbf{v}.

One way to interpret the eigenproblem is that we are attempting to ascertain the “action” of the matrix AA on some subspace of Cm\mathbb{C}^m where this action acts like scalar multiplication. This subspace is called an eigenspace.

General idea of EigenProblems

Rewriting the standard Eigen problem Av=λvA\mathbf{v}=\lambda\mathbf{v} for ACm×mA \in \mathbb{C}^{m\times m}, vCm\mathbf{v}\in\mathbb{C}^m as

(AλI)v=0(A - \lambda I)\mathbf{v} = \mathbf{0}

it becomes clear that for v\mathbf{v} to be non-trivial (i.e. 0\neq \mathbf{0}), requires that the matrix (AλI)(A-\lambda I) be singular,

This is equivalent to finding all values of λ\lambda such that AλI=0|A-\lambda I| = 0 (the determinant of singular matrices is always zero). However, it can also be shown that

AλI=Pm(λ)| A-\lambda I| = P_m(\lambda)

which is a mmth order polynomial in λ\lambda. Thus Pm(λ)=0P_m(\lambda)=0 implies the eigenvalues are the mm roots of PP, and the eigenspace corresponding to λi\lambda_i is just N(AλiI)N(A-\lambda_i I)

Solving EigenProblems

The temptation (and what we usually teach in introductory linear algebra classes) is to find the roots of Pm(λ)P_m(\lambda) to get the eigenvalues, then find the null-space of AλIA-\lambda I.

However that would be wrong. The best algorithms for finding Eigenvalues are completely unrelated to rootfinding as we shall see (and in fact, the way you find the roots of polynomials is to find the eigenvalues of a “companion matrix”)

The Singular Value Decomposition

One of the most beautiful, and useful, factorizations is the Singular Value Decomposition (or SVD),

A=UΣVTA = U\Sigma V^T

where AA is a general m×nm\times n matrix, UU and VV are unitary (orthogonal) matrices such that UTU=Im×mU^TU=I^{m\times m}, VTV=In×nV^TV = I^{n\times n} and Σ\Sigma is a diagonal matrix with real, positive diagonal entries (the singular values)

σ1σ2σr>0\sigma_1 \geq \sigma_2\geq \ldots \sigma_{r} > 0

where rr is the rank of AA

Applications of the SVD

The SVD combines all the aspects of basic linear algebra and is used in a large number of applications including

  • diagnosing ill-conditioned matrices

  • providing orthonormal bases for the 4-subspaces of AA

  • Solving linear systems and linear least-squares problems

  • Solving ill-conditioned and singular linear systems (Pseudo-inverse)

  • Dimensional reduction in Data analysis (PCA, EOF, POD analysis)

  • and more...

Because of its ubiquity in computational linear algebra and data science. we will spend a bit of time to understand the SVD and its applications (but not the specific algorithms). But first

Fundamentals

Objectives

  • Understand basic linear-algebraic operations and their computational costs

  • Understand Numpy implementation (and performance) for Linear Algebra

  • Understand Singular vs Invertible matrices

  • Understand Orthonormal vectors and QQ matrices

  • Understand vector and Matrix norms

  • Understand the Condition Number

Matrix-Vector Multiplication

One of the most basic operations we can perform with matrices is to multiply them by a vector which maps a vector to a vector. There are multiple ways to interpret and compute the matrix-vector product AxA \mathbf{x}.

index notation

bi=j=1naijxjwherei=1,,mb_i = \sum^n_{j=1} a_{ij} x_j \quad \text{where}\quad i = 1, \ldots, m

row picture (dot products)

We can also consider matrix-vector multiplication as a sequence of inner products (dot-products between the rows of AA and the vector x\mathbf{x}.

b=Ax,=[a1Txa2TxamTx]\begin{align} \mathbf{b} &= A \mathbf{x}, \\ &= \begin{bmatrix} \mathbf{a}_1^T \mathbf{x} \\ \mathbf{a}_2^T \mathbf{x} \\ \vdots \\ \mathbf{a}_m^T \mathbf{x}\end{bmatrix} \end{align}

where aiT\mathbf{a}_i^T is the iith row of AA

column picture

An alternative (and entirely equivalent way) to write the matrix-vector product is as a linear combination of the columns of AA where each column’s weighting is xjx_j.

b=Ax,=[a1a2an][x1x2xn],=x1a1+x2a2++xnan.\begin{align} \mathbf{b} &= A \mathbf{x}, \\ &= \begin{bmatrix} & & & \\ & & & \\ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \\ & & & \\ & & & \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}, \\ &= x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n. \end{align}

This view will be useful later when we are trying to interpret various types of matrices.

Operation Counts

No matter how you compute AxA\mathbf{x}, the total number of operations is the same (for a dense matrix AA). The row view however is convenient for calculating the Operation counts required for AxA\mathbf{x}.

If ACm×nA\in\mathbb{C}^{m\times n} and xCn\mathbf{x}\in\mathbb{C}^n. Then just counting the number of multiplications involved to compute AxA\mathbf{x} is O(??)O(??)

One important property of the matrix-vector product is that is a linear operation, also known as a linear operator. This means that the for any x,yCn\mathbf{x}, \mathbf{y} \in \mathbb{C}^n and any cCc \in \mathbb{C} we know that

  1. A(x+y)=Ax+AyA (\mathbf{x} + \mathbf{y}) = A\mathbf{x} + A\mathbf{y}

  2. A(cx)=cAxA\cdot (c\mathbf{x}) = c A \mathbf{x}

Example: Numerical matrix-vector multiply

Write a matrix-vector multiply function and check it with the appropriate numpy routine. Also verify the linearity of the matrix-vector multiply.

# Mat-vec in index notation (the long way...don't do this)
def matrix_vector_product_index(A, x):
    m, n = A.shape
    b = numpy.zeros(m)
    for i in range(m):
        for j in range(n):
            b[i] += A[i, j] * x[j]
    return b


# Mat-vec by row picture (still don't do this)
def matrix_vector_product_row(A, x):
    m, n = A.shape
    b = numpy.zeros(m)
    # loop on rows
    for i in range(m):
        b[i] = A[i, :].dot(x)
    return b


# Mat-vec by column picture (still don't do this)
def matrix_vector_product_col(A, x):
    m, n = A.shape
    b = numpy.zeros(m)
    # loop over columns
    for j in range(n):
        b += A[:, j] * x[j]
    return b

Check equivalence

first set up some large random matrices and vectors and compare to numpy built in .dot function

m = 1000
n = 1000
A = numpy.random.uniform(size=(m, n))
x = numpy.random.uniform(size=(n))
y = numpy.random.uniform(size=(n))
c = numpy.random.uniform()
funcs = [
    matrix_vector_product_index,
    matrix_vector_product_row,
    matrix_vector_product_col,
]
for f in funcs:
    b = f(A, x)
    print("{}(A,x) = A.dot(x)? {}".format(f.__name__, numpy.allclose(b, A.dot(x))))

Check Linearity

for f in funcs:
    print(
        "{}(A,x+y) = Ax + Ay is {}".format(
            f.__name__, numpy.allclose(f(A, (x + y)), f(A, x) + f(A, y))
        )
    )
    print(
        "{}(A,cx) = cAx  is {}\n".format(
            f.__name__, numpy.allclose(f(A, c * x), c * f(A, x))
        )
    )

Check Timing/performance

for f in funcs:
    print(f.__name__, end="\t")
    %timeit f(A,x)

print("numpy.dot(A,x)", end="\t\t\t")
%timeit A.dot(x)

Matrix-Matrix Multiplication

The matrix product with another matrix C=AB C=AB is defined as

cij=k=1maikbkj=aiTbjc_{ij} = \sum^m_{k=1} a_{ik} b_{kj} = \mathbf{a}_i^T\mathbf{b}_j

i.e. each component of CC is a dot-product between the iith row of AA and the jjth column of BB

As with matrix-vector multiplication, Matrix-matrix multiplication can be thought of multiple ways

  • m×pm\times p dot products (each with nn flops)

  • AA multiplying the columns of BB

    C=AB=[Ab1Ab2Abp]C = AB = \begin{bmatrix} A\mathbf{b}_1 & A\mathbf{b}_2 & \ldots & A\mathbf{b}_p\\ \end{bmatrix}
  • Linear combinations of the rows of BB

    C=AB=[a1TBa2TBamTB]C = AB = \begin{bmatrix} \mathbf{a}_1^T B \\ \mathbf{a}_2^T B \\ \vdots \\ \mathbf{a}_m^T B\\ \end{bmatrix}

Questions

  • What are the dimensions of AA and BB so that the multiplication works?

  • What are the Operations Counts for Matrix-Matrix Multiplication?

  • Comment on the product c=(AB)x\mathbf{c}=(AB)\mathbf{x} vs. d=A(Bx)\mathbf{d} = A(B\mathbf{x})

Example: Outer Product

The product of two vectors uCm\mathbf{u} \in \mathbb{C}^m and vCn\mathbf{v} \in \mathbb{C}^n is a m×nm \times n matrix where the columns are the vector uu multiplied by the corresponding value of vv: $$

uvT=[u1u2um][v1v2vn],=[v1u1vnu1v1umvnum].\begin{align} \mathbf{u} \mathbf{v}^T &= \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_m \end{bmatrix} \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix}, \\ & = \begin{bmatrix} v_1u_1 & \cdots & v_n u_1 \\ \vdots & & \vdots \\ v_1 u_m & \cdots & v_n u_m \end{bmatrix}. \end{align}

$$

It is useful to think of these as operations on the column vectors, and an equivalent way to express this relationship is $$

uvT=[u][v1v2vn],=[uv1uv2uvn],=[v1u1vnu1v1umvnum].\begin{align} \mathbf{u} \mathbf{v}^T &= \begin{bmatrix} \\ \mathbf{u} \\ \\ \end{bmatrix} \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix}, \\ &= \begin{bmatrix} & & & \\ & & & \\ \mathbf{u}v_1 & \mathbf{u} v_2 & \cdots & \mathbf{u} v_n \\ & & & \\ & & & \end{bmatrix}, \\ & = \begin{bmatrix} v_1u_1 & \cdots & v_n u_1 \\ \vdots & & \vdots \\ v_1 u_m & \cdots & v_n u_m \end{bmatrix}. \end{align}

$$

Or each column is just a scalar multiple of u\mathbf{u}

Alternatively you can think of this as the rows are all just scalar multiples of v\mathbf{v}

rank 1 updates

We call any matrix of the form uvT\mathbf{u}\mathbf{v}^T a “rank one matrix” ( because its rank r=?). These sort of matrix operations are very common in numerical algorithms for orthogonalization, eigenvalues and the original page-rank algorithm for google. Again, the order of operations is critical.

Comment on the difference in values and operation counts between

y=(uvT)x\mathbf{y} = (\mathbf{u}\mathbf{v}^T)\mathbf{x}

and

y~=u(vTx)\tilde{\mathbf{y}} = \mathbf{u}(\mathbf{v}^T\mathbf{x})

for u\mathbf{u}, v\mathbf{v}, x\mathbf{x}, y\mathbf{y}, y~Rn\tilde{\mathbf{y}}\in\mathbb{R}^n,

Check

Set up some random vectors

m = 10000
n = 10000
u = numpy.random.rand(m)
v = numpy.random.rand(n)
x = numpy.random.rand(n)

Check Equivalence of numbers

show

(uvT)x=u(vTx)(\mathbf{u}\mathbf{v}^T)\mathbf{x} = \mathbf{u}(\mathbf{v}^T\mathbf{x})
b = numpy.outer(u, v).dot(x)
c = u.dot(v.dot(x))
d = u * v.dot(x)
print("Max Difference = {}".format(numpy.max(numpy.abs(b - c))))
print("Max Difference = {}".format(numpy.max(numpy.abs(b - d))))
print("Max Difference = {}".format(numpy.max(numpy.abs(c - d))))

Check Timing/performance

# timing
print("(uv^T).dot(x)", end="\t\t")
%timeit numpy.outer(u,v).dot(x)
print("\nu*v.dot(x)", end="\t\t")
%timeit u*v.dot(x)

Example: Upper Triangular Multiplication

Consider the multiplication of a matrix ACm×nA \in \mathbb{C}^{m\times n} and the upper-triangular matrix RR defined as the n×nn \times n matrix with entries rij=1r_{ij} = 1 for iji \leq j and rij=0r_{ij} = 0 for i>ji > j. The product can be written as

[b1bn]=[a1an][111].\begin{bmatrix} \\ \\ \mathbf{b}_1 & \cdots & \mathbf{b}_n \\ \\ \\ \end{bmatrix} = \begin{bmatrix} \\ \\ \mathbf{a}_1 & \cdots & \mathbf{a}_n \\ \\ \\ \end{bmatrix} \begin{bmatrix} 1 & \cdots & 1 \\ & \ddots & \vdots \\ & & 1 \end{bmatrix}.

The columns of BB are then

bj=Arj=k=1jak\mathbf{b}_j = A \mathbf{r}_j = \sum^j_{k=1} \mathbf{a}_k

so that bj\mathbf{b}_j is the sum of the first jj columns of AA.

Example: Write Matrix-Matrix Multiplication

Write a function that computes matrix-matrix multiplication and demonstrate the following properties:

  1. A(B+C)=AB+ACA (B + C) = AB + AC (for square matrices))

  2. A(cB)=cABA (cB) = c AB where cCc \in \mathbb{C}

  3. ABBAAB \neq BA in general

def matrix_matrix_product(A, B):
    C = numpy.zeros((A.shape[0], B.shape[1]))
    for i in range(A.shape[0]):
        for j in range(B.shape[1]):
            for k in range(A.shape[1]):
                C[i, j] += A[i, k] * B[k, j]
    return C


m = 4
n = 4
p = 4
A = numpy.random.uniform(size=(m, n))
B = numpy.random.uniform(size=(n, p))
C = numpy.random.uniform(size=(m, p))
c = numpy.random.uniform()
print(numpy.allclose(matrix_matrix_product(A, B), numpy.dot(A, B)))
print(
    numpy.allclose(
        matrix_matrix_product(A, (B + C)),
        matrix_matrix_product(A, B) + matrix_matrix_product(A, C),
    )
)
print(numpy.allclose(matrix_matrix_product(A, c * B), c * matrix_matrix_product(A, B)))
print(numpy.allclose(matrix_matrix_product(A, B), matrix_matrix_product(B, A)))

NumPy Products

NumPy and SciPy contain routines that are optimized to perform matrix-vector and matrix-matrix multiplication. Given two ndarrays you can take their product by using the dot function.

n = 10
# Matrix vector with identity
I = numpy.identity(n)
x = numpy.random.random(n)
print(x)
print("x = Ix is {}\n".format(numpy.allclose(x, numpy.dot(I, x))))
print("x - I.dot(x) = {}\n".format(x - I.dot(x)))
print("I*x = \n{}\n".format(I * x))
print(I.dot(x))
# Matrix vector product
m = 5
A = numpy.random.random((m, n))
print(numpy.dot(A, x))
# Matrix matrix product
B = numpy.random.random((n, m))
print(numpy.dot(A, B))
print()
print(A.dot(B))

Check non-commutative property of matrix-matrix multiplication

It’s easy to demonstrate the non-commutative nature of Matrix-Matrix multiplication with almost any two matrices AA and BB. For a clear demonstration, consider the two matrices DD a diagonal matrix and AA a matrix of all 1s

N = 5
D = numpy.diag(numpy.array(range(1, N + 1)))
A = numpy.ones((N, N))
print(D, "\n")
print(A)
print(D.dot(A))
print(A.dot(D))

Range and Null-Space

Range

  • The range of a matrix ARm×nA \in \mathbb R^{m \times n} (similar to any function), denoted as range(A)\text{range}(A), is the set of vectors that can be expressed as AxA x for xRnx \in \mathbb R^n.

  • We can also then say that that range(A)\text{range}(A) is the space spanned by the columns of AA. In other words the linearly independent columns of AA provide a basis for range(A)\text{range}(A), also called the column space of the matrix AA.

  • C(A)C(A) controls the existence of solutions to Ax=bA\mathbf{x}=\mathbf{b}

Null-Space

  • Similarly the null-space of a matrix AA, denoted null(A)\text{null}(A) is the set of vectors x\mathbf{x} that satisfy Ax=0A \mathbf{x} = \mathbf{0}.

  • N(A)N(A) controls the uniqueness of solutions to Ax=bA\mathbf{x}=\mathbf{b}

  • A similar concept is the rank of the matrix AA, denoted as rank(A)\text{rank}(A), is the dimension of the column space. A matrix AA is said to have full-rank if rank(A)=min(m,n)\text{rank}(A) = \min(m, n). This property also implies that the matrix mapping is one-to-one.

Inverse

A non-singular or invertible matrix is characterized as a square matrix with full-rank. This is related to why we know that the matrix is one-to-one, we can use it to transform a vector xx and using the inverse, denoted A1A^{-1}, we can map it back to the original matrix. The familiar definition of this is

Ax=b,A1Ax=A1b,x=A1b.\begin{align*} A \mathbf{x} &= \mathbf{b}, \\ A^{-1} A \mathbf{x} & = A^{-1} \mathbf{b}, \\ x &=A^{-1} \mathbf{b}. \end{align*}

Since AA has full rank, its columns form a basis for Rm\mathbb{R}^m and the vector b\mathbf{b} must be in the column space of AA.

There are a number of important properties of an invertible matrix AA. Here we list them as the following equivalent statements

  1. AA has a unique inverse A1A^{-1} such that A1A=AA1=IA^{-1}A=AA^{-1}=I

  2. rank(A)=m\text{rank}(A) = m

  3. range(A)=Cm\text{range}(A) = \mathbb{C}^m

  4. null(A)=0\text{null}(A) = \mathbf{0}

  5. 0 is not an eigenvalue of AA

  6. det(A)0\text{det}(A) \neq 0

Example: Properties of invertible matrices

Show that given an invertible matrix that the rest of the properties hold. Make sure to search the numpy packages for relevant functions.

m = 5
# generate a random m x m invertible matrix
for n in range(100):
    A = numpy.random.uniform(size=(m, m))
    if numpy.linalg.det(A) != 0:
        break

print("A^{{-1}}*A = \n\n{}\n".format(numpy.dot(numpy.linalg.inv(A), A)))
print("rank(A) = {}\n".format(numpy.linalg.matrix_rank(A)))
print("N(A)= {}\n".format(numpy.linalg.solve(A, numpy.zeros(m))))
print("Eigenvalues = {}".format(numpy.linalg.eigvals(A)))

Orthogonal Vectors and Matrices

Orthogonality is a very important concept in linear algebra that forms the basis of many of the modern methods used in numerical computations.

Two vectors are said to be orthogonal if their inner-product or dot-product defined as

<x,y>(x,y)xTyxy=i=1mxiyi=0< \mathbf{x}, \mathbf{y} > \equiv (\mathbf{x}, \mathbf{y}) \equiv \mathbf{x}^T\mathbf{y} \equiv \mathbf{x} \cdot \mathbf{y} = \sum^m_{i=1} x_i y_i = 0

Here we have shown the various notations you may run into (the inner-product is in-fact a general term for a similar operation for mathematical objects such as functions).

If x,y=0\langle \mathbf{x},\mathbf{y} \rangle = 0 then we say x\mathbf{x} and y\mathbf{y} are orthogonal. The reason we use this terminology is that the inner-product of two vectors can also be written in terms of the angle between them where

cosθ=x,yx2 y2\cos \theta = \frac{\langle \mathbf{x}, \mathbf{y} \rangle}{||\mathbf{x}||_2~||\mathbf{y}||_2}

and x2||\mathbf{x}||_2 is the Euclidean (2\ell^2) norm of the vector x\mathbf{x}, which we can interpret as the length of a vector.

We can write the 2-norm or length of a vector in terms of the inner-product as well as

x22=x,x=xTx=i=1mxi2.||\mathbf{x}||_2^2 = \langle \mathbf{x}, \mathbf{x} \rangle = \mathbf{x}^T\mathbf{x} = \sum^m_{i=1} |x_i|^2.
x2=x,x||\mathbf{x}||_2 = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle}

The generalization of the inner-product to complex vector spaces is defined as

x,y=i=1mxiyi\langle x, y \rangle = \sum^m_{i=1} x_i^* y_i

where xix_i^* is the complex-conjugate of the value xix_i.

Orthonormality

Taking this idea one step further we can say a set of vectors xX\mathbf{x} \in X are orthogonal to yY\mathbf{y} \in Y if

<x,y>=0x,y,< \mathbf{x}, \mathbf{y} > = 0 \quad \forall \mathbf{x},\mathbf{y},

If in addition

x=1,y=1x,y||\mathbf{x}|| = 1,\, ||\mathbf{y}|| = 1\quad \forall \mathbf{x},\mathbf{y}

then they are also called orthonormal.

Note that we dropped the 2 as a subscript to the notation for the norm of a vector. Later we will explore other ways to define a norm of a vector other than the Euclidean norm defined above.

Another concept that is related to orthogonality is linear-independence. A set of vectors xX\mathbf{x} \in X are linearly independent if xX\forall \mathbf{x} \in X that each x\mathbf{x} cannot be written as a linear combination of the other vectors in the set XX.

An equivalent statement is that given a set of nn vectors xi\mathbf{x}_i, the only set of scalars cic_i that satisfies

i=1ncixi=0\sum_{i=1}^n c_i\mathbf{x}_i = \mathbf{0}

is if ci=0c_i=0 for all i[1,n]i\in[1,n]

Show that if a set of vectors are orthonormal, they must be linearly independent.

This can be related directly through the idea of projection. If we have a set of vectors xX\mathbf{x} \in X we can project another vector v\mathbf{v} onto the vectors in XX by using the inner-product. This is especially powerful if we have a set of orthogonal vectors XX, which are said to span a space (or provide a basis for a space), s.t. any vector in the space spanned by XX can be expressed as a linear combination of the basis vectors XX

v=i=1nv,xixi.\mathbf{v} = \sum^n_{i=1} \, \langle \mathbf{v}, \mathbf{x}_i \rangle \, \mathbf{x}_i.

Note if vX\mathbf{v} \in X that

v,xi=0xiXv.\langle \mathbf{v}, \mathbf{x}_i \rangle = 0 \quad \forall \mathbf{x}_i \in X \setminus \mathbf{v}.

Looping back to matrices, the column space of a matrix is spanned by its linearly independent columns. Any vector vv in the column space can therefore be expressed via the equation above.

Unitary Matrices

A special class of matrices are called unitary matrices when complex-valued and orthogonal when purely real-valued if the columns of the matrix are orthonormal to each other. Importantly this implies that for a unitary matrix QQ we know the following

  1. QQ=IQ^*Q = I

  2. therefore if QQ is square, Q=Q1Q^* = Q^{-1}

where QQ^* is called the adjoint (or Hermitian) of QQ. The adjoint is defined as the transpose of the original matrix with the entries being the complex conjugate of each entry as the notation implies. Note, if QRn×nQ\in\mathbb{R}^{n\times n} then Q=QTQ^*=Q^T

As an example if we have the matrix

Q=[q11q12q21q22q31q32]thenQ=[q11q21q31q12q22q32]\begin{aligned} Q &= \begin{bmatrix} q_{11} & q_{12} \\ q_{21} & q_{22} \\ q_{31} & q_{32} \end{bmatrix} \quad \text{then} \\ Q^* &= \begin{bmatrix} q^*_{11} & q^*_{21} & q^*_{31} \\ q^*_{12} & q^*_{22} & q^*_{32} \end{bmatrix} \end{aligned}

The important part of being an unitary matrix is that the projection onto the column space of the matrix QQ preserves geometry in an Euclidean sense, i.e. preserves the Cartesian distance.

Vector and Matrix Norms and the condition number

The following sections will lay out the definitions and computation required to calculate a range of vector and matrix norms which provide a measure of the “size” of an object or the distance in some vector space.

In the context of Numerical Linear algebra, norms are essential for defining a key property of a matrix, the “condition number”.

In infinite precision, a square matrix is either invertible or singular, however, in finite precision, even an invertible matrix can behave poorly if it is ‘ill-conditioned’, i.e. it is almost singular.

We need a quantitiative measure of how “near-singular” a matrix is

Why not the Determinant A|A|?

We know that if a matrix AA is singular, A=0|A|=0. But what if A|A| is small?

Turns out even a perfectly invertible, well-conditioned matrix can have arbitrarily small determinant.

Consider the two diagonal matrices

I=[111]andϵI=[ϵϵϵ]I = \begin{bmatrix} 1 & & \\ & 1 & \\ & & 1\\ \end{bmatrix}\quad\text{and}\quad \epsilon I = \begin{bmatrix} \epsilon & & \\ & \epsilon & \\ & & \epsilon \\ \end{bmatrix}

by definition, I=1|I|=1, but ϵI=??|\epsilon I| = ??.

More generally if IRn×nI\in\mathbb{R}^{n\times n}, ϵI=??|\epsilon I| = ??.

Yet all these matrices are diagonal and easily inverted (i.e. (ϵI)1=(1/ϵ)I(\epsilon I )^{-1} = (1/\epsilon) I). Thus we need something better (the condition number). But to get there requires developing important ideas about vector and matrix norms.

Vector Norms

Norms (and also measures) provide a means for measure the “size” or distance in a space. In general a norm is a function, denoted by ||\cdot||, that maps CmR\mathbb{C}^m \rightarrow \mathbb{R}. In other words we stick in a multi-valued object and get a single, real-valued number out the other end.

All norms satisfy the properties:

  1. x0||\mathbf{x}|| \geq 0

  2. x=0||\mathbf{x}|| = 0 only if x=0\mathbf{x} = \mathbf{0}

  3. x+yx+y||\mathbf{x} + \mathbf{y}||\leq ||\mathbf{x}|| + ||\mathbf{y}|| (triangle inequality)

  4. cx=cx||c \mathbf{x}|| = |c|||\mathbf{x}|| where cCc \in \mathbb{C}

There are a number of relevant norms that we can define beyond the Euclidean norm, also know as the 2-norm or 2\ell_2 norm:

  • 1\ell_1 norm:

    x1=i=1mxi,||\mathbf{x}||_1 = \sum^m_{i=1} |x_i|,
  • 2\ell_2 norm:

    x2=(i=1mxi2)1/2,||\mathbf{x}||_2 = \left( \sum^m_{i=1} |x_i|^2 \right)^{1/2},
  • p\ell_p norm:

    xp=(i=1mxip)1/p,1p<,||\mathbf{x}||_p = \left( \sum^m_{i=1} |x_i|^p \right)^{1/p}, \quad \quad 1 \leq p < \infty,
  • \ell_\infty norm:

    x=max1imxi,||\mathbf{x}||_\infty = \max_{1\leq i \leq m} |x_i|,
  1. weighted p\ell_p norm:

    xWp=(i=1mwixip)1/p,1p<,||\mathbf{x}||_{W_p} = \left( \sum^m_{i=1} |w_i x_i|^p \right)^{1/p}, \quad \quad 1 \leq p < \infty,

These are also related to other norms denoted by capital letters (L2L_2 for instance). In this case we use the lower-case notation to denote finite or discrete versions of the infinite dimensional counterparts.

Example: Comparisons Between Norms

Compute the norms given some vector x\mathbf{x} and compare their values. Verify the properties of the norm for one of the norms.

def pnorm(x, p):
    """return the vector p norm of a vector
    parameters:
    -----------
    x: numpy array
        vector
    p: float or numpy.inf
        value of p norm such that ||x||_p = (sum(|x_i|^p))^{1/p} for p< inf
        for infinity norm return max(abs(x))
    returns:
    --------
    pnorm: float
        pnorm of x
    """
    if p == numpy.inf:
        norm = numpy.max(numpy.abs(x))
    else:
        norm = numpy.sum(numpy.abs(x) ** p) ** (1.0 / p)
    return norm
m = 10
p = 4
x = numpy.random.uniform(size=m)
ell_1 = pnorm(x, 1)
ell_2 = pnorm(x, 2)
ell_p = pnorm(x, p)
ell_infty = pnorm(x, numpy.inf)

print("x = {}".format(x))
print()
print(
    "L_1   = {}\nL_2   = {}\nL_{}   = {}\nL_inf = {}".format(
        ell_1, ell_2, p, ell_p, ell_infty
    )
)
y = numpy.random.uniform(size=m)
print()
print("Properties of norms:")

print("x = {}".format(x))
print("y = {}\n".format(y))
p = 2
print(
    "||x+y||_{p}         = {nxy}\n||x||_{p} + ||y||_{p} = {nxny}".format(
        p=p, nxy=pnorm(x + y, p), nxny=pnorm(x, p) + pnorm(y, p)
    )
)
c = -0.1
print("||c x||_{} = {}".format(p, pnorm(c * x, p)))
print("|c||x||_{} = {}".format(p, abs(c) * pnorm(x, p)))

Geometric Interpretation

For every pp-norm we can define a set of ‘unit vectors’ that is the set of all vectors in Rn\mathbb{R}^n with xp=1||\mathbf{x}||_p = 1.

For example in R2\mathbb{R}^2, the unit spheres in the 1-,2- and \infty-norm look like

import matplotlib.patches as patches

# Note: that this code is a bit fragile to angles that go beyond pi
# due to the use of arccos.

head_width = 0.1
head_length = 1.5 * head_width


def draw_unit_vectors(axes, A, head_width=0.1):
    head_length = 1.5 * head_width
    image_e = numpy.empty(A.shape)
    angle = numpy.empty(A.shape[0])
    image_e[:, 0] = numpy.dot(A, numpy.array((1.0, 0.0)))
    image_e[:, 1] = numpy.dot(A, numpy.array((0.0, 1.0)))
    for i in range(A.shape[0]):
        angle[i] = numpy.arccos(image_e[0, i] / numpy.linalg.norm(image_e[:, i], ord=2))
        axes.arrow(
            0.0,
            0.0,
            image_e[0, i] - head_length * numpy.cos(angle[i]),
            image_e[1, i] - head_length * numpy.sin(angle[i]),
            head_width=head_width,
            color="b",
            alpha=0.5,
        )


# comparison of norms
# ============
# 1-norm
# Unit-ball
fig = plt.figure(figsize=(8, 6))
# fig.suptitle("1-Norm: $||A||_1 = {}$".format(numpy.linalg.norm(A,ord=1)), fontsize=16)

theta = numpy.linspace(0.0, 2.0 * numpy.pi, 100)
axes = fig.add_subplot(1, 1, 1, aspect="equal")
axes.plot(
    (1.0, 0.0, -1.0, 0.0, 1.0),
    (0.0, 1.0, 0.0, -1.0, 0.0),
    "r",
    label="$||\mathbf{x}||_1=1$",
)
axes.plot(numpy.cos(theta), numpy.sin(theta), "g", label="$||\mathbf{x}||_2=1$")
axes.plot(
    (1.0, -1.0, -1.0, 1.0, 1.0),
    (1.0, 1.0, -1.0, -1.0, 1.0),
    "b",
    label="$||\mathbf{x}||_\infty=1$",
)
draw_unit_vectors(axes, numpy.eye(2))
axes.arrow(
    0.0,
    0.0,
    1.0 - head_length * numpy.cos(numpy.pi / 4.0),
    1.0 - head_length * numpy.sin(numpy.pi / 4.0),
    head_width=head_width,
    color="k",
    linestyle="--",
    alpha=0.5,
)
axes.set_title("Unit Ball in 1-,2-,and $\infty$ norm")
axes.set_xlim((-1.1, 1.1))
axes.set_ylim((-1.1, 1.1))
axes.grid(True)
# axes.legend([ '$||\mathbf{x}||_1=1$', '$||\mathbf{x}||_2=1$', '$||\mathbf{x}||_\infty=1$'], loc='best')
axes.legend(loc="upper center", bbox_to_anchor=(0.5, 0.5))
plt.show()

And if x=[11]\mathbf{x} = \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix} then x1=??,  ||\mathbf{x}||_1 = ??,~~ x2=??,  ||\mathbf{x}||_2 = ??,~~ and x=??||\mathbf{x}||_\infty = ??

Induced Matrix Norms

The most direct way to consider a matrix norm is one induced by a vector-norm. Given a vector norm, we can define a matrix p-norm as the smallest number CC that satisfies the inequality

AxpCxpxCn||A \mathbf{x}||_{p} \leq C ||\mathbf{x}||_{p} \quad\forall\quad \mathbf{x}\in\mathbb{C}^n

or as the supremum of the ratios so that

C=Ap=supxCn x0Axpxp.C = ||A||_p = \sup_{\mathbf{x}\in\mathbb{C}^n ~ \mathbf{x}\neq\mathbf{0}} \frac{||A \mathbf{x}||_{p}}{||\mathbf{x}||_p}.

with no loss of generality, we can restrict x\mathbf{x} to the set of all unit vectors xp=1||\mathbf{x}||_p=1 and interpret the matrix p-norm as the maximum distortion of the “unit sphere”

Properties of all Matrix Norms (induced and non-induced)

In general matrix-norms have the following properties whether they are induced from a vector-norm or not:

  1. A0||A|| \geq 0 and A=0||A|| = 0 only if A=0A = 0

  2. A+BA+B||A + B|| \leq ||A|| + ||B|| (Triangle Inequality)

  3. cA=cA||c A|| = |c| ||A||

In addition, all induced p-norms satisfy the product rules

  1. ABAB||AB|| \leq ||A||\,||B||

  2. AxAx||A\mathbf{x}|| \leq ||A||\,||\mathbf{x}||

Computation of induced matrix p-norms

With a little work it can be shown that

  1. A1||A||_1 = numpy.linalg.norm(A, ord=1) is the maximum 1-norm of the Columns of AA

  2. A2||A||_2 = numpy.linalg.norm(A, ord=2) = max(numpy.linalg.svd(A)[1] is the maximum Singular Value of AA

  3. A||A||_\infty = numpy.linalg.norm(A, ord=numpy.inf) is the maximum 1-norm of the Rows of AA (or AT1||A^T||_1)

Examples

The Identity Matrix I||I||

I=[111]I = \begin{bmatrix} 1 & & \\ & 1 & \\ & & 1\\ \end{bmatrix}

I1=I2=I=??||I||_1=||I||_2=||I||_\infty = ??

I = numpy.identity(3)

for ord in [1, 2, numpy.inf]:
    print("||I||_{} = {}".format(ord, numpy.linalg.norm(I, ord=ord)))

Diagonal Matrices D||D||

D=[d1d2d3]e.g.[213]D = \begin{bmatrix} d_1 & & \\ & d_2 & \\ & & d_3\\ \end{bmatrix} \quad\text{e.g.}\quad \begin{bmatrix} 2 & & \\ & 1 & \\ & & -3\\ \end{bmatrix}

D1=D2=D=??||D||_1=|D||_2=||D||_\infty = ??

D = numpy.diag([2, 1, -3])
for ord in [1, 2, numpy.inf]:
    print("||D||_{} = {}".format(ord, numpy.linalg.norm(D, ord=ord)))

Orthogonal Matrices Q2||Q||_2 where QQ=IQ^*Q=I

A fundamental property of QQ matrices is that they do not change the Euclidean length of vectors i.e.

Qx22=xQQx=xx=x22||Q\mathbf{x}||_2^2 = \mathbf{x}^*Q^*Q\mathbf{x} = \mathbf{x}^*\mathbf{x} = ||\mathbf{x}||_2^2

Therefore

Q2=supxCn,x2=1Qx2=??||Q||_2 = \sup_{\mathbf{x}\in\mathbb{C}^n, ||\mathbf{x}||_2=1} ||Q\mathbf{x}||_2 = ??

Example: Induced Matrix Norms

Consider the matrix

A=[1202].A = \begin{bmatrix} 1 & 2 \\ 0 & 2 \end{bmatrix}.

Compute the induced-matrix norm of AA for the vector norms 2\ell_2 and \ell_\infty.

2\ell^2: For both of the requested norms the unit-length vectors [1,0][1, 0] and [0,1][0, 1] can be used to give an idea of what the norm might be and provide a lower bound.

A2=supxRn(A[1,0]T2,A[0,1]T2)||A||_2 = \sup_{x \in \mathbb{R}^n} \left( ||A \cdot [1, 0]^T||_2, ||A \cdot [0, 1]^T||_2 \right )

computing each of the norms we have

[1202][10]=[10][1202][01]=[22]\begin{aligned} \begin{bmatrix} 1 & 2 \\ 0 & 2 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 0 \end{bmatrix} &= \begin{bmatrix} 1 \\ 0 \end{bmatrix} \\ \begin{bmatrix} 1 & 2 \\ 0 & 2 \end{bmatrix} \cdot \begin{bmatrix} 0 \\ 1 \end{bmatrix} &= \begin{bmatrix} 2 \\ 2 \end{bmatrix} \end{aligned}

which translates into the norms A[1,0]T2=1||A \cdot [1, 0]^T||_2 = 1 and A[0,1]T2=22||A \cdot [0, 1]^T||_2 = 2 \sqrt{2}. This implies that the 2\ell_2 induced matrix norm of AA is at least A2=222.828427125||A||_{2} = 2 \sqrt{2} \approx 2.828427125.

The exact value of A2||A||_2 can be computed using the spectral radius defined as

ρ(A)=maxiλi,\rho(A) = \max_{i} |\lambda_i|,

where λi\lambda_i are the eigenvalues of AA. With this we can compute the 2\ell_2 norm of AA as

A2=ρ(AA)||A||_2 = \sqrt{\rho(A^\ast A)}

Computing the norm again here we find

AA=[1022][1202]=[1228]A^\ast A = \begin{bmatrix} 1 & 0 \\ 2 & 2 \end{bmatrix} \begin{bmatrix} 1 & 2 \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 1 & 2 \\ 2 & 8 \end{bmatrix}

which has eigenvalues

λ=12(9±65)\lambda = \frac{1}{2}\left(9 \pm \sqrt{65}\right )

so A22.9208096||A||_2 \approx 2.9208096.

The actual induced 2-norm of a matrix can be derived using the Singular Value Decomposition (SVD) and is simply the largest singular value σ1\sigma_1.

Proof: Given that every Matrix ACm×nA\in\mathbb{C}^{m\times n} can be factored into its SVD (see notebook 10.1):

A=UΣVA = U\Sigma V^*

where UCm×mU\in\mathbb{C}^{m\times m} and VCn×nV\in\mathbb{C}^{n\times n} are unitary matrices with the property UU=IU^*U=I and VV=IV^*V=I (of their respective sizes) and Σ\Sigma is a real diagonal matrix of singular values σ1σ2...σn0\sigma_1 \geq\sigma_2\geq...\sigma_n\geq 0.

Then the 2-norm squared of a square matrix is

A22=supxCn x2=1Ax22=xTAAx||A||^2_2 = \sup_{\mathbf{x} \in \mathbb{C}^n ~ ||\mathbf{x}||_2 = 1} ||A \mathbf{x}||_2^2 = \mathbf{x}^TA^*A\mathbf{x}

but AA=VΣ2VA^*A = V\Sigma^2V^* so

Ax22=xVΣ2Vx=yΣ2ywherey=Vx=i=1nσi2yi2σ12i=1nyi2=σi2y2\begin{align} ||A \mathbf{x}||_2^2 &= \mathbf{x}^*V\Sigma^2V^*\mathbf{x} \\ &= \mathbf{y}^*\Sigma^2\mathbf{y} \quad\mathrm{where}\quad \mathbf{y}=V^*\mathbf{x}\\ &= \sum_{i=1}^n \sigma_i^2|y_i|^2\\ &\leq \sigma_1^2\sum_{i=1}^n |y_i|^2 = \sigma_i^2||\mathbf{y}||_2\\ \end{align}

but if x2=1||\mathbf{x}||_2 = 1 (i.e. is a unit vector), then so is y\mathbf{y} because unitary matrices don’t change the length of vectors. So it follows that

A2=σ1||A||_2 = \sigma_1
A = numpy.array([[1, 2], [0, 2]])

# calculate the SVD(A)
U, S, Vt = numpy.linalg.svd(A)

print("Singular_values = {}".format(S))
print("||A||_2 = {}".format(S.max()))
print("||A||_2 = {}".format(numpy.linalg.norm(A, ord=2)))

# more fun facts about the SVD
# print(U.T.dot(U))
# print(Vt.T.dot(Vt))
# print(A - numpy.dot(U,numpy.dot(numpy.diag(S),Vt)))

Some Quick proofs

The induced 1-norm is the max of the 1-norm of the columns of AA

Given

Ax=x1a1+x2a2++xnanA\mathbf{x} = x_1\mathbf{a}_1 + x_2\mathbf{a}_2 + \ldots + x_n\mathbf{a}_n

where x1=1||\mathbf{x}||_1 = 1. Then

Ax1=x1a1+x2a2++xnanx1a11+x2a21++xnan1(triangle rule)max1jnaj1j=1nxj=max1jnaj1x1=max1jnaj1\begin{align} ||A \mathbf{x}||_1 &= || x_1\mathbf{a}_1 + x_2\mathbf{a}_2 + \ldots + x_n\mathbf{a}_n || \\ &\leq |x_1|\,||\mathbf{a}_1||_1 + |x_2|\,||\mathbf{a}_2||_1 + \ldots + |x_n|\,||\mathbf{a}_n||_1 \quad\text{(triangle rule)}\\ &\leq \max_{1\leq j\leq n} ||\mathbf{a}_j||_1\sum_{j=1}^n |x_j| = \max_{1\leq j\leq n} ||\mathbf{a}_j||_1 ||\mathbf{x}||_1\\ &= \max_{1\leq j\leq n} ||\mathbf{a}_j||_1\\ \end{align}

The induced 2-norm is the maximum singular value (short version...slightly scrappy)

Given

A=UΣVTA = U\Sigma V^T

where UTU=VTV=IU^TU = V^TV = I and Σ\Sigma is a diagonal matrix with diagonal entries σ1σ2σr>0\sigma_1\geq\sigma_2\geq\ldots\geq\sigma_r>0

then $$

AV=UΣAV2=UΣ2A2V2=U2Σ2A2Σ2=σ1\begin{align} AV &= U\Sigma\\ ||AV||_2 &= ||U\Sigma||_2\\ ||A||_2||V||_2 &= ||U||_2||\Sigma||_2\\ ||A||_2 &\leq ||\Sigma||_2 = \sigma_1 \\ \end{align}

$$

The induce \infty-norm is the max of the 1-norm of rows of AA

Ax=max1imaixmax1imai1||A \mathbf{x}||_\infty = \max_{1 \leq i \leq m} | \mathbf{a}^*_i \mathbf{x} | \leq \max_{1 \leq i \leq m} ||\mathbf{a}^*_i||_1

because the largest unit vector on the unit sphere in the \infty norm is a vector of 1’s.

Example:

A=[1202].A = \begin{bmatrix} 1 & 2 \\ 0 & 2 \end{bmatrix}.
A1=4,A=3||A||_1 = 4, \quad ||A||_\infty = 3
# Calculate the 1-norm of A
A = numpy.array([[1, 2], [0, 2]])
normA_1 = numpy.max(numpy.sum(numpy.abs(A), axis=0))
print("||A||_1 = {}".format(normA_1))
print("||A||_1 = {}".format(numpy.linalg.norm(A, ord=1)))
# calculate the 2  norm of A
normA_2 = numpy.max(numpy.linalg.svd(A, compute_uv=False))
print("||A||_2 = {}".format(normA_2))
print("||A||_2 = {}".format(numpy.linalg.norm(A, ord=2)))
print()
U, S, V = numpy.linalg.svd(A)
print(U.dot(numpy.diag(S).dot(V)))
# calculate the infinity norm of A
normA_inf = numpy.max(numpy.sum(numpy.abs(A), axis=1))
print("||A||_inf = {}".format(normA_inf))
print("||A||_inf = {}".format(numpy.linalg.norm(A, ord=numpy.inf)))

The Geometric Picture

One of the most useful ways to think about matrix norms is as a transformation of a unit-ball to an ellipse. Depending on the norm in question, the norm will be some combination of the resulting ellipse.

1-Norm: A=[1202]A = \begin{bmatrix} 1 & 2 \\ 0 & 2 \\ \end{bmatrix}

A = numpy.array([[1, 2], [0, 2]])
# ============
# 1-norm
# Unit-ball
fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 2)
fig.suptitle("1-Norm:  $||A||_1 = {}$".format(numpy.linalg.norm(A, ord=1)), fontsize=16)

axes = fig.add_subplot(1, 2, 1, aspect="equal")
axes.plot((1.0, 0.0, -1.0, 0.0, 1.0), (0.0, 1.0, 0.0, -1.0, 0.0), "r")
draw_unit_vectors(axes, numpy.eye(2))
axes.set_title("Unit Ball")
axes.set_xlim((-1.1, 1.1))
axes.set_ylim((-1.1, 1.1))
axes.grid(True)

# Image
axes = fig.add_subplot(1, 2, 2, aspect="equal")
axes.plot((1.0, 2.0, -1.0, -2.0, 1.0), (0.0, 2.0, 0.0, -2.0, 0.0), "r")
draw_unit_vectors(axes, A, head_width=0.2)

axes.set_title("Images Under A")
axes.grid(True)

plt.show()

2-Norm: A=[1202]A = \begin{bmatrix} 1 & 2 \\ 0 & 2 \\ \end{bmatrix}

# ============
# 2-norm

# Unit-ball
fig = plt.figure()
fig.suptitle(
    "2-Norm: $||A||_2 = ${:3.4f}".format(numpy.linalg.norm(A, ord=2)), fontsize=16
)
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1, aspect="equal")
axes.add_artist(plt.Circle((0.0, 0.0), 1.0, edgecolor="r", facecolor="none"))
draw_unit_vectors(axes, numpy.eye(2))
axes.set_title("Unit Ball")
axes.set_xlim((-1.1, 1.1))
axes.set_ylim((-1.1, 1.1))
axes.grid(True)

# Image
# Compute some geometry
u, s, v = numpy.linalg.svd(A)
theta = numpy.empty(A.shape[0])
ellipse_axes = numpy.empty(A.shape)
theta[0] = numpy.arccos(u[0][0]) / numpy.linalg.norm(u[0], ord=2)
theta[1] = theta[0] - numpy.pi / 2.0
for i in range(theta.shape[0]):
    ellipse_axes[0, i] = s[i] * numpy.cos(theta[i])
    ellipse_axes[1, i] = s[i] * numpy.sin(theta[i])

axes = fig.add_subplot(1, 2, 2, aspect="equal")
axes.add_artist(
    patches.Ellipse(
        (0.0, 0.0),
        2 * s[0],
        2 * s[1],
        theta[0] * 180.0 / numpy.pi,
        edgecolor="r",
        facecolor="none",
    )
)
for i in range(A.shape[0]):
    axes.arrow(
        0.0,
        0.0,
        ellipse_axes[0, i] - head_length * numpy.cos(theta[i]),
        ellipse_axes[1, i] - head_length * numpy.sin(theta[i]),
        head_width=head_width,
        color="k",
    )
draw_unit_vectors(axes, A, head_width=0.2)
axes.set_title("Images Under A")
axes.set_xlim((-s[0] + 0.1, s[0] + 0.1))
axes.set_ylim((-s[0] + 0.1, s[0] + 0.1))
axes.grid(True)

plt.show()

\infty-Norm: A=[1202]A = \begin{bmatrix} 1 & 2 \\ 0 & 2 \\ \end{bmatrix}

# ============
# infty-norm
# Unit-ball
fig = plt.figure()
fig.suptitle(
    "$\infty$-Norm: $||A||_\infty = {}$".format(numpy.linalg.norm(A, ord=numpy.inf)),
    fontsize=16,
)
fig.set_figwidth(fig.get_figwidth() * 2)

axes = fig.add_subplot(1, 2, 1, aspect="equal")
axes.plot((1.0, -1.0, -1.0, 1.0, 1.0), (1.0, 1.0, -1.0, -1.0, 1.0), "r")
draw_unit_vectors(axes, numpy.eye(2))
axes.set_title("Unit Ball")
axes.set_xlim((-1.1, 1.1))
axes.set_ylim((-1.1, 1.1))
axes.grid(True)

# Image
# Geometry - Corners are A * ((1, 1), (1, -1), (-1, 1), (-1, -1))
# Symmetry implies we only need two.  Here we just plot two
u = numpy.empty(A.shape)
u[:, 0] = numpy.dot(A, numpy.array((1.0, 1.0)))
u[:, 1] = numpy.dot(A, numpy.array((-1.0, 1.0)))
theta[0] = numpy.arccos(u[0, 0] / numpy.linalg.norm(u[:, 0], ord=2))
theta[1] = numpy.arccos(u[0, 1] / numpy.linalg.norm(u[:, 1], ord=2))

axes = fig.add_subplot(1, 2, 2, aspect="equal")
axes.plot((3, 1, -3, -1, 3), (2, 2, -2, -2, 2), "r")
for i in range(A.shape[0]):
    axes.arrow(
        0.0,
        0.0,
        u[0, i] - head_length * numpy.cos(theta[i]),
        u[1, i] - head_length * numpy.sin(theta[i]),
        head_width=head_width,
        color="k",
    )

draw_unit_vectors(axes, A, head_width=0.2)
axes.set_title("Images Under A")
axes.set_xlim((-4.1, 4.1))
axes.set_ylim((-3.1, 3.1))
axes.grid(True)

plt.show()

Cauchy-Schwarz and Hölder Inequalities

Computing matrix norms where p1p \neq 1 or \infty is more difficult unfortunately. We have a couple of tools that can be useful however.

  • Cauchy-Schwarz Inequality: For the special case where p=q=2p=q=2, for any vectors x\mathbf{x} and y\mathbf{y}

    xyx2y2|\mathbf{x}^*\mathbf{y}| \leq ||\mathbf{x}||_2 ||\mathbf{y}||_2
  • Hölder’s Inequality: Turns out this holds in general if given a pp and qq that satisfy 1/p+1/q=11/p + 1/q = 1 with 1p,q1 \leq p, q \leq \infty

xyxpyq.|\mathbf{x}^*\mathbf{y}| \leq ||\mathbf{x}||_p ||\mathbf{y}||_q.

Note: this is essentially what we used in the proof of the \infty-norm with p=1p=1 and q=q=\infty

The most widely used matrix norm not induced by a vector norm is the Frobenius norm defined by

AF=(i=1mj=1nAij2)1/2.||A||_F = \left( \sum^m_{i=1} \sum^n_{j=1} |A_{ij}|^2 \right)^{1/2}.

Invariance under unitary multiplication

One important property of the matrix 2-norm (and Frobenius norm) is that multiplication by a unitary matrix does not change the product (kind of like multiplication by 1). In general for any ACm×nA \in \mathbb{C}^{m\times n} and unitary matrix QCm×mQ \in \mathbb{C}^{m \times m} we have

QA2=A2QAF=AF.\begin{align*} ||Q A||_2 &= ||A||_2 \\ ||Q A||_F &= ||A||_F. \end{align*}

The condition number

Finally we have enough machinery to define the condition number κ(A)\kappa(A) (or often cond(A)\mathrm{cond}(A)) which is simply

κ(A)=AA1[1,)\kappa(A) = ||A||\,||A^{-1}|| \in [1,\infty)

Examples for simple matrices

  • Identity Matrix

    κ(I)=II1=??\kappa(I) = ||I||\,||I^{-1}|| = ??
  • Diagonal Matrix

    κ(D)=DD1=??\kappa(D) = ||D||\,||D^{-1}|| = ??
  • 2-norm condition number of a matrix

    κ2(A)=A2A12=σ1σn\kappa_2(A) = ||A||_2\,||A^{-1}||_2 = \frac{\sigma_1}{\sigma_n}

if AA is singular, κ2(A)=??\kappa_2(A) = ??

Back to our initial example

Let

A=ϵIRn×nA = \epsilon I \in \mathbb{R}^{n\times n}
  • The Determinant A=ϵn|A| =\epsilon^n (which can be arbitrarily small)

  • But $$

κ(A)=AA1=ϵI(1/ϵ)Iϵ1/ϵI2=1\begin{align} \kappa(A) &= ||A||\,||A^{-1}|| \\ &= ||\epsilon I||\,||(1/\epsilon)I|| \\ & \leq |\epsilon||1/\epsilon|||I||^2 = 1\\ \end{align}

$$

More generally, it’s easy to show that scaling of a matrix does not change its condition number

κ(αA)=αA(1/α)A1=κ(A)\kappa(\alpha A) = ||\alpha A||\,||(1/\alpha)A^{-1}|| =\kappa(A)

Example

A=[121+ϵ2]ϵϵmachA = \begin{bmatrix} 1 & 2 \\ 1 +\epsilon & 2\\ \end{bmatrix}\quad \epsilon \geq \epsilon_{mach}

then

A1=12ϵ[22(1+ϵ)1]A^{-1} = \frac{-1}{2\epsilon} \begin{bmatrix} 2 & -2 \\ -(1 +\epsilon) & 1\\ \end{bmatrix}

So

κ1(A)=A1A11=42ϵ(3+ϵ)6ϵ\kappa_1(A) = ||A||_1||A^{-1}||_1 = \frac{4}{2\epsilon}(3 + \epsilon)\sim \frac{6}{\epsilon}

which is very ill-conditioned.

The condition number and error analysis of Ax=bA\mathbf{x}=\mathbf{b}

The condition number is important in many parts of analysis of numerical linear algebra, but is easily illustrated in understanding the behavior of solutions of linear systems

Assume that x\mathbf{x} is a solution to Ax=bA\mathbf{x}=\mathbf{b} and we want to understand how a small change in the RHS b\mathbf{b} propagates to errors in x\mathbf{x}

Consider the perturbed problem

A(x+Δx)=b+ΔbA(\mathbf{x} +\Delta\mathbf{x}) = \mathbf{b} + \Delta\mathbf{b}

which by linearity of Matrix vector multiplication, and that Ax=bA\mathbf{x}=\mathbf{b} implies that

AΔx=ΔbA\Delta\mathbf{x} = \Delta\mathbf{b}

or

Δx=A1Δb\Delta\mathbf{x} = A^{-1}\Delta\mathbf{b}

Given

Δx=A1Δb\Delta\mathbf{x} = A^{-1}\Delta\mathbf{b}

Taking the norm of both sides implies

ΔxA1Δb||\Delta\mathbf{x}|| \leq ||A^{-1}||\,||\Delta\mathbf{b}||

since Ax=b||A\mathbf{x}|| = ||\mathbf{b}||, it follows that

ΔxAxA1Δbb\frac{||\Delta\mathbf{x}||}{||A\mathbf{x}||} \leq ||A^{-1}||\,\frac{||\Delta\mathbf{b}||}{||\mathbf{b}||}

or since AxAx||A\mathbf{x}||\leq||A||\,||\mathbf{x}||, it follows that

ΔxAxA1Δbb\frac{||\Delta\mathbf{x}||}{||A||\,||\mathbf{x}||} \leq ||A^{-1}||\,\frac{||\Delta\mathbf{b}||}{||\mathbf{b}||}

or

Δxxκ(A)Δbb\frac{||\Delta\mathbf{x}||}{||\mathbf{x}||} \leq \kappa(A)\,\frac{||\Delta\mathbf{b}||}{||\mathbf{b}||}

or the relative error in the solution x\mathbf{x} depends on the relative error in the RHS b\mathbf{b} times the condition number.

Example

Consider the problem

[121+α2]x=[11]\begin{bmatrix} 1 & 2 \\ 1+\alpha & 2\\ \end{bmatrix}\mathbf{x} = \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix}

for α=O(ϵmach)\alpha = O(\epsilon_{mach})

alpha = numpy.finfo(float).eps
A = numpy.array([[1, 2], [1 + alpha, 2]])

# first version
b = numpy.array([1.0, 1.0])
x = numpy.linalg.solve(A, b)
print("b = {}\nx = {}".format(b, x))
# now perturb b by epsilon
bp = numpy.array([1, 1 - alpha])
xp = numpy.linalg.solve(A, bp)
print("b'={}\nx' = {}".format(bp, xp))
# and calculate relative error and condition number
err_b = numpy.linalg.norm(bp - b) / numpy.linalg.norm(b)
err_x = numpy.linalg.norm(xp - x) / numpy.linalg.norm(x)
condA = numpy.linalg.cond(A)
print("k(A) = {}".format(condA))
print("err_b = {}, err_x = {}, err_x/err_b ={}".format(err_b, err_x, err_x / err_b))