Skip to article frontmatterSkip to article content

Understanding the Singular Value Decomposition

Text provided under a Creative Commons Attribution license, CC-BY. All code is made available under the FSF-approved MIT license. (c) Kyle T. Mandli and Rajath Kumar Mysore Pradeep Kumar

Note: This material largely follows the text “Numerical Linear Algebra” by Trefethen and Bau (SIAM, 1997) and is meant as a guide and supplement to the material presented there.

from __future__ import print_function

%precision 6 
%matplotlib inline
import matplotlib.pyplot as plt
import numpy
import pandas as pd

Understanding the Singular Value Decomposition

Factorizations: A review

Matrix Factorization is a fundamental concept in Linear Algebra and refers to the ability to write general matrices as the products of simpler matrices with special properties.

In particular, all of the great algorithms encountered in numerical linear Algebra can be succinctly described by their factorizations. Examples include

The LU decomposition

PA=LUPA=LU
  • Algorithms Gaussian Elimination with partial pivoting

  • Factorization PP Permutation matrix (from pivoting), L,UL,U: Lower and upper triangular matrices

  • Application direct solution of Ax=bA\mathbf{x}=\mathbf{b} for ARn×nA\in\mathbb{R}^{n\times n}

The QR decomposition

A=QRA=QR
  • Algorithms Gram-Schmidt Orthogonalization, Householder Triangularization

  • Factorization QQ: Orthonormal Basis for C(A)C(A), RR: upper triangular matrix

  • Applications Least-Squares solutions, Projection problems, Eigenvalues QR/RQQR/RQ

EigenProblems

AX=XΛ,(ARn×n)AX = X\Lambda, \quad (A\in\mathbb{R}^{n\times n})
  • Diagonalizable matrices A=XΛX1A = X\Lambda X^{-1}

  • Hermitian/Symmetric Matrices A=QΛQTA = Q\Lambda Q^T

  • Schur Factorization A=QTQTA = QTQ^T

  • Algorithms: Power Method, Inverse Power with shifts, QR/RQQR/RQ

  • Factorization-- XX, QQ: matrix of Eigenvectors, Λ\Lambda and diagonal matrix of eigenvalues

  • Application Solution of Dynamical systems, Iterative Maps, Markov Chains, Vibrational analysis\ldots, Quantum Mechanics

*But perhaps the most beautiful factorization, which contains aspects of all of these problems\ldots

Singular Value Decomposition (SVD)

A=UΣVTA = U\Sigma V^{T}

for any ARm×nA\in\mathbb{R}^{m\times n}, where

  • URm×mU \in \mathbb R^{m \times m} and is the orthogonal matrix whose columns are the eigenvectors of AATAA^{T}

  • VRn×nV \in \mathbb R^{n \times n} and is the orthogonal matrix whose columns are the eigenvectors of ATAA^{T}A

  • ΣRm×n\Sigma \in \mathbb R^{m \times n} and is a diagonal matrix with elements σ1,σ2,σ3,...σr\sigma_{1}, \sigma_{2}, \sigma_{3}, ... \sigma_{r} where r=rank(A)r = rank(A) corresponding to the square roots of the eigenvalues of ATAA^{T}A. They are called the singular values of AA and are positive arranged in descending order. (σ1σ2σ3...σr>0\sigma_{1} \geq \sigma_{2} \geq \sigma_{3} \geq ... \sigma_{r} > 0).

Or a picture is worth a thousand words...

Existence and Uniqueness

Every matrix ACm×nA \in \mathbb{C}^{m \times n} has a singular value decomposition. Furthermore, the singular values {σj}\{\sigma_{j}\} are uniquely determined, and if AA is square and the σj\sigma_{j} are distinct, the left and right singular vectors {uj}\{u_{j}\} and {vj}\{v_{j}\} are uniquely determined up to complex signs (i.e., complex scalar factors of absolute value 1).

Eigenvalue Decomposition vs. SVD Decomposition

Let the matrix XX contain the eigenvectors of AA which are linearly independent, then we can write a decomposition of the matrix AA as

A=XΛX1.A = X \Lambda X^{-1}.

How does this differ from the SVD?

  • The basis of the SVD representation differs from the eigenvalue decomposition

    • The basis vectors are not in general orthogonal for the eigenvalue decomposition where it is for the SVD

    • The SVD effectively contains two basis sets.

  • All matrices have an SVD decomposition whereas not all have eigenvalue decompositions.

Applications

  • Matrix Pseudo-Inverse for ill-conditions problems

  • Principal Component analysis

  • Total Least Squares

  • Image Compression

  • Data Analysis for interpretation of high-dimensional data

  • and more\ldots

A conceptual computation of the SVD (not how it’s really done)

Given A=UΣVTA = U\Sigma V^T, we can form the matrix

ATA=VΣTΣVTA^TA = V\Sigma^T\Sigma V^T

where VRn×nV\in\mathbb{R}^{n\times n} is unitary and

ΣTΣ=[σ12σ22σn2]\Sigma^T\Sigma = \begin{bmatrix} \sigma_1^2 & & & \\ & \sigma_2^2 & & \\ & & \ddots & \\ & & & \sigma_n^2 \\ \end{bmatrix}

However, because ATAA^TA is clearly Symmetric it also has the eigen factorization

ATA=QΛQTA^TA = Q\Lambda Q^T

where QQ is unitary and Λ\Lambda is real diagonal.

Moreover, we can show that ATAA^T A is at least Positive semi-definite so all the eigenvalues are 0\geq 0.

In particular, if rank(A)=r\mathrm{rank}(A)=r we know that rr eigenvalues are >0>0 and nrn-r eigenvalues are equal to zero.

Given that

ATA=VΣTΣVT=QΛQT\begin{align} A^TA &= V\Sigma^T\Sigma V^T \\ &= Q\Lambda Q^T \\ \end{align}

If we simply order the eigenvalues from largest to smallest (and rearrange the columns of QQ to match), then we can make the association

V=Q,ΣTΣ=ΛV = Q\quad, \Sigma^T\Sigma = \Lambda

or

σi2=λiorσi=λi\sigma_i^2 = \lambda_i\quad\textrm{or}\quad \sigma_i = \sqrt{\lambda_i}

A similar argument can be made about UU and the eigenvectors of AATAA^T, however, a better way to think about the relationship between UU, VV and Σ\Sigma is to rearrange the SVD as

AV=UΣAV = U\Sigma

or we can look at this column-by-column

Avi=σiuiA\mathbf{v}_i = \sigma_i\mathbf{u}_i

(in the same way the Eigen decomposition AX=XΛAX = X\Lambda is equivalent to Axi=λixiA\mathbf{x}_i = \lambda_i\mathbf{x}_i)

or rearranging

ui=Aviσi,i=1,2,,r\mathbf{u}_i = \frac{A\mathbf{v}_i}{\sigma_i},\quad i=1,2,\ldots,r

lets us solve for the first rr columns of UU given VV and Σ\Sigma.

As we’ll show, we usually don’t need to solve for the remaing mrm-r columns of UU for reasons that relate to the “4 fundamental subspaces of AA

Review: the Singular Value Decomposition (SVD)

A=UΣVTA = U\Sigma V^{T}

for any ARm×nA\in\mathbb{R}^{m\times n}, where

  • URm×mU \in \mathbb R^{m \times m} and is the orthogonal matrix whose columns are the eigenvectors of AATAA^{T}

  • VRn×nV \in \mathbb R^{n \times n} and is the orthogonal matrix whose columns are the eigenvectors of ATAA^{T}A

  • ΣRm×n\Sigma \in \mathbb R^{m \times n} and is a diagonal matrix with elements σ1,σ2,σ3,...σr\sigma_{1}, \sigma_{2}, \sigma_{3}, ... \sigma_{r} where r=rank(A)r = rank(A) corresponding to the square roots of the eigenvalues of ATAA^{T}A. They are called the singular values of AA and are positive arranged in descending order. (σ1σ2σ3...σr>0\sigma_{1} \geq \sigma_{2} \geq \sigma_{3} \geq ... \sigma_{r} > 0).

But a better way to write and understand the SVD is as

AV=UΣAV = U\Sigma

where the columns of UU and VV contain orthonormal bases for the 4 fundamental subspaces of AA

The Strangian view of the 4 Subspaces: a quick refresher

The Big idea: The columns of UU and VV contain orthonormal bases for the 4 fundamental subspaces

Returning to AV=UΣAV = U\Sigma it follows that

Avi=σiui,fori=1,2rAvi=0,fori=r+1,n\begin{align} A\mathbf{v}_i &= \sigma_i\mathbf{u_i},\quad\mathrm{for}\quad i=1,2\ldots r \\ A\mathbf{v}_i &= \mathbf{0},\quad\mathrm{for}\quad i=r+1,\ldots n \\ \end{align}

Therefore the last nrn-r columns of VV must be in the Null space of AA, and in fact must form an orthonormal basis for the Null space N(A)N(A).

Since the first rr columns of VV are all orthogonal to the last nrn-r columns, they must form an orthonormal basis for the row space C(AT)C(A^T).

Likewise, since the first rr columns of UU satisfy

ui=Aviσi,i=1,2,,r\mathbf{u}_i = \frac{A\mathbf{v}_i}{\sigma_i},\quad i=1,2,\ldots,r

they must form an orthonormal basis for C(A)C(A)

Leaving only the last mrm-r columns of UU as an orthonormal basis for the left null space N(AT)N(A^T)

Again a picture is worth a lot here

A[v1vrvr+1vn]=[u1urur+1um][σ10σr00]A\begin{bmatrix} & & & |& & & \\ & & & |& & & \\ \mathbf{v}_1 & \cdots & \mathbf{v}_r & | &\mathbf{v}_{r+1} & \cdots & \mathbf{v}_n \\ & & & |& & & \\ & & & |& & & \\ \end{bmatrix} = \begin{bmatrix} & & & |& & & \\ & & & |& & & \\ & & & |& & & \\ \mathbf{u}_1 & \cdots & \mathbf{u}_r & | &\mathbf{u}_{r+1} & \cdots & \mathbf{u}_m \\ & & & |& & & \\ & & & |& & & \\ & & & |& & & \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & & & | & & \\ & \ddots & & | & \mathbf{0} \\ & & \sigma_r &| & \\ -&- &- & | & - & -\\ & \mathbf{0} & & | & \mathbf{0} \\ & & & | & \\ & & & | & \\ \end{bmatrix}

The Economy (or Skinny) SVD

As it turns out, because of all the 0’s all we actually need to reconstruct AA is the the first rr columns of UU, and VV, and the square sub-block of Σ\Sigma with just the singular values. i.e.

A[v1vr]=[u1ur][σ1σr]A\begin{bmatrix} & & & \\ & & & \\ \mathbf{v}_1 & \cdots & \mathbf{v}_r \\ & & & \\ & & & \\ \end{bmatrix} = \begin{bmatrix} & & & \\ & & & \\ & & & \\ \mathbf{u}_1 & \cdots & \mathbf{u}_r \\ & & & \\ & & & \\ & & & \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_r \\ \end{bmatrix}

or

A=UrΣrVrTA = U_r\Sigma_r V_r^T

And it is this object (The Economy (or Skinny, or Compact) SVD), that helps us understand what the SVD does.

Spectral Theorem

The Economy SVD can also be written as

A=UrΣrVrT=i=1rσiuiviTA = U_r\Sigma_r V_r^T = \sum_{i=1}^r \sigma_i \mathbf{u}_i\mathbf{v}^T_i

Which says we can expand AA as a series of rank-1 matrices uiviT\mathbf{u}_i\mathbf{v}^T_i weighted by the singular values...

and it is this picture that leads to many of the important approximating properties of the SVD.

Matrix Multiplication

Consider the Matrix Vector product AxA\mathbf{x} which maps a vector xRn\mathbf{x}\in\mathbb{R}^n to its image in column space C(A)C(A). In the context of the Economy SVD

Ax=UrΣrVrTxA\mathbf{x} = U_r\Sigma_r V_r^T\mathbf{x}

or (since VrTVr=IV_r^TV_r = I)

Ax=UrΣr(VrTVr)VrTxA\mathbf{x} = U_r\Sigma_r (V_r^T V_r) V_r^T\mathbf{x}

or $$

Ax=(UrΣrVrT)(VrVrT)x=Ax+\begin{align} A\mathbf{x} &= (U_r\Sigma_r V_r^T) (V_r V_r^T)\mathbf{x} \\ & = A \mathbf{x}^+ \end{align}
Can't use function '$' in math mode at position 7: where $̲$\mathbf{x}^+ =…

where $$\mathbf{x}^+ = (V_r V_r^T)\mathbf{x}

is the projection of x\mathbf{x} onto the row space C(AT)C(A^T)

Question: If we wanted it, how would we find the projection of x\mathbf{x} onto the Null space of AA?

SVD and the Matrix Inverse

If a matrix is square and invertible, it implies that m=n=rm=n=r and UU, Σ\Sigma and VV are all square invertible matrices, therefore if

A=UΣVTA = U\Sigma V^T

then

A1=VΣ1UTA^{-1} = V\Sigma^{-1} U^T

where

Σ1=[1/σ11/σ21/σn]\Sigma^{-1} = \begin{bmatrix} 1/\sigma_1 & & & \\ & 1/\sigma_2 & & \\ & & \ddots & \\ & & & 1/\sigma_n \\ \end{bmatrix}

and clearly

A1b=xA^{-1}\mathbf{b} = \mathbf{x}

maps any vector bC(A)\mathbf{b}\in C(A) back to a unique vector xC(AT)\mathbf{x}\in C(A^T)

SVD and the Matrix Pseudo-Inverse

But suppose AA is not square. Then it cannot have an inverse, however, the SVD allows us to define a “Pseudo-Inverse”

Given, the skinny SVD

A=UrΣrVrTA = U_r\Sigma_r V_r^T

we can define

A+=VrΣr1UrTA^{+} = V_r \Sigma_r^{-1} U_r^T

because Σr\Sigma_r is square and invertible (but its size is set by the rank rr)

Action of the Pseudo-Inverse

In general, if AA is rank deficient (r<min(m,n)r<\min(m,n)), the problem Ax=bA\mathbf{x} = \mathbf{b} has either no solution or an infinite number of solutions. However, it always has a minimal least squares solution given by

x+=A+bx^{+} = A^{+}\mathbf{b}

which maps any vector bRm\mathbf{b}\in \mathbb{R}^m to a unique vector x+C(AT)\mathbf{x}^+\in C(A^T)

As we did with AxA\mathbf{x}, we can consider the action of because A+A^{+} on any vector b\mathbf{b} as a projection problem.

A+b=VrΣr1UrTb\begin{align} A^{+}\mathbf{b} &= V_r\Sigma^{-1}_r U_r^T\mathbf{b} \\ \end{align}
=VrΣr1(UrTUr)UrTb=(VrΣr1UrT)(UrUrT)b=A+p\begin{align} \hspace{5em} &= V_r\Sigma^{-1}_r (U_r^T U_r) U_r^T\mathbf{b} \\ & = (V_r\Sigma^{-1}_r U_r^T) (U_r U_r^T)\mathbf{b} \\ & = A^{+} \mathbf{p} \end{align}

where

p=(UrUrT)b\mathbf{p} = (U_r U_r^T)\mathbf{b}

is projection of b\mathbf{b} onto C(A)C(A)

The minimal Least-Squares solution

It should be clear that

x+=A+b\mathbf{x}^+ = A^{+}\mathbf{b}

lies entirely in the Row space of AA (Why?)

Therefore it has no component in the N(A)N(A) and is the shortest solution to Ax^=pA\hat{\mathbf{x}}=\mathbf{p} which is the least squares problem

The Pseudo-Inverse and ill-conditioned problems

Technically, a square matrix is either invertible or singular, however, in many real problems, a matrix can be close to singular or “ill-conditioned”.

Again, the condition number of a matrix defined by an induced pp norm is

κp(A)=ApA1p\kappa_p(A) = ||A||_p ||A^{-1}||_p

In particular, for p=2p=2, it is easy to show (using the SVD) that for a square matrix

κ2(A)=σ1σn\kappa_2(A) =\frac{\sigma_1}{\sigma_n}

and thus for a singular matrix with r<nr<n, κ2=\kappa_2 =\infty.

ill-conditioned matrices

However, it is possible that while σn>0\sigma_n>0, it is significantly smaller than σ1\sigma_1, leading to a very large condition number. In particular if the last jj singular values are very small, it suggest that the matrix has a “near Null Space”, with a basis defined by the last jj columns of VV. Thus, while strictly invertible, if a direct solver is used to solve

Ax=bA\mathbf{x} =\mathbf{b}

any component in the near null space will be amplified by 1/σj1/\sigma_j and completely pollute your answer.

However, a useful fix can be to approximate AA as a lower rank matrix

Ak=UkΣkVkTA_k = U_k\Sigma_kV_k^T

which just keeps the first kk singular values (i.e. pretends r=kr=k). If chosen wisely, The approximate solution

x+=Ak+b\mathbf{x}^{+} = A^{+}_k\mathbf{b}

can be considerably more accurate as it does not allow the near-null space to contaminate the solution.

Examples

Full SVD example

Consider the matrix

A=[203571062].A = \begin{bmatrix} 2 & 0 & 3 \\ 5 & 7 & 1 \\ 0 & 6 & 2 \end{bmatrix}.

Confirm the SVD representation using numpy functions as appropriate.

A = numpy.array([[2.0, 0.0, 3.0], [5.0, 7.0, 1.0], [0.0, 6.0, 2.0]])

U, sigma, V_T = numpy.linalg.svd(A, full_matrices=True)
print("U:\n{}\n".format(U))
print("Sigma:\n{}\n".format(numpy.diag(sigma)))
print("V:\n{}\n".format(V_T.T))
print(sigma)
Aprime = numpy.dot(U, numpy.dot(numpy.diag(sigma), V_T))
print("A - USV^T:\n{}".format(A - Aprime))

Example: a rank-1 Matrix

A=xyTA = \mathbf{x}\mathbf{y}^T
x = numpy.array([1.0, 2.0, 1.0])
y = numpy.array([2.0, -1.0, 1.0])
A = numpy.outer(x, y)
print("A:\n{}\n".format(A))
U, sigma, V_T = numpy.linalg.svd(A, full_matrices=False)

print("U:\n{}\n".format(U))
print("sigma:\n{}\n".format(sigma))
print("V:\n{}\n".format(V_T.T))
Aprime = numpy.dot(U, numpy.dot(numpy.diag(sigma), V_T))
print("A - USV^T:\n{}".format(A - Aprime))
# rank-1 reconstruction
k = 1
A1 = numpy.dot(U[:, :k], numpy.dot(numpy.diag(sigma[:k]), V_T[:k, :]))
print("A_k =\n{}\n".format(A1))
print("A - A_k:\n{}".format(A - A1))

An ill-conditioned example

Consider the matrix

A=[101112+ϵ011].A = \begin{bmatrix} 1 & 0 & 1 \\ 1 & 1 & 2+\epsilon \\ 0 & 1 & 1\\ \end{bmatrix}.

where the last column is almost the sum of the first two.

epsilon = 1.0e-5
A = numpy.array([[1.0, 0.0, 1.0], [1.0, 1.0, 2 + epsilon], [0.0, 1.0, 1.0]])
print("A:\n{}\n".format(A))
U, S, V_T = numpy.linalg.svd(A)

print("U:\n{}\n".format(U))
print("S:\n{}\n".format(S))
print("V:\n{}\n".format(V_T.T))
print(
    "sigma_1/sigma_n = {}, cond_2(A) = {}".format(
        S[0] / S[2], numpy.linalg.cond(A, p=2)
    )
)

Try to solve Ax=bA\mathbf{x}=\mathbf{b}

epsilon = 5 * numpy.finfo(numpy.float64).eps
A = numpy.array([[1.0, 0.0, 1.0], [1.0, 1.0, 2 + epsilon], [0.0, 1.0, 1.0]])

x_true = numpy.array([1.0, 2.0, 3.0])
b = A.dot(x_true)
x = numpy.linalg.solve(A, b)
print("x = {}: cond(A)={}".format(x, numpy.linalg.cond(A, p=2)))

Rank 2 reconstruction

U, S, VT = numpy.linalg.svd(A)
print(S)
k = 2
A2 = numpy.dot(U[:, :k], numpy.dot(numpy.diag(S[:k]), VT[:k, :]))

# Pseudo Inverse
Ap2 = numpy.dot(VT.T[:, :k], numpy.dot(numpy.diag(1.0 / S[:k]), U.T[:k, :]))
xp = Ap2.dot(b)
print("x_true\t= {}".format(x_true))
print("x^+ \t= {}".format(xp))
print("x \t= {}: ".format(numpy.linalg.solve(A, b)))
print(
    "rel_err\t= {} ".format(numpy.linalg.norm(x - x_true) / numpy.linalg.norm(x_true))
)

Review: the Singular Value Decomposition (SVD)

A=UΣVTA = U\Sigma V^{T}

for any ARm×nA\in\mathbb{R}^{m\times n}, where

  • URm×mU \in \mathbb R^{m \times m} and is the orthogonal matrix whose columns are the eigenvectors of AATAA^{T}

  • VRn×nV \in \mathbb R^{n \times n} and is the orthogonal matrix whose columns are the eigenvectors of ATAA^{T}A

  • ΣRm×n\Sigma \in \mathbb R^{m \times n} and is a diagonal matrix with elements σ1,σ2,σ3,...σr\sigma_{1}, \sigma_{2}, \sigma_{3}, ... \sigma_{r} where r=rank(A)r = rank(A) corresponding to the square roots of the eigenvalues of ATAA^{T}A. They are called the singular values of AA and are positive arranged in descending order. (σ1σ2σ3...σr>0\sigma_{1} \geq \sigma_{2} \geq \sigma_{3} \geq ... \sigma_{r} > 0).

Again a picture is worth a lot here

A[v1vrvr+1vn]=[u1urur+1um][σ10σr00]A\begin{bmatrix} & & & |& & & \\ & & & |& & & \\ \mathbf{v}_1 & \cdots & \mathbf{v}_r & | &\mathbf{v}_{r+1} & \cdots & \mathbf{v}_n \\ & & & |& & & \\ & & & |& & & \\ \end{bmatrix} = \begin{bmatrix} & & & |& & & \\ & & & |& & & \\ & & & |& & & \\ \mathbf{u}_1 & \cdots & \mathbf{u}_r & | &\mathbf{u}_{r+1} & \cdots & \mathbf{u}_m \\ & & & |& & & \\ & & & |& & & \\ & & & |& & & \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & & & | & & \\ & \ddots & & | & \mathbf{0} \\ & & \sigma_r &| & \\ -&- &- & | & - & -\\ & \mathbf{0} & & | & \mathbf{0} \\ & & & | & \\ & & & | & \\ \end{bmatrix}

UU and VV contain orthonormal bases for the 4 subspaces of AA

  • The first rr columns of UU form an orthonormal bases for C(A)RmC(A)\in\mathbb{R}^m

  • The first rr columns of VV form an orthonormal bases for C(AT)RnC(A^T)\in\mathbb{R}^n

  • The last nrn-r columns of VV form an orthonormal basis for N(A)RnN(A)\in\mathbb{R}^n

  • The last mrm-r columns of UU form an orthonormal basis for N(AT)RmN(A^T)\in\mathbb{R}^m

The Economy (or Skinny) SVD

As it turns out, because of all the 0’s all we actually need to reconstruct AA is the the first rr columns of UU, and VV, and the square sub-block of Σ\Sigma with just the singular values. i.e.

A[v1vr]=[u1ur][σ1σr]A\begin{bmatrix} & & & \\ & & & \\ \mathbf{v}_1 & \cdots & \mathbf{v}_r \\ & & & \\ & & & \\ \end{bmatrix} = \begin{bmatrix} & & & \\ & & & \\ & & & \\ \mathbf{u}_1 & \cdots & \mathbf{u}_r \\ & & & \\ & & & \\ & & & \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_r \\ \end{bmatrix}
  • VrVrTV_rV_r^T is an orthogonal projector onto C(AT)C(A^T)

  • (IVrVrT)(I - V_rV_r^T) projects onto N(A)N(A)

  • UrUrTU_rU_r^T is an orthogonal projector onto C(A)C(A)

  • (IUrUrT)(I - U_rU_r^T) projects onto N(AT)N(A^T)

The Pseudo-Inverse

Given the SVD, every matrix ARm×nA\in\mathbb{R}^{m\times n} has a ‘pseudo-inverse’

A+=VrΣr1UrTA^{+} = V_r\Sigma_r^{-1}U_r^T

with the properties

  • if ARn×nA\in\mathbb{R}^{n\times n} and r=nr=n, then A+=A1A^{+}=A^{-1}

  • x+=A+b\mathbf{x}^+ = A^{+}\mathbf{b} is always the shortest least squares solution s.t. (IVrVrT)x+=0(I - V_rV_r^T)\mathbf{x}^+=\mathbf{0}

  • AA+AA^+ projects b\mathbf{b} onto ______?

  • A+AA^+A projects x\mathbf{x} onto ______?

Matrix Properties via the SVD

  • The rank(A)=r\text{rank}(A) = r where rr is the number of non-zero singular values.

  • The range(A)=spanu1,...,ur\text{range}(A) = \mathrm{span}\langle u_1, ... , u_r\rangle and null(A)=spanvr+1,...,vn\text{null}(A) = \mathrm{span}\langle v_{r+1}, ... , v_n\rangle.

  • The A2=σ1|| A ||_2 = \sigma_1 and AF=σ12+σ22+...+σr2||A||_F = \sqrt{\sigma_{1}^{2}+\sigma_{2}^{2}+...+\sigma_{r}^{2}}.

  • The nonzero singular values of A are the square roots of the nonzero eigenvalues of ATAA^{T}A or AATAA^{T}.

  • If A=ATA = A^{T}, then the singular values of AA are the absolute values of the eigenvalues of AA.

  • For ACm×mA \in \mathbb{C}^{m \times m} then det(A)=Πi=1mσi|det(A)| = \Pi_{i=1}^{m} \sigma_{i}

Low-Rank Approximations

  • AA is the sum of the rr rank-one matrices:

    A=UΣVT=j=1rσjujvjTA = U \Sigma V^T = \sum_{j=1}^{r} \sigma_{j}u_{j}v_{j}^{T}
  • For any kk with 0kr0 \leq k \leq r, define

    Ak=j=1kσjujvjTA_{k} = \sum_{j=1}^{k} \sigma_{j}u_{j}v_{j}^{T}

then

AAk2=j=k+1rσjujvjT2=σk+1||A - A_{k}||_{2} = ||\sum_{j=k+1}^{r} \sigma_{j}u_{j}v_{j}^{T}||_{2} = \sigma_{k+1}

It can then be shown (Eckart–Young–Mirsky theorem) that for all matrices BCm×nB\in \mathbb{C}^{m \times n} with rank krk\leq r, then

AAk2AB2||A - A_{k}||_{2} \leq || A-B||_{2}

i.e. that AkA_k is the best rank-k approximation to AA in the 2-norm.

A similar result can be shown for the Frobenious norm

  • For any kk with 0kr0 \leq k \leq r, the matrix AkA_{k} also satisfies

AAkF=infBCm×nrank(B)vABF=σv+12+...+σr2||A - A_{k}||_{F} = \text{inf}_{B \in \mathbb{C}^{m \times n}} \text{rank}(B)\leq v ||A-B||_{F} = \sqrt{\sigma_{v+1}^{2} + ... + \sigma_{r}^{2}}

Example: Image compression

How does this work in practice?

the original Matrix (From Durer’s Melancholia)

from matplotlib import image

data = image.imread("images/melancholia-magic-square.png")
m, n = data.shape
print("{}x{} pixel image".format(m, n))
plt.figure(figsize=(8, 6))
plt.imshow(data, cmap="gray")
plt.show()
u, s, vt = numpy.linalg.svd(data)
fig = plt.figure(figsize=(16, 6))
axes = fig.add_subplot(1, 2, 1)
axes.semilogy(s, "bo-")
axes.set_ylabel("$\sigma$", fontsize=16)
axes.grid()
axes.set_title("Spectrum of Singular Values")

axes = fig.add_subplot(1, 2, 2)
g = numpy.cumsum(s * s) / numpy.sum(s * s)
axes.plot(g, "b-")

axes.set_title("% cumulative percent variance explained", fontsize=18)
axes.grid()
plt.show()

First 4 Modes

where mode ii is the rank-1 matrix σiuiviT\sigma_i\mathbf{u}_i\mathbf{v}_i^T

fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 4)
fig.set_figheight(fig.get_figheight() * 1)
for i in range(4):
    mode = s[i] * numpy.outer(u[:, i], vt[i, :])

    axes = fig.add_subplot(1, 4, i + 1)
    mappable = axes.imshow(mode, cmap="gray")
    axes.set_title("Mode = {}, $\sigma$={:3.3f}".format(i + 1, s[i]), fontsize=18)

plt.show()

Compressed Reconstructions

We can now view compressed reconstructions that sum the first kk modes

Ak=i=1kσiuiviT=UkΣkVkTA_k = \sum_{i=1}^k \sigma_i\mathbf{u}_i\mathbf{v}_i^T = U_k\Sigma_kV_k^T

the amount of memory required to store AkA_k is O(k(m+n+1))O(k(m +n + 1)) whereas the storage of AA is O(mn)O(mn) so the relative compression is

k(m+n+1)mn\frac{k(m+n + 1)}{mn}

The question is how many modes are required to adequately reconstruct the original image?

fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 3)
fig.set_figheight(fig.get_figheight() * 2)
for i, k in enumerate([1, 10, 20]):
    Ak = u[:, :k].dot(numpy.diag(s[:k]).dot(vt[:k, :]))
    storage = 100.0 * k * (m + n + 1) / (m * n)
    axes = fig.add_subplot(1, 3, i + 1)
    mappable = axes.imshow(Ak, vmin=0.0, vmax=1.0, cmap="gray")
    axes.set_title(
        "$A_{{{}}}$, $storage={:2.2f}\%, \% var ={:2.2f}$".format(
            k, storage, 100 * g[k]
        ),
        fontsize=16,
    )

plt.show()
fig = plt.figure()
fig.set_figwidth(fig.get_figwidth() * 3)
fig.set_figheight(fig.get_figheight() * 2)
for i, k in enumerate([50, 100, 200]):
    Ak = u[:, :k].dot(numpy.diag(s[:k]).dot(vt[:k, :]))
    storage = 100.0 * k * (m + n + 1) / (m * n)
    axes = fig.add_subplot(1, 3, i + 1)
    mappable = axes.imshow(Ak, vmin=0.0, vmax=1.0, cmap="gray")
    axes.set_title(
        "$A_{{{}}}$, $storage={:2.2f}\%, \% var ={:2.2f}$".format(
            k, storage, 100 * g[k]
        ),
        fontsize=16,
    )

plt.show()

Other Applications

  • Total Least-squares

  • PCA

Total Least Squares

GOAL: Demonstrate the use of the SVD to calculate total least squares regression and compare it to the classical least squares problem that assumes only errors in y.

Random data:

We start by constructing a random data set that approximates a straight line but has random errors in both x and y coordinates

# npoints uniformly randomly distributed points in the interval [0,3]
npnts = 100
x = numpy.random.uniform(0.0, 3.0, npnts)
# set y = mx + b plus random noise of size err
slope = 2.0
intercept = 1.0
err = 0.5

y = slope * x + intercept
y += numpy.random.normal(loc=y, scale=err)

# add some random noise to x variable as well
x += numpy.random.normal(loc=x, scale=err)

# And plot out the data
fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.scatter(x, y)
axes.set_xlabel("x", fontsize=16)
axes.set_ylabel("y", fontsize=16)
axes.set_title("Data", fontsize=18)
axes.grid()
plt.show()

Classical Least Squares:

We first calculate the best fit straight line assuming all the error is in the yy variable using the a QRQR decomposition of the Vandermonde matrix [1x]\begin{bmatrix} 1 & \mathbf{x}\\\end{bmatrix}

# Vandermonde matrix
A = numpy.array([numpy.ones(x.shape), x]).T

# solve  Ac = y using the QR decomposition via scipy
c_ls, res, rank, s = numpy.linalg.lstsq(A, y, rcond=None)
print("Best fit Linear Least Squares:")
print("    slope={}".format(c_ls[1]))
print("    intercept={}".format(c_ls[0]))
# dummy variables
t_ls = numpy.linspace(0, x.max())

# And plot out the data
fig = plt.figure(figsize=(10, 8))
axes = fig.add_subplot(1, 1, 1)
axes.scatter(x, y)
axes.set_xlabel("x")
axes.set_ylabel("y")
# plot the least squares solution
axes.plot(t_ls, c_ls[0] + t_ls * c_ls[1], "r-", label="Least Squares")
axes.legend()
axes.grid()
plt.show()

Total Least Squares:

Suppose we wanted to find the line defined by its unit normal vector u\mathbf{u} that instead minimized the sum of orthogonal distance between the data and the line, i.e.

# make up a line through the center of the data
X = numpy.array([x, y]).T
X_mean = numpy.mean(X, 0)
v = numpy.array([1, 2])
v = v / numpy.linalg.norm(v, ord=2)
u = numpy.array([-v[1], v[0]])
# print(u.dot(v))
t = numpy.linspace(0, x.max())
t = 2 * (t - numpy.mean(t_ls))

fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1, 1, 1)
axes.scatter(x, y)
l = X_mean + numpy.outer(t, v)
lu = X_mean + u
axes.plot(l[:, 0], l[:, 1], "g")
axes.plot([X_mean[0], X_mean[0] + u[0]], [X_mean[1], X_mean[1] + u[1]], "r")
axes.text(X_mean[0] + 1.3 * u[0], X_mean[1] + 1.3 * u[1], "$\mathbf{u}$", fontsize=14)
axes.set_aspect("equal")
axes.set_xlabel("x", fontsize=16)
axes.set_ylabel("y", fontsize=16)
axes.set_title("Data", fontsize=18)

axes.grid()
plt.show()

This is equivalent to finding u\mathbf{u} that minimizes

Mu2||M\mathbf{u}||_2

where

M=XXˉ=[(x1xˉ)(y1yˉ)(x2xˉ)(y2yˉ)(x3xˉ)(y3yˉ)(xnxˉ)(ynyˉ)]M = X - \bar{X} = \begin{bmatrix} (x_1 - \bar{x}) & (y_1 - \bar{y}) \\ (x_2 - \bar{x}) & (y_2 - \bar{y}) \\ (x_3 - \bar{x}) & (y_3 - \bar{y}) \\ \vdots & \vdots \\ (x_n - \bar{x}) & (y_n - \bar{y}) \\ \end{bmatrix}

is a matrix of the de-meaned data

but

Mu22=uTMTMu||M\mathbf{u}||^2_2 = \mathbf{u}^TM^TM\mathbf{u}

or substituting in the SVD of M=UΣVTM=U\Sigma V^T the problem becomes find u\mathbf{u} that minimizes

(uTV)ΣTΣ(VTu)(\mathbf{u}^TV) \Sigma^T\Sigma (V^T\mathbf{u})

Let y=VTuR2\mathbf{y} = V^T\mathbf{u} \in \mathbb{R}^2 where y2=1||\mathbf{y}||_2 =1

then we seek y\mathbf{y} that minimizes

yTΣTΣy=σ12y12+σ22y22σ22(y12+y22)σ22\mathbf{y}^T\Sigma^T\Sigma \mathbf{y} = \sigma_1^2 y_1^2 + \sigma_2^2 y_2^2 \geq \sigma_2^2(y_1^2 + y_2^2) \geq \sigma_2^2

because

y2=y12+y22=1||\mathbf{y}||_2 = y_1^2 + y_2^2 = 1

hopefully it’s clear that the vector

y=[01]\mathbf{y} = \begin{bmatrix} 0 \\ 1 \\\end{bmatrix}

provides the minimum value

But u=Vy\mathbf{u} = V\mathbf{y}

So, the best fit line is the one that is orthogonal to v2\mathbf{v}_2 which is the singular vector corresponding to the smallest singular value.

The line we seek is orthogonal to that, so must lie in the direction of v1\mathbf{v}_1

Total Least-Squares (a recipe)

  • form a matrix MM whose rows are the de-meaned data

  • find the SVD of M=UΣVTM = U\Sigma V^T,

  • set u=v2\mathbf{u} = \mathbf{v}_2

  • and in 2-D, the best fit line is parallel to v1\mathbf{v}_1 and goes through the mean of the data

# Prepare the data matrix
X = numpy.array([x, y]).T
print("Shape of data Matrix: {}".format(X.shape))
# and remove the mean
X_mean = numpy.mean(X, 0)
print("Mean of data matrix={}".format(X_mean))
M = X - X_mean
# now calculate the SVD of the de-meaned data matrix
U, S, VT = numpy.linalg.svd(M, full_matrices=False)
V = VT.T
print("Singular values = {}".format(S))
print("First Right singular vector V_1= {}".format(V[:, 0]))

Now plot and compare the two solutions

# dummy variables
t_ls = numpy.linspace(0, x.max())
t_svd = 2 * (t_ls - numpy.mean(t_ls))

# make figure
plt.figure(figsize=(10, 8))
# plot data
plt.scatter(x, y)
# plot the least squares solution
plt.plot(t_ls, c_ls[0] + t_ls * c_ls[1], "r-", label="Least Squares")

# plot the total least Squares solution
# plot the mean
plt.plot(X_mean[0], X_mean[1], "go", markersize=12, label="X_mean")
# calculate a line through the mean with the first principal component as a basis
L_tls = X_mean + numpy.outer(t_svd, V[:, 0])
plt.plot(L_tls[:, 0], L_tls[:, 1], "c-", label="Total Least Squares")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Comparison Least Squares vs Total Least Squares")
plt.legend(loc="best")
plt.grid()
plt.show()

Notes

This technique can be extended into higher dimensions. For example, we could find a best fit plane through a 3-D cloud of data.

For this problem again, we would find the SVD of the demeaned data matrix MM, and the best fit plane would be normal to the last singular vector v3\mathbf{v}_3 or spanv1,v2\mathrm{span}\langle \mathbf{v}_1, \mathbf{v}_2\rangle

In general, for a demeaned data matrix MM, columns of the matrix VV form an orthonormal basis for the row space of MM and describe the axes of the best fit Ellipsoid describing the data.

Principal Component/EOF analysis

GOAL: Demonstrate the use of the SVD to calculate principal components or “Empirical Orthogonal Functions” in a geophysical data set. This example is modified from a paper by Chris Small (LDEO)

Small, C., 1994. A global analysis of mid-ocean ridge axial topography. Geophys J Int 116, 64–84. doi:10.1111/j.1365-246X.1994.tb02128.x

The Data

Here we will consider a set of topography profiles taken across the global mid-ocean ridge system where the Earth’s tectonic plates are spreading apart.

The data consists of 156 profiles from a range of spreading rates. Each profile contains 79 samples so is in effect a vector in R79R^{79}

# read the data from the csv file
data = pd.read_csv("data/m80_data.csv", header=None).values
data_mean = numpy.mean(data, 0)
# and plot out a few profiles and the mean depth.
plt.figure(figsize=(10, 8))
rows = [9, 59, 99]
labels = ["slow", "medium", "fast"]
for i, row in enumerate(rows):
    plt.plot(data[row, :], label=labels[i])
plt.plot(data_mean, "k--", label="mean")
plt.xlabel("Distance across axis (km)")
plt.ylabel("Relative Elevation (m)")
plt.legend(loc="best")
plt.title("Example cross-axis topography of mid-ocean ridges")
plt.grid()
plt.show()

EOF analysis

While each profile lives in an 80 dimensional space, we would like to see if we can classify the variability in fewer components. To begin we form a de-meaned data matrix XX where each row is a profile.

X = data - data_mean
plt.figure(figsize=(6, 8))
plt.imshow(X, vmin=-1500, vmax=2000)
plt.xlabel("Distance across axis (Km)", fontsize=16)
plt.ylabel("Relative Spreading Rate", fontsize=16)
plt.title("De-meaned Data", fontsize=18)
cbar = plt.colorbar()
cbar.set_label("Relative Depth (m)", fontsize=16)
plt.show()

Applying the SVD

We now use the SVD to factor the data matrix as X=UΣVTX = U\Sigma V^T

# now calculate the SVD of the de-meaned data matrix
U, S, Vt = numpy.linalg.svd(X, full_matrices=False)

And begin by looking at the spectrum of singular values Σ\Sigma. Defining the variance as Σ2\Sigma^2 then we can also calculate the cumulative contribution to the total variance as

gk=i=0kσi2i=0nσi2g_k = \frac{\sum_{i=0}^k \sigma_i^2}{\sum_{i=0}^n \sigma_i^2}

Plotting both Σ\Sigma and gg shows that \sim 80% of the total variance can be explained by the first 4-5 Components

# plot the singular values
fig = plt.figure(figsize=(16, 6))
axes = fig.add_subplot(1, 2, 1)
axes.semilogy(S, "bo")
axes.grid()
axes.set_title("Singular Values", fontsize=18)

# and cumulative percent of variance
axes = fig.add_subplot(1, 2, 2)
g = numpy.cumsum(S * S) / numpy.sum(S * S)
axes.plot(g, "bx-")

axes.set_title("% cumulative percent variance explained", fontsize=18)
axes.grid()

plt.show()

Plotting the first 3 Singular Vectors in VV, shows them to reflect some commonly occuring patterns in the data

num_EOFs = 3
fig = plt.figure(figsize=(10, 8))

axes = fig.add_subplot(1, 1, 1)
for row in range(num_EOFs):
    axes.plot(Vt[row, :], label="EOF{}".format(row + 1))

axes.plot([0, X.shape[1]], [0.0, 0.0], "k--")

axes.grid()
axes.set_xlabel("Distance (km)")
axes.set_title("First {} EOFs ".format(num_EOFs))
axes.legend(loc="best")


plt.show()

For example, the first EOF pattern is primarily a symmetric pattern with an axial high surrounded by two off axis troughs (or an axial low with two flanking highs, the EOF’s are just unit vector bases for the row-space and can be added with any positive or negative coefficient). The Second EOF is broader and all of one sign while the third EOF encodes assymetry.

Reconstruction

Using the SVD we can also decompose each profile into a weighted linear combination of EOF’s i.e.

X=UΣVT=CVTX = U\Sigma V^T = C V^T

where C=UΣC = U\Sigma is a matrix of coefficients that describes the how each data row is decomposed into the relevant basis vectors. We can then produce a k-rank truncated representation of the data by

Xk=CkVkTX_k = C_k V_k^T

where CkC_k is the first kk columns of CC and VkV_k is the first kk EOF’s.

Here we show the original data and the reconstructed data using the first 5 EOF’s

# recontruct the data using the first 5 EOF's
k = 5
Ck = numpy.dot(U[:, :k], numpy.diag(S[:k]))
Vtk = Vt[:k, :]
data_k = data_mean + numpy.dot(Ck, Vtk)
fig = plt.figure(figsize=(12, 8))
axes = fig.add_subplot(1, 2, 1)
im = axes.imshow(data_k, vmin=-1500, vmax=2000)
fig.colorbar(im)
axes.set_title("reconstructed data")

axes = fig.add_subplot(1, 2, 2)
im = axes.imshow(data, vmin=-2000, vmax=2000)
axes.set_title("Original data")
fig.colorbar(im)
plt.show()

And we can consider a few reconstructed profiles compared with the original data

# show the original 3 profiles and their recontructed values using the first k EOF's
fig = plt.figure(figsize=(24, 8))

for i, row in enumerate(rows):
    axes = fig.add_subplot(1, 3, i + 1)
    h = axes.plot(data_k[row, :])
    h += axes.plot(data[row, :])
    axes.grid()
    Cstring = ["{:3.0f},  ".format(Ck[row, i]) for i in range(k)]
    axes.set_title(
        "Reconstruction profile {}:\n C_{}=".format(row, k) + "".join(Cstring)
    )
    axes.legend(["k={}".format(k), "original data"], loc="best")

plt.show()

projection of data onto a subspace

We can also use the Principal Components to look at the projection of the data onto a lower dimensional space as the coefficients CC, are simply the coordinates of our data along each principal component. For example we can view the data in the 2-Dimensional space defined by the first 2 EOF’s by simply plotting C_1 against C_2.

# plot the data in the plane defined by the first two principal components
fig = plt.figure(figsize=(16, 6))
axes = fig.add_subplot(1, 2, 1)
axes.scatter(Ck[:, 0], Ck[:, 1])
axes.set_xlabel("$V_1$")
axes.set_ylabel("$V_2$")
axes.grid()
axes.set_title("Projection onto the first two principal components")

# Or consider the degree of assymetry (EOF 3) as a function of spreading rate
axes = fig.add_subplot(1, 2, 2)
axes.plot(Ck[:, 2], "o")
axes.set_xlabel("Relative Spreading rate")
axes.set_ylabel("$C_3$")
axes.grid()
axes.set_title("Degree of assymetry")
plt.show()