Expressions

  • Scalar:

    $$\lambda \in \mathbb{R}$$
  • Vector: (column-vectors!)

    $$v = \begin{pmatrix} v_1 \\ \vdots \\ v_n \end{pmatrix} \in \mathbb{R}^n$$
  • vector of ones:

    $$1_n = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} \in \mathbb{R}^n$$
  • Matrix:

    $$A = \begin{pmatrix} a_{11} & \cdots & a_{1p} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{np} \end{pmatrix} = (a_{ij}) \in \mathbb{R}^{n \times p}$$
  • Identity matrix:

    $$I_n = \begin{pmatrix} 1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & 1 \end{pmatrix} \in \mathbb{R}^{n \times n}$$
  • Diagonal matrix:

    $$D = \begin{pmatrix} d_1 & & 0 \\ & \ddots & \\ 0 & & d_n \end{pmatrix} = \text{diag}(d_1, \dots, d_n) \in \mathbb{R}^{n \times n}$$

Matrix operations

Given @@A = (a_{ij}) \in \mathbb{R}^{n \times p}@@

  • Transpose: @@A^T = (a_{ji}) \in \mathbb{R}^{p \times n}@@

  • Column Space: @@C(A) = \{ Av : v \in \mathbb{R}^p \}@@

@@Av = \sum_{j=1}^{p} v_j a_j@@, linear combination of columns in @@A = (a_1 \dots a_p)@@

  • Rank: @@rk(A)@@ is the number of linearly independent columns in A

    • i.e. @@rk(A) = \text{dim}(C(A))@@
  • Frobenius norm:

    $$||A||_F^2 = \sum_{i=1}^{n} \sum_{j=1}^{p} a_{ij}^2$$
  • Given @@A = (a_{ij}) \in \mathbb{R}^{n \times n}@@

    • Inverse: @@A^{-1}@@ is such that @@A \cdot A^{-1} = A^{-1} \cdot A = I_n@@

    • Square root: @@A^{1/2}@@ is such that @@A^{1/2} \cdot A^{1/2} = A@@

    • Trace:

      $$tr(A) = \sum_{i=1}^{n} a_{ii}$$
    • Determinant:

      $$det(A) = \sum_{\sigma \text{ is a permutation on } \{1..n\}} \text{sgn}(\sigma) \cdot a_{1, \sigma(1)} \dots a_{n, \sigma(n)}$$

Properties

  • @@A \in \mathbb{R}^{n \times m}@@, @@B \in \mathbb{R}^{m \times p}@@ then @@(AB)^T = B^T A^T@@

  • @@A, B \in \mathbb{R}^{n \times n}@@ both invertible then @@(AB)^{-1} = B^{-1} A^{-1}@@

  • @@A \in \mathbb{R}^{n \times m}@@, @@B \in \mathbb{R}^{m \times n}@@ then @@tr(AB) = tr(BA)@@

  • @@A \in \mathbb{R}^{n \times m}@@, @@B \in \mathbb{R}^{m \times p}@@, @@C \in \mathbb{R}^{p \times n}@@ then @@tr(ABC) = tr(BCA) = tr(CAB)@@

  • @@A, B \in \mathbb{R}^{n \times n}@@ then @@det(AB) = det(A) det(B)@@

  • @@A \in \mathbb{R}^{n \times n}@@ is invertible @@\Leftrightarrow det(A) \neq 0@@

  • @@A \in \mathbb{R}^{n \times n}@@ is invertible @@det(A^{-1}) = \frac{1}{det(A)}@@

  • @@A \in \mathbb{R}^{n \times p}@@ then @@C(A^T)@@ is the row-space of A

  • @@\text{dim}(C(A)) = \text{dim}(C(A^T)) = \text{rank}(A)@@


Vector product

  • Product of @@a@@ with a scalar @@d \in \mathbb{R}@@ Given @@a = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix} \in \mathbb{R}^n@@, then @@d \cdot a = \begin{pmatrix} d a_1 \\ \vdots \\ d a_n \end{pmatrix} \in \mathbb{R}^n@@

  • Inner product Of @@a = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix} \in \mathbb{R}^n@@ and @@b = \begin{pmatrix} b_1 \\ \vdots \\ b_n \end{pmatrix} \in \mathbb{R}^n@@ is

    $$a^T b = \sum_{i=1}^{n} a_i b_i = \langle a, b \rangle$$
    • @@\langle a, a \rangle = \sum_{i=1}^{n} a_i^2 = ||a||^2@@

    • @@\cos \theta = \frac{\langle a, b \rangle}{||a|| ||b||}@@

    • @@a@@ and @@b@@ are orthogonal if @@a^T b = 0@@

  • Outer product Of @@a = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix} \in \mathbb{R}^n@@ and @@b = \begin{pmatrix} b_1 \\ \vdots \\ b_p \end{pmatrix} \in \mathbb{R}^p@@ is

    $$a b^T = \begin{pmatrix} a_1 b_1 & \cdots & a_1 b_p \\ \vdots & \ddots & \vdots \\ a_n b_1 & \cdots & a_n b_p \end{pmatrix} \in \mathbb{R}^{n \times p}$$
  • Examples: Inner vs. Outer Product

Let’s use two column vectors @@a \in \mathbb{R}^2@@ and @@b \in \mathbb{R}^2@@:

$$a = \begin{pmatrix} 1 \\ 2 \end{pmatrix} \quad \text{and} \quad b = \begin{pmatrix} 3 \\ 4 \end{pmatrix}$$
  • Inner Product (Result is a Scalar)

The inner product is @@a^T b@@. This multiplies a (@@1 \times 2@@) row vector by a (@@2 \times 1@@) column vector, resulting in a (@@1 \times 1@@) scalar.

$$a^T b = \underset{(1 \times 2)}{\begin{pmatrix} 1 & 2 \end{pmatrix}} \underset{(2 \times 1)}{\begin{pmatrix} 3 \\ 4 \end{pmatrix}} = (1 \cdot 3) + (2 \cdot 4) = 3 + 8 = 11$$
  • Outer Product (Result is a Matrix)

The outer product is @@a b^T@@. This multiplies a (@@2 \times 1@@) column vector by a (@@1 \times 2@@) row vector, resulting in a (@@2 \times 2@@) matrix.

$$a b^T = \underset{(2 \times 1)}{\begin{pmatrix} 1 \\ 2 \end{pmatrix}} \underset{(1 \times 2)}{\begin{pmatrix} 3 & 4 \end{pmatrix}} = \begin{pmatrix} 1 \cdot 3 & 1 \cdot 4 \\ 2 \cdot 3 & 2 \cdot 4 \end{pmatrix} = \begin{pmatrix} 3 & 4 \\ 6 & 8 \end{pmatrix}$$

Matrix product

Given @@A \in \mathbb{R}^{n \times m}@@ and @@B \in \mathbb{R}^{m \times p}@@ there are two ways to view the product @@A \cdot B \in \mathbb{R}^{n \times p}@@

“inner product view”

$$A = \begin{pmatrix} a_1^T \\ \vdots \\ a_n^T \end{pmatrix}, \quad B = \begin{pmatrix} b_1 & \cdots & b_p \end{pmatrix}$$$$AB = \begin{pmatrix} a_1^T b_1 & \cdots & a_1^T b_p \\ \vdots & \ddots & \vdots \\ a_n^T b_1 & \cdots & a_n^T b_p \end{pmatrix}$$

“outer product view”

$$A = \begin{pmatrix} a_1 & \cdots & a_m \end{pmatrix}, \quad B = \begin{pmatrix} b_1^T \\ \vdots \\ b_m^T \end{pmatrix}$$$$AB = \sum_{j=1}^{m} a_j b_j^T = \underbrace{\Box}_{n \times p} + \dots + \underbrace{\Box}_{n \times p} \quad (m \text{ matrices})$$

Types of matrices

Given @@A \in \mathbb{R}^{n \times p}@@ it is

  • Column-orthogonal if @@A^T A = I_p@@ (@@n > p@@)

    If @@A = (a_1 \dots a_p)@@, then @@a_i \perp a_j@@ and @@||a_i|| = 1@@

Given @@A \in \mathbb{R}^{n \times n}@@ it is

  • Orthogonal if @@A^T A = A A^T = I_n@@

    If @@A = (a_1 \dots a_n)@@, then @@a_i \perp a_j@@

    and if @@A = \begin{pmatrix} - a_1^T - \\ \vdots \\ - a_n^T - \end{pmatrix}@@, then @@a_i \perp a_j@@

  • Symmetric if @@A = A^T@@

Given symmetric @@A \in \mathbb{R}^{n \times n}@@ it is

  • positive definite (PD) if @@v^T A v > 0 \quad \forall v \in \mathbb{R}^n \quad (v \neq 0)@@
  • positive semidefinite (PSD) if @@v^T A v \ge 0 \quad \forall v \in \mathbb{R}^n@@
  • Idempotent if @@A^2 = A@@

Rank factorization

Matrix @@A \in \mathbb{R}^{n \times p}@@ with @@\text{rank}(A) = r@@ can be decomposed as @@A = CF^T@@ where

  • @@C \in \mathbb{R}^{n \times r}@@ is full column rank
  • @@F \in \mathbb{R}^{p \times r}@@ is full column rank

For @@C = (c_1 \dots c_r)@@ and @@F = (f_1 \dots f_r)@@

$$A = C F^T = \underbrace{c_1 f_1^T + \dots + c_r f_r^T}_{\text{r matrices of rank 1}}$$

QR decomposition

Matrix @@A \in \mathbb{R}^{n \times p}@@ can be decomposed as @@A = QR@@ where

  • @@Q@@ is column orthogonal (@@Q^T Q = I_p@@)
  • @@R@@ is upper-triangular

To find @@Q@@ and @@R@@ Gram-Schmidt procedure is used

  • @@e_1 = a_1 \rightarrow q_1 = \frac{e_1}{||e_1||}@@
  • @@e_2 = a_2 - \langle a_2, q_1 \rangle q_1 \rightarrow q_2 = \frac{e_2}{||e_2||}@@
  • @@e_k = a_k - \sum_{i=1}^{k-1} \langle a_k, q_i \rangle q_i \rightarrow q_k = \frac{e_k}{||e_k||}@@

Eigen (Spectral) decomposition

Consider symmetric @@A \in \mathbb{R}^{n \times n}@@. Then @@A = U \Lambda U^T@@ where

  • @@U \in \mathbb{R}^{n \times n}@@ is orthogonal (@@U^T U = U U^T = I_n@@)
  • $$\Lambda = \begin{pmatrix} \lambda_1 & & 0 \\ & \ddots & \\ 0 & & \lambda_n \end{pmatrix}$$ is diagonal (@@\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_n@@)

@@U = (u_1 \dots u_n)@@

  • @@u_i@@ are eigenvectors
  • @@\lambda_i@@ are eigenvalues

Sometimes it is useful to view ED as

$$A = \sum_{i=1}^{n} \lambda_i u_i u_i^T$$

where

$$\underbrace{u_i u_i^T}_{\text{rank 1 matrix of size } n \times n}$$

Properties of Symmetric Matrix

Assume @@A \in \mathbb{R}^{n \times n}@@ is symmetric.

  • @@Au_j = \lambda_j u_j@@, i.e. mapping A doesn’t rotate @@u_j@@.

    @@A = \sum_{i=1}^{n} \lambda_i u_i u_i^T \Rightarrow Au_j = \sum_{i=1}^{n} \lambda_i u_i (u_i^T u_j) = \lambda_j u_j@@

  • ED is not unique.

    @@U = (u_1 \dots u_n)@@ and @@\tilde{U} = (-u_1, \dots, u_n)@@ are both valid decompositions

  • Rank of A is equal to the number of non-zero @@\lambda_i@@

    @@A = \sum_{i=1}^{n} \lambda_i u_i u_i^T@@ will have @@\text{rk}(A)@@ non-zero terms


Properties of PSD Matrix

  • A is PSD @@\Leftrightarrow \lambda_i \ge 0@@

  • A is PD @@\Leftrightarrow \lambda_i > 0@@

    @@A > 0 \Rightarrow v^T A v > 0 \quad \forall v \neq 0@@

    Let @@v = u_i@@; then @@u_i^T A u_i = \lambda_i u_i^T u_i = \lambda_i ||u_i||^2 > 0@@

    @@\lambda_i > 0 \Rightarrow v^T A v = v^T U \Lambda U^T v = \underbrace{(U^T v)^T}_{y^T} \Lambda \underbrace{(U^T v)}_{y} = y^T \Lambda y = \sum_{i=1}^{n} \lambda_i y_i^2 > 0@@

  • If A is full-rank then

    $$A^{-1} = U \Lambda^{-1} U^T = U \begin{pmatrix} 1/\lambda_1 & & 0 \\ & \ddots & \\ 0 & & 1/\lambda_n \end{pmatrix} U^T$$

    @@A A^{-1} = (U \Lambda U^T) (U \Lambda^{-1} U^T) = U \Lambda (U^T U) \Lambda^{-1} U^T = U \Lambda \Lambda^{-1} U^T = U I U^T = U U^T = I_n@@

    @@A^{-1} A = \dots@@

  • If A is PSD then

    $$A^{1/2} = U \Lambda^{1/2} U^T = U \begin{pmatrix} \sqrt{\lambda_1} & & 0 \\ & \ddots & \\ 0 & & \sqrt{\lambda_n} \end{pmatrix} U^T$$

    @@A^{1/2} A^{1/2} = (U \Lambda^{1/2} U^T) (U \Lambda^{1/2} U^T) = U \Lambda^{1/2} (U^T U) \Lambda^{1/2} U^T = U \Lambda^{1/2} \Lambda^{1/2} U^T = U \Lambda U^T = A@@


Projection

Symmetric (@@P^T = P@@), idempotent (@@P^2 = P@@) matrix @@P@@ is called orthogonal projection matrix.

Properties: if P is orthogonal projection then

  • Eigenvalues are 0 or 1 (so it is PSD)

    Using the spectral decomposition @@P = U \Lambda U^T@@ and the idempotent property @@P^2 = P@@:

    $$P^2 = (U \Lambda U^T)(U \Lambda U^T) = U \Lambda (U^T U) \Lambda U^T = U \Lambda^2 U^T$$

    Since @@P^2 = P@@, we have @@U \Lambda^2 U^T = U \Lambda U^T@@, which implies @@\Lambda^2 = \Lambda@@. This means each diagonal entry (eigenvalue) @@\lambda_i@@ must satisfy @@\lambda_i^2 = \lambda_i@@, so @@\lambda_i = 0@@ or @@\lambda_i = 1@@.

Examples:

  • Given @@A \in \mathbb{R}^{n \times p}@@ matrix @@P = A(A^T A)^{-1} A^T@@ projects on @@C(A)@@.

  • Projecting @@b \in \mathbb{R}^n@@ onto @@C(A)@@ is equivalent to finding @@v \in \mathbb{R}^p@@ such that @@||Av - b||@@ is minimal. This is regression with response @@b@@ and feature @@A@@. In regression, @@v = (A^T A)^{-1} A^T b@@, so the projection is @@A(A^T A)^{-1} A^T b = Pb@@.

  • If @@U \in \mathbb{R}^{n \times p}@@ is column-orthogonal then @@P = U U^T@@ projects on @@C(U)@@.

  • If @@U \in \mathbb{R}^{n \times p}@@ is column-orthogonal then @@P = I_n - U U^T@@ projects onto @@C(U_{\perp})@@.

    Here @@U_{\perp} \in \mathbb{R}^{n \times (n-p)}@@ is orthogonal complement of @@U@@, i.e. @@U_{\perp}^T U_{\perp} = I_{n-p}@@ and @@U^T U_{\perp} = O_{p \times (n-p)}@@.

    Let @@\tilde{U} = (U, U_{\perp})@@, then @@\tilde{U}^T \tilde{U} = \tilde{U} \tilde{U}^T = I_n@@.

    $$\tilde{U} \tilde{U}^T = (U \quad U_{\perp}) \begin{pmatrix} U^T \\ U_{\perp}^T \end{pmatrix} = U U^T + U_{\perp} U_{\perp}^T = I_n$$

    Thus @@U_{\perp} U_{\perp}^T = I_n - U U^T@@.


Centering operator

$$C = I_n - \frac{1}{n} 1_n 1_n^T \in \mathbb{R}^{n \times n}$$

is centering operator.

  • Let @@v = \begin{pmatrix} v_1 \\ \vdots \\ v_n \end{pmatrix}@@ then

    $$CV = \begin{pmatrix} v_1 - \bar{v} \\ \vdots \\ v_n - \bar{v} \end{pmatrix}$$

    with @@\bar{v} = \frac{1}{n} \sum_{i=1}^{n} v_i@@.

  • In particular, the mean of @@CV@@ is 0.

    $$CV = (I_n - \frac{1}{n} 1_n 1_n^T) v = v - \frac{1}{n} 1_n (1_n^T v) = v - 1_n \bar{v} = \begin{pmatrix} v_1 \\ \vdots \\ v_n \end{pmatrix} - \begin{pmatrix} \bar{v} \\ \vdots \\ \bar{v} \end{pmatrix}$$$$\frac{1}{n} \sum_{i=1}^{n} (v_i - \bar{v}) = \frac{1}{n} \sum_{i=1}^{n} v_i - \frac{n \bar{v}}{n} = \bar{v} - \bar{v} = 0$$
  • Let @@X = (f_1 \dots f_p)@@ (i.e. a matrix with columns @@f_j \in \mathbb{R}^n@@)

  • then @@CX = C(f_1 \dots f_p) = (Cf_1 \dots Cf_p)@@

  • thus the mean of each column is zero.


Singular Value Decomposition (SVD)

Matrix @@A \in \mathbb{R}^{n \times p}@@ can be decomposed as @@A = U D V^T@@ where

  • @@U \in \mathbb{R}^{n \times n}@@ and @@V \in \mathbb{R}^{p \times p}@@ are orthogonal
  • @@D \in \mathbb{R}^{n \times p}@@ is diagonal with non-negative entries

Sorted in decreasing order.

Full form (Case @@n > p@@)

  • Left singular vectors: @@U = (u_1 \dots u_n) \in \mathbb{R}^{n \times n}@@
  • Singular values: @@d_1 \ge \dots \ge d_p \ge 0@@ (these are the diagonal entries of @@D@@)
  • Right singular vectors: @@V = (v_1 \dots v_p) \in \mathbb{R}^{p \times p}@@
  • Properties:
    • @@U^T U = U U^T = I_n@@
    • @@V^T V = V V^T = I_p@@

Compact form

(Note: This “Thin SVD” form is often used when @@n > p@@. Here @@U@@ is @@n \times p@@ and @@D@@ is @@p \times p@@.)

  • Left singular vectors: @@U = (u_1 \dots u_p) \in \mathbb{R}^{n \times p}@@
  • Singular values: @@d_1 \ge \dots \ge d_p \ge 0@@
  • Right singular vectors: @@V = (v_1 \dots v_p) \in \mathbb{R}^{p \times p}@@
  • Properties:
    • @@U^T U = I_p@@ (but @@U U^T \neq I_n@@)
    • @@V^T V = V V^T = I_p@@

Note that A can be written as a sum of rank 1 matrices:

$$A = \sum_{i=1}^{\min(n,p)} \underbrace{d_i u_i v_i^T}_{\text{rank 1 matrices}}$$
  • Note: The “importance” of each rank 1 matrix is determined by its singular value @@d_i@@.

Properties (SVD)

  • SVD is a “generalization” of ED
  • Every @@A@@ has SVD.
  • SVD is not unique
  • If @@\text{rank}(A) = r@@ then @@D = \text{diag}(d_1, \dots, d_r, 0, \dots, 0)@@
  • If @@A \in \mathbb{R}^{n \times n}@@ is PSD then the decompositions are the same.

    $$A = U \Lambda U^T, \quad U \text{ is orthogonal}, \quad \Lambda = \begin{pmatrix} \lambda_1 & & 0 \\ & \ddots & \\ 0 & & \lambda_n \end{pmatrix}, \quad \lambda_i \ge 0$$

    Set @@V=U@@, @@D=\Lambda@@ then @@A = U D V^T@@. (PSD 矩阵的特征值 @@\lambda_i@@ 均为非负,与奇异值 @@d_i@@ 的性质相同,因此 @@\Lambda@@ 和 @@D@@ 矩阵是相同的。)

  • If @@A \in \mathbb{R}^{n \times p}@@ (@@n > p@@) then @@v_i@@ are eigenvectors of @@A^T A@@ and @@d_i^2@@ are the eigenvalues.

    $$A = U D V^T \Rightarrow A^T A = (U D V^T)^T (U D V^T) = V D^T (U^T U) D V^T$$

    Since @@U@@ is orthogonal (@@U^T U = I_n@@):

    $$A^T A = V (D^T D) V^T = V \begin{pmatrix} d_1^2 & & 0 \\ & \ddots & \\ 0 & & d_p^2 \end{pmatrix} V^T$$

    This is the Eigen Decomposition of @@A^T A@@.

  • If @@A \in \mathbb{R}^{n \times p}@@ (@@n > p@@) then @@u_i@@ are eigenvectors of @@A A^T@@ and @@d_i^2@@ are the eigenvalues (plus extra zeroes).

    $$A = U D V^T \Rightarrow A A^T = (U D V^T) (U D V^T)^T = U D (V^T V) D^T U^T$$

    Since @@V@@ is orthogonal (@@V^T V = I_p@@):

    $$A A^T = U (D D^T) U^T = U \text{diag}(d_1^2, \dots, d_p^2, 0, \dots, 0) U^T$$

    (Note: 结果是一个 n x n 矩阵, 包含 p 个非零特征值和 (n-p) 个零特征值。)


Gradients

Vector Gradients

If @@f: \mathbb{R}^n \rightarrow \mathbb{R}@@, so it takes @@x = \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix} \in \mathbb{R}^n@@ and maps it to a number then

$$\nabla_x f(x) = \begin{pmatrix} \frac{\partial f(x)}{\partial x_1} \\ \vdots \\ \frac{\partial f(x)}{\partial x_n} \end{pmatrix} \in \mathbb{R}^n$$

Examples:

  • @@f(x) = \langle a, x \rangle@@ for @@a \in \mathbb{R}^n@@ then @@\nabla_x f(x) = a@@

    @@f(x) = \sum_{i=1}^{n} a_i x_i@@ then @@\frac{\partial f(x)}{\partial x_i} = a_i@@ and @@\nabla_x f(x) = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix} = a@@

  • @@f(x) = ||x||^2@@ then @@\nabla_x f(x) = 2x@@

    @@f(x) = \sum_{i=1}^{n} x_i^2@@ then @@\frac{\partial f(x)}{\partial x_i} = 2x_i@@ and @@\nabla_x f(x) = \begin{pmatrix} 2x_1 \\ \vdots \\ 2x_n \end{pmatrix} = 2x@@

Matrix Gradients

If @@f: \mathbb{R}^{n \times p} \rightarrow \mathbb{R}@@, so it takes @@A = \begin{pmatrix} a_{11} & \cdots & a_{1p} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{np} \end{pmatrix} \in \mathbb{R}^{n \times p}@@ and maps it to a number then

$$\nabla_A f(A) = \begin{pmatrix} \frac{\partial f(A)}{\partial a_{11}} & \cdots & \frac{\partial f(A)}{\partial a_{1p}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(A)}{\partial a_{n1}} & \cdots & \frac{\partial f(A)}{\partial a_{np}} \end{pmatrix} \in \mathbb{R}^{n \times p}$$

Example:

  • @@f(A) = ||A||_F^2@@ then @@\nabla_A f(A) = 2A@@
    $$f(A) = \sum_{i=1}^{n} \sum_{j=1}^{p} a_{ij}^2 \Rightarrow \frac{\partial f(A)}{\partial a_{ij}} = 2 a_{ij}$$

    $$\Rightarrow \nabla_A f(A) = \begin{pmatrix} 2a_{11} & \cdots & 2a_{1p} \\ \vdots & \ddots & \vdots \\ 2a_{n1} & \cdots & 2a_{np} \end{pmatrix} = 2A$$