Expressions
Scalar:
$$\lambda \in \mathbb{R}$$Vector: (column-vectors!)
$$v = \begin{pmatrix} v_1 \\ \vdots \\ v_n \end{pmatrix} \in \mathbb{R}^n$$vector of ones:
$$1_n = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} \in \mathbb{R}^n$$Matrix:
$$A = \begin{pmatrix} a_{11} & \cdots & a_{1p} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{np} \end{pmatrix} = (a_{ij}) \in \mathbb{R}^{n \times p}$$Identity matrix:
$$I_n = \begin{pmatrix} 1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & 1 \end{pmatrix} \in \mathbb{R}^{n \times n}$$Diagonal matrix:
$$D = \begin{pmatrix} d_1 & & 0 \\ & \ddots & \\ 0 & & d_n \end{pmatrix} = \text{diag}(d_1, \dots, d_n) \in \mathbb{R}^{n \times n}$$
Matrix operations
Given @@A = (a_{ij}) \in \mathbb{R}^{n \times p}@@
Transpose: @@A^T = (a_{ji}) \in \mathbb{R}^{p \times n}@@
Column Space: @@C(A) = \{ Av : v \in \mathbb{R}^p \}@@
@@Av = \sum_{j=1}^{p} v_j a_j@@, linear combination of columns in @@A = (a_1 \dots a_p)@@
Rank: @@rk(A)@@ is the number of linearly independent columns in A
- i.e. @@rk(A) = \text{dim}(C(A))@@
Frobenius norm:
$$||A||_F^2 = \sum_{i=1}^{n} \sum_{j=1}^{p} a_{ij}^2$$Given @@A = (a_{ij}) \in \mathbb{R}^{n \times n}@@
Inverse: @@A^{-1}@@ is such that @@A \cdot A^{-1} = A^{-1} \cdot A = I_n@@
Square root: @@A^{1/2}@@ is such that @@A^{1/2} \cdot A^{1/2} = A@@
Trace:
$$tr(A) = \sum_{i=1}^{n} a_{ii}$$Determinant:
$$det(A) = \sum_{\sigma \text{ is a permutation on } \{1..n\}} \text{sgn}(\sigma) \cdot a_{1, \sigma(1)} \dots a_{n, \sigma(n)}$$
Properties
@@A \in \mathbb{R}^{n \times m}@@, @@B \in \mathbb{R}^{m \times p}@@ then @@(AB)^T = B^T A^T@@
@@A, B \in \mathbb{R}^{n \times n}@@ both invertible then @@(AB)^{-1} = B^{-1} A^{-1}@@
@@A \in \mathbb{R}^{n \times m}@@, @@B \in \mathbb{R}^{m \times n}@@ then @@tr(AB) = tr(BA)@@
@@A \in \mathbb{R}^{n \times m}@@, @@B \in \mathbb{R}^{m \times p}@@, @@C \in \mathbb{R}^{p \times n}@@ then @@tr(ABC) = tr(BCA) = tr(CAB)@@
@@A, B \in \mathbb{R}^{n \times n}@@ then @@det(AB) = det(A) det(B)@@
@@A \in \mathbb{R}^{n \times n}@@ is invertible @@\Leftrightarrow det(A) \neq 0@@
@@A \in \mathbb{R}^{n \times n}@@ is invertible @@det(A^{-1}) = \frac{1}{det(A)}@@
@@A \in \mathbb{R}^{n \times p}@@ then @@C(A^T)@@ is the row-space of A
@@\text{dim}(C(A)) = \text{dim}(C(A^T)) = \text{rank}(A)@@
Vector product
Product of @@a@@ with a scalar @@d \in \mathbb{R}@@ Given @@a = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix} \in \mathbb{R}^n@@, then @@d \cdot a = \begin{pmatrix} d a_1 \\ \vdots \\ d a_n \end{pmatrix} \in \mathbb{R}^n@@
Inner product Of @@a = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix} \in \mathbb{R}^n@@ and @@b = \begin{pmatrix} b_1 \\ \vdots \\ b_n \end{pmatrix} \in \mathbb{R}^n@@ is
$$a^T b = \sum_{i=1}^{n} a_i b_i = \langle a, b \rangle$$@@\langle a, a \rangle = \sum_{i=1}^{n} a_i^2 = ||a||^2@@
@@\cos \theta = \frac{\langle a, b \rangle}{||a|| ||b||}@@
@@a@@ and @@b@@ are orthogonal if @@a^T b = 0@@
Outer product Of @@a = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix} \in \mathbb{R}^n@@ and @@b = \begin{pmatrix} b_1 \\ \vdots \\ b_p \end{pmatrix} \in \mathbb{R}^p@@ is
$$a b^T = \begin{pmatrix} a_1 b_1 & \cdots & a_1 b_p \\ \vdots & \ddots & \vdots \\ a_n b_1 & \cdots & a_n b_p \end{pmatrix} \in \mathbb{R}^{n \times p}$$Examples: Inner vs. Outer Product
Let’s use two column vectors @@a \in \mathbb{R}^2@@ and @@b \in \mathbb{R}^2@@:
$$a = \begin{pmatrix} 1 \\ 2 \end{pmatrix} \quad \text{and} \quad b = \begin{pmatrix} 3 \\ 4 \end{pmatrix}$$- Inner Product (Result is a Scalar)
The inner product is @@a^T b@@. This multiplies a (@@1 \times 2@@) row vector by a (@@2 \times 1@@) column vector, resulting in a (@@1 \times 1@@) scalar.
$$a^T b = \underset{(1 \times 2)}{\begin{pmatrix} 1 & 2 \end{pmatrix}} \underset{(2 \times 1)}{\begin{pmatrix} 3 \\ 4 \end{pmatrix}} = (1 \cdot 3) + (2 \cdot 4) = 3 + 8 = 11$$- Outer Product (Result is a Matrix)
The outer product is @@a b^T@@. This multiplies a (@@2 \times 1@@) column vector by a (@@1 \times 2@@) row vector, resulting in a (@@2 \times 2@@) matrix.
$$a b^T = \underset{(2 \times 1)}{\begin{pmatrix} 1 \\ 2 \end{pmatrix}} \underset{(1 \times 2)}{\begin{pmatrix} 3 & 4 \end{pmatrix}} = \begin{pmatrix} 1 \cdot 3 & 1 \cdot 4 \\ 2 \cdot 3 & 2 \cdot 4 \end{pmatrix} = \begin{pmatrix} 3 & 4 \\ 6 & 8 \end{pmatrix}$$Matrix product
Given @@A \in \mathbb{R}^{n \times m}@@ and @@B \in \mathbb{R}^{m \times p}@@ there are two ways to view the product @@A \cdot B \in \mathbb{R}^{n \times p}@@
“inner product view”
$$A = \begin{pmatrix} a_1^T \\ \vdots \\ a_n^T \end{pmatrix}, \quad B = \begin{pmatrix} b_1 & \cdots & b_p \end{pmatrix}$$$$AB = \begin{pmatrix} a_1^T b_1 & \cdots & a_1^T b_p \\ \vdots & \ddots & \vdots \\ a_n^T b_1 & \cdots & a_n^T b_p \end{pmatrix}$$“outer product view”
$$A = \begin{pmatrix} a_1 & \cdots & a_m \end{pmatrix}, \quad B = \begin{pmatrix} b_1^T \\ \vdots \\ b_m^T \end{pmatrix}$$$$AB = \sum_{j=1}^{m} a_j b_j^T = \underbrace{\Box}_{n \times p} + \dots + \underbrace{\Box}_{n \times p} \quad (m \text{ matrices})$$Types of matrices
Given @@A \in \mathbb{R}^{n \times p}@@ it is
- Column-orthogonal if @@A^T A = I_p@@ (@@n > p@@)
If @@A = (a_1 \dots a_p)@@, then @@a_i \perp a_j@@ and @@||a_i|| = 1@@
Given @@A \in \mathbb{R}^{n \times n}@@ it is
Orthogonal if @@A^T A = A A^T = I_n@@
If @@A = (a_1 \dots a_n)@@, then @@a_i \perp a_j@@
and if @@A = \begin{pmatrix} - a_1^T - \\ \vdots \\ - a_n^T - \end{pmatrix}@@, then @@a_i \perp a_j@@
Symmetric if @@A = A^T@@
Given symmetric @@A \in \mathbb{R}^{n \times n}@@ it is
- positive definite (PD) if @@v^T A v > 0 \quad \forall v \in \mathbb{R}^n \quad (v \neq 0)@@
- positive semidefinite (PSD) if @@v^T A v \ge 0 \quad \forall v \in \mathbb{R}^n@@
- Idempotent if @@A^2 = A@@
Rank factorization
Matrix @@A \in \mathbb{R}^{n \times p}@@ with @@\text{rank}(A) = r@@ can be decomposed as @@A = CF^T@@ where
- @@C \in \mathbb{R}^{n \times r}@@ is full column rank
- @@F \in \mathbb{R}^{p \times r}@@ is full column rank
For @@C = (c_1 \dots c_r)@@ and @@F = (f_1 \dots f_r)@@
$$A = C F^T = \underbrace{c_1 f_1^T + \dots + c_r f_r^T}_{\text{r matrices of rank 1}}$$QR decomposition
Matrix @@A \in \mathbb{R}^{n \times p}@@ can be decomposed as @@A = QR@@ where
- @@Q@@ is column orthogonal (@@Q^T Q = I_p@@)
- @@R@@ is upper-triangular
To find @@Q@@ and @@R@@ Gram-Schmidt procedure is used
- @@e_1 = a_1 \rightarrow q_1 = \frac{e_1}{||e_1||}@@
- @@e_2 = a_2 - \langle a_2, q_1 \rangle q_1 \rightarrow q_2 = \frac{e_2}{||e_2||}@@
- @@e_k = a_k - \sum_{i=1}^{k-1} \langle a_k, q_i \rangle q_i \rightarrow q_k = \frac{e_k}{||e_k||}@@
Eigen (Spectral) decomposition
Consider symmetric @@A \in \mathbb{R}^{n \times n}@@. Then @@A = U \Lambda U^T@@ where
- @@U \in \mathbb{R}^{n \times n}@@ is orthogonal (@@U^T U = U U^T = I_n@@)
- $$\Lambda = \begin{pmatrix} \lambda_1 & & 0 \\ & \ddots & \\ 0 & & \lambda_n \end{pmatrix}$$ is diagonal (@@\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_n@@)
@@U = (u_1 \dots u_n)@@
- @@u_i@@ are eigenvectors
- @@\lambda_i@@ are eigenvalues
Sometimes it is useful to view ED as
$$A = \sum_{i=1}^{n} \lambda_i u_i u_i^T$$where
$$\underbrace{u_i u_i^T}_{\text{rank 1 matrix of size } n \times n}$$Properties of Symmetric Matrix
Assume @@A \in \mathbb{R}^{n \times n}@@ is symmetric.
@@Au_j = \lambda_j u_j@@, i.e. mapping A doesn’t rotate @@u_j@@.
@@A = \sum_{i=1}^{n} \lambda_i u_i u_i^T \Rightarrow Au_j = \sum_{i=1}^{n} \lambda_i u_i (u_i^T u_j) = \lambda_j u_j@@
ED is not unique.
@@U = (u_1 \dots u_n)@@ and @@\tilde{U} = (-u_1, \dots, u_n)@@ are both valid decompositions
Rank of A is equal to the number of non-zero @@\lambda_i@@
@@A = \sum_{i=1}^{n} \lambda_i u_i u_i^T@@ will have @@\text{rk}(A)@@ non-zero terms
Properties of PSD Matrix
A is PSD @@\Leftrightarrow \lambda_i \ge 0@@
A is PD @@\Leftrightarrow \lambda_i > 0@@
@@A > 0 \Rightarrow v^T A v > 0 \quad \forall v \neq 0@@
Let @@v = u_i@@; then @@u_i^T A u_i = \lambda_i u_i^T u_i = \lambda_i ||u_i||^2 > 0@@
@@\lambda_i > 0 \Rightarrow v^T A v = v^T U \Lambda U^T v = \underbrace{(U^T v)^T}_{y^T} \Lambda \underbrace{(U^T v)}_{y} = y^T \Lambda y = \sum_{i=1}^{n} \lambda_i y_i^2 > 0@@
If A is full-rank then
$$A^{-1} = U \Lambda^{-1} U^T = U \begin{pmatrix} 1/\lambda_1 & & 0 \\ & \ddots & \\ 0 & & 1/\lambda_n \end{pmatrix} U^T$$@@A A^{-1} = (U \Lambda U^T) (U \Lambda^{-1} U^T) = U \Lambda (U^T U) \Lambda^{-1} U^T = U \Lambda \Lambda^{-1} U^T = U I U^T = U U^T = I_n@@
@@A^{-1} A = \dots@@
If A is PSD then
$$A^{1/2} = U \Lambda^{1/2} U^T = U \begin{pmatrix} \sqrt{\lambda_1} & & 0 \\ & \ddots & \\ 0 & & \sqrt{\lambda_n} \end{pmatrix} U^T$$@@A^{1/2} A^{1/2} = (U \Lambda^{1/2} U^T) (U \Lambda^{1/2} U^T) = U \Lambda^{1/2} (U^T U) \Lambda^{1/2} U^T = U \Lambda^{1/2} \Lambda^{1/2} U^T = U \Lambda U^T = A@@
Projection
Symmetric (@@P^T = P@@), idempotent (@@P^2 = P@@) matrix @@P@@ is called orthogonal projection matrix.
Properties: if P is orthogonal projection then
- Eigenvalues are 0 or 1 (so it is PSD)
Using the spectral decomposition @@P = U \Lambda U^T@@ and the idempotent property @@P^2 = P@@:
$$P^2 = (U \Lambda U^T)(U \Lambda U^T) = U \Lambda (U^T U) \Lambda U^T = U \Lambda^2 U^T$$Since @@P^2 = P@@, we have @@U \Lambda^2 U^T = U \Lambda U^T@@, which implies @@\Lambda^2 = \Lambda@@. This means each diagonal entry (eigenvalue) @@\lambda_i@@ must satisfy @@\lambda_i^2 = \lambda_i@@, so @@\lambda_i = 0@@ or @@\lambda_i = 1@@.
Examples:
Given @@A \in \mathbb{R}^{n \times p}@@ matrix @@P = A(A^T A)^{-1} A^T@@ projects on @@C(A)@@.
Projecting @@b \in \mathbb{R}^n@@ onto @@C(A)@@ is equivalent to finding @@v \in \mathbb{R}^p@@ such that @@||Av - b||@@ is minimal. This is regression with response @@b@@ and feature @@A@@. In regression, @@v = (A^T A)^{-1} A^T b@@, so the projection is @@A(A^T A)^{-1} A^T b = Pb@@.
If @@U \in \mathbb{R}^{n \times p}@@ is column-orthogonal then @@P = U U^T@@ projects on @@C(U)@@.
If @@U \in \mathbb{R}^{n \times p}@@ is column-orthogonal then @@P = I_n - U U^T@@ projects onto @@C(U_{\perp})@@.
Here @@U_{\perp} \in \mathbb{R}^{n \times (n-p)}@@ is orthogonal complement of @@U@@, i.e. @@U_{\perp}^T U_{\perp} = I_{n-p}@@ and @@U^T U_{\perp} = O_{p \times (n-p)}@@.
Let @@\tilde{U} = (U, U_{\perp})@@, then @@\tilde{U}^T \tilde{U} = \tilde{U} \tilde{U}^T = I_n@@.
$$\tilde{U} \tilde{U}^T = (U \quad U_{\perp}) \begin{pmatrix} U^T \\ U_{\perp}^T \end{pmatrix} = U U^T + U_{\perp} U_{\perp}^T = I_n$$Thus @@U_{\perp} U_{\perp}^T = I_n - U U^T@@.
Centering operator
$$C = I_n - \frac{1}{n} 1_n 1_n^T \in \mathbb{R}^{n \times n}$$is centering operator.
Let @@v = \begin{pmatrix} v_1 \\ \vdots \\ v_n \end{pmatrix}@@ then
$$CV = \begin{pmatrix} v_1 - \bar{v} \\ \vdots \\ v_n - \bar{v} \end{pmatrix}$$with @@\bar{v} = \frac{1}{n} \sum_{i=1}^{n} v_i@@.
In particular, the mean of @@CV@@ is 0.
$$CV = (I_n - \frac{1}{n} 1_n 1_n^T) v = v - \frac{1}{n} 1_n (1_n^T v) = v - 1_n \bar{v} = \begin{pmatrix} v_1 \\ \vdots \\ v_n \end{pmatrix} - \begin{pmatrix} \bar{v} \\ \vdots \\ \bar{v} \end{pmatrix}$$$$\frac{1}{n} \sum_{i=1}^{n} (v_i - \bar{v}) = \frac{1}{n} \sum_{i=1}^{n} v_i - \frac{n \bar{v}}{n} = \bar{v} - \bar{v} = 0$$
Let @@X = (f_1 \dots f_p)@@ (i.e. a matrix with columns @@f_j \in \mathbb{R}^n@@)
then @@CX = C(f_1 \dots f_p) = (Cf_1 \dots Cf_p)@@
thus the mean of each column is zero.
Singular Value Decomposition (SVD)
Matrix @@A \in \mathbb{R}^{n \times p}@@ can be decomposed as @@A = U D V^T@@ where
- @@U \in \mathbb{R}^{n \times n}@@ and @@V \in \mathbb{R}^{p \times p}@@ are orthogonal
- @@D \in \mathbb{R}^{n \times p}@@ is diagonal with non-negative entries
Sorted in decreasing order.
Full form (Case @@n > p@@)
- Left singular vectors: @@U = (u_1 \dots u_n) \in \mathbb{R}^{n \times n}@@
- Singular values: @@d_1 \ge \dots \ge d_p \ge 0@@ (these are the diagonal entries of @@D@@)
- Right singular vectors: @@V = (v_1 \dots v_p) \in \mathbb{R}^{p \times p}@@
- Properties:
- @@U^T U = U U^T = I_n@@
- @@V^T V = V V^T = I_p@@
Compact form
(Note: This “Thin SVD” form is often used when @@n > p@@. Here @@U@@ is @@n \times p@@ and @@D@@ is @@p \times p@@.)
- Left singular vectors: @@U = (u_1 \dots u_p) \in \mathbb{R}^{n \times p}@@
- Singular values: @@d_1 \ge \dots \ge d_p \ge 0@@
- Right singular vectors: @@V = (v_1 \dots v_p) \in \mathbb{R}^{p \times p}@@
- Properties:
- @@U^T U = I_p@@ (but @@U U^T \neq I_n@@)
- @@V^T V = V V^T = I_p@@
Note that A can be written as a sum of rank 1 matrices:
$$A = \sum_{i=1}^{\min(n,p)} \underbrace{d_i u_i v_i^T}_{\text{rank 1 matrices}}$$- Note: The “importance” of each rank 1 matrix is determined by its singular value @@d_i@@.
Properties (SVD)
- SVD is a “generalization” of ED
- Every @@A@@ has SVD.
- SVD is not unique
- If @@\text{rank}(A) = r@@ then @@D = \text{diag}(d_1, \dots, d_r, 0, \dots, 0)@@
Link between SVD and ED
If @@A \in \mathbb{R}^{n \times n}@@ is PSD then the decompositions are the same.
$$A = U \Lambda U^T, \quad U \text{ is orthogonal}, \quad \Lambda = \begin{pmatrix} \lambda_1 & & 0 \\ & \ddots & \\ 0 & & \lambda_n \end{pmatrix}, \quad \lambda_i \ge 0$$
Set @@V=U@@, @@D=\Lambda@@ then @@A = U D V^T@@. (PSD 矩阵的特征值 @@\lambda_i@@ 均为非负,与奇异值 @@d_i@@ 的性质相同,因此 @@\Lambda@@ 和 @@D@@ 矩阵是相同的。)
If @@A \in \mathbb{R}^{n \times p}@@ (@@n > p@@) then @@v_i@@ are eigenvectors of @@A^T A@@ and @@d_i^2@@ are the eigenvalues.
$$A = U D V^T \Rightarrow A^T A = (U D V^T)^T (U D V^T) = V D^T (U^T U) D V^T$$
Since @@U@@ is orthogonal (@@U^T U = I_n@@):
$$A^T A = V (D^T D) V^T = V \begin{pmatrix} d_1^2 & & 0 \\ & \ddots & \\ 0 & & d_p^2 \end{pmatrix} V^T$$This is the Eigen Decomposition of @@A^T A@@.
If @@A \in \mathbb{R}^{n \times p}@@ (@@n > p@@) then @@u_i@@ are eigenvectors of @@A A^T@@ and @@d_i^2@@ are the eigenvalues (plus extra zeroes).
$$A = U D V^T \Rightarrow A A^T = (U D V^T) (U D V^T)^T = U D (V^T V) D^T U^T$$
Since @@V@@ is orthogonal (@@V^T V = I_p@@):
$$A A^T = U (D D^T) U^T = U \text{diag}(d_1^2, \dots, d_p^2, 0, \dots, 0) U^T$$(Note: 结果是一个 n x n 矩阵, 包含 p 个非零特征值和 (n-p) 个零特征值。)
Gradients
Vector Gradients
If @@f: \mathbb{R}^n \rightarrow \mathbb{R}@@, so it takes @@x = \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix} \in \mathbb{R}^n@@ and maps it to a number then
$$\nabla_x f(x) = \begin{pmatrix} \frac{\partial f(x)}{\partial x_1} \\ \vdots \\ \frac{\partial f(x)}{\partial x_n} \end{pmatrix} \in \mathbb{R}^n$$Examples:
@@f(x) = \langle a, x \rangle@@ for @@a \in \mathbb{R}^n@@ then @@\nabla_x f(x) = a@@
@@f(x) = \sum_{i=1}^{n} a_i x_i@@ then @@\frac{\partial f(x)}{\partial x_i} = a_i@@ and @@\nabla_x f(x) = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix} = a@@
@@f(x) = ||x||^2@@ then @@\nabla_x f(x) = 2x@@
@@f(x) = \sum_{i=1}^{n} x_i^2@@ then @@\frac{\partial f(x)}{\partial x_i} = 2x_i@@ and @@\nabla_x f(x) = \begin{pmatrix} 2x_1 \\ \vdots \\ 2x_n \end{pmatrix} = 2x@@
Matrix Gradients
If @@f: \mathbb{R}^{n \times p} \rightarrow \mathbb{R}@@, so it takes @@A = \begin{pmatrix} a_{11} & \cdots & a_{1p} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{np} \end{pmatrix} \in \mathbb{R}^{n \times p}@@ and maps it to a number then
$$\nabla_A f(A) = \begin{pmatrix} \frac{\partial f(A)}{\partial a_{11}} & \cdots & \frac{\partial f(A)}{\partial a_{1p}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(A)}{\partial a_{n1}} & \cdots & \frac{\partial f(A)}{\partial a_{np}} \end{pmatrix} \in \mathbb{R}^{n \times p}$$Example:
- @@f(A) = ||A||_F^2@@ then @@\nabla_A f(A) = 2A@@
$$f(A) = \sum_{i=1}^{n} \sum_{j=1}^{p} a_{ij}^2 \Rightarrow \frac{\partial f(A)}{\partial a_{ij}} = 2 a_{ij}$$$$\Rightarrow \nabla_A f(A) = \begin{pmatrix} 2a_{11} & \cdots & 2a_{1p} \\ \vdots & \ddots & \vdots \\ 2a_{n1} & \cdots & 2a_{np} \end{pmatrix} = 2A$$