Cocktail party problem
$$\begin{aligned} X_1 &= \ell_{11} Z_1 + \dots + \ell_{1p} Z_p \\ &\vdots \\ X_p &= \ell_{p1} Z_1 + \dots + \ell_{pp} Z_p \end{aligned}$$$$\begin{pmatrix} X_1 \\ \vdots \\ X_p \end{pmatrix}= \begin{pmatrix} \ell_{11} & \dots & \ell_{1p} \\ \ell_{p1} & \dots & \ell_{pp} \end{pmatrix} \begin{pmatrix} Z_1 \\ \vdots \\ Z_p \end{pmatrix}$$$$\underset{p}{X} = \underset{p \times p}{L} \cdot \underset{p}{Z}$$- $X$ is observed mixture vector
- $L$ is unknown mixing matrix
- $Z$ is unobserved source vector
We assume that $Z_1 \dots Z_p$ are independent and $Z_i$ are non-Gaussian.
Goal: find unmixing matrix $W = L^{-1}$
$$ \begin{aligned} Z &= WX \quad \text{can recover original signals} \\ \begin{pmatrix} Z_1 \\ \vdots \\ Z_p \end{pmatrix} &= \begin{pmatrix} w_1^T \\ \vdots \\ w_p^T \end{pmatrix} X = \begin{pmatrix} w_1^T X \\ \vdots \\ w_p^T X \end{pmatrix} \\ &\Rightarrow z_i = w_i^T X \end{aligned} $$Identifiability
- We cannot determine the variance of the independent sources (and sign)
One way to fix this is to assume $var(Z) = I_p$
- There is no fixed order of $z_1 \dots z_p$
ICA preprocessing
$X = \binom{X_1}{X_p}$ with $E(X) = \mu$ and $Var(X) = \Sigma$
Centering + Whitening (Sphering): $X := \Sigma^{-1/2}(X-\mu)$ with $E(X)=0$ and $var(X)=I_p$
ICA model: $X = L \cdot Z$, unmixing $Z = WX$
- we can then assume orthogonal $L \in \mathbb{R}^{p \times p}$
- unmixing $W \in \mathbb{R}^{p \times p}$ is also orthogonal
Non-Gaussian assumption
Gaussian distribution creates more identifiability issues
$$ \begin{aligned} &Z \sim N_p(0, I_p) \quad \text{and } Q \in \mathbb{R}^{p \times p} \text{ orthogonal} \\ &\tilde{Z} = QZ \sim N_p(0, QQ^T) \equiv N_p(0, I_p) \\ &\text{components in both } Z \text{ and } QZ \text{ are independent} \end{aligned} $$Thm (Kac) If $Z = \binom{Z_1}{Z_2}$ is a random vector with independent $Z_1$ and $Z_2$, and for some orthogonal $Q$ with non-zero entries $\tilde{Z} = QZ = \binom{\tilde{Z}_1}{\tilde{Z}_2}$ also has independent $\tilde{Z}_1$ and $\tilde{Z}_2$ then $Z$ follows normal distribution.
Non-Gaussianity via Kurtosis
For a random variable $y \sim (\mu, \sigma^2)$, $k$-th standardized moment is $E\left( (\frac{y-\mu}{\sigma})^k \right)$.
$k=1$: $\quad E(\frac{y-\mu}{\sigma}) = 0$
$k=2$: $\quad E(\frac{y-\mu}{\sigma})^2 = \frac{E(y-\mu)^2}{\sigma^2} = 1$
$k=3$: $\quad \text{skew}(y) = E(\frac{y-\mu}{\sigma})^3$ is skewness
$k=4$: $\quad \text{kurt}(y) = E(\frac{y-\mu}{\sigma})^4$ is kurtosis
skewness: measures how symmetric is $p_y(y)$
kurtosis: measures how heavy are the tails in $p_y(y)$
If $y \sim N(\mu, \sigma^2)$ then
$$ E\left( (\frac{y-\mu}{\sigma})^p \right) = \begin{cases} 0 & \text{if } p \text{ is odd} \\ (p-1)!! & \text{if } p \text{ is even} \end{cases} $$here $n!! = n(n-2)(n-4)\dots$ is double factorial
$$ \begin{aligned} E\left( \frac{y-\mu}{\sigma} \right) &= E\left( \frac{y-\mu}{\sigma} \right)^3 = E\left( \frac{y-\mu}{\sigma} \right)^5 = \dots = 0 \\ E\left( \frac{y-\mu}{\sigma} \right)^2 &= 1!! = 1 \\ E\left( \frac{y-\mu}{\sigma} \right)^4 &= 3!! = 3 \\ E\left( \frac{y-\mu}{\sigma} \right)^6 &= 5!! = 5 \cdot 3 = 15 \end{aligned} $$More intuition: CLT
Linear combinations of independent non-Gaussian variables are more Gaussian than the original variables.
$$ \begin{aligned} &z_1 \dots z_n \sim Unif[-\sqrt{3}, \sqrt{3}], \quad E(z_i)=0, \, var(z_i)=1 \\ &\bar{z} = \frac{1}{n} (z_1 + \dots + z_n) \simeq N(0, \frac{1}{n}) \quad \text{for large } n \end{aligned} $$[Diagram: Uniform distributions (Box shapes) $\rightarrow$ Converging to Gaussian (Bell shape)]
Idea
Given $x = \begin{pmatrix} x_1 \\ \vdots \\ x_p \end{pmatrix}$, each unmixing vector $w_1 \dots w_p$ should maximize non-Gaussianity of $w_i^T x$. (Subject to orthogonality constraints).
$$ \begin{aligned} x &= Lz \Rightarrow w^T x = \underbrace{w^T L}_{\alpha^T} z = \alpha^T z \\ \end{aligned} $$If $z_1 \dots z_p$ are non-Gaussian, then $\alpha_1 z_1 + \dots + \alpha_p z_p$ are non-Gaussian when $\alpha = e_i = \begin{pmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{pmatrix}$ ($i$-th entry).
This is not true if $z_i$ were Gaussian, as $\alpha^T z \sim N(0, \alpha^T \alpha)$ for any $\alpha$.
Thm (Common)
If $Z = \begin{pmatrix} z_1 \\ \vdots \\ z_p \end{pmatrix}$ have $E(Z)=0$ and $Var(Z)=I_p$, and $z_1 \dots z_p$ are independent and non-Gaussian, then unmixing matrix $W$ can be identified up to sign changes and permutation.
Non-Gaussianity via Kurtosis
For a random variable $y \sim (\mu, \sigma^2)$, its $k$-th Standardized Moment is defined as:
$$ E\left[ \left( \frac{y-\mu}{\sigma} \right)^k \right] $$Depending on the value of $k$, we have the following properties:
- $k=1$ (Mean): $$E\left( \frac{y-\mu}{\sigma} \right) = 0$$
- $k=2$ (Variance): $$E\left[ \left( \frac{y-\mu}{\sigma} \right)^2 \right] = \frac{E[(y-\mu)^2]}{\sigma^2} = 1$$
- $k=3$ (Skewness): $$\text{skew}(y) = E\left[ \left( \frac{y-\mu}{\sigma} \right)^3 \right]$$ Note: Skewness measures how symmetric $p_y(y)$ is.
- $k=4$ (Kurtosis): $$\text{kurt}(y) = E\left[ \left( \frac{y-\mu}{\sigma} \right)^4 \right]$$ Note: Kurtosis measures how heavy the tails in $p_y(y)$ are.
Moments of a Normal Distribution
If $y \sim \mathcal{N}(\mu, \sigma^2)$, then the properties of its standardized $p$-th moment are as follows:
$$ E\left[ \left( \frac{y-\mu}{\sigma} \right)^p \right] = \begin{cases} 0 & \text{if } p \text{ is odd} \\ (p-1)!! & \text{if } p \text{ is even} \end{cases} $$where $n!!$ denotes the Double Factorial, defined as $n!! = n(n-2)(n-4)\dots$.
Examples:
Odd moments:
$$E\left( \frac{y-\mu}{\sigma} \right) = E\left[ \left( \frac{y-\mu}{\sigma} \right)^3 \right] = \dots = 0$$(Since the Gaussian distribution is symmetric about the mean, all standardized odd moments are 0)
Even moments:
$$ \begin{aligned} E\left[ \left( \frac{y-\mu}{\sigma} \right)^2 \right] &= 1!! = 1 \\ E\left[ \left( \frac{y-\mu}{\sigma} \right)^4 \right] &= 3!! = 3 \\ E\left[ \left( \frac{y-\mu}{\sigma} \right)^6 \right] &= 5!! = 5 \cdot 3 = 15 \end{aligned} $$
Excess Kurtosis and Linear Combinations
For a random variable $y$, Excess Kurtosis is defined as:
$$ \mathcal{K}(y) = \text{kurt}(y) - 3 $$Subtracting 3 here makes the excess kurtosis of a Gaussian Distribution 0.
Assume $z_1, z_2$ are Independent random variables with Mean Zero and Unit Variance. We want to investigate the kurtosis properties of the linear combination $\alpha_1 z_1 + \alpha_2 z_2$.
$$ \mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2) = \alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2) $$Derivation
First, calculate the variance and fourth moment of the combination:
Variance:
$$ \text{Var}(\alpha_1 z_1 + \alpha_2 z_2) = \alpha_1^2 \text{Var}(z_1) + \alpha_2^2 \text{Var}(z_2) = \alpha_1^2 + \alpha_2^2 $$(Assuming unit variance)
Fourth Moment Expansion: By definition, $\mathcal{K}(y) = E(y^4) - 3$ (assuming variance is normalized). We need to expand $E[(\alpha_1 z_1 + \alpha_2 z_2)^4]$:
$$ \begin{aligned} E[(\alpha_1 z_1 + \alpha_2 z_2)^4] &= \alpha_1^4 E(z_1^4) + \alpha_2^4 E(z_2^4) + 4\alpha_1^3 \alpha_2 E(z_1^3 z_2) \\ &+ 4\alpha_1 \alpha_2^3 E(z_1 z_2^3) + 6\alpha_1^2 \alpha_2^2 E(z_1^2 z_2^2) \end{aligned} $$Using the properties of independence ($z_1 \perp z_2$) and zero mean ($E(z)=0$):
- Cross term $E(z_1^3 z_2) = E(z_1^3)E(z_2) = 0$
- Cross term $E(z_1 z_2^3) = E(z_1)E(z_2^3) = 0$
- Term $E(z_1^2 z_2^2) = E(z_1^2)E(z_2^2) = 1 \cdot 1 = 1$
Substituting $E(z^4) = \mathcal{K}(z) + 3$:
$$ \begin{aligned} E[(\dots)^4] &= \alpha_1^4 [\mathcal{K}(z_1) + 3] + \alpha_2^4 [\mathcal{K}(z_2) + 3] + 6\alpha_1^2 \alpha_2^2 \\ &= \alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2) + 3(\alpha_1^4 + \alpha_2^4 + 2\alpha_1^2 \alpha_2^2) \\ &= \alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2) + 3(\alpha_1^2 + \alpha_2^2)^2 \end{aligned} $$Finally, substituting back into the definition of excess kurtosis (assuming $\alpha_1^2 + \alpha_2^2 = 1$):
$$ \mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2) = E[(\dots)^4] - 3(\alpha_1^2 + \alpha_2^2)^2 = \alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2) $$
Maximization of Non-Gaussianity
Assume the constraint $\alpha_1^2 + \alpha_2^2 = 1$ holds. We prove that $|\mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2)|$ achieves its maximum when $\alpha_1=1, \alpha_2=0$ or $\alpha_1=0, \alpha_2=1$.
Proof:
$$ \begin{aligned} |\mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2)| &= |\alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2)| \\ &\leq \alpha_1^4 |\mathcal{K}(z_1)| + \alpha_2^4 |\mathcal{K}(z_2)| \quad (\text{Triangle Inequality}) \\ &\leq (\alpha_1^4 + \alpha_2^4) \max(|\mathcal{K}(z_1)|, |\mathcal{K}(z_2)|) \end{aligned} $$Considering the constraint $\alpha_1^2 + \alpha_2^2 = 1 \implies \alpha_i^2 \leq 1$:
$$ \alpha_i^4 \leq \alpha_i^2 \implies \alpha_1^4 + \alpha_2^4 \leq \alpha_1^2 + \alpha_2^2 = 1 $$Therefore:
$$ |\mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2)| \leq \max(|\mathcal{K}(z_1)|, |\mathcal{K}(z_2)|) $$Equality Conditions: The conditions for equality (i.e., kurtosis maximization) are:
- $\alpha_1=0, \alpha_2=1$ (when $|\mathcal{K}(z_2)| \geq |\mathcal{K}(z_1)|$)
- $\alpha_1=1, \alpha_2=0$ (when $|\mathcal{K}(z_1)| \geq |\mathcal{K}(z_2)|$)
Note: This result can be extended to $p$ variables $z_1 \dots z_p$ and coefficients $\alpha_1 \dots \alpha_p$.
non-Gaussianity: negentropy
For a random variable $y$ with density $f_y(y)$ differential entropy is
$$H(y) = E [ -\log f_y(y) ]$$Thm For given variance, differential entropy is maximized by normal distribution.
Negentropy is
$$J(y) = H(y_{gauss}) - H(y)$$where $y_{gauss}$ is a Gaussian random variable with the same covariance as $y$.
$$J(y) \equiv 0 \quad \text{iff} \quad y \text{ is gaussian}$$In practice, we don’t know $f_y(y)$ so we need to approximate $J(y)$
$$J(y) \simeq \sum_{i=1}^m a_i ( E(g_i(y)) - E(g_i(y_{gauss})) )^2$$where $g_i$’s are some non-quadratic functions.
Examples:
- $g_1(y) = y^4$ If $m=1$ and $E(y)=0$, $Var(y)=1$ $$E(g_1(y_{gauss})) = 3 \quad E(g_1(y)) = \text{kurt}(y)$$ $$J(y) \propto (\mathcal{K}(y))^2$$
- $g_1(y) = \log \cosh (y) \qquad g_2(y) = -e^{-\frac{y^2}{2}} $
Implications for ICA:
maximizing $|\mathcal{K}(w^T x)|$ will presumably find $w$ such that $w^T x = \alpha^T z = \alpha_1 z_1 + \dots + \alpha_p z_p \equiv z_i$ , i.e. $w$ will be proper unmixing vector.
ICA VS other methods
ICA vs PCA:
- no dimension reduction
- focus on independence as opposed to focus on maximizing variance
ICA vs FA:
- no noise $\quad \epsilon \equiv 0 \quad (\Psi \equiv 0)$
- assume independence of $z_1 \dots z_p$ (stronger than $\text{var}(z) = I_p$)
- no dimension reduction $(r=p)$
- $z_1 \dots z_p$ are non-Gaussian
FastICA
$ICA_1$
$$ \underset{w \in \mathbb{R}^p}{\text{maximize}} \quad |J(w^T x)| \quad \text{subject to } \|w\|=1 $$$ICA_2$
$$ \underset{w \in \mathbb{R}^p}{\text{maximize}} \quad |J(w^T x)| \quad \text{subject to } \|w\|=1 \text{ and } w^T w_1 = 0 $$…
$ICA_p$
$$ \underset{w \in \mathbb{R}^p}{\text{maximize}} \quad |J(w^T x)| \quad \text{subject to } \|w\|=1 \text{ and } w^T w_1 = \dots = w^T w_{p-1} = 0 $$