ICA | Undertide

Cocktail party problem

$$\begin{aligned} X_1 &= \ell_{11} Z_1 + \dots + \ell_{1p} Z_p \\ &\vdots \\ X_p &= \ell_{p1} Z_1 + \dots + \ell_{pp} Z_p \end{aligned}$$$$\begin{pmatrix} X_1 \\ \vdots \\ X_p \end{pmatrix}= \begin{pmatrix} \ell_{11} & \dots & \ell_{1p} \\ \ell_{p1} & \dots & \ell_{pp} \end{pmatrix} \begin{pmatrix} Z_1 \\ \vdots \\ Z_p \end{pmatrix}$$$$\underset{p}{X} = \underset{p \times p}{L} \cdot \underset{p}{Z}$$

$X$ is observed mixture vector
$L$ is unknown mixing matrix
$Z$ is unobserved source vector

We assume that $Z_1 \dots Z_p$ are independent and $Z_i$ are non-Gaussian.

Goal: find unmixing matrix $W = L^{-1}$

$$ \begin{aligned} Z &= WX \quad \text{can recover original signals} \\ \begin{pmatrix} Z_1 \\ \vdots \\ Z_p \end{pmatrix} &= \begin{pmatrix} w_1^T \\ \vdots \\ w_p^T \end{pmatrix} X = \begin{pmatrix} w_1^T X \\ \vdots \\ w_p^T X \end{pmatrix} \\ &\Rightarrow z_i = w_i^T X \end{aligned} $$

Identifiability

We cannot determine the variance of the independent sources (and sign)

$$ \begin{aligned} LZ &= L \cdot D^{-1} D Z = \tilde{L} \tilde{Z} \quad \text{where } D = \text{diag}(d_1 \dots d_p) \\ \text{var}(Z) &= \begin{pmatrix} \text{var}(z_1) & & 0 \\ & \ddots & \\ 0 & & \text{var}(z_p) \end{pmatrix} \quad \text{as } z_1 \dots z_p \text{ are independent} \\ \text{var}(\tilde{Z}) &= D \text{var}(Z) D = \text{diag}(d_1^2 \text{var}(z_1), \dots, d_p^2 \text{var}(z_p)) \end{aligned} $$

One way to fix this is to assume $var(Z) = I_p$

There is no fixed order of $z_1 \dots z_p$

$$ \begin{aligned} L &= (\ell_1 \, \ell_2) \quad Z = \binom{z_1}{z_2} \\ \tilde{L} &= (\ell_2 \, \ell_1) \quad \tilde{Z} = \binom{z_2}{z_1} \end{aligned} \Rightarrow \begin{aligned} X &= LZ = \ell_1 z_1 + \ell_2 z_2 \\ &= \ell_2 z_2 + \ell_1 z_1 = \tilde{L} \tilde{Z} \end{aligned} $$

ICA preprocessing

$X = \binom{X_1}{X_p}$ with $E(X) = \mu$ and $Var(X) = \Sigma$

Centering + Whitening (Sphering): $X := \Sigma^{-1/2}(X-\mu)$ with $E(X)=0$ and $var(X)=I_p$

ICA model: $X = L \cdot Z$, unmixing $Z = WX$

we can then assume orthogonal $L \in \mathbb{R}^{p \times p}$

$$ var(X) = L var(Z) L^T = LL^T = I_p $$

unmixing $W \in \mathbb{R}^{p \times p}$ is also orthogonal

$$ \text{thus each } ||w_i|| = 1 \text{ and } w_i^T w_j = 0 $$

Non-Gaussian assumption

Gaussian distribution creates more identifiability issues

$$ \begin{aligned} &Z \sim N_p(0, I_p) \quad \text{and } Q \in \mathbb{R}^{p \times p} \text{ orthogonal} \\ &\tilde{Z} = QZ \sim N_p(0, QQ^T) \equiv N_p(0, I_p) \\ &\text{components in both } Z \text{ and } QZ \text{ are independent} \end{aligned} $$

Thm (Kac) If $Z = \binom{Z_1}{Z_2}$ is a random vector with independent $Z_1$ and $Z_2$, and for some orthogonal $Q$ with non-zero entries $\tilde{Z} = QZ = \binom{\tilde{Z}_1}{\tilde{Z}_2}$ also has independent $\tilde{Z}_1$ and $\tilde{Z}_2$ then $Z$ follows normal distribution.

Non-Gaussianity via Kurtosis

For a random variable $y \sim (\mu, \sigma^2)$, $k$-th standardized moment is $E\left( (\frac{y-\mu}{\sigma})^k \right)$.

$k=1$: $\quad E(\frac{y-\mu}{\sigma}) = 0$
$k=2$: $\quad E(\frac{y-\mu}{\sigma})^2 = \frac{E(y-\mu)^2}{\sigma^2} = 1$
$k=3$: $\quad \text{skew}(y) = E(\frac{y-\mu}{\sigma})^3$ is skewness
$k=4$: $\quad \text{kurt}(y) = E(\frac{y-\mu}{\sigma})^4$ is kurtosis
skewness: measures how symmetric is $p_y(y)$
kurtosis: measures how heavy are the tails in $p_y(y)$

If $y \sim N(\mu, \sigma^2)$ then

$$ E\left( (\frac{y-\mu}{\sigma})^p \right) = \begin{cases} 0 & \text{if } p \text{ is odd} \\ (p-1)!! & \text{if } p \text{ is even} \end{cases} $$

here $n!! = n(n-2)(n-4)\dots$ is double factorial

$$ \begin{aligned} E\left( \frac{y-\mu}{\sigma} \right) &= E\left( \frac{y-\mu}{\sigma} \right)^3 = E\left( \frac{y-\mu}{\sigma} \right)^5 = \dots = 0 \\ E\left( \frac{y-\mu}{\sigma} \right)^2 &= 1!! = 1 \\ E\left( \frac{y-\mu}{\sigma} \right)^4 &= 3!! = 3 \\ E\left( \frac{y-\mu}{\sigma} \right)^6 &= 5!! = 5 \cdot 3 = 15 \end{aligned} $$

More intuition: CLT

Linear combinations of independent non-Gaussian variables are more Gaussian than the original variables.

$$ \begin{aligned} &z_1 \dots z_n \sim Unif[-\sqrt{3}, \sqrt{3}], \quad E(z_i)=0, \, var(z_i)=1 \\ &\bar{z} = \frac{1}{n} (z_1 + \dots + z_n) \simeq N(0, \frac{1}{n}) \quad \text{for large } n \end{aligned} $$

[Diagram: Uniform distributions (Box shapes) $\rightarrow$ Converging to Gaussian (Bell shape)]

Idea

Given $x = \begin{pmatrix} x_1 \\ \vdots \\ x_p \end{pmatrix}$, each unmixing vector $w_1 \dots w_p$ should maximize non-Gaussianity of $w_i^T x$. (Subject to orthogonality constraints).

$$ \begin{aligned} x &= Lz \Rightarrow w^T x = \underbrace{w^T L}_{\alpha^T} z = \alpha^T z \\ \end{aligned} $$

If $z_1 \dots z_p$ are non-Gaussian, then $\alpha_1 z_1 + \dots + \alpha_p z_p$ are non-Gaussian when $\alpha = e_i = \begin{pmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{pmatrix}$ ($i$-th entry).
This is not true if $z_i$ were Gaussian, as $\alpha^T z \sim N(0, \alpha^T \alpha)$ for any $\alpha$.

Thm (Common)

If $Z = \begin{pmatrix} z_1 \\ \vdots \\ z_p \end{pmatrix}$ have $E(Z)=0$ and $Var(Z)=I_p$, and $z_1 \dots z_p$ are independent and non-Gaussian, then unmixing matrix $W$ can be identified up to sign changes and permutation.

Non-Gaussianity via Kurtosis

For a random variable $y \sim (\mu, \sigma^2)$, its $k$-th Standardized Moment is defined as:

$$ E\left[ \left( \frac{y-\mu}{\sigma} \right)^k \right] $$

Depending on the value of $k$, we have the following properties:

$k=1$ (Mean): $$E\left( \frac{y-\mu}{\sigma} \right) = 0$$
$k=2$ (Variance): $$E\left[ \left( \frac{y-\mu}{\sigma} \right)^2 \right] = \frac{E[(y-\mu)^2]}{\sigma^2} = 1$$
$k=3$ (Skewness): $$\text{skew}(y) = E\left[ \left( \frac{y-\mu}{\sigma} \right)^3 \right]$$ Note: Skewness measures how symmetric $p_y(y)$ is.
$k=4$ (Kurtosis): $$\text{kurt}(y) = E\left[ \left( \frac{y-\mu}{\sigma} \right)^4 \right]$$ Note: Kurtosis measures how heavy the tails in $p_y(y)$ are.

Moments of a Normal Distribution

If $y \sim \mathcal{N}(\mu, \sigma^2)$, then the properties of its standardized $p$-th moment are as follows:

$$ E\left[ \left( \frac{y-\mu}{\sigma} \right)^p \right] = \begin{cases} 0 & \text{if } p \text{ is odd} \\ (p-1)!! & \text{if } p \text{ is even} \end{cases} $$

where $n!!$ denotes the Double Factorial, defined as $n!! = n(n-2)(n-4)\dots$.

Examples:

Odd moments:
$$E\left( \frac{y-\mu}{\sigma} \right) = E\left[ \left( \frac{y-\mu}{\sigma} \right)^3 \right] = \dots = 0$$
(Since the Gaussian distribution is symmetric about the mean, all standardized odd moments are 0)
Even moments:
$$ \begin{aligned} E\left[ \left( \frac{y-\mu}{\sigma} \right)^2 \right] &= 1!! = 1 \\ E\left[ \left( \frac{y-\mu}{\sigma} \right)^4 \right] &= 3!! = 3 \\ E\left[ \left( \frac{y-\mu}{\sigma} \right)^6 \right] &= 5!! = 5 \cdot 3 = 15 \end{aligned} $$

Excess Kurtosis and Linear Combinations

For a random variable $y$, Excess Kurtosis is defined as:

$$ \mathcal{K}(y) = \text{kurt}(y) - 3 $$

Subtracting 3 here makes the excess kurtosis of a Gaussian Distribution 0.

Assume $z_1, z_2$ are Independent random variables with Mean Zero and Unit Variance. We want to investigate the kurtosis properties of the linear combination $\alpha_1 z_1 + \alpha_2 z_2$.

$$ \mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2) = \alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2) $$

Derivation

First, calculate the variance and fourth moment of the combination:

Variance:
$$ \text{Var}(\alpha_1 z_1 + \alpha_2 z_2) = \alpha_1^2 \text{Var}(z_1) + \alpha_2^2 \text{Var}(z_2) = \alpha_1^2 + \alpha_2^2 $$
(Assuming unit variance)
Fourth Moment Expansion: By definition, $\mathcal{K}(y) = E(y^4) - 3$ (assuming variance is normalized). We need to expand $E[(\alpha_1 z_1 + \alpha_2 z_2)^4]$:
$$ \begin{aligned} E[(\alpha_1 z_1 + \alpha_2 z_2)^4] &= \alpha_1^4 E(z_1^4) + \alpha_2^4 E(z_2^4) + 4\alpha_1^3 \alpha_2 E(z_1^3 z_2) \\ &+ 4\alpha_1 \alpha_2^3 E(z_1 z_2^3) + 6\alpha_1^2 \alpha_2^2 E(z_1^2 z_2^2) \end{aligned} $$
Using the properties of independence ($z_1 \perp z_2$) and zero mean ($E(z)=0$):
- Cross term $E(z_1^3 z_2) = E(z_1^3)E(z_2) = 0$
- Cross term $E(z_1 z_2^3) = E(z_1)E(z_2^3) = 0$
- Term $E(z_1^2 z_2^2) = E(z_1^2)E(z_2^2) = 1 \cdot 1 = 1$
Substituting $E(z^4) = \mathcal{K}(z) + 3$:
$$ \begin{aligned} E[(\dots)^4] &= \alpha_1^4 [\mathcal{K}(z_1) + 3] + \alpha_2^4 [\mathcal{K}(z_2) + 3] + 6\alpha_1^2 \alpha_2^2 \\ &= \alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2) + 3(\alpha_1^4 + \alpha_2^4 + 2\alpha_1^2 \alpha_2^2) \\ &= \alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2) + 3(\alpha_1^2 + \alpha_2^2)^2 \end{aligned} $$
Finally, substituting back into the definition of excess kurtosis (assuming $\alpha_1^2 + \alpha_2^2 = 1$):
$$ \mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2) = E[(\dots)^4] - 3(\alpha_1^2 + \alpha_2^2)^2 = \alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2) $$

Maximization of Non-Gaussianity

Assume the constraint $\alpha_1^2 + \alpha_2^2 = 1$ holds. We prove that $|\mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2)|$ achieves its maximum when $\alpha_1=1, \alpha_2=0$ or $\alpha_1=0, \alpha_2=1$.

Proof:

$$ \begin{aligned} |\mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2)| &= |\alpha_1^4 \mathcal{K}(z_1) + \alpha_2^4 \mathcal{K}(z_2)| \\ &\leq \alpha_1^4 |\mathcal{K}(z_1)| + \alpha_2^4 |\mathcal{K}(z_2)| \quad (\text{Triangle Inequality}) \\ &\leq (\alpha_1^4 + \alpha_2^4) \max(|\mathcal{K}(z_1)|, |\mathcal{K}(z_2)|) \end{aligned} $$

Considering the constraint $\alpha_1^2 + \alpha_2^2 = 1 \implies \alpha_i^2 \leq 1$:

$$ \alpha_i^4 \leq \alpha_i^2 \implies \alpha_1^4 + \alpha_2^4 \leq \alpha_1^2 + \alpha_2^2 = 1 $$

Therefore:

$$ |\mathcal{K}(\alpha_1 z_1 + \alpha_2 z_2)| \leq \max(|\mathcal{K}(z_1)|, |\mathcal{K}(z_2)|) $$

Equality Conditions: The conditions for equality (i.e., kurtosis maximization) are:

$\alpha_1=0, \alpha_2=1$ (when $|\mathcal{K}(z_2)| \geq |\mathcal{K}(z_1)|$)
$\alpha_1=1, \alpha_2=0$ (when $|\mathcal{K}(z_1)| \geq |\mathcal{K}(z_2)|$)

Note: This result can be extended to $p$ variables $z_1 \dots z_p$ and coefficients $\alpha_1 \dots \alpha_p$.

non-Gaussianity: negentropy

For a random variable $y$ with density $f_y(y)$ differential entropy is

$$H(y) = E [ -\log f_y(y) ]$$

Thm For given variance, differential entropy is maximized by normal distribution.

Negentropy is

$$J(y) = H(y_{gauss}) - H(y)$$

where $y_{gauss}$ is a Gaussian random variable with the same covariance as $y$.

$$J(y) \equiv 0 \quad \text{iff} \quad y \text{ is gaussian}$$

In practice, we don’t know $f_y(y)$ so we need to approximate $J(y)$

$$J(y) \simeq \sum_{i=1}^m a_i ( E(g_i(y)) - E(g_i(y_{gauss})) )^2$$

where $g_i$’s are some non-quadratic functions.

Examples:

$g_1(y) = y^4$ If $m=1$ and $E(y)=0$, $Var(y)=1$ $$E(g_1(y_{gauss})) = 3 \quad E(g_1(y)) = \text{kurt}(y)$$ $$J(y) \propto (\mathcal{K}(y))^2$$
$g_1(y) = \log \cosh (y) \qquad g_2(y) = -e^{-\frac{y^2}{2}} $

Implications for ICA:

maximizing $|\mathcal{K}(w^T x)|$ will presumably find $w$ such that $w^T x = \alpha^T z = \alpha_1 z_1 + \dots + \alpha_p z_p \equiv z_i$ , i.e. $w$ will be proper unmixing vector.

ICA VS other methods

ICA vs PCA:

no dimension reduction
focus on independence as opposed to focus on maximizing variance

$$z_1, z_2 \overset{iid}{\sim} Unif(-1, 1)$$$$x = \binom{x_1}{x_2} = L \binom{z_1}{z_2}$$

ICA vs FA:

no noise $\quad \epsilon \equiv 0 \quad (\Psi \equiv 0)$
assume independence of $z_1 \dots z_p$ (stronger than $\text{var}(z) = I_p$)
no dimension reduction $(r=p)$
$z_1 \dots z_p$ are non-Gaussian

FastICA

$ICA_1$

$$ \underset{w \in \mathbb{R}^p}{\text{maximize}} \quad |J(w^T x)| \quad \text{subject to } \|w\|=1 $$

$ICA_2$

$$ \underset{w \in \mathbb{R}^p}{\text{maximize}} \quad |J(w^T x)| \quad \text{subject to } \|w\|=1 \text{ and } w^T w_1 = 0 $$

…

$ICA_p$

$$ \underset{w \in \mathbb{R}^p}{\text{maximize}} \quad |J(w^T x)| \quad \text{subject to } \|w\|=1 \text{ and } w^T w_1 = \dots = w^T w_{p-1} = 0 $$

Cocktail party problem#

Identifiability#

ICA preprocessing#

Non-Gaussian assumption#

Non-Gaussianity via Kurtosis#

More intuition: CLT#

Idea#

Thm (Common)#

Non-Gaussianity via Kurtosis#

Moments of a Normal Distribution#

Excess Kurtosis and Linear Combinations#

Derivation#

Maximization of Non-Gaussianity#

non-Gaussianity: negentropy#

ICA VS other methods#

ICA vs PCA:#

ICA vs FA:#

FastICA#

Cocktail party problem

Identifiability

ICA preprocessing

Non-Gaussian assumption

Non-Gaussianity via Kurtosis

More intuition: CLT

Idea

Thm (Common)

Non-Gaussianity via Kurtosis

Moments of a Normal Distribution

Excess Kurtosis and Linear Combinations

Derivation

Maximization of Non-Gaussianity

non-Gaussianity: negentropy

ICA VS other methods

ICA vs PCA:

ICA vs FA:

FastICA