Related to matrix transpose operations

The derivation is completely based on the basic arithmetic law of matrix transpose (Transpose). Any real matrix $X$ that forms $P = XX^\top$ is a symmetric matrix.

Derivation goal

We want to prove $P^\top = P$, where $P = XX^\top$.

Detailed steps

$$ P^\top = (XX^\top)^\top $$

Step 1: Apply the transposed product rule (Reverse Order Law) There is a very core rule in matrix algebra: The transpose of the multiplication of two matrices is equivalent to the reverse multiplication of their respective transposes. The formula is:

$$ (AB)^\top = B^\top A^\top $$

We apply this rule to our equation:

  • Order $A = X$
  • Order $B = X^\top$

Then $(X \cdot X^\top)^\top$ becomes the transpose of $B$ times the transpose of $A$:

$$ (XX^\top)^\top = (X^\top)^\top X^\top $$

Step 2: Apply the Involution Law of Transposition This step deals with $(X^\top)^\top$. The rule is simple: A matrix is transposed twice, and it returns to itself. The formula is:

$$ (M^\top)^\top = M $$

Apply to the previous result:

$$ (X^\top)^\top X^\top = X X^\top $$

Step Three: Return to Definition The result of our operation now is $XX^\top$. From the original definition, we know $P = XX^\top$. so:

$$ XX^\top = P $$

Summarize

That’s why the picture says:

$$ P^\top = (XX^\top)^\top = (X^\top)^\top X^\top = XX^\top = P $$

This proves that $P$ is a Symmetric Matrix.

Supplementary explanation: About the use of “column orthogonal”

Since you mentioned that $X$ is column-orthogonal, it’s worth noting that this property ($X^\top X = I_p$) plays a key role in the following:

$$ P^2 = X(\underbrace{X^\top X}_{I_p})X^\top = X I_p X^\top = XX^\top = P $$

And in the second formula (symmetry) we just derived, even if $X$ is not column orthogonal, $XX^\top$ is always symmetric.

Is this explained clearly? If you need a more detailed expansion of the “idempotence” of the first line, please feel free to tell me.

Response Variable distribution of linear regression

Core Concept: The linear regression model is defined as $y = X\beta + \epsilon$. Here $y$ is a linear transformation of the random variable $\epsilon$ (plus a constant vector $X\beta$).

Derivation steps:

  1. Expectation (Mean):

    $$ \mathbb{E}[y] = \mathbb{E}[X\beta + \epsilon] = X\beta + \mathbb{E}[\epsilon] $$

    Because $\epsilon \sim \mathcal{N}_n(0, \sigma^2 I_n)$, therefore $\mathbb{E}[\epsilon] = 0$.

    $$ \mathbb{E}[y] = X\beta $$
  2. Variance:

    $$ \text{Var}(y) = \text{Var}(X\beta + \epsilon) $$

    Because $X\beta$ is a constant vector and does not affect the variance, so:

    $$ \text{Var}(y) = \text{Var}(\epsilon) = \sigma^2 I_n $$

in conclusion: According to the properties of the multivariate normal distribution, a linear transformation of a normal variable is still a normal variable.

$$y \sim \mathcal{N}_n(X\beta, \sigma^2 I_n)$$

OLS least squares method

Core Concept: Fitted Values $\hat{y}$ is defined as $X\hat{\beta}$. We need to ask for $\hat{\beta}$ first.

Derivation steps:

  1. Find $\hat{\beta}$ (OLS Estimator): Usually the formula is $\hat{\beta} = (X^\top X)^{-1} X^\top y$. Key point: It is known that the question $X$ is column-orthogonal, that is, $X^\top X = I_p$. Substitute into the formula:

    $$ \hat{\beta} = (I_p)^{-1} X^\top y = I_p X^\top y = X^\top y $$

    (This is why Solution writes $\hat{\beta} = X^\top y$ directly).

  2. Ask for $\hat{y}$:

    $$ \hat{y} = X\hat{\beta} = X(X^\top y) = (XX^\top)y $$
  3. Definition $P$: It can be seen from the above formula that the linear operator (matrix) $P$ is:

    $$P = XX^\top$$

Geometric meaning: $P$ projects vector $y$ onto the subspace (Column Space, $\mathcal{C}(X)$) generated by the column vector of $X$.

  • Verify it is a projection matrix:
  • Idempotence: $P^2 = (XX^\top)(XX^\top) = X(X^\top X)X^\top = X(I_p)X^\top = XX^\top = P$.
  • Symmetry: $P^\top = (XX^\top)^\top = (X^\top)^\top X^\top = XX^\top = P$.

#Linear transformation properties of Multivariate Normal Distribution

Core Concept: $\hat{y} = Py$ is a linear transformation of $y$. Using $y \sim \mathcal{N}(\mu, \Sigma)$, then $Ay \sim \mathcal{N}(A\mu, A\Sigma A^\top)$.

Derivation steps:

  1. Expectations:

    $$ \mathbb{E}[\hat{y}] = P\mathbb{E}[y] = (XX^\top)(X\beta) $$

    Use the associative law $X(X^\top X)\beta = X(I_p)\beta = X\beta$.

    $$ \mathbb{E}[\hat{y}] = X\beta $$

    (This also shows that $\hat{y}$ is an unbiased estimate of $X\beta$).

  2. Variance:

    $$ \text{Var}(\hat{y}) = P \text{Var}(y) P^\top $$

    Substitute $\text{Var}(y) = \sigma^2 I_n$ and $P^\top = P$:

    $$ \text{Var}(\hat{y}) = P (\sigma^2 I_n) P = \sigma^2 P^2 $$

    Taking advantage of idempotence $P^2 = P$:

    $$ \text{Var}(\hat{y}) = \sigma^2 P = \sigma^2 XX^\top $$

in conclusion:

$$\hat{y} \sim \mathcal{N}_n(X\beta, \sigma^2 XX^\top)$$

Distribution of Residual

Derivation steps:

  1. Expectations:

    $$ \mathbb{E}[r] = (I_n - P)\mathbb{E}[y] = (I_n - XX^\top)X\beta $$

    Expand: $X\beta - XX^\top X \beta = X\beta - X(I_p)\beta = X\beta - X\beta = \mathbf{0}$.

    $$ \mathbb{E}[r] = \mathbf{0} $$
  2. Variance:

    $$ \text{Var}(r) = P_\perp \text{Var}(y) P_\perp^\top = (I_n - P)(\sigma^2 I_n)(I_n - P) $$

    $$ = \sigma^2 (I_n - P)^2 $$

    Taking advantage of idempotence $(I_n - P)^2 = I_n - P$:

    $$ \text{Var}(r) = \sigma^2 (I_n - XX^\top) $$

in conclusion:

$$r \sim \mathcal{N}_n(\mathbf{0}, \sigma^2 (I_n - XX^\top))$$

$I-P$ At the same time, he is also a Projection Matrix. Although the proof has been written in (a) of Q1, it is repeated here.

In linear algebra, the core definition of determining whether a matrix is a Projection Matrix is to see whether it satisfies Idempotency, that is, it satisfies $A^2 = A$.

If in the context of statistics or least squares (like your previous question), we usually refer to the Orthogonal Projection Matrix, then it also needs to satisfy Symmetry, that is, $A^\top = A$.

Let’s prove that $I-P$ satisfies these two properties respectively.

1. Prove idempotency (Idempotency)

Goal: Proof $(I - P)^2 = I - P$ Known: $P$ is the projection matrix, so $P^2 = P$.

Derivation:

$$ \begin{aligned} (I - P)^2 &= (I - P)(I - P) \\ &= &= I - P - P + P^2 \\ &= I - 2P + P^2 \end{aligned} $$

At this time, use the known condition $P^2 = P$ to substitute:

$$ \begin{aligned} &= I - 2P + P \\ &= I - P \end{aligned} $$

in conclusion: Because $(I - P)^2 = I - P$, $I - P$ is also a projection matrix.


2. Prove symmetry (Symmetry)

*Note: This step is only required when discussing “Orthogonal Projection”. If $P$ is just oblique projection, this step is not needed, but in the context of your regression analysis, it is required. *

Goal: Proof $(I - P)^\top = I - P$ Known: $P$ is symmetric, which is $P^\top = P$ (and $I$ is also symmetric).

Derivation:

$$ \begin{aligned} (I - P)^ op &= I^ op - P^ op & ext{(transposed addition rule)} \ &= I - P^ op & ext{(identity matrix transpose unchanged)} \ &= I - P & ext{(using known } P^ op = P ext{)} \end{aligned} $$

in conclusion: $I - P$ is also a symmetric matrix.


3. Geometric Intuition

To give you a clearer understanding, you can look at it this way:

Any vector $v$ can be decomposed into two parts:

$$ v = Pv + (I - P)v $$
  • $Pv$ is the projection (shadow) of $v$ on some space $S$.
  • $(I-P)v$ is the projection of $v$ on the complementary space of $S$.

If you do another projection on $(I-P)v$:

$$ (I-P) [ (I-P)v ] $$

Because $(I-P)v$ is already in the padding space, it should be unchanged when projected again. This is the physical meaning of why $(I-P)^2$ must be equal to $I-P$.


Calculation formula and derivation of Covariance matrix in MVN

The prototype of the most basic and important formula when dealing with linear transformation of random vectors in Multivariate Statistics is as follows:

$$ \text{Cov}(Ay, By) = A \text{Var}(y) B^\top $$

Let’s break down its source and derivation process in detail.

1. Derivation from First Principles

To understand this formula, we need to go back to the original definition of Covariance Matrix.

Suppose $y$ is a random vector with mean $\mu_y = E[y]$ and variance (covariance matrix) $\Sigma_y = \text{Var}(y) = E[(y-\mu_y)(y-\mu_y)^\top]$.

Now we have two new random vectors, both linear transformations of $y$:

  1. $u = Ay$ (corresponding to $\hat{y}$ in the question, among which $A=P$)
  2. $v = By$ (corresponding to $r$ in the question, among which $B=I-P$)

We want to calculate the covariance $\text{Cov}(u, v)$ between $u$ and $v$.

Derivation steps:

Step 1: Definition According to the definition of covariance matrix:

$$ \text{Cov}(u, v) = E\left[ (u - E[u]) (v - E[v])^\top \right] $$

Step 2: Substitute linear transformation Since the expectation operator $E$ is linear, so $E[u] = E[Ay] = A E[y]$. therefore:

$$ u - E[u] = Ay - A E[y] = A(y - E[y]) $$

Same reason:

$$ v - E[v] = B(y - E[y]) $$

Step 3: Substitute the definition

$$ \begin{aligned} \text{Cov}(Ay, By) &= E\left[ \Big( A(y - E[y]) \Big) \Big( B(y - E[y]) \Big)^\top \right] \end{aligned} $$

Step 4: Use the transpose operation law Remember $(CD)^\top = D^\top C^\top$? Let’s expand on the second half:

$$ \Big( B(y - E[y]) \Big)^\top = (y - E[y])^\top B^\top $$

Now the formula becomes:

$$ ext{Cov}(Ay, By) = E\left[ A \underbrace{(y - E[y]) (y - E[y])^ op}_{ ext{core part}} B^ op ight] $$

Step 5: Extract the constant matrix In the expectation operation $E[\cdot]$, $A$ and $B$ are constant matrices that can be extracted outside the expectation:

$$ \text{Cov}(Ay, By) = A \cdot E\left[ (y - E[y]) (y - E[y])^\top \right] \cdot B^\top $$

Step 6: Identify the core parts The middle $E\left[ (y - E[y]) (y - E[y])^\top \right]$ happens to be the covariance matrix $\text{Var}(y)$ of $y$.

Final conclusion:

$$ \text{Cov}(Ay, By) = A \text{Var}(y) B^\top $$

2. Return to your topic (Application)

After understanding the general formula above, let’s see how it is applied to question (f):

  • $y$: That is the original observation value, we know $\text{Var}(y) = \sigma^2 I_n$.
  • $\hat{y} = Py$: So here is $A = P$.
  • $r = (I_n - P)y$: So here is $B = (I_n - P)$.

Apply the formula directly:

$$ \begin{aligned} \text{Cov}(\hat{y}, r) &= P \cdot \text{Var}(y) \cdot (I_n - P)^\top \\ &= P (\sigma^2 I_n) (I_n - P^\top) \\ &= \sigma^2 P (I_n - P) \quad ext{(because } P ext{ is symmetric and } I ext{ is symmetric)} \ &= \sigma^2 (P - P^2) \\ &= \sigma^2 (P - P) \quad ext{(because } P ext{ idempotent)} \ &= 0 \end{aligned} $$

3. Why is the result 0? (intuitive understanding)

This goes back to the geometric concept I told you in the previous question (d):

  • $\hat{y}$ lives in the column space of $X$.
  • $r$ lives in the orthogonal complement of the column space of $X$.
  • They are perpendicular to each other.

Answer to (f)

For the multivariate normal distribution, vertical (orthogonal) means uncorrelated, and for the normal distribution, uncorrelated means independent. This is the final conclusion that question (f) wants to prove.

Core Concept: For the multivariate normal distribution (MVN), Uncorrelated is equivalent to Independent [cite: 146]. We need to prove that their covariance matrix is 0.

Derivation steps:

$$ \begin{aligned} \text{Cov}(\hat{y}, r) &= \text{Cov}(Py, (I_n-P)y) \\ &= P \text{Var}(y) (I_n - P)^\top \quad \text{(using } \text{Cov}(Ax, Bx) = A \text{Var}(x) B^\top \text{)} \\ &= P (\sigma^2 I_n) (I_n - P) \\ &= \sigma^2 (P - P^2) \end{aligned} $$

Since $P$ is a projection matrix, $P^2 = P$ is satisfied:

$$ \text{Cov}(\hat{y}, r) = \sigma^2 (P - P) = \mathbf{0}_{n \times n} $$

in conclusion: Because $\hat{y}$ and $r$ are jointly normally distributed and uncorrelated (Covariance is 0), they are independent.


Joint/Conditional Distribution

1. From Univariate to Bivariate

Joint Distribution

Imagine we have two random variables $X$ and $Y$.

  • Univariate: Only care about $X$, its distribution is a bell curve (Bell Curve).
  • Joint: Care about both $X$ and $Y$. This means we have to look at where the point $(X, Y)$ falls.
  • If you draw a picture, it is no longer a line, but a three-dimensional hill (3D Surface).
  • Composition: It is determined by two elements:
  1. Center Location: $(\mu_x, \mu_y)$.
  2. Shape (covariance): This determines whether the mountain bag is round ($X, Y$ independent) or an oblate ellipse ($X, Y$ dependent).

Conditional Distribution

This is the core of Bayesian inference: given $Y=y_0$, infer the distribution of $X$. **

  • Geometric Action: This is equivalent to holding a knife and making a cut perpendicular to the $Y$ axis at the $Y=y_0$ position.
  • Slice: The cut section, after normalization, is still a normal distribution (bell-shaped curve).

Intuition for scalar formulas (remember this form): If $Y=y$ is known, what happens to the expectation of $X$?

$$ E[X|Y=y] = \mu_x + \underbrace{ ho rac{\sigma_x}{\sigma_y}}_{ ext{coefficient}} (y - \mu_y) $$
  • Intuition: Without looking at $Y$, we guess $X$ is $\mu_x$. Now that we see $Y$, we need to correct the prediction for $X$ based on how much $Y$ deviates from $\mu_y$, multiplied by a “correlation coefficient ratio”.
  • Variance: $$\text{Var}(X|Y=y) = \sigma_x^2 (1 - \rho^2)$$
  • Intuition: Knowing $Y$, the uncertainty (variance) of $X$ becomes smaller.

2. Advanced: Matrix form of multivariate normal distribution (MVN)

Now we explode the dimensions into vectors $x$ and $y$ (which may have $p$ and $q$ variables each).

Joint Distribution - “Stacking”

The joint distribution essentially means “putting” two vectors together to form a larger vector. Assume $x \sim \mathcal{N}_p$ and $y \sim \mathcal{N}_q$.

We stack them into a $(p+q)$ dimensional vector $z$:

$$ z = \begin{pmatrix} x \\ y \end{pmatrix} $$

**What does this joint distribution consist of? ** It still consists of Mean Vector and Covariance Matrix, but now it is represented by Block Matrix:

$$ \begin{pmatrix} x \\ y \end{pmatrix} \sim \mathcal{N}_{p+q} \left( \underbrace{ egin{pmatrix} \mu_x \ \mu_y \end{pmatrix}}_{ ext{joint mean}}, \quad \underbrace{ egin{pmatrix} \Sigma_{xx} & \Sigma_{xy} \ \Sigma_{yx} & \Sigma_{yy} \end{pmatrix}}_{ ext{joint covariance matrix}} \right) $$

Disassemble this covariance matrix:

  • $\Sigma_{xx}$ (Top-Left): $x$’s own variance-covariance matrix (dimension $p \times p$).
  • $\Sigma_{yy}$ (Bottom-Right): $y$’s own variance-covariance matrix (dimension $q \times q$).
  • $\Sigma_{xy}$ (Off-Diagonal): This is the key. It describes the relationship between $x$ and $y$ (dimension $p \times q$). If these two blocks are $\mathbf{0}$, it means $x$ and $y$ are independent.
  • $\Sigma_{yx}$: It is the transpose of $\Sigma_{xy}$ ($\Sigma_{xy}^\top$).

Conditional Distribution - “Projection and Correction”

Now, we have observed the specific value of vector $y$ (Given $y$), and we want to find the distribution $p(x|y)$ of $x$.

This uses the famous MVN conditional distribution formula**. This formula is a perfect matrix upgraded version of the scalar formula just now.

$x|y$ still obeys the multivariate normal distribution:

$$ x|y \sim \mathcal{N}_p (\mu_{x|y}, \Sigma_{x|y}) $$

Let’s look at how the mean and variance come from:

1. Conditional Mean - The core logic here is “regression”

$$ \mu_{x|y} = \mu_x + \underbrace{\Sigma_{xy} \Sigma_{yy}^{-1}}_{ ext{regression coefficient}} (y - \mu_y) $$
  • $\mu_x$: Our a priori guess (Base line).
  • $y - \mu_y$: The “surprise” (Innovation/Error) brought by the observed $y$, that is, how much $y$ deviates from the expectation.
  • $\Sigma_{xy} \Sigma_{yy}^{-1}$: Similar to $\rho \frac{\sigma_x}{\sigma_y}$ in scalar. This is actually the linear regression coefficient matrix (Regression Coefficient Matrix)!
  • It “translates/maps” the deviation of $y$ into the deviation of $x$.

2. Conditional Covariance - The core logic here is “Shur’s complement”

$$ \Sigma_{x|y} = \Sigma_{xx} - \underbrace{\Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{yx}}_{ ext{information gain}} $$
  • $\Sigma_{xx}$: Uncertainty of original $x$.
  • Minus Term: Because we know $y$, we get some information about $x$, so the uncertainty must be reduced.
  • Schur Complement: This structure $\Sigma_{xx} - \Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{yx}$ is called Schur complement in linear algebra and is specially used to handle the inverse and conditional properties of block matrices.

Summarize

  1. Joint Distribution: Just stack the vectors and combine the variances into a large matrix.
  • Key point to look at Off-diagonal block $\Sigma_{xy}$: If it is 0, it is independent; if it has a value, it is related.
  1. Conditional Distribution: It uses the observed value $y$ to correct $x$.
  • Mean was corrected: a correlation-based offset was added.
  • Variance is reduced: the certainty brought about by known information is subtracted.

Go back to your previous exercise question (g) and ask, why is it Joint Distribution? Because it directly stacks $\hat{y}$ and $r$ together:

$$ \begin{pmatrix} \hat{y} \\ r \end{pmatrix} $$

And by proving $\Sigma_{12} = \text{Cov}(\hat{y}, r) = \mathbf{0}$, we immediately know that this Joint Distribution can be decomposed into two independent distributions.

The matrix logic of this piece is very beautiful. Alex, if you want to substitute this formula into the specific matrix block asked in (g) (for example, to find $E[\hat{y}|r]$), we can give it a try.

(g) Joint distribution of $(\hat{y}, r)^\top$.

Core Concept: We “stack” two vectors. Since they are independent, the off-diagonal blocks of the covariance matrix are 0.

Expansion in matrix form:

$$ \begin{pmatrix} \hat{y} \\ r \end{pmatrix} \sim \mathcal{N}_{2n} \left( \begin{pmatrix} X\beta \\ \mathbf{0} \end{pmatrix}, \sigma^2 \begin{pmatrix} XX^\top & \mathbf{0} \\ \mathbf{0} & I_n - XX^\top \end{pmatrix} \right) $$

Here I replaced $P$ with the standard form $XX^\top$ and $P_\perp$ with $I_n - XX^\top$.


(h) What is the conditional distribution of $r | \hat{y}$?

Core Concept: This question is a “trap” question (or a score-giving question). [cite_start] Usually the formula for the conditional distribution $y|x$ is complex (involving $\mu_y + \Sigma_{yx}\Sigma_{xx}^{-1}(x-\mu_x)$ etc. [cite: 186]). However, we have shown in (f) that $r$ and $\hat{y}$ are independent.

Derivation: If variables $A$ and $B$ are independent, then the conditional distribution of $A$ given $B$ is the marginal distribution (Marginal Distribution) of $A$ itself.

$$ f(r | \hat{y}) = f(r) $$

in conclusion:

$$r | \hat{y} \sim \mathcal{N}_n(\mathbf{0}, \sigma^2 (I_n - XX^\top))$$

(i.e. exactly the same distribution as in (e)).


Summarize: This question demonstrates the beautiful properties of the orthogonal design matrix ($X^\top X = I$):

  1. The calculation is extremely simplified (no need to find the inverse matrix).
  2. The fitted values and residuals are geometrically perfectly separated by the orthogonal projection matrices $XX^\top$ and $I - XX^\top$.
  3. They are not only geometrically orthogonal, but also statistically independent.

Column Space $\mathcal{C}(X)^\perp$

The complete meaning of this sentence is: **$P_\perp$ This matrix will project the vector onto the “orthogonal complement space of the column space of $X$”. **

Let me break down this concept in detail, which is very critical in the geometric interpretation of linear regression.

1. Symbol disassembly

  • $C(X)$: represents Column Space of X (the column space of $X$).
  • This is the space that can be linearly combined with all $X$ column vectors.
  • In linear regression, our fitted value $\hat{y} = X\beta$ is always in this space.
  • $\perp$: stands for Orthogonal / Perpendicular (orthogonal/perpendicular).
  • $C(X)^\perp$ (handwritten note is written as $C(X_\perp)$): represents Orthogonal Complement of the Column Space (orthogonal complement of column space).

2. Geometric meaning: What is “orthogonal complement”?

Imagine you are in a three-dimensional room ($n=3$):

  • $C(X)$ (desktop): Suppose $X$ has two columns, and they form a plane (such as the desktop of a table). The model believes that the truth is on the table.
  • $y$ (vector): The observation data is an arrow flying in the air, it is not on the desktop.
  • $P$ (projection): The function of the $P$ matrix is to drop a pebble vertically onto the table from the tip of the arrow of $y$. The landing point is $\hat{y}$.
  • $r$ (residual): The vertical distance from the landing point $\hat{y}$ to $y$ is the residual vector.

Here comes the key point: The residual vector $r$ is perpendicular to the table. This means that $r$ belongs to the “orthogonal complement space of the desktop”.

  • $P$ projects $y$ onto the desktop ($\mathcal{C}(X)$).
  • $P_\perp$ is $(I-P)$, which projects $y$ onto the line perpendicular to the desktop ($\mathcal{C}(X)^\perp$).

3. Verification of mathematical definitions

Since $r = P_\perp y$, to prove that $r$ is really in the orthogonal complement space of $X$, we only need to prove that each column of $r$ and $X$ is perpendicular**.

Mathematically, if $u$ and $v$ are perpendicular, then $u^\top v = 0$. So we verify that $X^\top r$ is equal to 0:

$$ \begin{aligned} X^\top r &= X^\top (y - \hat{y}) \\ &= X^\top (y - X\hat{\beta}) \\ &= X^\top y - X^\top X \hat{\beta} \end{aligned} $$

Recall the definition of Normal Equations, $\hat{\beta} = (X^\top X)^{-1}X^\top y$ (or $\hat{\beta} = X^\top y$ in this particular orthogonal question). Either way, we have:

$$ X^\top X \hat{\beta} = X^\top y $$

so:

$$ X^\top r = X^\top y - X^\top y = 0 $$

in conclusion: Residuals $r$ are orthogonal (perpendicular) to all columns of $X$. So the vector generated by $P_\perp$ does fall in the orthogonal complement space of $C(X)$.

Summarize

$C(X_\perp)$ in the handwritten note actually says: “The space in which the residual is located is the space composed of all vectors perpendicular to the column vector of $X$.”

This also corresponds to the Left Null Space in linear algebra, that is, $\mathcal{N}(X^\top)$.