Why is $\sigma^2$ set to Scaler in the first question?
This is a very perceptive question! Your intuition is right: the full form describing the error structure should indeed be a matrix, not a scalar. **
The reason $\sigma^2$ is a scalar here is because this is one of the strongest assumptions of Ordinary Least Squares (OLS).
Let me break down the meaning behind this for you, and “when does it become the Matrix”.
1. Your inference is completely correct
“This means that this linear regression follows exactly the same variance for every sample?”
**Yes, that’s exactly what it means. **
Error 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know.
$$ \Sigma = \text{Cov}(\epsilon) = \begin{bmatrix} \text{Var}(\epsilon_1) & \text{Cov}(\epsilon_1, \epsilon_2) & \cdots & \text{Cov}(\epsilon_1, \epsilon_n) \\ \text{Cov}(\epsilon_2, \epsilon_1) & \text{Var}(\epsilon_2) & \cdots & \text{Cov}(\epsilon_2, \epsilon_n) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(\epsilon_n, \epsilon_1) & \text{Cov}(\epsilon_n, \epsilon_2) & \cdots & \text{Var}(\epsilon_n) \end{bmatrix} $$The writing $\Sigma = \sigma^2 I_n$ in the figure actually makes two extremely strict assumptions about this matrix:
- Diagonal equality (Homoscedasticity, homoskedasticity): The variance $\text{Var}(\epsilon_i)$ of all samples is the same value $\sigma^2$. In other words, the noise level of the first sample is the same as that of the 100th sample.
- Off-diagonal is zero (No Autocorrelation, no autocorrelation): There is no correlation between different samples, $\text{Cov}(\epsilon_i, \epsilon_j) = 0$.
So the matrix becomes like this:
$$ \Sigma = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix} = \sigma^2 \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} = \sigma^2 I_n $$**This is why it can be extracted as a scalar. **
2. What happens if the variances are inconsistent?
If your intuition comes into effect, that is, “the variance of each sample may be different”, then $\sigma^2$ cannot be presented as a scalar and multiplied in front of the identity matrix.
This situation is called Heteroscedasticity in statistics.
At this time, the covariance matrix $\Sigma$ will become a matrix with different values on the diagonal (assuming that the samples are still independent):
$$ \Sigma = \begin{bmatrix} \sigma_1^2 & 0 & \cdots & 0 \\ 0 & \sigma_2^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_n^2 \end{bmatrix} \neq \sigma^2 I_n $$in this case:
- You cannot write $\Sigma$ as $\sigma^2 I_n$.
- You cannot use ordinary least squares (OLS) because OLS assumes that all data points are equally weighted.
- You need to use Weighted Least Squares (WLS) or Generalized Least Squares (GLS) to give higher weight to samples with small variance.
3. Summary
The derivation process you see is for the most basic and ideal linear regression situation (i.i.d. assumption - independent and identically distributed assumption).
- If it’s a matrix: That’s the generalized case ($\Sigma$).
- If scalar: That is a special case ($\sigma^2$), which implies that we assume that the “noise level” is consistent for all data points.