00 Mathematical Foundation
Table of Contents
Overview
This document covers essential mathematical and statistical foundations for understanding machine learning algorithms. Topics include linear algebra, probability theory, random variables, statistical distributions, and basic inference methods.
Linear Algebra
Linear algebra is fundamental to machine learning, as most ML algorithms operate on vectors and matrices.
Vectors
Vector: An ordered array of numbers representing a point or direction in space.
Column vector:
Operations:
Addition: $\mathbf{x}+\mathbf{y}=\begin{bmatrix}x_1+y_1\x_2+y_2\\vdots\x_n+y_n\end{bmatrix}$
Scalar multiplication: $c\mathbf{x}=\begin{bmatrix}cx_1\cx_2\\vdots\cx_n\end{bmatrix}$
Dot product: $\mathbf{x}^T\mathbf{y}=\sum_{i=1}^nx_iy_i$
Norm: $|\mathbf{x}|2=\sqrt{\sum{i=1}^nx_i^2}$ (Euclidean norm)
Matrices
Matrix: A 2D array of numbers with m rows and n columns.
Operations:
Addition: $(\mathbf{A}+\mathbf{B}){ij}=a{ij}+b_{ij}$
Scalar multiplication: $(c\mathbf{A}){ij}=ca{ij}$
Matrix multiplication: $(\mathbf{AB}){ik}=\sum{j=1}^na_{ij}b_{jk}$
Transpose: $(\mathbf{A}^T){ij}=a{ji}$
Properties:
Matrix multiplication is associative: $(\mathbf{AB})\mathbf{C}=\mathbf{A}(\mathbf{BC})$
Matrix multiplication is distributive: $\mathbf{A}(\mathbf{B}+\mathbf{C})=\mathbf{AB}+\mathbf{AC}$
Matrix multiplication is not commutative: $\mathbf{AB}\neq\mathbf{BA}$ (in general)
$(\mathbf{AB})^T=\mathbf{B}^T\mathbf{A}^T$
Special Matrices
Identity matrix (I):
Property: $\mathbf{AI}=\mathbf{IA}=\mathbf{A}$
Diagonal matrix: All off-diagonal elements are zero
Symmetric matrix: $\mathbf{A}=\mathbf{A}^T$
Orthogonal matrix: $\mathbf{Q}^T\mathbf{Q}=\mathbf{QQ}^T=\mathbf{I}$
Matrix Inverse
For square matrix $\mathbf{A}$, if it exists, the inverse $\mathbf{A}^{-1}$ satisfies:
Properties:
$(\mathbf{A}^{-1})^{-1}=\mathbf{A}$
$(\mathbf{AB})^{-1}=\mathbf{B}^{-1}\mathbf{A}^{-1}$
$(\mathbf{A}^T)^{-1}=(\mathbf{A}^{-1})^T$
A matrix is invertible (non-singular) if its determinant is non-zero.
Determinant
For 2×2 matrix:
Properties:
$\det(\mathbf{AB})=\det(\mathbf{A})\det(\mathbf{B})$
$\det(\mathbf{A}^T)=\det(\mathbf{A})$
$\det(\mathbf{A}^{-1})=\frac{1}{\det(\mathbf{A})}$
Trace
The trace of a square matrix is the sum of diagonal elements:
Properties:
$\text{tr}(\mathbf{A}+\mathbf{B})=\text{tr}(\mathbf{A})+\text{tr}(\mathbf{B})$
$\text{tr}(c\mathbf{A})=c\cdot\text{tr}(\mathbf{A})$
$\text{tr}(\mathbf{AB})=\text{tr}(\mathbf{BA})$
Eigenvalues and Eigenvectors
For square matrix $\mathbf{A}$, if there exists scalar $\lambda$ and non-zero vector $\mathbf{v}$ such that:
Then $\lambda$ is an eigenvalue and $\mathbf{v}$ is an eigenvector.
Characteristic equation:
Properties:
Sum of eigenvalues = trace of matrix
Product of eigenvalues = determinant of matrix
Symmetric matrices have real eigenvalues and orthogonal eigenvectors
Eigendecomposition (for square matrix):
where $\mathbf{Q}$ contains eigenvectors and $\Lambda$ is diagonal matrix of eigenvalues.
For symmetric matrix: $\mathbf{A}=\mathbf{Q}\Lambda\mathbf{Q}^T$ (spectral theorem)
Singular Value Decomposition (SVD)
Any matrix $\mathbf{A}$ (m×n) can be decomposed as:
where:
$\mathbf{U}$ (m×m): left singular vectors (orthogonal)
$\Sigma$ (m×n): diagonal matrix with singular values $\sigma_1\geq\sigma_2\geq\cdots\geq0$
$\mathbf{V}$ (n×n): right singular vectors (orthogonal)
Applications in ML:
Principal Component Analysis (PCA)
Matrix approximation
Dimensionality reduction
Recommender systems
Matrix Rank
The rank of a matrix is the maximum number of linearly independent rows (or columns).
Properties:
$\text{rank}(\mathbf{A})\leq\min(m,n)$ for m×n matrix
$\text{rank}(\mathbf{AB})\leq\min(\text{rank}(\mathbf{A}),\text{rank}(\mathbf{B}))$
Full rank: rank equals the smallest dimension
Vector Spaces
Linear independence: Vectors $\mathbf{v}_1,\mathbf{v}_2,...,\mathbf{v}_k$ are linearly independent if:
Span: The set of all possible linear combinations of a set of vectors.
Basis: A set of linearly independent vectors that span a vector space.
Dimension: The number of vectors in a basis.
Matrix Calculus
Gradient of scalar function with respect to vector:
Useful derivatives:
$\nabla_{\mathbf{x}}(\mathbf{a}^T\mathbf{x})=\mathbf{a}$
$\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x})=(\mathbf{A}+\mathbf{A}^T)\mathbf{x}$
$\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{x})=2\mathbf{x}$
Hessian (matrix of second derivatives):
Norms
Vector norms:
$L^1$ norm: $|\mathbf{x}|1=\sum{i=1}^n|x_i|$
$L^2$ norm (Euclidean): $|\mathbf{x}|2=\sqrt{\sum{i=1}^nx_i^2}$
$L^\infty$ norm: $|\mathbf{x}|_\infty=\max_i|x_i|$
Matrix norms:
Frobenius norm: $|\mathbf{A}|F=\sqrt{\sum{i=1}^m\sum_{j=1}^na_{ij}^2}$
Positive Definite Matrices
A symmetric matrix $\mathbf{A}$ is positive definite if:
Properties:
All eigenvalues are positive
Invertible
Important for optimization (convex functions)
Positive semi-definite: $\mathbf{x}^T\mathbf{A}\mathbf{x}\geq0$
Linear Systems and Least Squares
Linear system: $\mathbf{Ax}=\mathbf{b}$
Least squares solution (when system is overdetermined):
Normal equations:
Solution:
The matrix $(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T$ is called the pseudo-inverse.
Probability Fundamentals
Sample Space and Events
Sample space: set of all possible outcomes of an experiment
Event: any subset of the sample space
Set operations:
Intersection: $E \cap F$
Union: $E \cup F$
Complement: $E^c$
Mutually exclusive: $E \cap F=\phi$

Conditional Probability
General formula:
Bayes formula:
Independence
Two events E and F are independent if:
Random Variables
Basic Concepts
Random variable: quantity of interest determined by the result of an experiment
Discrete random variables: countable set of possible values
Continuous random variables: continuum of possible values
Cumulative distribution function: $F(x)=P(X\le x)$
Probability mass function (discrete): $p(a)=P(X=a)$
Probability density function (continuous): $F(a)=\int_{-\infty}^af(x)dx$
Expectation and Variance
Expectation:
Discrete: $E[X]=\sum_{i=1}^n x_ip(x_i)$
Continuous: $E[X]=\int xf(x)dx$
Linearity: $E[X+Y]=E[X]+E[Y]$
Variance:
Covariance and Correlation
Covariance:
Correlation:
Properties:
$Var(X+Y)=Var(X)+Var(Y)+2Cov(X, Y)$
Independent random variables: $Cov(X,Y)=0$
Important Distributions
Bernoulli Distribution
Binary outcome (success/failure):
Binomial Distribution
Number of successes in n independent Bernoulli trials:
Poisson Distribution
Events occurring in a fixed interval:

Uniform Distribution
Normal (Gaussian) Distribution
Standard Normal: $\mu=0, \sigma=1$
Exponential Distribution
Time between events in a Poisson process:
Memoryless property: $P(X>s+t|X>t)=P(X>s)$
Statistical Inference
Law of Large Numbers
For i.i.d. random variables $X_1,X_2,...,X_n$ with mean $\mu$ and finite variance:
Central Limit Theorem
For i.i.d. random variables with mean $\mu$ and variance $\sigma^2$:
The sample mean of a large sample (typically n ≥ 30) is approximately normally distributed, regardless of the underlying distribution.
Sample Statistics
Sample mean: $\bar X=\frac{X_1+...+X_n}{n}$
Sample variance: $s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar x)^2$
Sample standard deviation: $s=\sqrt{s^2}$
Properties:
$E[\bar X]=\mu$
$Var(\bar X)=\frac{\sigma^2}{n}$
Parameter Estimation
Maximum Likelihood Estimation (MLE)
Given i.i.d. sample $X_1,...,X_n$ with density $f(x;\theta)$:
Likelihood function:
MLE:
Properties:
Consistent: $\hat \theta_n\stackrel{p}{\longrightarrow}\theta_0$
Asymptotically normal: $\sqrt n(\hat \theta_n-\theta_0)\stackrel{d}{\longrightarrow}N(0,I(\theta_0)^{-1})$
Examples:
Normal distribution: $\hat\mu = \bar X$, $\hat\sigma^2=\frac{1}{n}\sum_{i=1}^n (X_i-\bar X)^2$
Uniform distribution: $\hat \theta=\max(X_1, X_2, ..., X_n)$
Confidence Intervals
A $(1-\alpha)$ confidence interval for parameter $\theta$ satisfies:
Common confidence intervals:
Normal mean (known variance): $\bar X \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$
Normal mean (unknown variance): $\bar X \pm t_{\alpha/2,n-1}\frac{s}{\sqrt{n}}$
Hypothesis Testing
Basic Concepts
Null hypothesis (H₀): default hypothesis to be tested
Alternative hypothesis (H₁): competing hypothesis
Type I error: rejecting H₀ when it is true (significance level α)
Type II error: accepting H₀ when it is false (power = 1 - β)
p-value: probability of observing the data (or more extreme) under H₀
Decision rule:
If p-value < α: reject H₀
If p-value ≥ α: fail to reject H₀
Common Tests
Mean of a normal population (known variance):
Mean of a normal population (unknown variance):
Linear Regression
Simple Linear Regression
Model: $Y_i=\beta_0+\beta_1x_i+\epsilon_i$
Least squares estimates:
Coefficient of determination:
$R^2$ measures the proportion of variance in Y explained by X.
Multiple Linear Regression
Matrix form: $\mathbf{Y}=\mathbf{X}\beta+\epsilon$
Least squares solution:
Properties:
Unbiased: $E[\hat\beta]=\beta$
Variance: $Var(\hat\beta)=\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$
Multivariate Normal Distribution
Density function:
MLE estimates:
Central Limit Theorem:
Key Concepts for Machine Learning
Bias-Variance Trade-off
Model error can be decomposed as:
Bias: error from incorrect assumptions
Variance: error from sensitivity to training data
Irreducible error: inherent noise in data
Regularization
L1 Regularization (Lasso):
L2 Regularization (Ridge):
Information Theory Basics
Entropy:
Cross-Entropy:
KL Divergence:
References
Course Materials:
STAT GR5701 Probability and Statistics for Data Science - Columbia University
STAT GR5703 Statistical Inference and Modeling - Columbia University
Last updated