【Machine Learning】Unsupervised Learning

来源：99网

Principle Component Analysis

Dimension reductio: JL lemma $d=\Omega\left(\frac{\log n}{\epsilon^2}\right)$ to remain the distance of $n$ data points.
Goal of PCA
- maximize variance: $\mathbb{E}[(v^\top x)^2]=v^\top XX^\top v$ for $∣ v ∣ = 1$
- minimize reconstruction error: $\mathbb{E}[|x-(v^\top x)v|^2]$
Find $v_i$ iteratively, project data points onto subspace expanded by $v_1,v_2,..,v_d$
How to find $v$ ?
- Eigen decomposition: $XX^\top =U\Sigma U^\top$
- $v_1$ is the eigenvector of maximum eigenvalue.
- Power method

project $x_i$ into $f(x_i)$
Hard version(compare label of its neighbor)- soft version
Neighborhood Component Analysis(NCA)
- $p_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2)$
- maximize $\sum_{i}\sum_{j\in C_i}p_{i,j}$
LMNN: $L=\max(0,\|f(x)-f(x^+)\|_2-\|f(x)-f(x^-)\|_2+r)$
- $x^+,x^-$ are worst cases.
- r is margin

K-means
Spectral graph clustring
- Graph laplacian: $L = D - A$ , $A$ represents the similarity.
- #zero eigenvalue = # connected component
- Smallest $k$ eigenvectors gives a partition of $k$ clusters, do $k$ -means on the row
- Ratio cut can be transfered into finding the $k$ smallest eigenvectors, which is the same as graph laplacian.

Intelligence is positioning
InfoNCE loss
$L(q,p_1,\{p_i\}_{i=2}^N)=-\log \frac{\exp(-\|f(q)-f(p_1)|^2/(2\tau)}{\sum_{i=1}^{N}\exp(-\|f(q)-f(p_{i})|^2/(2\tau)}$
Learn $Z = f (x)$ : map original data points into a space that semantic similarity is captured naturally.
- Reproducing kernel Hilbert space: $k(f(x_1),f(x_2))=\langle\phi(f(x_1)),\phi(f(x_2))\rangle_H$ . Inner product is a kernel function.
- Usually, $K_{Z,i,j}=k(Z_i-Z_j)$ , $k$ is gaussian.
We have a similarity matrix $\pi$ about the dataset previously. $\pi_{i,j}$ is the similarity of data $i$ and $j$ . We want the similarity matrix $K_Z$ of $f (x)$ is the same as that of $x$ which is given manually. Let $W_X\sim \pi,W_Z\sim K_Z$ , we want these two samples are the same.
Minimize crossentropy loss: $H_{\pi}^{k}(Z)=-\mathbb{E}_{W_X\sim P(\cdot ;\pi)}[\log P(W_Z=W_X;K_Z)]$
- Equivalent to InfoNCE loss: Only care about row $i$ , infoNCE loss is $log(W_{Z,i}=W_{X,i})$ . The given pair $q,p_1$ are sampled from similarity matrix $\pi$ , which corresponds to $W_X\sim P(\cdot;\pi)$ .
- Equivalent to spectral clustering: equaivalent to $\arg \min_Ztr(Z^\top L^*Z)$

data visualization: map data into low dimension space(2D)
SNE: Same as NCA, want $q_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2/(2\sigma^2))$ to be similar to $p_{i,j}\sim \exp (-\|x_i-x_j\|^2/(2\sigma_i^2))$
- CrossEntropy loss $-p_{i,j}\cdot \log \frac{q_{i,j}}{p_{i,j}}$
Crowding problem
Solved by t-SNE: let $q_{i,j}\sim (1+\|y_j-y_i\|^2)^{-1}$ (student t-distribution)
- The power $- 1$ is more heavy tail than Gaussian, then we can solve the crowding problem by shifting the distance.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文