PCA 介绍

数据分析相关知识

Hugo 基于webslides模板

Holibut, 2019.11

Summary

What - Orthogonality?

What - 什么是PCA?

Why - 为什么要PCA?

When - 何时使用PCA?

How - 如何PCA?

How to - 实战

什么是orthogonality

orthogonality-维基

[ɔθɒgə’nælɪtɪ] n.正交性,正交状态;

Two vectors, x and y, in an inner product space, V, are orthogonal if their inner product ⟨ x , y ⟩ is zero. This relationship is denoted x ⊥ y .

In geometry, two Euclidean vectors are orthogonal if they are perpendicular, i.e., they form a right angle.

正交是点乘结果为0

a • b = ‖a‖ * ‖b‖ * cosø

线性代数回顾

标量、向量与矩阵(Scalar/Vector/Matrix)

Computer science

Orthogonality is a system design property which guarantees that modifying the technical effect produced by a component of a system neither creates nor propagates side effects to other components of the system. Typically this is achieved through the separation of concerns and encapsulation, and it is essential for feasible and compact designs of complex systems. The emergent behavior of a system consisting of components should be controlled strictly by formal definitions of its logic and not by side effects resulting from poor integration, i.e., non-orthogonal design of modules and interfaces. Orthogonality reduces testing and development time because it is easier to verify designs that neither cause side effects nor depend on them.

An instruction set is said to be orthogonal if it lacks redundancy (i.e., there is only a single instruction that can be used to accomplish a given task)

Chemistry and biochemistry

DNA has two orthogonal pairs: cytosine and guanine form a base-pair, and adenine and thymine form another base-pair, but other base-pair combinations are strongly disfavored.

C - cytosine G - guanine A - adenine T - thymine

cytosine [‘saɪtoʊˌsin] n.胞核嘧啶,氧氨嘧啶;

guanine 英 [‘gwɑ:ni:n] 美 [‘gwɑnin] n.鸟嘌呤(核酸的基本成分);

adenine 英 [‘ædənɪn] 美 [‘ædənɪn] n.腺嘌呤;

thymine 英 [‘θaɪmi:n] 美 [‘θaɪmin] n.胸腺嘧啶

什么是PCA ?

PCA@维基百科

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set.

What Is Principal Component Analysis (PCA)?

Principal components analysis (PCA) is a dimensionality reduction technique that enables you to identify correlations and patterns in a data set so that it can be transformed into a data set of significantly lower dimension without loss of any important information.

The main idea behind PCA is to figure out patterns and correlations among various features in the data set. On finding a strong correlation between different variables, a final decision is made about reducing the dimensions of the data in such a way that the significant data is still retained.

PCA -Overview

•It is a mathematical tool from applied linear algebra.

•It is a simple,non-parametric method of extracting relevant information from confusing datasets.

•It provides a roadmap for how to reduce a complex dataset to a lower dimension.

Why PCA?

如何区分某两个学生的学习水平差异?

注意:只有一种颜色的点点:因为点点有两个值:物理成绩和统计学成绩。 黑色的点,是由于点重复所致。 横轴变化大,即物理分数,更能反映学生的差异。 那,在判断学生能力差异时,只看物理分数就可以了。物理分高的,认为能力强。 对于最开始的问题,即判断两个学生的能力差异,对于分步在两端的学生,非常容易; 但分布于中间的学生,就不大容易;但我们从图上得出结论,只看物理即可。那就容易了。 物理分数高的,就比分数低的强些。

那对于第二个图呢?两边的学生,同样容易区分能力的差距:右侧的肯定要好。但中间的学生之间 的比较,就不容易了:不能单说物理成绩好,就能代表这个学生的能力强;

通过PCA,找到了一个映射轴;这样,原来的每个点(学生的成绩),都能在这个线上有一个映射点; 这就变成了一维的直线比较,就非常容易了:在直线左面的,能力弱于直线右侧的; 这就是PCA的核心思想:降维。

Step By Step Computation Of PCA

1. Standardization of the data

2. Computing the covariance matrix

3. Calculating the eigenvectors and eigenvalues

4. Computing the Principal Components

5. Reducing the dimensions of the data set

1. Standardization of the data

X

X_centered

mean_vec = np.mean(X, axis=0)

X_centered = X - mean_vec

covariance matrix

Standard Deviation (均方差)

In statistics, the standard deviation (SD, also represented by the lower case Greek letter sigma σ for the population standard deviation or the Latin letter s for the sample standard deviation) is a measure of the amount of variation or dispersion of a set of values.

Variance(方差)

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

covariance [kəʊ’veərɪəns] n.协方差;

Standard deviation and variance only operate on 1 dimension,

Covariance is always measured between 2 dimen-sions.

2. Computing the covariance matrix

cov_mat = X_centered.T.dot(X_centered) / (X_centered.shape[0]-1)

3. Calculating the eigenvectors and eigenvalues

eigen [‘eɪdʒən] 释义特征的,本征;

这种向量,只改变大小,不改变方向;改变大小的值,叫eigen value。

eigVals, eigVecs = np.linalg.eig(X_centered.T.dot(X_centered))

使用np.linalg.eig,连cov mat都不用了?

4. Computing the Principal Components

We can see that the blue vector direction corresponds to the oblique shape of our data. The idea is that if you project the data points on the line corresponding to the blue vector direction you will end up with the largest variance. This vector has the direction that maximizes variance of projected data. Have a look at the following figure:

plotVectors(eigVecs.T, [orange, blue])

Projection of the data point: this line direction is the one with the largest variance

When you project data points on the pink line there is more variance. This line has the direction that maximizes the variance of the data points. It is the same for the figure above: our blue vector has the direction of the line where data point projection has the higher variance. Then the second eigenvector is orthogonal to the first.

旋转

It worked! The rotation transformed our dataset that have now the more variance on one of the basis axis. You could keep only this dimension and have a fairly good representation of the data.

这个转向,应该和投影是一个道理;投影后,还要找一个坐标轴(因为那是将向量看成了一个轴,最终还是要将那个向量旋转到某个坐标轴上) 所以,上一步处理后, 就完成了投影和变换坐标轴的两个操作。从上图可见, 只要关注Y轴就可以了。即要比较两个点的好坏,只看Y坐标就可以了:这就相当于降了一维。

1
2
3
4
5
6
X_new = eigVecs.T.dot(X_centered.T) #

plt.plot(eigVecs.T.dot(X_centered.T)[0, :], eigVecs.T.dot(X_centered.T)[1, :], '*')
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.show()

详细说明及代码

hadrienj/deepLearningBook-Notes @github

七步

1. Standardize the d-dimensional dataset.

2. Construct the covariance matrix.

3. Decompose the covariance matrix into its eigenvectors and eigenvalues.

4. Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors.

5. Select k eigenvectors which correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k ≤ d).

6. Construct a projection matrix W from the “top” k eigenvectors.

7. Transform the d-dimensional input dataset X using the projection matrix W to obtain the new k-dimensional feature subspace.

PCA vs linear regression

1. linear regression的两个座标含义是不同的,是不对等的,即x和y

y用于被预测,和x完全不同。

PCA在二维时,有x1, x1,这两个值没有什么不同,它们都是平等的属性;不需要用一个属性去预测另一个属性

2. 虽然都是误差都小,但实际上不同

对于linear regression,对于一个x,是相应x对应的y,与到直线上y值的差,即是一个垂直的线段长度; 而对于PCA,是各点在直线上的投影,是与直线的垂直(linear regression永远是对x轴的垂线)。 所以,误差是不同的。

高维

手写数字数据: the data consists of 8×8 pixel images, meaning that they are 64-dimensional.

问题

如何解读主成份?

究竟要选择多少个主成份才合适?

Summary

-> 正交的定义、思想

-> PCA是一种降维方法

-> 通过PCA降维,可将高维不可视数据,变为可视数据(如降到二维或三维)

-> PCA可分成标准化数据、计算协方矩阵、计算特征象量及投影等步骤来实现

-> PCA与线性回归不同

参考资源

谢谢,再见