PCA 介绍

Hugo 基于webslides模板

Holibut, 2019.11

# orthogonality-维基

[ɔθɒgə’nælɪtɪ] n.正交性，正交状态；

## 正交是点乘结果为0

a • b = ‖a‖ * ‖b‖ * cosø

# Computer science

Orthogonality is a system design property which guarantees that modifying the technical effect produced by a component of a system neither creates nor propagates side effects to other components of the system. Typically this is achieved through the separation of concerns and encapsulation, and it is essential for feasible and compact designs of complex systems. The emergent behavior of a system consisting of components should be controlled strictly by formal definitions of its logic and not by side effects resulting from poor integration, i.e., non-orthogonal design of modules and interfaces. Orthogonality reduces testing and development time because it is easier to verify designs that neither cause side effects nor depend on them.

An instruction set is said to be orthogonal if it lacks redundancy (i.e., there is only a single instruction that can be used to accomplish a given task)

# Chemistry and biochemistry

DNA has two orthogonal pairs: cytosine and guanine form a base-pair, and adenine and thymine form another base-pair, but other base-pair combinations are strongly disfavored.

C - cytosine G - guanine A - adenine T - thymine

cytosine [‘saɪtoʊˌsin] n.胞核嘧啶，氧氨嘧啶；

guanine 英 [‘gwɑ:ni:n] 美 [‘gwɑnin] n.鸟嘌呤（核酸的基本成分）；

adenine 英 [‘ædənɪn] 美 [‘ædənɪn] n.腺嘌呤；

thymine 英 [‘θaɪmi:n] 美 [‘θaɪmin] n.胸腺嘧啶

# 什么是PCA ?

## PCA@维基百科

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set.

# What Is Principal Component Analysis (PCA)?

Principal components analysis (PCA) is a dimensionality reduction technique that enables you to identify correlations and patterns in a data set so that it can be transformed into a data set of significantly lower dimension without loss of any important information.

The main idea behind PCA is to figure out patterns and correlations among various features in the data set. On finding a strong correlation between different variables, a final decision is made about reducing the dimensions of the data in such a way that the significant data is still retained.

# 1. Standardization of the data

X

X_centered

mean_vec = np.mean(X, axis=0)

X_centered = X - mean_vec

# covariance matrix

## Standard Deviation (均方差)

In statistics, the standard deviation (SD, also represented by the lower case Greek letter sigma σ for the population standard deviation or the Latin letter s for the sample standard deviation) is a measure of the amount of variation or dispersion of a set of values.

## Variance(方差)

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

## covariance [kəʊ’veərɪəns] n.协方差；

Standard deviation and variance only operate on 1 dimension,

Covariance is always measured between 2 dimen-sions.

# 2. Computing the covariance matrix

cov_mat = X_centered.T.dot(X_centered) / (X_centered.shape[0]-1)

# 3. Calculating the eigenvectors and eigenvalues

eigen [‘eɪdʒən] 释义特征的，本征；

eigVals, eigVecs = np.linalg.eig(X_centered.T.dot(X_centered))

# 4. Computing the Principal Components

We can see that the blue vector direction corresponds to the oblique shape of our data. The idea is that if you project the data points on the line corresponding to the blue vector direction you will end up with the largest variance. This vector has the direction that maximizes variance of projected data. Have a look at the following figure:

plotVectors(eigVecs.T, [orange, blue])

Projection of the data point: this line direction is the one with the largest variance

When you project data points on the pink line there is more variance. This line has the direction that maximizes the variance of the data points. It is the same for the figure above: our blue vector has the direction of the line where data point projection has the higher variance. Then the second eigenvector is orthogonal to the first.

# 旋转

It worked! The rotation transformed our dataset that have now the more variance on one of the basis axis. You could keep only this dimension and have a fairly good representation of the data.

 1 2 3 4 5 6  X_new = eigVecs.T.dot(X_centered.T) # plt.plot(eigVecs.T.dot(X_centered.T)[0, :], eigVecs.T.dot(X_centered.T)[1, :], '*') plt.xlim(-5, 5) plt.ylim(-5, 5) plt.show()

# PCA vs linear regression

## 1. linear regression的两个座标含义是不同的，是不对等的，即x和y

y用于被预测，和x完全不同。

PCA在二维时，有x1, x1，这两个值没有什么不同，它们都是平等的属性；不需要用一个属性去预测另一个属性

# 参考资源

## Internet misc

deepLearningBook-Notes