Machine Learning for Hackers Chapter 8: Principal Components Analysis
The code for Chapter 8 has been sitting around for a long time now. Let’s blow the dust off and check it out. One thing before we start: explaining PCA well is kinda hard. If any experts reading feel like I’ve described something imprecisely (and have a better description), I’m very open to suggestions.
Introduction
Chapter 8 is about Principal Components Analysis (PCA), which the authors perform on data with time series of prices for 24 stocks. In very broad terms, PCA is about projecting many real-life, observed variables onto a smaller number of “abstract” variables, the principal components. Principal components are selected in order to best preserve the variation and correlation of the original variables. For example, if we have 100 variables in our data, which are all highly correlated, we can project them down to just a few principal components—-i.e., the high correlation between them can be imagined as coming from an underlying factor that drives all of them, with some other less important factors driving their differences. When variables aren’t highly correlated, more principal components are needed to describe them well.
As you might imagine, PCA can be a very effective ...