{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Principle Component Analysis\n", "\n", "Author: [Pili Hu](http://hupili.net)\n", " \n", "Let's try to visualize the legco data set via dimensionality reduction.\n", "We only show PCA in this notebook.\n", "If you are interested in more dimensionality reduction techniques, please find pointers in reference section.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Principal Component Analysis\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
20121017-120121017-1020121017-220121017-320121017-420121017-520121017-620121017-720121017-820121017-9...20140219-220140219-2020140219-2120140219-320140219-420140219-520140219-620140219-720140219-820140219-9
member
Abraham SHEKNoNoNoNoNoNoNoNoNoNo...AbstainYesNoNoYesYesYesNoAbsentAbstain
Alan LEONGYesYesYesYesYesYesYesNoNoNo...AbstainYesAbsentNoNoNoYesYesYesAbsent
Albert CHANYesYesYesYesYesYesAbstainNoNoNo...NoAbsentAbstainAbsentNoYesYesYesAbsentAbsent
Albert HOYesYesYesYesYesYesYesNoNoNo...YesYesAbsentNoNoYesYesYesYesYes
Alice MAKNoYesNoNoYesYesAbstainNoNoAbstain...YesYesYesNoNoNoNoYesNoNo
\n", "

5 rows × 1031 columns

\n", "
" ], "text/plain": [ " 20121017-1 20121017-10 20121017-2 20121017-3 20121017-4 \\\n", "member \n", "Abraham SHEK No No No No No \n", "Alan LEONG Yes Yes Yes Yes Yes \n", "Albert CHAN Yes Yes Yes Yes Yes \n", "Albert HO Yes Yes Yes Yes Yes \n", "Alice MAK No Yes No No Yes \n", "\n", " 20121017-5 20121017-6 20121017-7 20121017-8 20121017-9 \\\n", "member \n", "Abraham SHEK No No No No No \n", "Alan LEONG Yes Yes No No No \n", "Albert CHAN Yes Abstain No No No \n", "Albert HO Yes Yes No No No \n", "Alice MAK Yes Abstain No No Abstain \n", "\n", " ... 20140219-2 20140219-20 20140219-21 20140219-3 \\\n", "member ... \n", "Abraham SHEK ... Abstain Yes No No \n", "Alan LEONG ... Abstain Yes Absent No \n", "Albert CHAN ... No Absent Abstain Absent \n", "Albert HO ... Yes Yes Absent No \n", "Alice MAK ... Yes Yes Yes No \n", "\n", " 20140219-4 20140219-5 20140219-6 20140219-7 20140219-8 20140219-9 \n", "member \n", "Abraham SHEK Yes Yes Yes No Absent Abstain \n", "Alan LEONG No No Yes Yes Yes Absent \n", "Albert CHAN No Yes Yes Yes Absent Absent \n", "Albert HO No Yes Yes Yes Yes Yes \n", "Alice MAK No No No Yes No No \n", "\n", "[5 rows x 1031 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#file_prefix = 'https://course.ie.cuhk.edu.hk/~engg4030/tutorial/tutorial7/'\n", "file_prefix = ''\n", "import pandas as pd\n", "df = pd.io.parsers.read_csv(file_prefix + 'votes-matrix.csv', index_col='member')\n", "df[:5]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(70, 1031)\n", "[['No' 'No' 'No' 'No' 'No']\n", " ['Yes' 'Yes' 'Yes' 'Yes' 'Yes']\n", " ['Yes' 'Yes' 'Yes' 'Yes' 'Yes']\n", " ['Yes' 'Yes' 'Yes' 'Yes' 'Yes']\n", " ['No' 'Yes' 'No' 'No' 'Yes']]\n" ] } ], "source": [ "X = df.as_matrix()\n", "# This is member-by-vote matrix\n", "print X.shape\n", "# A corner of the matrix\n", "print X[:5, :5]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Convert string value to numeric value\n", "X[(X!='Yes') & (X!='No')] = 0\n", "X[X=='Yes'] = 1\n", "X[X=='No'] = -1\n", "X = matrix(X.astype('float'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to one friend, the meaning of labels are as follows:\n", "\n", " * Yes: Yes\n", " * No: No\n", " * Absent: Did not come\n", " * Abstain: Did not vote\n", " * Present: Came at first, but was abset when vote this topic\n", " \n", "Correct if the interpretation is wrong.\n", "For the educational purpose, we simply give anything other than \"Yes\" or \"No\" a value `0`.\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "28471\n", "10939\n", "32760\n", "72170\n", "72170\n", "[[-1. -1. -1. -1. -1.]\n", " [ 1. 1. 1. 1. 1.]\n", " [ 1. 1. 1. 1. 1.]\n", " [ 1. 1. 1. 1. 1.]\n", " [-1. 1. -1. -1. 1.]]\n" ] } ], "source": [ "# Check whether all cells are transformed to numerical values.\n", "print len(find(X == -1))\n", "print len(find(X == 1))\n", "print len(find(X == 0))\n", "print len(find(X == -1)) + len(find(X == 0)) + len(find(X == 1))\n", "print size(X)\n", "print X[:5, :5]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-0.8 -1.24285714 -0.78571429 -0.8 -1.21428571]\n", " [ 1.2 0.75714286 1.21428571 1.2 0.78571429]\n", " [ 1.2 0.75714286 1.21428571 1.2 0.78571429]\n", " [ 1.2 0.75714286 1.21428571 1.2 0.78571429]\n", " [-0.8 0.75714286 -0.78571429 -0.8 0.78571429]]\n" ] } ], "source": [ "# Centering!\n", "# This is a often neglectd step!\n", "# If you wonder what happens without centering, see the remarks section\n", "X = X - mean(X, 0)\n", "print X[:5, :5]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1031, 1031)\n" ] } ], "source": [ "# The covariance matrix\n", "C = X.T * X\n", "print C.shape" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2, 70)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/hupili/Desktop/initium/hk_legco/third/hupili-engg4030/src/t7-pca/venv/lib/python2.7/site-packages/IPython/kernel/__main__.py:11: ComplexWarning: Casting complex values to real discards the imaginary part\n", "/Users/hupili/Desktop/initium/hk_legco/third/hupili-engg4030/src/t7-pca/venv/lib/python2.7/site-packages/IPython/kernel/__main__.py:12: ComplexWarning: Casting complex values to real discards the imaginary part\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# EVD and get Principle **Axis**\n", "# In previous version of this note, we use eig to get all eigen values/vectors.\n", "# It suffers from compatibility problems and is inefficient on large matrix.\n", "# scipy.sparse.linalg.eigs is a more advanced interface to get partial eigen values/vectors.\n", "import scipy.sparse.linalg\n", "PA = scipy.sparse.linalg.eigs(C, k=2, which='LM', return_eigenvectors=True)[1]\n", "\n", "# Project data points onto principle axis\n", "X_2D = PA.T * X.T\n", "print X_2D.shape\n", "x = array(X_2D[0, :]).astype('float')\n", "y = array(X_2D[1, :]).astype('float')\n", "scatter(x, y)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using matplotlib backend: MacOSX\n", "Populating the interactive namespace from numpy and matplotlib\n", "(3, 70)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/hupili/Desktop/initium/hk_legco/third/hupili-engg4030/src/t7-pca/venv/lib/python2.7/site-packages/IPython/kernel/__main__.py:17: ComplexWarning: Casting complex values to real discards the imaginary part\n", "/Users/hupili/Desktop/initium/hk_legco/third/hupili-engg4030/src/t7-pca/venv/lib/python2.7/site-packages/IPython/kernel/__main__.py:18: ComplexWarning: Casting complex values to real discards the imaginary part\n", "/Users/hupili/Desktop/initium/hk_legco/third/hupili-engg4030/src/t7-pca/venv/lib/python2.7/site-packages/IPython/kernel/__main__.py:19: ComplexWarning: Casting complex values to real discards the imaginary part\n" ] }, { "data": { "text/plain": [ "9" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the following one to pop up the 3D plot in another window.\n", "# Good for interactive exploration.\n", "# May or may not be available due to your desktop.\n", "%pylab\n", "# Use the following one to plot inline (embedded in this notebook).\n", "#%pylab inline\n", "\n", "# Try 3D\n", "PA = scipy.sparse.linalg.eigs(C, k=3, which='LM', return_eigenvectors=True)[1]\n", "# Project data points onto principle axis\n", "X_3D = PA.T * X.T\n", "print X_3D.shape\n", "# We intentionally add some disturbance for better visualization.\n", "# Or else, some of the nodes are located in the same place.\n", "# (Those who vote exactly the same)\n", "X_3D = X_3D + randn(*tuple(X_3D.shape)) * 0.3\n", "x = array(X_3D[0, :]).astype('float')\n", "y = array(X_3D[1, :]).astype('float')\n", "z = array(X_3D[2, :]).astype('float')\n", "\n", "from mpl_toolkits.mplot3d import Axes3D\n", "fig = figure()\n", "ax = fig.add_subplot(111, projection='3d')\n", "ax.scatter(x, y, z, picker=True, s=100)\n", "title('HK Legco 2012.10 ~ 2015.6 Votes (3 PCs)')\n", "ax.set_xlabel('Principal Component 1')\n", "ax.set_ylabel('Principal Component 2')\n", "ax.set_zlabel('Principal Component 3')\n", "\n", "def onpick(event):\n", " if hasattr(onpick, '_label'):\n", " #pass\n", " onpick._label.remove()\n", " thisline = event.artist\n", " ind = event.ind\n", " #print type(ind)\n", " #print X_3D[0, ind]\n", " names = df.iloc[ind].index.values\n", " label = ('\\n'.join(names))\n", " pos = (X_3D[0, ind[0]], X_3D[1, ind[0]], X_3D[2, ind[0]])\n", " #onpick._label = ax.set_title(label\n", " onpick._label = ax.text(*pos, s=label)\n", " fig.canvas.draw()\n", " \n", "fig.canvas.mpl_connect('pick_event', onpick)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can change \"%pylab inline\" to \"%pylab\"i n the above cell to pop the 3D plot in another window.\n", "Pick some points and you can see who are those people.\n", "You can also zoom/pan/rotate the plot to explore in details.\n", "Following is one example:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import Image\n", "Image(url=(file_prefix + 'screen-legco-pca-3d-labels.png'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE**:\n", "We have used PCA to visualize **what members vote alike**.\n", "Try to visualize **what topics are voted alike**.\n", "You may try the following two methods:\n", "\n", " * Simply transpose the original data matrix $X$, and do the same numeric computation.\n", " * Re-use the above l ($[\\lambda_1,\\ldots,\\lambda_n]$) and U ($[U_1,\\ldots,U_n]$).\n", " Compute what you need (principle axis, principle component) by matrix multiplication, instead of decomposition.\n", "\n", "I strongly encourage you to try both methods.\n", "By deriving the math relation between $U,V,\\Lambda,\\Sigma$, you will get a deeper understanding." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The energy (optional)\n", "\n", "What's the energy preserved by principal axes?\n", "i.e. sum of squared singular values or sum of eigen values.\n", "In a previous version of this note, we use `eig` to get all eigen values.\n", "Now we only get the largest eigen values.\n", "Without a full decomposition, what can we do?\n", "\n", "Recall the [trace](http://en.wikipedia.org/wiki/Trace_%28linear_algebra%29) you learned in linear algebra:\n", "\n", "$$Tr[X] = Tr[U \\Lambda U^T] = Tr[U^T U \\Lambda] = Tr[\\Lambda]$$\n", "\n", "So the sum of diagonal is equal to the sum of eigen values.\n", "We can use this to compute the percentage of captured energy by principal axes." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "eval= 614.335088097 %= 0.022421479598\n", "eval= 631.805969672 %= 0.023059116976\n", "eval= 1035.41967009 %= 0.0377898665697\n", "eval= 1868.59401075 %= 0.0681983551007\n", "eval= 15788.2125691 %= 0.576224755619\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/hupili/Desktop/initium/hk_legco/third/hupili-engg4030/src/t7-pca/venv/lib/python2.7/site-packages/IPython/kernel/__main__.py:3: ComplexWarning: Casting complex values to real discards the imaginary part\n", " app.launch_new_instance()\n" ] } ], "source": [ "l = scipy.sparse.linalg.eigs(C, k=5, which='LM', return_eigenvectors=False)\n", "for i in l:\n", " print 'eval=', float(i), '%=', float(i) / trace(C)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only ~70% is captured by first three principal axes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remarks (optional)\n", " \n", "Mis-uses/mis-interpretations of PCA:\n", " \n", " * What happens without centering?\n", " You can directly decompose the matrix without centering.\n", " It still gives you certain good visualization.\n", " Note that it is not PCA anymore!\n", " However, it is a valid spectral embedding technique (see reference).\n", " * Principal component and principal axis.\n", " Depending on the convention of $X$ (row is data point or column is data point), the interpretation of $U$ and $V$ are different.\n", " Besides, \"principal component\" is used for \"principal axis\" in some texts.\n", " We adopt the following convention: ($n$ data points; reduce from $d$ dimension to $k$ dimension)\n", " * Principal axes: They are $k$ $d$-dimensional vectors.\n", " i.e. they are in the original space. \n", " i.e. they span a subspace in the original space.\n", " * Principal components: They are $n$ $k$-dimensional vectors.\n", " i.e. they are the coordinates of original data points in the new dimension-reduced space.\n", " * Suppose data points are organized in columns.\n", " Given $X=U\\Sigma V^T$ (e.g. by decomposing $X X^T$), we truncate to the large $k$ singular values: $X \\approx U_k\\Sigma_k V_k^T$\n", " We know that $ U_k$ is principal axis and $\\Sigma_k V_k^T$ is the principal component.\n", " We need to plot $\\Sigma_k V_k^T$ to get a lower dimension embedding.\n", " However, people sometimes decompose $X^T X$ to obtain $V_k^T$ and plot it direclty.\n", " $V_k^T$ and $\\Sigma_k V_k^T$ are of same dimension, so there is no problem.\n", " Again, this is not PCA but it is within the scope of spectral embedding techniques and gives nice visualization sometimes.\n", "\n", "How to interpret the above \"Mis-uses/mis-interpretations\"?\n", "It is here to remind you that there are rigorous math formulations and derivations behind PCA.\n", "It is very easy to mis-use the algorithm and mis-interpret the result.\n", "However, it does not say the variants are incorrect.\n", "On the contrary, PCA is not ideal in many cases and the variants turn out to be useful sometimes.\n", "Some of the variants also have rigorous math justifications but some of them do not have.\n", "Some of them have been formalized and named but some are not.\n", "It is very common to see people get \"right result\" using \"wrong method\".\n", "Just stay open-mineded.\n", "In the world of data mining, there are no one-size-fit-all strong algorithms but there are strong miners.\n", "**Pre-processing** and **post-processing** may sometimes be even more important than the algorithm." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", " * Lecture notes _Dimensionality Reduction Part 1_ : https://course.ie.cuhk.edu.hk/~engg4030/lecture/Dimension1.pdf\n", " * **Jolliffe, Ian.** _Principal component analysis_. John Wiley & Sons, Ltd, 2005. APA.\n", " * PCA is just one of the many **spectral** dimensionality reduction methods.\n", " If you want to explore more, you can refer to my\n", " [notes on spectral clustering](http://project.hupili.net/tutorial/hu2012-spectral/hu2012-spectral.html).\n", " * The [techreport](http://project.hupili.net/tutorial/hu2012-spectral/hu2012spectral-survey.pdf) \n", " surveys some common linear and non-linear DR methods.\n", " * The [slides, p21-26](http://project.hupili.net/tutorial/hu2012-spectral/hu2013spectral-slides.pdf) \n", " has a quick overview of PCA from formulation to some operational nitty-gritties (e.g. principle component v.s. principle axis).\n", " * PCA (and spectral embedding methods) may not be the best way to visualize this data set.\n", " Given the voting matrix, one can extract a \"social graph\" on legco members.\n", " [Force-directed layout](http://en.wikipedia.org/wiki/Force-directed_graph_drawing)\n", " is a popular family of methods to visualize this type of data.\n", " (depending on time availability, this may be covered during graph mining chapters)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }