Principal Component Analysis
See the IPython Notebooks:
- Linear Algebra in IPython Notebook: online viwer, source
- Dimensionality Reduction: online viwer, source
Data Preparation (optional)
NOTE: We already downloaded and preprocessed the data set. URLs are already included in the above IPython notebook. This section is only for those who are interested in the whole process.
LegcoHK releases voting records as open data starting from Sept 2013. We use the english version.
Here is a one-liner to get the list of XMLs from Chrome console.
$('.vote a').filter(function(){return $(this).attr('href').substr(-3,3)=='xml';}).map(function(){console.log($(this).attr('href'))})
We save it as: list-of-xml.txt
Batch download those files using the following commands:
mkdir legco-xml
cd legco-xml
cat ../list-of-xml.txt | xargs -P 10 -I{} wget http://www.legco.gov.hk/{}
We need to pre-process the XML files into a single CSV for further mining:
- Legco-Preprocessing: online viwer, source
You can use the same method to deal with 2013-2014 records. We have collected all records from 20121017 to 20140219.
Outcome of This Tutorial
- Have a feel of how a practical problem is tackled.
We emphasize the mining part in this course but it's worth to know the whole story.
- web scraping,
- data conversion,
- data cleaning,
- data transformation,
- data mining,
- visualization,
- ...
- Be able to manipulate and decompose matrix in Python. Matrix decomposition is the fundamental routine of many advanced spectral embedding techniques.
- Have a clear idea of how to use the decomposed sub-matrices in PCA.
- Know there are some common pitfalls.