Table of Contents
Tutorial Notes on Web-Scale Information Analytics

Principal Component Analysis

Principal Component Analysis

See the IPython Notebooks:

Data Preparation (optional)

NOTE: We already downloaded and preprocessed the data set. URLs are already included in the above IPython notebook. This section is only for those who are interested in the whole process.

LegcoHK releases voting records as open data starting from Sept 2013. We use the english version.

Here is a one-liner to get the list of XMLs from Chrome console.

$('.vote a').filter(function(){return $(this).attr('href').substr(-3,3)=='xml';}).map(function(){console.log($(this).attr('href'))})

We save it as: list-of-xml.txt

Batch download those files using the following commands:

mkdir legco-xml
cd legco-xml
cat ../list-of-xml.txt | xargs -P 10 -I{} wget http://www.legco.gov.hk/{}

We need to pre-process the XML files into a single CSV for further mining:

You can use the same method to deal with 2013-2014 records. We have collected all records from 20121017 to 20140219.

Outcome of This Tutorial

comments powered by Disqus
▶ Back ▲ Top