Hadoop and GraphLab in IPython Notebook
Summary of current works
The materials of this course focus on hands-on training. Let's summarize what we have done in tutorials and homeworks so far:
- Tutorial 1, 2, 3, 4:
- Systems, targeting "big data"
- Linux, Hadoop, Python, Shell
- Tutorial 5, 6, 7, 9, 10:
- Algorithms, working with "small data".
- LSH, K-means, PCA, RecSys, Graph
- Synthetic data (Tut5, Tut6); Real data (Tut7, Tut9, Tut10); Data collection/pre-processing (Tut7).
- Primitive way of implementation (Tut5, Tut6, Tut7); Packages (Tut9, Tut10).
- Python scientific computing environment.
- HW1, HW2:
- Hadoop and MapReduce programming model
- Linux, Hadoop, Python, Shell
- HW3:
- Mahout, a machine learning package on Hadoop.
- Linux, Mahout
- HW4:
- GraphLab (via existing C++ packages)
- Linux, GraphLab, Mahout
In the system part, you touched almost all underlying details. This is essential training but does not make your work more efficient. In the algorithm part, we work in a unified scientific computing environment. It is convenient but we only played with small data. Many distributed computation platforms have (official/ 3rd-party) Python bindings. They are the bridge to get the best out of both ends. That's the goal of this tutorial
Hadoop bindings in Python
- Hadoop in IPython Notebook: online viwer, source
GraphLab bindings in Python
- GraphLab in IPython Notebook: online viwer, source
Outcome of This Tutorial
- Master shell integration in IPython Notebook, which enables programmably invokation of many Linux tools.
- Have an idea of language-specific "bindings". Many open source projects may not be written in your language of choice. However, there may be such bindings.
- Hands-on one Hadoop binding (
mrjob
) and GraphLab binding (GraphLab Create
).