Single Node Hadoop Setup
Hadoop rolled out version 2 last year. Since version 1 is still the mainstream, we show how to setup hadoop-1.2.1 in this tutorial, which is the latest version 1 release.
System
We work on the virtual machine created in last tutorial. It runs Ubuntu 12.04 LTS distribution. The following materials should apply to many other Linux distributions. Just the specific commands may differ a bit.
Environment and Dependencies
Java
Hadoop requires Java Runtime Environment. We install openjdk-6 for this tutorial. See the wiki for a list of compatible Java versions.
sudo apt-get update
sudo apt-get install openjdk-6-jdk
Verify if you have successfully installed Java6.
azureuser@test-hpl:~$ java -version
java version "1.6.0_31"
OpenJDK Runtime Environment (IcedTea6 1.13.3) (6b31-1.13.3-1ubuntu1~0.12.04.2)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
Hosts
Note, in order for Hadoop nodes to contact each other,
they need to resolve their own names.
The freshly launched VM on Azure (as of this writing) can not resolve its own name.
You can modify /etc/hosts
such that it looks like the following:
azureuser@test-hpl:/opt/hadoop$ cat /etc/hosts
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
127.0.0.1 test-hpl
Note, change test-hpl
to your own hostname,
which appears at the beginning of command line prompt azureuser@test-hpl
.
Password-less SSH
First generate your key pairs:
azureuser@test-hpl:~$ cd .ssh
azureuser@test-hpl:~/.ssh$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/azureuser/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/azureuser/.ssh/id_rsa.
Your public key has been saved in /home/azureuser/.ssh/id_rsa.pub.
The key fingerprint is:
b2:5c:35:2f:04:7a:47:81:b1:23:5e:40:55:67:df:6e azureuser@test-hpl
The key's randomart image is:
+--[ RSA 2048]----+
| .o.+++.o |
| o.+ o . . |
| o = = . .|
| . + = o . |
| o S . . E|
| . + . . |
| o |
| |
| |
+-----------------+
Put the pubkey in the authorized key list:
azureuser@test-hpl:~/.ssh$ cat id_rsa.pub >> authorized_keys
Check the setup:
azureuser@test-hpl:~/.ssh$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 04:35:ad:f4:a1:cf:0d:c4:5e:c9:e4:65:6f:52:36:68.
Are you sure you want to continue connecting (yes/no)? yes
You will find that you re-loggged-in the same machine.
A new shell is allocated.
Remember to exit
after the test.
The first time to SSH to a machine,
it will prompt you whether to accept the public key of the server.
After you typing yes
,
the information will be recorded in ~/.ssh/known_hosts
.
EXERCISE: You can skip this section first and see what happens when you start the cluster. Then you'll find the password-less SSH setup saves you some typing. This is especially useful when you manage a big cluster.
EXERCISE:
Explore the ~/.ssh/
folder.
Download and Install Hadoop Package
We install Hadoop under /opt/
.
First create the directories and give our user the ownership:
azureuser@test-hpl:~$ sudo mkdir -p /opt
azureuser@test-hpl:~$ sudo chown azureuser:azureuser /opt
azureuser@test-hpl:~$ cd /opt/
Verify we are in /opt/
and have the read/write permissions.
azureuser@test-hpl:/opt$ pwd
/opt
azureuser@test-hpl:/opt$ ls -al .
total 8
drwxr-xr-x 2 azureuser azureuser 4096 Apr 28 07:05 .
drwxr-xr-x 23 root root 4096 May 2 03:18 ..
Download Hadoop package from official repository:
azureuser@test-hpl:/opt$ wget 'http://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz'
A good habit is to verify the downloaded package before installing it.
We first use curl
to download the correct message digests for hadoop-1.2.1.tar.gz
.
We cam then use md5sum
and sha1sum
to check the downloaded file.
azureuser@test-hpl:/opt$ curl http://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz.mds
hadoop-1.2.1.tar.gz: MD5 = 8D 79 04 80 56 17 C1 6C B2 27 D1 CC BF E9 38 5A
hadoop-1.2.1.tar.gz: SHA1 = B07B 88CA 658D C9D3 38AA 84F5 C68C 809E B7C7 0964
hadoop-1.2.1.tar.gz: RMD160 = 6330 DED6 043A 1C8D D859 7910 E77F 3DED F249 A807
hadoop-1.2.1.tar.gz: SHA224 = 3500FE1F 513A32D7 AD3EEBA1 F177710C 3D678534
CD6DA4F4 13C8188E
hadoop-1.2.1.tar.gz: SHA256 = 94A11817 71F173BD B55C8F90 17228258 66396091
F0516BDD 12B34DC3 DE1706A1
hadoop-1.2.1.tar.gz: SHA384 = 2ABAF8DB 781FB3EA 11621937 1847445C 44B5C7D7
48EC8410 43D96C9A 9D6DD978 F300CE18 F02D9FC8
ED6B8176 D08E5B62
hadoop-1.2.1.tar.gz: SHA512 = 79C6423D 1E0E2835 98442DCF FD63DA52 BF3AB53B
80957243 7BAF8DA8 38B592E1 E776430E A67DB53E
52B78112 5BCAB225 DC222632 63CDF185 7D2A7A46
A4966DA8
azureuser@test-hpl:/opt$ md5sum hadoop-1.2.1.tar.gz
8d7904805617c16cb227d1ccbfe9385a hadoop-1.2.1.tar.gz
azureuser@test-hpl:/opt$ sha1sum hadoop-1.2.1.tar.gz
b07b88ca658dc9d338aa84f5c68c809eb7c70964 hadoop-1.2.1.tar.gz
Uncompress the downloaded .tar.gz
archive.
Instead of operating on directory hadoop-1.2.1
,
it is better to soft link it to hadoop
.
In this way,
you don't have to modify other programs
when you upgrade your Hadoop version.
azureuser@test-hpl:/opt$ tar -xzvf hadoop-1.2.1.tar.gz
...
azureuser@test-hpl:/opt$ ln -s hadoop-1.2.1 hadoop
azureuser@test-hpl:/opt$ ls
hadoop hadoop-1.2.1 hadoop-1.2.1.tar.gz
Configuration
Environment Variables
Export the environment variables in your ~/.bashrc
.
You can use vim
to edit the file
or use cat >> ~/.bashrc
followed by an input stream.
Check your configuration:
azureuser@test-hpl:/opt/hadoop/conf$ tail ~/.bashrc
# sources /etc/bash.bashrc).
if [ -f /etc/bash_completion ] && ! shopt -oq posix; then
. /etc/bash_completion
fi
export HADOOP_PREFIX=/opt/hadoop
export HADOOP_HOME=$HADOOP_PREFIX
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
export PATH=$PATH:$HADOOP_HOME/bin
You can activate those environment variables by source ~/.bashrc
.
This configuration will also be loaded every time you login.
Now issue the hadoop
command.
You should be able to see the following prompt.
azureuser@test-hpl:/opt/hadoop/conf$ hadoop
Warning: $HADOOP_HOME is deprecated.
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
datanode run a DFS datanode
...
TIP:
You may see a warning "Warning: $HADOOP_HOME is deprecated.".
The reason is that HADOOP_HOME
is deprecated.
You can use HADOOP_PREFIX
as the new recommended environment variable.
( Robin Lee)
TIP:
You may see a warning that deprecates localhost:9000
for fs.default.name
.
Use hdfs://localhost:9000
to solve it.
( Gao Ruohan)
Hadoop Configurations
Hadoop configuration files are in /opt/hadoop/conf
:
azureuser@test-hpl:/opt/hadoop/conf$ ls
capacity-scheduler.xml hadoop-metrics2.properties mapred-site.xml taskcontroller.cfg
configuration.xsl hadoop-policy.xml masters task-log4j.properties
core-site.xml hdfs-site.xml slaves
fair-scheduler.xml log4j.properties ssl-client.xml.example
hadoop-env.sh mapred-queue-acls.xml ssl-server.xml.example
Modify the configuration as follows.
The output is Git Diff format.
A line starting with -
means we removed it.
A line starting with +
means we added it.
Before the content diff is presented, there will be two lines showing the file name,
e.g. core-site.xml
.
index 970c8fe..317d9ba 100644
--- a/core-site.xml
+++ b/core-site.xml
@@ -5,4 +5,13 @@
<configuration>
+ <property>
+ <name>hadoop.tmp.dir</name>
+ <value>/opt/hadoop-tmp</value>
+ <description>A base for other temporary directories.</description>
+ </property>
+ <property>
+ <name>fs.default.name</name>
+ <value>hdfs://localhost:9000</value>
+ </property>
</configuration>
diff --git a/hadoop-env.sh b/hadoop-env.sh
index 01654b9..97ccb79 100644
--- a/hadoop-env.sh
+++ b/hadoop-env.sh
@@ -6,7 +6,7 @@
# remote nodes.
# The java implementation to use. Required.
-# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
+export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
diff --git a/hdfs-site.xml b/hdfs-site.xml
index 970c8fe..4e52fa4 100644
--- a/hdfs-site.xml
+++ b/hdfs-site.xml
@@ -5,4 +5,8 @@
<configuration>
+ <property>
+ <name>dfs.replication</name>
+ <value>1</value>
+ </property>
</configuration>
diff --git a/mapred-site.xml b/mapred-site.xml
index 970c8fe..5d8b379 100644
--- a/mapred-site.xml
+++ b/mapred-site.xml
@@ -5,4 +5,9 @@
<configuration>
+ <property>
+ <name>mapred.job.tracker</name>
+ <value>localhost:9001</value>
+ </property>
</configuration>
+
TIP:
Use ctrl+d to end the input when you use the cat >
approach to create file.
TIP:
A crash course of VIM:
start VIM by vim file-to-edit
;
use i
to enter the insert mode;
move cursor by arrow keys;
do your edits;
type <ESC>
to end insert mode;
type :wq
to save and quit the editor.
You are suggested to learn more about VIM after the tutorial.
Test HDFS
Before you start the cluster for the first time,
you need to prepare some data structure for HDFS's namenode.
This is done via following -format
command.
You can also check /opt/hadoop-tmp/
to see some directories and files are created.
azureuser@test-hpl:/opt/hadoop$ hadoop namenode -format
...
azureuser@test-hpl:/opt/hadoop$ ls /opt/hadoop-tmp/
dfs
Once ready, you can start HDFS using start-dfs.sh
script.
This by default launches a single node cluster.
Check whether NameNode
, SecondaryNameNode
and DataNode
are running.
azureuser@test-hpl:/opt/hadoop$ start-dfs.sh
...
azureuser@test-hpl:/opt/hadoop$ jps
12549 NameNode
12939 SecondaryNameNode
12741 DataNode
13012 Jps
If you know the ps
command for Linux,
jps
is an analogy of ps
for Java processes.
Although you can use system's ps
to check Java process,
e.g. `ps aux | grep java,
The output is too long to comprehend.
If the cluster is running,
you can operate HDFS using command hadoop dfs XXXX
.
Type hadoop dfs
to see a list of commands.
The names are very like other linux commands, e.g. -ls
, -mv
, ...
Test create/delete a dir under the root:
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -ls /
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -mkdir /testdir
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -ls /
Found 1 items
drwxr-xr-x - azureuser supergroup 0 2014-05-02 04:42 /testdir
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -rmr /testdir
Deleted hdfs://localhost:9000/testdir
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -ls /
Upload a test file:
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -ls /
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -copyFromLocal README.txt /README.txt
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -ls /
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -tail /README.txt
try, of
encryption software. BEFORE using any encryption software, please
check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to
see if this is permitted. See <http://www.wassenaar.org/> for more
information.
...
EXERCISE:
Checkout -copyToLocal
to download a file.
You can use stop-dfs.sh
to stop the HDFS cluster.
Let’s keep running for now because the Hadoop MapReduce runs on HDFS.
NOTE:
You need to work step by step.
If some components are not running, e.g. DataNode
,
you can try to check the logs usually located at $HADOOP_PREFIX/logs
.
Test Example MapReduce Job
Start the Hadoop MapReduce and check running processes by jps
.
There are two more processes: JobTracker
and TaskTracker
.
azureuser@test-hpl:/opt/hadoop$ start-mapred.sh
starting jobtracker, logging to /opt/hadoop-1.2.1/libexec/../logs/hadoop-azureuser-jobtracker-test-hpl.out
localhost: starting tasktracker, logging to /opt/hadoop-1.2.1/libexec/../logs/hadoop-azureuser-tasktracker-test-hpl.out
azureuser@test-hpl:/opt/hadoop$ jps
12549 NameNode
12939 SecondaryNameNode
13834 Jps
12741 DataNode
13600 JobTracker
13785 TaskTracker
The there is an example suite in hadoop-1.2.1 package. Find the list of examples as follows:
azureuser@test-hpl:/opt/hadoop$ hadoop jar hadoop-examples-1.2.1.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
We try the wordcount example. We need to know how to pass parameters:
azureuser@test-hpl:/opt/hadoop$ hadoop jar hadoop-examples-1.2.1.jar wordcount
Usage: wordcount <in> <out>
Then we do a wordcount using the README.txt
we uploaded to HDFS in last section.
azureuser@test-hpl:/opt/hadoop$ hadoop jar hadoop-examples-1.2.1.jar wordcount /README.txt /output/
14/05/02 05:25:11 INFO input.FileInputFormat: Total input paths to process : 1
14/05/02 05:25:11 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/05/02 05:25:11 WARN snappy.LoadSnappy: Snappy native library not loaded
14/05/02 05:25:12 INFO mapred.JobClient: Running job: job_201405020524_0001
14/05/02 05:25:13 INFO mapred.JobClient: map 0% reduce 0%
14/05/02 05:25:21 INFO mapred.JobClient: map 100% reduce 0%
14/05/02 05:25:30 INFO mapred.JobClient: map 100% reduce 33%
14/05/02 05:25:31 INFO mapred.JobClient: map 100% reduce 100%
14/05/02 05:25:33 INFO mapred.JobClient: Job complete: job_201405020524_0001
...
Check the output after the job is finished:
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -ls /
Found 3 items
-rw-r--r-- 1 azureuser supergroup 1366 2014-05-02 04:54 /README.txt
drwxr-xr-x - azureuser supergroup 0 2014-05-02 05:24 /opt
drwxr-xr-x - azureuser supergroup 0 2014-05-02 05:25 /output
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -ls /output/
Found 3 items
-rw-r--r-- 1 azureuser supergroup 0 2014-05-02 05:25 /output/_SUCCESS
drwxr-xr-x - azureuser supergroup 0 2014-05-02 05:25 /output/_logs
-rw-r--r-- 1 azureuser supergroup 1306 2014-05-02 05:25 /output/part-r-00000
azureuser@test-hpl:/opt/hadoop$ hadoop dfs -tail /output/part-r-00000
ty 1
License 1
Number 1
Regulations, 1
SSL 1
Section 1
Security 1
See 1
Software 2
More
Start/Stop the Cluster
You can use start-all.sh
and stop-all.sh
to start and stop the entire cluster,
including HDFS and MapReduce.
Check Job History
azureuser@test-hpl:/opt/hadoop$ hadoop job -history /output/
Hadoop job: 0001_1399008311838_azureuser
=====================================
Job tracker host name: job
job tracker start time: Thu May 20 01:50:20 UTC 1976
References
- Hadoop streaming: http://hadoop.apache.org/docs/stable1/streaming.html
- Generic command options: http://hadoop.apache.org/docs/stable1/streaming.html#Generic+Command+Options
- Streaming commmand options: http://hadoop.apache.org/docs/stable1/streaming.html#Streaming+Command+Options
Outcome of This Tutorial
- Have a basic idea of Hadoop package.
- Have a basic idea of the workflow of running a Hadoop MapReduce job.