How to Configure Apache Hadoop in Standalone Mode
Apache Hadoop is an open source framework for storing and distributed batch processing of huge datasets on clusters of commodity hardware. Hadoop can be used on a single machine (Standalone Mode) as well as on a cluster of machines (Distributed Mode – Pseudo & Fully). One of the striking features of Hadoop is that it efficiently distributes large amounts of work across a cluster of machines/commodity hardware.
Through this tutorial I will try and throw light on how to configure Apache Hadoop in Standalone Mode.
Before I get to that, it is important to understand that Hadoop can be run in any of the following three modes: Standalone Mode, Pseudo-Distributed Mode and Fully Distributed Mode.
Standalone Mode- In standalone mode, we will configure Hadoop on a single machine (e.g. an Ubuntu machine on the host VM). The configuration in standalone mode is quite straightforward and does not require major changes.
Pseudo-Distributed Mode- In a pseudo distributed environment, we will configure more than one machine, one of these to act as a master and the rest as slave machines/node. In addition we will have more than one Ubuntu machine playing on the host VM.
Fully Distributed Mode- It is quite similar to a pseudo distributed environment with the exception that instead of VM the machines/node will be on a real distributed environment.
Following are some of the prerequisites for configuring Hadoop:
Hadoop requires Java 1.5+ installation. However, using Java 1.6 is recommended for running Hadoop. It can be run on both Windows & Unix but Linux/Unix best support the production environment. Working with Hadoop on Windows also requires Cygwin installation.
Installing & Configuring Hadoop in Standalone Mode
You might want to create a dedicated user for running Apache Hadoop but it is not a prerequisite. In our demonstration, we will be using a default user for running Hadoop.
Environment
Ubuntu 10.10
JDK 6 or above
Hadoop-1.1.2 (Any stable release)
Follow these steps for installing and configuring Hadoop on a single node:
Step-1. Install Java
In this tutorial, we will use Java 1.6 therefore describing the installation of Java 1.6 in detail.
Use the below command to begin the installation of Java
$ sudo apt-get install openjdk-6-jdk
or
$ sudo apt-get install sun-java6-jdk
This will install the full JDK under /usr/lib/jvm/java-6-sundirectory.
Step-2. Verify Java installation
You can verify java installation using the following command
$ java -version
On executing this command, you should see output similar to the following:
java version “1.6.0_27”
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)
Step-3. SSH configuration
- Install SSH using the command.
sudo apt-get install ssh
- Generate ssh key
ssh -keygen -t rsa -P “” (press enter when asked for a file name; this will generate a passwordless ssh file)
- Now copy the public key (id_rsa.pub) of current machine to authorized_keysBelow command copies the generated public key in the .ssh/authorized_keys file:
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
- Verify ssh configuration using the command
ssh localhost
Pressing yes will add localhost to known hosts
Step-4. Download Hadoop
Download the latest stable release of Apache Hadoop from http://hadoop.apache.org/ releases.html.
Unpack the release tar – zxvf hadoop-1.0.3.tar.gz
Save the extracted folder to an appropriate location, HADOOP_HOME will be pointing to this directory.
Step-5. Verify Hadoop
Check if the following directories exist under HADOOP_HOME: bin, conf, lib, bin
Use the following command to create an environment variable that points to the Hadoop installation directory (HADOOP_HOME)
export HADOOP_HOME=/home/user/hadoop
Now place the Hadoop binary directory on your command-line path by executing the command
export PATH=$PATH:$HADOOP_HOME/bin
Use this command to verify your Hadoop installation:
hadoop version
The o/p should be similar to below one
Hadoop 1.1.2
Subversion https://svn.apache.org/repos/asf/hadoop/common/ branches/branch-0.20 -r911707
Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010
Step-6. Configure JAVA_HOME
Hadoop requires Java installation path to work on, for this we will be setting JAVA_HOME environment variable and this will point to our Java installation dir.
Java_Home can be configured in ~/.bash_profile or ~/.bashrc file. Alternatively you can also let hadoop know this by setting Java_Home in hadoop conf/hadoop-env.sh file.
Use the below command to set JAVA_HOME on Ubuntu
export JAVA_HOME=/usr/lib/jvm/java-6-sun
JAVA_HOME can be verified by command
echo $JAVA_HOME
Step-7. Create Data Directory for Hadoop
An advantage of using Hadoop is that with just a limited number of directories you can set it up to work correctly. Let us create a directory with the name hdfs and three sub-directories name, data and tmp.
Since a Hadoop user would require to read-write to these directories you would need to change the permissions of above directories to 755 or 777 for Hadoop user.
Step-8. Configure Hadoop XML files
Next, we will configure Hadoop XML file. Hadoop configuration files are in the HADOOP_HOME/conf dir.
conf/core-site.xml
>
<! -- Putting site-specific property overrides the file. -->
fs.default.name
hdfs://localhost:9000
hadoop.temp.dir
/home/girish/hdfs/temp
conf/hdfs-site.xml
<! -- Putting site specific property overrides in the file. --> dfs.name.dir /home/girish/hdfs/name dfs.data.dir /home/girish/hdfs/data dfs.replication 1
conf/mapred-site.xml
<! -- Putting site-specific property overrides this file. --> mapred.job.tracker localhost:9001
conf/masters
Not required in single node cluster.
conf/slaves
Not required in single node cluster.
Step-9. Format Hadoop Name Node-
Execute the below command from hadoop home directory
$ ~/hadoop/bin/hadoop namenode -format
The following image gives an overview of a Hadoop Distributed File System Architecture.
Step-10. Start Hadoop daemons
$ ~/hadoop/bin/start-all.sh
Step-11. Verify the daemons are running
$ jps (if jps is not in path, try /usr/java/latest/bin/jps)
output will look similar to this
9316 SecondaryNameNode
9203 DataNode
9521 TaskTracker
9403 JobTracker
9089 NameNode
Now we have all the daemons running:
Note: If your master server fails to start due to the dfs safe mode issue, execute this on the Hadoop command line:
hadoop dfsadmin -safemode leave
Also make sure to format the namenode again if you make changes to your configuration.
Step-12. Verify UIs by namenode & job tracker
Open a browser window and type the following URLs:
namenode UI: http://machine_host_name:50070
job tracker UI: http://machine_host_name:50030
substitute ‘machine host name’ with the public IP of your node e.g: http://localhost:50070
Now you have successfully installed and configured Hadoop on a single node.
Basic Hadoop Admin Commands
(Source: Getting Started with Hadoop):
The ~/hadoop/bin directory contains some scripts used to launch Hadoop DFS and Hadoop Map/Reduce daemons. These are:
- start-all.sh – Starts all Hadoop daemons, the namenode, datanodes, the jobtracker and tasktrackers.
- stop-all.sh – Stops all Hadoop daemons.
- start-mapred.sh – Starts the Hadoop Map/Reduce daemons, the jobtracker and tasktrackers.
- stop-mapred.sh – Stops the Hadoop Map/Reduce daemons.
- start-dfs.sh – Starts the Hadoop DFS daemons, the namenode and datanodes.
- stop-dfs.sh – Stops the Hadoop DFS daemons.
Executing WordCount Example in Hadoop standalone mode
When you download Hadoop, it comes with some existing demonstration programs and WordCount is one of them.
Step-1. Creating a working directory for your data
create a directory and name it dft.
$ mkdir dft $ cd dft ~/dft$
Step-2. Creating a working directory for data
To process our text file we will have to provide this file to Hadoop File System (HDFS) afterwards Hadoop namenode and datanode willl be sharing this file from HDFS.
1. Creating a local copy
Create your own text file with some commonly used words.
Let us give it a file name of MyTextFile.txt
2. Copy Data File to HDFS
Copy the data file MyTextFile.txt to the Hadoop File System (HDFS):
syntax: hadoop dfs -copyFromLocal
$ .bin/hadoop dfs -copyFromLocal /home/hadoop/dft dft
Follow the below steps, if you encounter any issues e.g. “Cannot access dft: No such file or directory.”
Check the Hadoop dfs directory to see if the file already exists.
user@ubuntu:/opt/hadoop-1.2.1$ ./bin/hadoop dfs -ls dft
Found 1 items
-rw-r–r– 1 username supergroup 1573078 2013-10-09 00:32
/user/username/dft/MyTextFile.txt
If the file already exists it needs to be deleted first
user@ubuntu:/opt/hadoop-1.2.1$ ./bin/hadoop dfs -rmr dft
Deleted hdfs://localhost:9000/user/username/dft
Re run the -copyFromLocal command
user@ubuntu:/opt/hadoop-1.2.1$ .bin/hadoop dfs -copyFromLocal /home/hadoop/dft dft
3. Confirm Data File is available at HDFS
$ hadoop dfs -ls
Found x items … drwxr-xr-x – hadoop supergroup 0 2010-03-16 11:36 /user/hadoop/dft
Verify that your directory is now in the Hadoop File System, as indicated above.
4. Check the contents of your directory:
$ hadoop dfs -ls dft
Found 1 items -rw-r–r– 2 hadoop supergroup 1573044 2010-03-16 11:36
/user/hadoop/dft/MyTextFile.txt
Verify that the file MyTextFile.txt exists.
Step-3. WordCount.java Map-Reduce Program
The program has several sections:
The Map section
public static class MapClass extends MapReduceBase implements Mapper<longWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
Hadoop breaks the text file in parts which now becomes the input for mapper. Then Hadoop will tokenize each line and tags each word as a datagram <”word”, 1> which indicates this word has appeared once, so if a particular word appeared 10 time, the datagram <”word”, 1> will be appear 10 times as well to reflect the repetition.
The Reduce section
public static class Reduce extends MapReduceBase implements Reducer<text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector<text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
The reducer collects the datagram as a pair <word, word count> of word and its frequency from each data node and creates another datagram as a pair of word and its total frequency from all nodes.
map-reduce organization
conf.setMapperClass(); //In our case MapClass.class
conf.setCombinerClass(); //In our case Reduce.class
conf.setReducerClass(); //In our case again Reduce.class
Combiner and Reducer take the reduce class parameters because ultimately they are doing the same work at different levels.
datagram pair
conf.setOutputKeyClass(); // Since it is text in our case so Text.class
conf.setOutputValueClass(); // Since it is int in our case so IntWritable.class
Step-4. Running WordCount
Now you are ready to execute the WordCount example.
To run this example you should be inside the example directory of Hadoop:
Use the below syntax to execute any of the example,
$hadoop jar /home/hadoop/hadoop--examples.jar dft dft-output
Execute the below command to run WordCount example:
$hadoop jar /home/hadoop/hadoop/hadoop-1.1.12-examples.jar wordcount dft dft-output
The output of command should appear like this:
Step-4. Getting the final output
Execute this Hadoop command to check the content in hadoop dfs directory
$ hadoop dfs -ls
The output may appear like below
Found x items drwxr-xr-x – hadoop supergroup 0 2010-03-16 11:36 /user/hadoop/dft drwxr-xr-x – hadoop supergroup 0 2010-03-16 11:41 /user/hadoop/dft-output.
You must see cross verify if the directory with -output at the end of your identifier (dft in our case) has been created or not.
Checking the contents of output directory:
Execute this command to check the contents of output directory
$ hadoop dfs -ls dft-output
The output should appear like below.
Found 2 items drwxr-xr-x – hadoop supergroup 0 2010-03-16 11:40 /user/hadoop/dft-output/_logs -rw-r–r– 2 hadoop supergroup 518532 2013-10-31 10:31 /user/hadoop/dft-output/part-00000
To get the frequency count of each word we need to explore the file part-00000, using the following command to check the file contents:
$ hadoop dfs -cat dft-output/part-00000 | less
Note: Output file is created in the HDFS and not on your local storage. So if you want to copy the output file to your local storage, follow these simple steps.
$ cd ~/dft $ hadoop dfs -copyToLocal $ hadoop dfs -copyToLocal dft-output/part-00000 . $ hadoop dfs -copyToLocal dft-output/part-00000
Check the current directory to see the copied file.
$ ls
You should now be able to see
MyTextFile.txt part-00000
To remove the output directory (recursively going through directories if necessary):
$ hadoop dfs -rmr dft-output
It is important to note Hadoop WordCount program will not run again if the output directory already exists. One of the prerequisite for the program to run successfully is that it should always create a new output directory, so that you do not have to delete or remove the output directory after each job execution.
Now you have successfully installed and configured Hadoop in standalone mode. In my second part, I will talk about how to configure Apache Hadoop in Pseudo Distributed Mode.
Recent blog posts
Stay in Touch
Keep your competitive edge – subscribe to our newsletter for updates on emerging software engineering, data and AI, and cloud technology trends.