Setting up a Hadoop minicluster in mac OS X snow leopard

Hadoop is an increasingly popular framework for distributed SIMD (Single instruction, multiple data) parallel computing. At the risk of dramatic oversimplification a heap of data is partitioned between a number of processors (known as a hadoop cluster, the individual processors are called nodes) and a uniform operation is applied to the portion of data (the shard) assigned to each machine. The result of the operation is returned as a set of key-value pairs. This stage is known as Mapping, where “Map” refers to a data structure consisting of a collection of key-value pairs.

At the next stage the set of key-value pairs are reduced in size. Typically for each key a smaller set of results – ideally one key value pair, is produced. This step is known as reduction.

The algorithm is unsurprisingly called Map-Reduce.

In complexity arises because processors may fail or take radically different times to execute their path, and data may be lost in transit.

For developers a single machine minicluster with a very small dataset is useful for testing and debugging programs. Here (approximately, since everything changes) is how to set up a test and development environment on OS X (Snow Leopard). Most instructions here assume OSX's terminal utility is open in the Hadoop installation directory. There is a lot could be done to make the installation more elegant and easier to use, but everyone will have their own ideas how this could and should be done.

SSH

SSH is a secure connection protocol that comes with OS X but is best enabled, otherwise Hadoop may ask for your password at the wrong time.

Open terminal and type

ssh localhost

You should get a reply like reply

Last login: Sun Jul 15 01:15:19 2012

Alternatively (for the curious) look at your .ssh directory (note the dot) in your home folder. You should see three files.

ls .ssh

authorized_keys id_rsa id_rsa.pub known_hosts

If you cannot ssh to local host

  1. Go to System Preferences → Internet and Wireless → Sharing

    and check

  2. Open terminal in your home directory and type

    1. ssh-keygen -t rsa -P ""

    2. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

  1. Try ssh localhost again.

Step two generates public and private keys for ssh to use and appends them to a file of PUBLIC keys belonging to authorised cluster users.

Getting Hadoop

This changes from time to time. Google “Hadoop download” and look around till you find the latest stable versions (Masochists can download any alpha version if they wish). You want the hadoop_**-bin.tar.gz version or, if you want to look at the source the **tar.gz version. The version used here was 1.0.3

Download it, unpack it, put the folder where you want. I keep in in a folder /Local.

Configuration


You need to edit some files ( all names are relative to the the directory unpacked from the tar file), all in the conf directory of the distribution

conf/hadoop-env.sh

  • uncomment the export JAV A_HOME line and set it to /Library/Java/Home

  • uncomment the export HADOOP_HEAPSIZE line and keep it at 2000

  • You may need to add or uncomment the following line which fixes a problem with Lion

  • export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"

The last modification is to overcome a problem with Mac OS Lion. I found this the version I downloaded.

conf/core­site.xml should look something like this.

(It sets the location where hadoop stores temporary directories and hdfs, the Hadoop Distributed File System. It also sets the default name for the filesystem, which is a port on your local machine. Of course you can make the value of hadoop.tmp.dir whatever you want) .

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>hadoop.tmp.dir</name>

<value>/Users/${user.name}/hadoop-store</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:8020</value>

</property>

</configuration>

conf/ hdfs-site.xml should look like this. The replication property determines how many copies of a file should be stored. As this will run on a single machine it is set to 1

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

mapred-site.xml should look like this:

(You are specifying the job tracker name and the maximum number of map and reduce tasks spawned).

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

<property>

<name>mapred.tasktracker.map.tasks.maximum</name>

<value>2</value>

</property>

<property>

<name>mapred.tasktracker.reduce.tasks.maximum</name>

<value>2</value>

</property>

</configuration>


Format the NameNode

This needs to be done only once. From the Hadoop install directory type

bin/hadoop namenode -format

You should see a load of lines, the last ending with

SHUTDOWN_MSG: Shutting down NameNode

You should not need to format the namenode again. It creates the hadoop.tmp.dir directory and populates it.

IF YOU REFORMAT THE NAME NODE LATER DELETE THE hadoop.tmp.dir DIRECTORY FIRST.

If you used the default above it will be hadoop-store in your home directory. If you do not delete it things will nolonger work and applications will either hang or throw exceptions. You will also see odd messages in the logs when starting HDFS.

Start hdfs

Again from the Hadoop Install directory

bin/start-all.sh

The response should be like this

starting namenode, logging to......

localhost: starting datanode, logging to …......................

localhost: starting secondarynamenode, logging to ….............................

starting jobtracker, logging to …...............................

localhost: starting tasktracker, logging to …...............

Take a note of the log file locations and names. If there are no error messages or exception stack traces you can continue. Otherwise revisit the instructions above. Another smoke test is to type

ps ax | grep hadoop | wc -l

The expected result should be 6

Explore and populate HDFS

You can access many unix filesystem facilities as options of the the dfs option of hadoop thus

bin/hadoop dfs -ls

will show all the folders in your area of the dfs: bin/hadoop dfs -rmr removes a directory from dfs and so on.

One notable exception is copying from your local filesystem to dfs which is done like this

bin/hadoop dfs -copyFromLocal from to

from is a full pathname in your local system and to is where you want it in DFS.

There is a similar option in the other direction.

bin/hadoop dfs -copyToLocal output myoutput

Try this a couple of times with files and directories till you are happy you know what is happening. Use the -ls and -cat options of the hadoop dfs command to check everything is going well

In particular you will want at least one directory in dfs for input data.

Running the examples

The final test is to run some of the canned examples that come with the install. The generic command for this is

/bin/hadoop jar <jar file name> <example name> <optional parameters>

thus

bin/hadoop jar hadoop-examples-1.0.3.jar pi 100 10

computes an approximation to pi. Since you are running on a single node cluster you can expect it to take a while.

replacing “pi 100 10” with “wordcount input output” runs a wordcount program that counts the words in textfiles in the directory input in dfs and write the results to a newly created directory called output

First run the pi example then

bin/hadoop jar hadoop-examples-1.0.3.jar wordcount input ouput

NOTE: Hadoop will not overwrite an existing directory, and if you do not either name a new output directory or delete the old one you will get an error and the job will not run

Checking the results

You can verify the results by looking at the ouput directory

bin/hadoop dfs -ls output

should show three files. The one called part-r-00000 should not be empty. See what is in it by typing

bin/hadoop dfs -cat output2/part*

Notes

Namenode info: http://localhost:50070/dfshealth.jsp

Jobtracker: http://localhost:50030

Starting hadoop cluster: ‘bin/start­all.sh’

Stop hadoop cluster: bin/stop­all.sh

Verify hadoop started properly: Use ps ax | grep hadoop | wc ­l and make sure you see 6 as output.

There are 5 processes associated with hadoop and one pertaining to the last command

When formatting namenode at least look at where the hadoop.tmp.dir dorectory is located and where the logs are. The logs are your friend. Sometimes your only friend.

The Wrap

This note has shown a cheap and cheerful way to set up a single node hadoop cluster on OS X. The process for setting up a multi node cluster is more complicated and in a corporate environment may be complicated by security restrictions involving ssh. Nevertheless this lets developers learn by experimentation on very small datasets, and to test and debug programs on the same data sets before deploying them in production. It will also be useful for experimenting with PIG, the Hadoop scripting language

The references below proved invaluable. However the authors perhaps assumed too much knowledge from the readers and did not detail traps that might occur. This is common and I have doubtless missed somethings you will find useful.

This is not the only way to experiment with Hadoop on OS X. If you are only interested in coding you can run Hadoop programs, with rather less hassle, in Eclipse. But this note is long enough already.

More by this Author


Comments 3 comments

AlexK2009 profile image

AlexK2009 3 years ago from Edinburgh, Scotland Author

Thanks Bob and Arvind. I have been meaning to upgrade my minicluster but decided to wait till YARN/Hadoop 2 was properly released, since I do not want to go through the pain of installation and configuration more often than I have to.

I find the minicluster most useful for running PIG scripte these days as I can run prototypes under Eclipse.


Bob Freitas 3 years ago

A great post! It provided me with a good starting point. Too bad there is not much info available on the MiniCluster. To extend things I put together a follow-on article giving a little more detail and fully working project in GitHub, http://www.lopakalogic.com/articles/hadoop-article...


Arvindsaxena 3 years ago

hi

nice and valueable blog about the Hadoop introduction please see more understandable video on http://www.youtube.com/watch?v=ApR6pHN-m6M

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working