ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

Setting up a Hadoop minicluster in mac OS X snow leopard

Updated on June 8, 2013

Hadoop is an increasingly popular framework for distributed SIMD (Single instruction, multiple data) parallel computing. At the risk of dramatic oversimplification a heap of data is partitioned between a number of processors (known as a hadoop cluster, the individual processors are called nodes) and a uniform operation is applied to the portion of data (the shard) assigned to each machine. The result of the operation is returned as a set of key-value pairs. This stage is known as Mapping, where “Map” refers to a data structure consisting of a collection of key-value pairs.

At the next stage the set of key-value pairs are reduced in size. Typically for each key a smaller set of results – ideally one key value pair, is produced. This step is known as reduction.

The algorithm is unsurprisingly called Map-Reduce.

In complexity arises because processors may fail or take radically different times to execute their path, and data may be lost in transit.

For developers a single machine minicluster with a very small dataset is useful for testing and debugging programs. Here (approximately, since everything changes) is how to set up a test and development environment on OS X (Snow Leopard). Most instructions here assume OSX's terminal utility is open in the Hadoop installation directory. There is a lot could be done to make the installation more elegant and easier to use, but everyone will have their own ideas how this could and should be done.

SSH

SSH is a secure connection protocol that comes with OS X but is best enabled, otherwise Hadoop may ask for your password at the wrong time.

Open terminal and type

ssh localhost

You should get a reply like reply

Last login: Sun Jul 15 01:15:19 2012

Alternatively (for the curious) look at your .ssh directory (note the dot) in your home folder. You should see three files.

ls .ssh

authorized_keys id_rsa id_rsa.pub known_hosts

If you cannot ssh to local host

  1. Go to System Preferences → Internet and Wireless → Sharing

    and check

  2. Open terminal in your home directory and type

    1. ssh-keygen -t rsa -P ""

    2. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

  1. Try ssh localhost again.

Step two generates public and private keys for ssh to use and appends them to a file of PUBLIC keys belonging to authorised cluster users.

Getting Hadoop

This changes from time to time. Google “Hadoop download” and look around till you find the latest stable versions (Masochists can download any alpha version if they wish). You want the hadoop_**-bin.tar.gz version or, if you want to look at the source the **tar.gz version. The version used here was 1.0.3

Download it, unpack it, put the folder where you want. I keep in in a folder /Local.

Configuration


You need to edit some files ( all names are relative to the the directory unpacked from the tar file), all in the conf directory of the distribution

conf/hadoop-env.sh

  • uncomment the export JAV A_HOME line and set it to /Library/Java/Home

  • uncomment the export HADOOP_HEAPSIZE line and keep it at 2000

  • You may need to add or uncomment the following line which fixes a problem with Lion

  • export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"

The last modification is to overcome a problem with Mac OS Lion. I found this the version I downloaded.

conf/core­site.xml should look something like this.

(It sets the location where hadoop stores temporary directories and hdfs, the Hadoop Distributed File System. It also sets the default name for the filesystem, which is a port on your local machine. Of course you can make the value of hadoop.tmp.dir whatever you want) .

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>hadoop.tmp.dir</name>

<value>/Users/${user.name}/hadoop-store</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:8020</value>

</property>

</configuration>

conf/ hdfs-site.xml should look like this. The replication property determines how many copies of a file should be stored. As this will run on a single machine it is set to 1

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

mapred-site.xml should look like this:

(You are specifying the job tracker name and the maximum number of map and reduce tasks spawned).

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

<property>

<name>mapred.tasktracker.map.tasks.maximum</name>

<value>2</value>

</property>

<property>

<name>mapred.tasktracker.reduce.tasks.maximum</name>

<value>2</value>

</property>

</configuration>


Format the NameNode

This needs to be done only once. From the Hadoop install directory type

bin/hadoop namenode -format

You should see a load of lines, the last ending with

SHUTDOWN_MSG: Shutting down NameNode

You should not need to format the namenode again. It creates the hadoop.tmp.dir directory and populates it.

IF YOU REFORMAT THE NAME NODE LATER DELETE THE hadoop.tmp.dir DIRECTORY FIRST.

If you used the default above it will be hadoop-store in your home directory. If you do not delete it things will nolonger work and applications will either hang or throw exceptions. You will also see odd messages in the logs when starting HDFS.

Start hdfs

Again from the Hadoop Install directory

bin/start-all.sh

The response should be like this

starting namenode, logging to......

localhost: starting datanode, logging to …......................

localhost: starting secondarynamenode, logging to ….............................

starting jobtracker, logging to …...............................

localhost: starting tasktracker, logging to …...............

Take a note of the log file locations and names. If there are no error messages or exception stack traces you can continue. Otherwise revisit the instructions above. Another smoke test is to type

ps ax | grep hadoop | wc -l

The expected result should be 6

Explore and populate HDFS

You can access many unix filesystem facilities as options of the the dfs option of hadoop thus

bin/hadoop dfs -ls

will show all the folders in your area of the dfs: bin/hadoop dfs -rmr removes a directory from dfs and so on.

One notable exception is copying from your local filesystem to dfs which is done like this

bin/hadoop dfs -copyFromLocal from to

from is a full pathname in your local system and to is where you want it in DFS.

There is a similar option in the other direction.

bin/hadoop dfs -copyToLocal output myoutput

Try this a couple of times with files and directories till you are happy you know what is happening. Use the -ls and -cat options of the hadoop dfs command to check everything is going well

In particular you will want at least one directory in dfs for input data.

Running the examples

The final test is to run some of the canned examples that come with the install. The generic command for this is

/bin/hadoop jar <jar file name> <example name> <optional parameters>

thus

bin/hadoop jar hadoop-examples-1.0.3.jar pi 100 10

computes an approximation to pi. Since you are running on a single node cluster you can expect it to take a while.

replacing “pi 100 10” with “wordcount input output” runs a wordcount program that counts the words in textfiles in the directory input in dfs and write the results to a newly created directory called output

First run the pi example then

bin/hadoop jar hadoop-examples-1.0.3.jar wordcount input ouput

NOTE: Hadoop will not overwrite an existing directory, and if you do not either name a new output directory or delete the old one you will get an error and the job will not run

Checking the results

You can verify the results by looking at the ouput directory

bin/hadoop dfs -ls output

should show three files. The one called part-r-00000 should not be empty. See what is in it by typing

bin/hadoop dfs -cat output2/part*

Notes

Namenode info: http://localhost:50070/dfshealth.jsp

Jobtracker: http://localhost:50030

Starting hadoop cluster: ‘bin/start­all.sh’

Stop hadoop cluster: bin/stop­all.sh

Verify hadoop started properly: Use ps ax | grep hadoop | wc ­l and make sure you see 6 as output.

There are 5 processes associated with hadoop and one pertaining to the last command

When formatting namenode at least look at where the hadoop.tmp.dir dorectory is located and where the logs are. The logs are your friend. Sometimes your only friend.

The Wrap

This note has shown a cheap and cheerful way to set up a single node hadoop cluster on OS X. The process for setting up a multi node cluster is more complicated and in a corporate environment may be complicated by security restrictions involving ssh. Nevertheless this lets developers learn by experimentation on very small datasets, and to test and debug programs on the same data sets before deploying them in production. It will also be useful for experimenting with PIG, the Hadoop scripting language

The references below proved invaluable. However the authors perhaps assumed too much knowledge from the readers and did not detail traps that might occur. This is common and I have doubtless missed somethings you will find useful.

This is not the only way to experiment with Hadoop on OS X. If you are only interested in coding you can run Hadoop programs, with rather less hassle, in Eclipse. But this note is long enough already.

Comments

    0 of 8192 characters used
    Post Comment

    • AlexK2009 profile imageAUTHOR

      AlexK2009 

      4 years ago from Edinburgh, Scotland

      Thanks Bob and Arvind. I have been meaning to upgrade my minicluster but decided to wait till YARN/Hadoop 2 was properly released, since I do not want to go through the pain of installation and configuration more often than I have to.

      I find the minicluster most useful for running PIG scripte these days as I can run prototypes under Eclipse.

    • profile image

      Bob Freitas 

      4 years ago

      A great post! It provided me with a good starting point. Too bad there is not much info available on the MiniCluster. To extend things I put together a follow-on article giving a little more detail and fully working project in GitHub, http://www.lopakalogic.com/articles/hadoop-article...

    • profile image

      Arvindsaxena 

      4 years ago

      hi

      nice and valueable blog about the Hadoop introduction please see more understandable video on http://www.youtube.com/watch?v=ApR6pHN-m6M

    working

    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://hubpages.com/privacy-policy#gdpr

    Show Details
    Necessary
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
    Features
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Marketing
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Statistics
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)