// Tutorial //

How To Spin Up a Hadoop Cluster with DigitalOcean Droplets

Published on April 4, 2018
Default avatar
By Jeremy Morris
Developer and author at DigitalOcean.
How To Spin Up a Hadoop Cluster with DigitalOcean Droplets

Introduction

This tutorial will cover setting up a Hadoop cluster on DigitalOcean. The Hadoop software library is an Apache framework that lets you process large data sets in a distributed way across server clusters through leveraging basic programming models. The scalability provided by Hadoop allows you to scale up from single servers to thousands of machines. It also provides failure detection at the application layer, so it can detect and handle failures as a high-availability service.

There are 4 important modules that we will be working with in this tutorial:

  • Hadoop Common is the collection of common utilities and libraries necessary to support other Hadoop modules.
  • The Hadoop Distributed File System (HDFS), as stated by the Apache organization, is a highly fault-tolerant, distributed file system, specifically designed to run on commodity hardware to process large data sets.
  • Hadoop YARN is the framework used for job scheduling and cluster resource management.
  • Hadoop MapReduce is a YARN-based system for parallel processing of large data sets.

In this tutorial, we will be setting up and running a Hadoop cluster on four DigitalOcean Droplets.

Prerequisites

This tutorial will require the following:

  • Four Ubuntu 16.04 Droplets with non-root sudo users set up. If you do not have this set up, follow along with steps 1-4 of the Initial Server Setup with Ubuntu 16.04. This tutorial will assume that you are using an SSH key from a local machine. Per Hadoop’s language, we’ll refer to these Droplets by the following names:

    • hadoop-master
    • hadoop-worker-01
    • hadoop-worker-02
    • hadoop-worker-03
  • Additionally, you may want to use DigitalOcean Snapshots after the initial server set up and the completion of Steps 1 and 2 (below) of your first Droplet.

With these prerequisites in place, you will be ready to begin setting up a Hadoop cluster.

Step 1 — Installation Setup for Each Droplet

We’re going to be installing Java and Hadoop on each of our four Droplets. If you don’t want to repeat each step on each Droplet, you can use DigitalOcean Snapshots at the end of Step 2 in order to replicate your initial installation and configuration.

First, we’ll update Ubuntu with the latest software patches available:

  1. sudo apt-get update && sudo apt-get -y dist-upgrade

Next, let’s install the headless version of Java for Ubuntu on each Droplet. “Headless” refers to the software that is capable of running on a device without a graphical user interface.

  1. sudo apt-get -y install openjdk-8-jdk-headless

To install Hadoop on each Droplet, let’s make the directory where Hadoop will be installed. We can call it my-hadoop-install and then move into that directory.

  1. mkdir my-hadoop-install && cd my-hadoop-install

Once we’ve created the directory, let’s install the most recent binary from the Hadoop releases list. At the time of this tutorial, the most recent is Hadoop 3.0.1.

Note: Keep in mind that these downloads are distributed via mirror sites, and it is recommended that it be checked first for tampering using either GPG or SHA-256.

When you are satisfied with the download you have selected, you can use the wget command with the binary link you have chosen, such as:

  1. wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-3.0.1/hadoop-3.0.1.tar.gz

Once your download is complete, unzip the file’s contents using tar, a file archiving tool for Ubuntu:

  1. tar xvzf hadoop-3.0.1.tar.gz

We’re now ready to start our initial configuration.

Step 2 — Update Hadoop Environment Configuration

For each Droplet node, we’ll need to set up JAVA_HOME. Open the following file with nano or another text editor of your choice so that we can update it:

  1. nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hadoop-env.sh

Update the following section, where JAVA_HOME is located:

hadoop-env.sh
...
###
# Generic settings for HADOOP
###

# Technically, the only required environment variable is JAVA_HOME.
# All others are optional.  However, the defaults are probably not
# preferred.  Many sites configure these options outside of Hadoop,
# such as in /etc/profile.d

# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
# export JAVA_HOME=

# Location of Hadoop.  By default, Hadoop will attempt to determine
# this location based upon its execution path.
# export HADOOP_HOME=
...

To look like this:

hadoop-env.sh
...
###
# Generic settings for HADOOP
###

# Technically, the only required environment variable is JAVA_HOME.
# All others are optional.  However, the defaults are probably not
# preferred.  Many sites configure these options outside of Hadoop,
# such as in /etc/profile.d

# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# Location of Hadoop.  By default, Hadoop will attempt to determine
# this location based upon its execution path.
# export HADOOP_HOME=
...

We’ll also need to add some environment variables to run Hadoop and its modules. They should be added to the bottom of the file so it looks like the following, where sammy would be your sudo non-root user’s username.

Note: If you are using a different username across your cluster Droplets, you will need to edit this file in order to reflect the correct username for each specific Droplet.

hadoop-env.sh
...
#
# To prevent accidents, shell commands be (superficially) locked
# to only allow certain users to execute certain subcommands.
# It uses the format of (command)_(subcommand)_USER.
#
# For example, to limit who can execute the namenode command,
export HDFS_NAMENODE_USER="sammy"
export HDFS_DATANODE_USER="sammy"
export HDFS_SECONDARYNAMENODE_USER="sammy"
export YARN_RESOURCEMANAGER_USER="sammy"
export YARN_NODEMANAGER_USER="sammy"

At this point, you can save and exit the file. Next, run the following command to apply our exports:

  1. source ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hadoop-env.sh

With the hadoop-env.sh script updated and sourced, we need to create a data directory for the Hadoop Distributed File System (HDFS) to store all relevant HDFS files.

  1. sudo mkdir -p /usr/local/hadoop/hdfs/data

Set the permissions for this file with your respective user. Remember, if you have different usernames on each Droplet, be sure to allow your respective sudo user to have these permissions:

  1. sudo chown -R sammy:sammy /usr/local/hadoop/hdfs/data

If you would like to use a DigitalOcean Snapshot to replicate these commands across your Droplet nodes, you can create your Snapshot now and create new Droplets from this image. For guidance on this, you can read An Introduction to DigitalOcean Snapshots.

When you have completed the steps above across all four Ubuntu Droplets, you can move on to completing this configuration across nodes.

Step 3 — Complete Initial Configuration for Each Node

At this point, we need to update the core_site.xml file for all 4 of your Droplet nodes. Within each individual Droplet, open the following file:

  1. nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/core-site.xml

You should see the following lines:

core-site.xml
...
<configuration>
</configuration>

Change the file to look like the following XML so that we include each Droplet’s respective IP inside of the property value, where we have server-ip written. If you are using a firewall, you’ll need to open port 9000.

core-site.xml
...
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://server-ip:9000</value>
    </property>
</configuration>

Repeat the above writing in the relevant Droplet IP for all four of your servers.

Now all of the general Hadoop settings should be updated for each server node, and we can continue onto connecting our nodes via SSH keys.

Step 4 — Set Up SSH for Each Node

In order for Hadoop to work properly, we need to set up passwordless SSH between the master node and the worker nodes (the language of master and worker is Hadoop’s language to refer to primary and secondary servers).

For this tutorial, the master node will be hadoop-master and the worker nodes will be collectively referred to as hadoop-worker, but you’ll have three of them in total (referred to as -01, -02, and -03). We first need to create a public-private key-pair on the master node, which will be the node with the IP address belonging to hadoop-master.

While on the hadoop-master Droplet, run the following command. You’ll press enter to use the default for the key location, then press enter twice to use an empty passphrase:

  1. ssh-keygen

For each of the worker nodes, we need to take the master node’s public key and copy it into each of the worker nodes’ authorized_keys file.

Get the public key from the master node by running cat on the id_rsa.pub file located in your .ssh folder, to print to console:

  1. cat ~/.ssh/id_rsa.pub
  2. ```
  3. Now log into each worker node Droplet, and open the `authorized_keys` file:
  4. ```custom_prefix(sammy@hadoop-worker$)
  5. [environment fourth]
  6. nano ~/.ssh/authorized_keys
  7. ```
  8. You’ll copy the master node’s public key — which is the output you generated from the `cat ~/.ssh/id_rsa.pub` command on the master node — into each Droplet’s respective `~/.ssh/authorized_keys` file. Be sure to save each file before closing.
  9. When you are finished updating the 3 worker nodes, also copy the master node’s public key into its own `authorized_keys` file by issuing the same command:
  10. ```custom_prefix(sammy@hadoop-master$)
  11. [environment second]
  12. nano ~/.ssh/authorized_keys
  13. ```
  14. On `hadoop-master`, you should set up the `ssh` configuration to include each of the hostnames of the related nodes. Open the configuration file for editing, using nano:
  15. ```custom_prefix(sammy@hadoop-master$)
  16. [environment second]
  17. nano ~/.ssh/config
  18. ```
  19. You should modify the file to look like the following, with relevant IPs and usernames added.
  20. ```
  21. [environment second]
  22. [label config]
  23. Host hadoop-master-server-ip
  24. HostName hadoop-example-node-server-ip
  25. User sammy
  26. IdentityFile ~/.ssh/id_rsa
  27. Host hadoop-worker-01-server-ip
  28. HostName hadoop-worker-01-server-ip
  29. User sammy
  30. IdentityFile ~/.ssh/id_rsa
  31. Host hadoop-worker-02-server-ip
  32. HostName hadoop-worker-02-server-ip
  33. User sammy
  34. IdentityFile ~/.ssh/id_rsa
  35. Host hadoop-worker-03-server-ip
  36. HostName hadoop-worker-03-server-ip
  37. User sammy
  38. IdentityFile ~/.ssh/id_rsa
  39. ```
  40. Save and close the file.
  41. From the `hadoop-master`, SSH into each node:
  42. ```custom_prefix(sammy@hadoop-master$)
  43. [environment second]
  44. ssh sammy@hadoop-worker-01-server-ip
  45. ```
  46. Since it’s your first time logging into each node with the current system set up, it will ask you the following:
  47. ```
  48. [environment second]
  49. [secondary_label Output]
  50. are you sure you want to continue connecting (yes/no)?
  51. ```
  52. Reply to the prompt with `yes`. This will be the only time it needs to be done, but it is required for each worker node for the initial SSH connection. Finally, log out of each worker node to return to `hadoop-master`:
  53. ```custom_prefix(sammy@hadoop-worker$)
  54. [environment fourth]
  55. logout
  56. ```
  57. Be sure to **repeat these steps** for the remaining two worker nodes.
  58. Now that we have successfully set up passwordless SSH for each worker node, we can now continue to configure the master node.
  59. ## Step 5 — Configure the Master Node
  60. For our Hadoop cluster, we need to configure the HDFS properties on the master node Droplet.
  61. While on the master node, edit the following file:
  62. ```custom_prefix(sammy@hadoop-master$)
  63. [environment second]
  64. nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hdfs-site.xml
  65. ```
  66. Edit the `configuration` section to look like the XML below:
  67. ```xml
  68. [environment second]
  69. [label hdfs-site.xml]
  70. ...
  71. <configuration>
  72. <property>
  73. <name>dfs.replication</name>
  74. <value>3</value>
  75. </property>
  76. <property>
  77. <name>dfs.namenode.name.dir</name>
  78. <value>file:///usr/local/hadoop/hdfs/data</value>
  79. </property>
  80. </configuration>
  81. ```
  82. Save and close the file.
  83. We’ll next configure the `MapReduce` properties on the master node. Open `mapred.site.xml` with nano or another text editor:
  84. ```custom_prefix(sammy@hadoop-master$)
  85. [environment second]
  86. nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/mapred-site.xml
  87. ```
  88. Then update the file so that it looks like this, with your current server’s IP address reflected below:
  89. ```xml
  90. [environment second]
  91. [label mapred-site.xml]
  92. ...
  93. <configuration>
  94. <property>
  95. <name>mapreduce.jobtracker.address</name>
  96. <value>hadoop-master-server-ip:54311</value>
  97. </property>
  98. <property>
  99. <name>mapreduce.framework.name</name>
  100. <value>yarn</value>
  101. </property>
  102. </configuration>
  103. ```
  104. Save and close the file. If you are using a firewall, be sure to open port 54311.
  105. Next, set up YARN on the master node. Again, we are updating the configuration section of another XML file, so let’s open the file:
  106. ```custom_prefix(sammy@hadoop-master$)
  107. [environment second]
  108. nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/yarn-site.xml
  109. ```
  110. Now update the file, being sure to input your current server’s IP address:
  111. ```xml
  112. [environment second]
  113. [label yarn-site.xml]
  114. ...
  115. <configuration>
  116. <!-- Site specific YARN configuration properties -->
  117. <property>
  118. <name>yarn.nodemanager.aux-services</name>
  119. <value>mapreduce_shuffle</value>
  120. </property>
  121. <property>
  122. <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  123. <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  124. </property>
  125. <property>
  126. <name>yarn.resourcemanager.hostname</name>
  127. <value>hadoop-master-server-ip</value>
  128. </property>
  129. </configuration>
  130. ```
  131. Finally, let’s configure Hadoop’s point of reference for what the master and worker nodes should be. First, open the `masters` file:
  132. ```custom_prefix(sammy@hadoop-master$)
  133. [environment second]
  134. nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/masters
  135. ```
  136. Into this file, you’ll add your current server’s IP address:
  137. ```
  138. [environment second]
  139. [label masters]
  140. hadoop-master-server-ip
  141. ```
  142. Now, open and edit the `workers` file:
  143. ```custom_prefix(sammy@hadoop-master$)
  144. [environment second]
  145. nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/workers
  146. ```
  147. Here, you’ll add the IP addresses of each of your worker nodes, underneath where it says `localhost`.
  148. ```
  149. [environment second]
  150. [label workers]
  151. localhost
  152. hadoop-worker-01-server-ip
  153. hadoop-worker-02-server-ip
  154. hadoop-worker-03-server-ip
  155. ```
  156. After finishing the configuration of the `MapReduce` and `YARN` properties, we can now finish configuring the worker nodes.
  157. ## Step 6 — Configure the Worker Nodes
  158. We’ll now configure the worker nodes so that they each have the correct reference to the data directory for HDFS.
  159. On **each worker node**, edit this XML file:
  160. ```custom_prefix(sammy@hadoop-worker$)
  161. [environment fourth]
  162. nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hdfs-site.xml
  163. ```
  164. Replace the configuration section with the following:
  165. ```
  166. [label hdfs-site.xml]
  167. [environment fourth]
  168. <configuration>
  169. <property>
  170. <name>dfs.replication</name>
  171. <value>3</value>
  172. </property>
  173. <property>
  174. <name>dfs.datanode.data.dir</name>
  175. <value>file:///usr/local/hadoop/hdfs/data</value>
  176. </property>
  177. </configuration>
  178. ```
  179. Save and close the file. Be sure to replicate this step on **all three** of your worker nodes.
  180. At this point, our worker node Droplets are pointing to the data directory for HDFS, which will allow us to run our Hadoop cluster.
  181. ## Step 7 — Run the Hadoop Cluster
  182. We have reached a point where we can start our Hadoop cluster. Before we start it up, we need to format the HDFS on the master node. While on the master node Droplet, change directories to where Hadoop is installed:
  183. ```custom_prefix(sammy@hadoop-master$)
  184. [environment second]
  185. cd ~/my-hadoop-install/hadoop-3.0.1/
  186. ```
  187. Then run the following command to format HDFS:
  188. ```custom_prefix(sammy@hadoop-master$)
  189. [environment second]
  190. sudo ./bin/hdfs namenode -format
  191. ```
  192. A successful formatting of the namenode will result in a lot of output, consisting of mostly `INFO` statements. At the bottom you will see the following, confirming that you’ve successfully formatted the storage directory.
  193. ```
  194. [environment second]
  195. [secondary_label Output]
  196. ...
  197. 2018-01-28 17:58:08,323 INFO common.Storage: Storage directory /usr/local/hadoop/hdfs/data has been successfully formatted.
  198. 2018-01-28 17:58:08,346 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/hdfs/data/current/fsimage.ckpt_0000000000000000000 using no compression
  199. 2018-01-28 17:58:08,490 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/hdfs/data/current/fsimage.ckpt_0000000000000000000 of size 389 bytes saved in 0 seconds.
  200. 2018-01-28 17:58:08,505 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
  201. 2018-01-28 17:58:08,519 INFO namenode.NameNode: SHUTDOWN_MSG:
  202. /************************************************************
  203. SHUTDOWN_MSG: Shutting down NameNode at hadoop-example-node/127.0.1.1
  204. ************************************************************/
  205. ```
  206. Now, start the Hadoop cluster by running the following scripts (be sure to check scripts before running by using the `less` command):
  207. ```custom_prefix(sammy@hadoop-master$)
  208. [environment second]
  209. sudo ./sbin/start-dfs.sh
  210. ```
  211. You’ll then see output that contains the following:
  212. ```
  213. [environment second]
  214. [secondary_label Output]
  215. Starting namenodes on [hadoop-master-server-ip]
  216. Starting datanodes
  217. Starting secondary namenodes [hadoop-master]
  218. ```
  219. Then run YARN, using the following script:
  220. ```custom_prefix(sammy@hadoop-master$)
  221. [environment second]
  222. ./sbin/start-yarn.sh
  223. ```
  224. The following output will appear:
  225. ```
  226. [environment second]
  227. [secondary_label Output]
  228. Starting resourcemanager
  229. Starting nodemanagers
  230. ```
  231. Once you run those commands, you should have daemons running on the master node and one on each of the worker nodes.
  232. We can check the daemons by running the `jps` command to check for Java processes:
  233. ```custom_prefix(sammy@hadoop-master$)
  234. [environment second]
  235. jps
  236. ```
  237. After running the `jps` command, you will see that the `NodeManager`, `SecondaryNameNode`, `Jps`, `NameNode`, `ResourceManager`, and `DataNode` are running. Something similar to the following output will appear:
  238. ```
  239. [environment second]
  240. [secondary_label Output]
  241. 9810 NodeManager
  242. 9252 SecondaryNameNode
  243. 10164 Jps
  244. 8920 NameNode
  245. 9674 ResourceManager
  246. 9051 DataNode
  247. ```
  248. This verifies that we’ve successfully created a cluster and verifies that the Hadoop daemons are running.
  249. In a web browser of your choice, you can get an overview of the health of your cluster by navigating to:
  250. ```
  251. http://hadoop-master-server-ip:9870
  252. ```
  253. If you have a firewall, be sure to open port 9870. You’ll see something that looks similar to the following:
  254. ![Hadoop Health Verification](https://assets.digitalocean.com/articles/hadoop-cluster/hadoop-verification.png)
  255. From here, you can navigate to the `Datanodes` item in the menu bar to see the node activity.
  256. ### Conclusion
  257. In this tutorial, we went over how to set up and configure a Hadoop multi-node cluster using DigitalOcean Ubuntu 16.04 Droplets. You can also now monitor and check the health of your cluster using Hadoop’s DFS Health web interface.
  258. To get an idea of possible projects you can work on to utilize your newly configured cluster, check out Apache’s long list of projects [powered by Hadoop](https://wiki.apache.org/hadoop/PoweredBy).

Want to learn more? Join the DigitalOcean Community!

Join our DigitalOcean community of over a million developers for free! Get help and share knowledge in our Questions & Answers section, find tutorials and tools that will help you grow as a developer and scale your project or business, and subscribe to topics of interest.

Sign up
About the authors
Default avatar
Developer and author at DigitalOcean.

Default avatar
Developer and author at DigitalOcean.

Still looking for an answer?

Was this helpful?
4 Comments

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Great tutorial, but just one mistake that got me stuck a while: Do not add the worker ip in core-site.xml even if you’re configuring the worker node. This config is about telling the worker node where to connect to the name node. Otherwise, very useful write up. Thank you!

Great tutorial.

I got my cluster up and running taking into account @Toastbroad comment.

However. I have a problem trying to use the webUI to modify files using webhdfs.

e.g. uploading a file:

When I inspect the network requests being sent from chrome I see that it begins by sending a request to the master node’s ip address. The master node then responds telling the webUI the address of the worker node to send the data request to. However, the address contain hostnames not ip addresses. e.g. https://hadoop-worker-01/… The webUI tries to connect but obviously it doesn’t success because that host name is not public. I would have expected hadoop to respond with the public ip address of the worker node.

Does anyone else have this problem?

Thanks

I followed the steps, to set up my multinode cluster but when I check datanode information on web(9870) shows only 1 master node. I cannot see slaves at all. Any help please?

Nicely detailed with the Hadoop multinode setup, but whats new in this configuration except it is been configured in yet another flavour of linux(Ubuntu-16.04 Droplets) is confusing. Visit :http://www.noahdatatech.com/solutions/big-data-engineering/