Installing Hadoop on Ubuntu 20.04 Apache Hadoop is a Java-based, open-source, freely available software platform for storing and analyzing big datasets on your system clusters. It keeps its data in the Hadoop Distributed File system (HDFS) and processes it utilizing MapReduce. Hadoop has been used in machine learning and data mining techniques. It is also used for managing multiple dedicated servers. Every major industry is implementing Apache Hadoop as the standard framework for processing and storing big data. Hadoop is designed to be deployed across a network of hundreds or even thousands of dedicated servers. All these machines work together to deal with the massive volume and variety of incoming datasets. A fully developed Hadoop platform includes a collection of tools that enhance the core Hadoop framework and enable it to overcome any obstacle. Below image gives an overview of the architecture. Understanding the basic architecture can help you make sense of what you’re configuring. Step 1 — Create user for Hadoop environment Hadoop should have its own dedicated user account on your system. To create one, open a terminal (ctrl + alt + T) and type the following command. You’ll also be prompted to create a password for the account. You are free the use any username and password you see fit. I’m adding a user hadoop Step 2— Installing Java The Hadoop framework is written in Java, and its services require a compatible Java Runtime Environment (JRE) and Java Development Kit (JDK). Use the following command to update your system before initiating a new installation: sudo apt update Install the latest version of Java. sudo apt install openjdk-8-jdk -y Once installed, verify the installed version of Java with the following command: java -version You should get the following output: Step 3: Install OpenSSH on Ubuntu Install the OpenSSH server and client using the following command: sudo apt install openssh-server openssh-client -y Switch to the created user. sudo su - hadoop Generate public and private key pairs. $ ssh-keygen -t rsa Add the generated public key from id_rsa.pub to authorized_keys. $ sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Change the permissions of the authorized_keys file. $ sudo chmod 640 ~/.ssh/authorized_keys Verify if the password-less SSH is functional. $ ssh localhost Step 4: Install Apache Hadoop Download the latest stable version of Hadoop. To get the latest version, go to Apache Hadoop official download page. $ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.2/hadoop-3.3.2.tar.gz Extract the downloaded file. $ tar -xvzf hadoop-3.3.2.tar.gz You can also rename the extracted directory as we will do by executing the below-given command: mv hadoop-3.3.0 hadoop Now, configure Java environment variables for setting up Hadoop. For this, we will check out the location of our “JAVA_HOME” variable: dirname $(dirname $(readlink -f $(which java))) Step 5: Configure Hadoop Hadoop excels when deployed in a fully distributed mode on a large cluster of networked servers. However, if you are new to Hadoop and want to explore basic commands or test applications, you can configure Hadoop on a single node. This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single Java process. A Hadoop environment is configured by editing a set of configuration files: bashrc, hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site-xml and yarn-site.xml They can be found in the newly created hadoop folder Step 5a: Configure Hadoop Environment Variables (bashrc) Edit file ~/.bashrc to configure the Hadoop environment variables. $ sudo nano ~/.bashrc Add the following lines to the file. Save and close the file. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=/usr/local/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" Activate the environment variables. $ source ~/.bashrc Step 5b: Edit hadoop-env.sh File The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related project settings. When setting up a single node Hadoop cluster, you need to define which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable to access the hadoop-env.sh file: sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the OpenJDK installation on your system. If you have installed the same version as presented in the first part of this tutorial, add the following line: export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 The path needs to match the location of the Java installation on your system. If you need help to locate the correct Java path, run the following command in your terminal window: which javac The resulting output provides the path to the Java binary directory. Use the provided path to find the OpenJDK directory with the following command: readlink -f /usr/bin/javac The section of the path just before the /bin/javac directory needs to be assigned to the $JAVA_HOME variable. Step 5c: Edit core-site.xml File The core-site.xml file defines HDFS and Hadoop core properties. To set up Hadoop in a pseudo-distributed mode, you need to specify the URL for your NameNode, and the temporary directory Hadoop uses for the map and reduce process. Open the core-site.xml file in a text editor: sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml Add the following configuration to override the default values for the temporary directory and add your HDFS URL to replace the default local file system setting: fs.defaultFS hdfs://localhost:9000 This example uses values specific to the local system. You should use values that match your systems requirements. The data needs to be consistent throughout the configuration process. Step 5d: Edit hdfs-site.xml File The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit log file. Configure the file by defining the NameNode and DataNode storage directories. In this “hdfs-site.xml” file, we will change the directory path of “datanode” and “namenode”: Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the single node setup. Use the following command to open the hdfs-site.xml file for editing: sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml Add the following configuration to the file and, if needed, adjust the NameNode and DataNode directories to your custom locations: dfs.replication 1 dfs.name.dir file:///home/hadoop/hadoopdata/hdfs/namenode dfs.data.dir file:///home/hadoop/hadoopdata/hdfs/datanode If necessary, create the specific directories you defined for the dfs.data.dir value. Step 5e: Edit mapred-site.xml File Use the following command to access the mapred-site.xml file and define MapReduce values: sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml Add the following configuration to change the default MapReduce framework name value to yarn: mapreduce.framework.name yarn Step 5f: Edit yarn-site.xml File The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the Node Manager, Resource Manager, Containers, and Application Master. Open the yarn-site.xml file in a text editor: sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml Append the following configuration to the file: yarn.nodemanager.aux-services mapreduce_shuffle Step 5g. Format HDFS NameNode It is important to format the NameNode before starting Hadoop services for the first time: hdfs namenode -format The shutdown notification signifies the end of the NameNode format process. Step 6: Start Hadoop Cluster Start the NameNode and DataNode. $ start-dfs.sh Start the YARN resource and node managers. $ start-yarn.sh Verify all the running components. $ jps The system takes a few moments to initiate the necessary nodes. If everything is working as intended, the resulting list of running Java processes contains all the HDFS and YARN daemons. Step 7: Access Hadoop UI from Browser Use your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives you access to the Hadoop NameNode UI: http://localhost:9870 The NameNode user interface provides a comprehensive overview of the entire cluster The default port 9864 is used to access individual DataNodes directly from your browser: http://localhost:9864 The YARN Resource Manager is accessible on port 8088: http://localhost:8088 The Resource Manager is an invaluable tool that allows you to monitor all running processes in your Hadoop cluster. Conclusion In this tutorial, you’ve installed Hadoop in stand-alone mode and verified it by running an example program it provided.