Install Hadoop 2.5 in ubuntu 14.04 as well as RHadoop.





Install Hadoop/YARN 2.4.0 on Ubuntu (VirtualBox)

This article describes the step-by-step approach to install Hadoop/YARN 2.4.0 on Ubuntu and its derivatives (LinuxMint, Kubuntu etc.). I personally use a virtual machine for testing out different big data softwares (Hadoop, Spark, Hive, etc.) and I’ve used LinuxMint 16 on VirtualBox 4.3.10 for the purpose of this blog post.

Install JDK 7

$sudo apt-get install openjdk-7-jdk
$ java -version
java version "1.7.0_25"
OpenJDKRuntime Environment (IceTea 2.3.12) (7u25-2.3.12-4ubuntu3)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
$cd /usr/lib/jvm
$ln -s java-7-openjdk-amd64 jdk

Install OpenSSH Server

$ sudo apt-get install openssh-server
$ ssh-keygen -t rsa
Hit enter on all prompts i.e. accept all defaults including “no passphrase”. Next, to prevent password prompts, add the public key of this machine to the authorized keys folder (Hadoop services use ssh to talk among themselves even on a single node cluster).
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
SSH to localhost to test ssh server and also save localhost in the list of known hosts. Next time when you ssh to localhost, there will be no prompts
$ ssh localhost

Download Hadoop

Note 1: You should use a mirror URL from the official downloads page
Note 2: parambirs is my user name as well as group name on the ubuntu machine. Please replace this with your own user/group name
$ cd Downloads/
$ wget www.webhostingreviewjam.com/mirror/apache/hadoop/common/stable/hadoop-2.5.0.tar.gz
$ tar zxvf hadoop-2.5.0.tar.gz
$ sudo mv hadoop-2.5.0 /usr/local/
$ cd /usr/local
$ sudo mv hadoop-2.2.0 hadoop
$ sudo chown -R hduser:hadoop hadoop

Environment Configuration

$ cd ~
$ vim .bashrc
Add the following to the end of .bashrc file
#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/jdk/
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
Modify hadoop-env.sh
$ cd /usr/local/hadoop/etc/hadoop
$ vim hadoop-env.sh
 
#modify JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/jdk/
Verify hadoop installation
$ source ~/.bashrc (refresh shell to reflect the configuration changes we’ve made)
$ hadoop version
Hadoop 2.4.0
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1583262
Compiled by jenkins on 2014-03-31T08:29Z
Compiled with protoc 2.5.0
From source with checksum 375b2832a6641759c6eaf6e3e998147
This command was run using /usr/local/hadoop-2.4.0/share/hadoop/common/hadoop-common-2.4.0.jar

 Hadoop Configuration

$ cd ~
$ mkdir -p mydata/hdfs/namenode
$ mkdir -p mydata/hdfs/datanode

core-site.xml

$ cd /usr/local/hadoop/etc/hadoop/
$ vim core-site.xml
Add the following between the <configuration></configuration> elements
<property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:9000</value>
</property>

yarn-site.xml

$ vim yarn-site.xml
Add the following between the <configuration></configuration> elements
<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

mapred-site.xml

$ cp mapred-site.xml.template mapred-site.xml
$ vim mapred-site.xml
Add the following between the <configuration></configuration> elements
<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
</property>

hdfs-site.xml

$ vim hdfs-site.xml
Add the following between the <configuration></configuration> elements. Replace /home/parambirs with your own home directory.
<property>
 <name>dfs.replication</name>
 <value>1</value>
 </property>
 <property>
 <name>dfs.namenode.name.dir</name>
 <value>file:/home/parambirs/mydata/hdfs/namenode</value>
 </property>
 <property>
 <name>dfs.datanode.data.dir</name<value>file:/home/parambirs/mydata/hdfs/datanode</value>
 </property>

Running Hadoop

Format the namenode
$ hdfs namenode -format
Start hadoop
$ start-dfs.sh
$ start-yarn.sh
Verify all services are running
$ jps
5037 SecondaryNameNode
4690 NameNode
5166 ResourceManager
4777 DataNode
5261 NodeManager
5293 Jps
Check web interfaces of different services

Run a hadoop example MR job

$ cd /usr/local/hadoop
$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar pi 2 5
 
Prerequisites
RStudio Server v0.98 requires Debian version 6 (or higher) or Ubuntu version 10.04 (or higher).
Installing R
RStudio requires a previous installation of R version 2.11.1 or higher. To install the latest version of R you should first add the CRAN repository to your system as described here:
You can then install R using the following command:
$ sudo apt-get install r-base

NOTE: if you do not add the CRAN Debian or Ubuntu repository as described above this command will install the version of R corresponding to your current system version. Since this version of R may be a year or two old it is strongly recommended that you add the CRAN repositories so you can run the most up to date version of R.
Additional Prerequisites for Debian 7
If you are running Debian 7 (as opposed to an earlier version of Debian or Ubuntu) then you’ll also need to install OpenSSL version 0.9.8 prior to installing RStudio Server. See these instructions for installing OpenSSL v0.9.8 on Debian 7 for more details.
Download and Install
To download and install RStudio Server open a terminal window and execute the following commands (corresponding to the 32 or 64-bit version as appropriate). Note that the gdebi-core package is installed first so that gdebi can be used to install RStudio and all of its dependencies. Also note that the libapparmor1 dependency is required for Ubuntu only, not Debian.

32bit
Size:  39.7 MB
MD5: 9e2a546071ec463d1521f89115ceb3f8
Version:  0.98.1028
Released:  2014-08-14
$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1 # Required only for Ubuntu, not Debian
$ wget http://download2.rstudio.org/rstudio-server-0.98.1028-i386.deb
$ sudo gdebi rstudio-server-0.98.1028-i386.deb
64bit
Size:  41.2 MB
MD5: c296b687cb8eced037d6eaa169521d99
Version:  0.98.1028
Released:  2014-08-14
$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1 # Required only for Ubuntu, not Debian
$ wget http://download2.rstudio.org/rstudio-server-0.98.1028-amd64.deb
$ sudo gdebi rstudio-server-0.98.1028-amd64.deb

NOTE: If you are running Debian 7 please ensure that you’ve installed OpenSSL v0.9.8 prior to installing RStudio Server.

From Browser:
127.0.0.1:8787
install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava"))
Download the latest rmr2*tar.gz and rhdfs*tar.gz from here 
install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava"))
install.packages(c("dplyr","R.methodsS3"))
install.packages(c("Hmisc"))
install.packages(c("caTools"))
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")

Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.0.jar")
install rmr, rhdfs, plyrmr, rhbase from tool -> install packages  browse and select and install.
To test our defined or designed R in hadoop is working or not:
>library(rmr2)
>library(rhdfs)
>hdfs.init()
rmr.options(backend="local")

>ints = to.dfs(1:100)
>calc = mapreduce(input = ints,map = function(k, v) cbind(v, 2*v))

from.dfs(calc)

$val
         v   
  [1,]   1   2
  [2,]   2   4
  [3,]   3   6
  [4,]   4   8
  [5,]   5  10
.....

 

3 comments:

  1. I run on Ubuntu 10.4 64 bits + RHadoop topic "Predicting the sale price of blue book for dulldozers (case study)" but can not run success please advior

    > # Loading Train.csv dataset which includes the Sales as well as
    > machine identifier data attributes.
    Error: unexpected symbol in "machine identifier"
    > transactions <- read.table(file="~/Downloads/Train.csv",
    + header=TRUE,
    + sep=",",
    + quote="\"",
    + row.names=1,
    + fill=TRUE,
    + colClasses=c(MachineID="factor",
    + ModelID="factor",
    + datasource="factor",
    + YearMade="character",
    + SalesID="character",
    + auctioneerID="factor",
    + UsageBand="factor",
    + saledate="custom.date.2",
    + Tire_Size="tire.size",
    + Undercarriage_Pad_Width="undercarriage",
    + Stick_Length="stick.length"),
    + na.strings=na.values)
    Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
    object 'na.values' not found
    > # Loading Machine_Appendix.csv for machine configuration information
    > machines <- read.table(file="~/Downloads/Machine_Appendix.csv",
    + header=TRUE,
    + sep=",",
    + quote="\"",
    + fill=TRUE,
    + colClasses=c(MachineID="character",
    + ModelID="factor",
    + fiManufacturerID="factor"),
    + na.strings=na.values)
    Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
    object 'na.values' not found

    ReplyDelete
  2. Dear Sir
    Now I can used Install Hadoop 2.5 in ubuntu 14.04 as well as RHadoop already complete but
    -after import dataset 2file .csv
    -Console command how I show graph prediction next one year
    Could you advisor

    ReplyDelete
  3. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai

    ReplyDelete