Tuesday, 11 December 2012

Hadoop tutorial, hands-on guide


                         Hadoop Developer Tutorial
Software required
1. VMWare Player 4.0.4
2. Linux version – Ubuntu 12.04
3. Putty
4. winscp
5. User name: *******
6. Password: *******

Tutorial is written with respect to VM created that can run on any windows machine using VMWare Player. All required software are all available on VM machine.
VM can be downloaded from:
Else tutorial can be followed if you have access to Unix/Linux OS.
  
Lab 1. Preparation for lab
(Not required if not working on VM provided)
1. Unzip mirror image at any location on windows machine
2. Open VMWare player and
     file -> open virtual machine and select
     -> mirror image folder ->\virtual machine file
     \ubuntu-server-12.04-amd64\ubuntu-server-12.04-amd64 file
3. Cltr+G and make note of IP address and same can be used to
     login via putty and winscp
4. Open Putty -> login via IP address ->
     username: ******, password: ********
5. Now we can minimize VM machine and Putty can be used from here

Lab 2. Setting Hadoop
1. Untar Hadoop jar file
   a. Go to lab/software
   b. Untar Hadoop tar file into software folder
   c. tar –xvf ../../downloads/Hadoop-1.0.3.tar.gz

2. Set up env. Variable
  a. Open .bash_profile i.e. vi .bash_profile
  b. Enter following
      1. export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
      2. export HADOOP_INSTALL=/home/notroot/lab/software/hadoop-1.0.3
      3. export HADOOP_HOME=/home/notroot/lab/software/hadoop-1.0.3
      4. export PATH=$PATH:$HADOOP_INSTALL/bin
          save and exit i.e. do:wq enter

 c.  Check installations
java –version
                           
hadoop version

3. configuring Hadoop/HDFS/MAPREDUCE

cd $HADOOP_HOME/conf
  reference Link: http://hadoop.apache.org/docs/stable/cluster_setup.html

Modify    core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>        
<value>hdfs://localhost:8020</value> 
<final>true</final>
</property>
</configuration>


Modify    hdfs-site.xml
<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/notroot/lab/hdfs/namenodep,/home/notroot/lab/hdfs/namenodes</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/notroot/lab/hdfs/datan1,/home/notroot/lab/hdfs/datan2</value>
<final>true</final>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>/home/notroot/lab/hdfs/checkp</value>
<final>true</final>
</property>
</configuration>


Modify    mapred-site.xml
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<final>true</final>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/notroot/lab/mapred/local1,/home/notroot/lab/mapred/local2</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/notroot/lab/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>3</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>3</value>
<final>true</final>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>
<!-- Not marked as final so jobs can include JVM debugging options -->
</property>
</configuration>


Create directories under lab/hdfs
1.      mkdir namenodep
2.      mkdir namenodes
3.      mkdir datan1
4.      mkdir datan2
5.      mkdir checkp

Change permission on folders
1.      chmod 755 datan1
2.      chmod 755 datan2

Create directories under lab/mapred
1.      mkdir local1
2.      mkdir local2
3.      mkdir system


Format namenode (only once)
Cmd: Hadoop namenode –format

Starting DHFS services
1)      cd $HADOOP_HOME/conf
2)      edit  Hadoop-env.sh and set JAVA_HOME
a.      export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
3)      start HDFS services
a.      cd $HADOOP_HOME/bin
b.      exec: ./start-dfs.sh
4)      start MapReduce services
a.      cd $HADOOP_HOME/bin
b.      exec: ./start-mapred.sh

run jps and check processes running



HDFS services: DataNode, NameNode and SecondaryNameNode
MapReduce services: TaskTracker and JobTracker

Lab 3: HDFS Lab:
 1) Create an input and output directory under HDFS for input and output files
            - Hadoop fs –mkdir input
            - Hadoop fs –mkdir output

2) Check directories
            - Hadoop fs –ls
  
3) Copy files from local system to HDFS and check if copied
            - Hadoop fs –copyFromLocal  /home/notroot/lab/data/txns  input/
              - Checking files, Hadoop fs –ls input/
    
4) Copy from HDFS to local system
            - Hadoop fs –copyToLocal  input/txns  /home/notroot/lab/data/txndatatemp

Goto datan1 and datan2 and check how the file is split and multiple blocks are created

Lab 3: MapReduce – Word Count
1. first we will focus on writing Java program using Eclipse
2. Eclipse lab (most of you know)
    a. Untar Hadoop tar file under, (say: c:\softwares\)
    b. Create new Java project (MRLab), package lab.samples
    c. Add Hadoop jar files to project created
        i. Jars under c:\softwares\hadoop-1.0.3
       ii. Jars under c:\softwares\hadoop-1.0.3\lib
    d. Time to write Map, Reduce functions
    e. We gonna write three classes and package them together in a jar file
      i. Map class
     ii. Reduce class
    iii. Driver class (Hadoop will call main function of this class)

link for sample code:

    f. Compile code and create jar file
       i. Right click on “Project folder” -> export -> jar file
   g. Transfer jar file from local machine to virtual machine, use WinSCP tool for it
   h. Copy jar file to /home/notroot/lab/programs (on virtual machine)

At this point, we have MapReduce function (jar file) on virtual machine and all processes are also running on virtual machine (HDFS, Job tracker, task tracker …)

Run MapReduce as
Hadoop   jar <jar file name>.jar   DriverClass    input file path    output file path
Hadoop   jar <jar file name>.jar   lab.samples.WordCount    input/words     output/wcount

Output file can be check by: Hadoop fs –cat output/wcount/part-r-00000

Lab 6: Hive Configuration

Install MySQL on virtual machine
1.      Sudo apt-get install mysql-server
2.      Sudo apt-get install mysql-client-core-5.5

a.      Untar Hive jar file
Ø  Go to lab/software
Ø  Untar Hive files into software folder
tar –xvf../../downloads/hive-0.9.0-bin.tar.gz
Ø  Browse through the directories and check which
subdirectory contains what files

b.      Set up .bash_profile
Ø  Open .bash_profile file under home directory
Enter the following settings
export HIVE_INSTALL= /home/notroot/lab/software/hive-0.9.0-bin
export PATH=$PATH:$HIVE_INSTALL/bin
Ø  Save and exit .bash_profile
Ø  Run following command
. .bash_profile
Ø  Verify whether variable are defined or not by typing export
at command prompt

c.       Check Hive Table
Ø  Run hive and verify  if enters hive shell
hive
Ø  Check databases and tables
show databases;
show tables;

Lab 7 : Hive Programming

Create databases
create database retail;

Use database
use retail;

Create table for storing transactional records
Create table txnrecords(txnno INT, txndate STRING , custno INT,
 amount DOUBLE, category STRING, product STRING, city STRING,
state STRING, spendBy STRING)

row format delimited

fields terminated by ‘,’

stored as textfile;

Load the data into the table
LOAD DATA LOCAL INPATH ‘/home/notroot/lab/data/txnns’
OVERWRITE INTO TABLE txnrecords;

Describing metadata or schema of the table
describe txnreords;

Counting no of records
Select count(*) from txnrecords;

Counting total spending by category of products
Select category , sum(amount) from txnrecords group by category;

Top 10 customers
Select custno, sum(amount) from txnrecords group by custno limit 10;









9 comments:

  1. hi,

    Can you please tell me from where the VM can be downloaded

    Thanks
    Sanjiv

    ReplyDelete
  2. Really is very interesting, I saw your website and get more details..Nice work. Thanks regards,
    Refer this link below,
    SAS Training in Chennai

    ReplyDelete
  3. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
    sas training in Chennai

    ReplyDelete
  4. Its really very informative.Thanks for sharing this valuable information with us.

    ReplyDelete
  5. I have read your blog, it was good to read & I am getting some useful info's through your blog keep sharing... sas training institute in Chennai

    ReplyDelete
  6. Day by day I am getting new things and learn new concept through your blogs, I am feeling so confidants, and thanks for your informative blog keep your post as updated one...
    sas course in Chennai | sas institutes in Chennai

    ReplyDelete



  7. The structs is a complex data type declaration used in the C programming language that helps you to define a physically grouped list of variables to be placed under one name in a block of memory.
    struts training in chennai | struts training | struts training center in chennai

    ReplyDelete