We know Hadoop has HDFS (file system),
MapReduce framework but this Blog will try to explain how these two individual components
come together and solves business problems
Common problem faced by organizations
across domains is processing BIG DATA.
What is BIG DATA?
BIG DATA is a relative term i.e. for
some start-up company(say CompUS) 2GB data is BIG DATA because there machine
spec reads somewhat like Dual Core
CompUS (company in reference) has to
buy bigger machine to process 2GB of data every day.
Imagine CompUS grows and needs to
process 4GB/day after 2 months, 10GB/day after 6 months and 1TB/day after
1year. To solve this, CompUS buys new and powerfull machine every time there
data increases. That means company tried to solve problem with vertical scaling.
Vertical scaling is not good because
every time company buys new machine/hardware, its adding infrastructure/maintenance
cost to company.
Hadoop solves problem by horizontal scaling, how?
Hadoop runs processing on parallel
machines and every time data increases, companies can add one more machine to
already existing fleet of machines. Hadoop processing doesn’t depend on spec of
individual machines i.e. new machine added doesn’t have to be powerful, expensive
machine but can be somewhat like commodity machine.
If CompUS uses Hadoop over vertical
scaling it gains following
1) Low Infrastructure cost.
2) System that can be easily scaled in
future without effecting existing infrastructure.
Let’s dig into User/Business cases
with Hadoop in perspective.
Financial firms dealing with Stock/equity
trading generates huge data every day.
Data in hand contains information of
various traders, brokers, orders etc. and company wants to generate metrics for
Infrastructure in hand:
11 machines running in parallel under
Hadoop and are controlled by 1 NameNode (Hadoop Master Node, controlling node)
and remaining 10 machine acts as Slaves, gets commands from Master Node
1)Import data into HDFS (Hadoop file
a.HDFS will split data across all
machines in small-small blocks i.e. 10 GB of data will get divided among 10
machines and each machine gets 1GB of data.
b.HDFS also replicate each block of
data across multiple machines to handle machine failure at runtime.
c.At this point, data is simply divided
into smaller blocks that means information of “trader A” can be on machine 1/2/3…
and same for all other traders.
2)Write MapReduce function
1)Map function will create “Key/value”
pair somewhat like