Analytics on Big Data

Introduction to Hadoop Core / File Systems

2013-02-26T10:48:00.001-08:00

Apache Hadoop Framework forms the kernel of an operating system for big data permitting users to share resources, managing permissions and allocations.

Map Reduce Layer :

The Task Tracker on each node spawns off a separate Java Virtual Machine process to prevent the Task Tracker itself from failing if the running job crashes the JVM.
The Job Tracker pushes work out to available Task Tracker nodes in the cluster, striving to keep the work as close to the data as possible.

Crux of MapReduce Architecture:

Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
Reducer reduces a set of intermediate values which share a key to a smaller set of values.

HDFS Layer :

Namenode is the single point for storage and management of metadata, this can be a bottleneck for supporting a huge number of files, especially a large number of small files.
Data Node talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

Hadoop Architecture

2013-02-26T10:39:00.001-08:00

Hadoop is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner.
Hadoop Distributes a task or piece of job across a Cluster of Machines, which access a shared file system hosted by a SAN.

Introduction to Big Data

2013-02-26T10:36:00.001-08:00

The total volume of data stored electronically, as on 2012, 2.7 zettabytes as per Forbes.com

(A zettabyte is 10^21 bytes, or equivalently one thousand Exabyte, one million petabytes, or one billion terabytes).

Statistical Facts on Big Data :

$300 Billion Potential annual value to US health care.
$250 billion Potential value to Europe's Public sector administration.
$600 Billion Potential annual consumer surplus from using personal location data globally.
140,000-190,000 - More deep analytical talent positions open for data savvy managers in USA during 2011.