Chapter1 Big Data Concept

Big Data

What is Big Data

Data - The most valuable Resource
Big Data is a term applied to data sets that cannot be capured, managed, and processed within a tolerable elapsed and specified time frame by commonly used software tools.

Characteristics of Big Data Technology

  • Volume
  • Variety
  • Value density
  • Velocity

The Birth of Big Data

The 21st century is an era of great development of data and information.

Cumulative amount of data generated

Transition from IT to DT

DT = Data Technology.
Usage of Big Data

  • Precision advertisement delivery
  • Medical and health systems
  • Personalized education
  • Personal Credit Reporting

Big Data Ecosphere Tools mainly include

  • Hadoop
    Hadoop is a distributed system infrastructure developed by Apache Foundation. Doug Cutting is inspired by Map/Reduce and Google File System(GFS) developed by Google Lab. The core architecture of Hadoop is MapReduce programming model and HDFS distributed file system.
    MapReduce framework can decompose an application into many parallel computing instructions, running very large data sets across a large number of computing nodes. Map is used to segment large data, and Reduce is used to merge the results of Map calculation.
    HDFS(Hadoop Distributed File System) distributed file system provides storage services for massive data and large files. The files greater than 64/127MB is divided into blocks and stored in multiple nodes. It has high throughput and fault tolerance.

  • HBase
    HBase is an open source KV database of apache. It is based on HDFS and provides a database system with high reliability, high performance, column storage, scalability and real-time reading and writing. It is between NoSQL and RDBMS. It can retrieve data only through row key and range of primary key. It only supports single-line transactions(multi-table join and other complex operation can be realized through hive support). It is mainly used to store unstructured and semi-structured loose data.
    Sparse. For null columns, there is no storage space, so tables can be designed to be very sparse.

  • Storm
    Apache Storm is a free and open source distributed real-time computing system, which simplifies the reliable processing of stream data.
    Storm Features: Easy to expand; The processing of each piece of information can be guaranteed; Storm cluster management is simple; Storm’s Fault Tolerance Function(A node hangs up and can’t affect my application); Low latency
  • Zookeeper
    Zookeeper is a high performance, distributed, open source distributed application coordination service.
    Zookeeper Characteristics: Sequentical consistency; Atomicity; Uniformity; Reliability; Timeliness
    Zookeeper Use Cases: Data Publishing and Subscription; Name Space Service; Distributed notification/coordination; Distributed Lock; Cluster Management
  • Sqoop
    Sqoop is Apache’s top-level project, which allows users to extract data from relational databases into Hadoop for further processing.
  • Mahout
    Mahout is a powerful data mining tool and a set of distributed machine learning algorithms, including the implementation, classification and clustering of distributed collaborative filtering. It greatly improves the data volume and processing performance of the algorithm.

  • Hive

Hadoop Introduction of Application Status

  • Alibaba
  • Baidu
  • Tencent
  • Qihu 360
  • China Mobile
  • Yahoo
  • Facebook

Hadoop Introduction to Version Development

  • 2004, The original version of HDFS and MapReduce was implemented by Doug Cutting and Mike Cafarella.

Chapter2 Hadoop Installation

Chapter3 Linux Basics

Chapter4 Hadoop Components

YARN Model Overview

The YARN model controls the entire cluster and manages the allocation fo applications to the underlying computing resouces. ResourceManager assigns various resource parts(computing, memory, bandwidth, etc.) to the underlying NodeManage(each node agent of YARN). ResourceManager also allocates resources with Application Master to start and monitor their underlying applications with NodeManager.


In HDFS, the NameNode service provides all the services such as NameSpave menagement, block management and so on for the entire HDFS file system. Metadata(the directory structure of the entire file system, which files are in each directory, which blocks are in each file, and which DataNode each block is stored on) All related services are provides by the NameNode process.
It includes(store in Linux not in HDFS):

  • fsimage: Metadata Mirror File.
  • edits: Operating log files.
  • fstime: Time to save the alst checkpoint.

QJM(Qurom Journal Manager)

QJM is introduced to realize high availability of read and write, which makes it possible for HDFS to achieve real high availability. The principle of QJM is to provide 2N+1 node to store data, and every time a node larger than or equal to N+1 is writen successfully, it will be successful and ensure high availability of data. These are the following auxiliary processes:

  • JournalNode: QJM storage segment process, providing log reading and writing, storage, repair and other services.
  • DFSZKFailoverController: The switching controller based on zookeeper is mainly composed of ActiveStandbyElector and HealthMonitor. ActiveStandbyElector mainly interacts with zookeeper to obtain global locks to determine whether Master is in standby or active state. HealthMonitor is responsible for the status of each Master and switches according to status.


NodeManager(NM) is an agent on each node in YARN. It manages a single computing node in Hadoop cluster, including maintaining communication with ResourceManager, supervising the life cycle management of Container, monitoring the resource utilization(mamory, CPU, etc.) of each Container, tracking the health status of nodes, managing logs and different applications. The ancillary services used by the program.


Provides storage services for real file data, reports block information stored by itself when DataNode starts, and updates mapping tables (DataNode and Block) after NameNode receives information.

  • File Block: The most basic storage unit.
  • HDFS default block size is 64MB, with a 256MB file, a total of 256/64 = 4 blocks.
  • Unlike ordinary file systems, HDFS does not take up the entire storage space of data blocks if a file is smaller than the size of a data block.
  • Replication: Multiple copies. The default is three, which can be configured through configuration files.


In YARN, Resource Manager is responsible for the unified management and allocation of all resources in the cluster. It receives resource reporting information from each node(NodeManager), and distributes there information to each application according to a certain strategy.

  • NodeManager follows instructions from ResourceManager to manage available resources on a single node.
  • ApplicationMaster is responsible for negotiating resources with ResourceManager and starting containers in cooperation with NodeManagers.
    Structure diagram of ResourceManager
  • UseService(User Interaction Module)
    • ClientRMService
    • AdminService
    • RMWebApp
  • Manage NMs(Management module)
    • NMLiveLinessMonitor
    • NodesListManager
    • ResourceTrackerService
  • AM Management module
    • AMLivelinessMonitor
    • ApplicationMasterService
    • ApplicationMasterService
  • Manage App(Management module)
    • ApplicationACLManager
    • RMAppManager
    • ContainerAllocationExpirer
  • State Machine Management Module
    • RM maintains four kinds of state machines, namely RMApp, RMAppAttempt, RMContainer and RMNode.
    • RMApp: Maintain the entire running cycle of an application, including the entire process from start-up to end-of-run.
    • RMAppAttempt: An application may start multiple instances.
    • RMContainer: Maintain a Container’s run cycle, from creation to completion.
    • RMNode: Maintaining a Nodemanager lifecycle, including the entire process from start to finish.


Each application submitted by users contains an AM. Its main functions are to negotiate with RM scheduler to obtain resources, further assign tasks to internal, start/stop tasks with NM, and monitor the running status of task.

  • Negotiate with Resource Manager to obtain resources(Container).
  • Get the task through core components with the resource manager.
  • Communicate with NodeManager to start or stop tasks(Map or Reduce).


Container is resource abstraction in YARN. It encapsulates resources on a DataNode node.


The resource scheduler allocates resources to each application according to queue capacity queue restrictions

Hadoop 1.0 and Hadoop 2.0 Differences


MapReduce is a parallel programming model that implements basic parallel computing tasks using Map and Reduce functions.

Two main phases of map reduce: Mapper and Reduce
Mapper: Maps are the idividucal tasks that transform input records into intermediate records.
Shuffling & Sorting: The process of exchanging the intermediate outputs from the map tasks to where they are required by reducers. Sorting is done

Reducer: Reduces a set of intermediate values which share a key to a smaller set of values. All of the values with the same key are persented to a single reducer together.

Have fun.