Bigdata and Hadoop Bootcamp

Dipanjan Mukherjee
Bigdata and Hadoop
Bootcamp

What Is Bigdata
Big data means really a big data, it is a collection
of large datasets that cannot be processed using
traditional computing techniques. Big data is not
merely a data, rather it has become a complete
subject, which involves various tools, techniques
and frameworks.

Big Data Perspective And Volume
 The big data growth we’ve been witnessing is only natural. We constantly generate
data. On Google alone, we submit 40,000 search queries per second. That amounts
to 1.2 trillion searches yearly!
 Each minute, 300 new hours of video show up on YouTube. That’s why there’s more
than 1 billion gigabytes (1 exabyte) of data on its servers!
 People share more than 100 terabytes of data on Facebook daily. Every minute,
users send 31 million messages and view 2.7 million videos.
 Big data usage statistics indicate people take about 80% of photos on their
smartphones. Considering that only this year over 1.4 billion devices will be
shipped worldwide, we can only expect this percentage to grow.
 Smart devices (for example, fitness trackers, sensors, Amazon Echo) produce 5
quintillion bytes of data daily. In 5 years, we can expect for the number of these
gadgets to be more than 50 billion!
 Big data stats indicate that more than 30% of data will be uploaded to the cloud
by next year.
 Huge companies like Google use shared computing to satisfy their customers’
needs. About 1,000 computers are involved in answering every query.
 In fact, the most popular open source for distributed computing – Hadoop, has a
compound annual growth rate of 58% and will surpass $1 billion by 2020.

What Is Hadoop
Hadoop is a free, Java-based programming
framework that supports the processing of large
data sets in a distributed computing environment.
It is part of the Apache project sponsored by the
Apache Software Foundation.

Hadoop Components
❖ MapReduce
❖ HDFS
❖ Hadoop Common
❖ YARN (Yet Another Resource Negotiator)

HDFS Architecture
NameNode
Metadata (Name, replicas,…):
/hime/foo/data, 3,…
Client
Rack 1 Rack 2Client
Read
Metadata ops
Replication
Block opsReadDataNodes ReadDataNodes
Write

It is distributed across hundreds or even thousands of servers with each node storing a part of the file system.
Since the storage is done on commodity hardware, there are more chances of the node failing and, with that,
the data can be lost. HDFS gets over that problem by storing the same data in multiple sets.
HDFS works quite well for data loads that come in a streaming format. So, it is more suited for batch processing
applications rather than for interactive use. It is important to note that HDFS works for high throughput rather
than low latency.
HDFS works exclusively well for large datasets, and the standard size of datasets could be anywhere between
gigabytes and terabytes. It provides high-aggregate data bandwidth, and it is possible to scale hundreds of
nodes in a single cluster. Hence, millions of files are supported in a single instance.
It is extremely important to stick to data coherency. The standard files that come routinely in the HDFS fold
are the read-once and write-many-times files so that the data can remain the same and it can be accessed
multiple times without any issues regarding data coherency.
HDFS works on the assumption that moving of computation is much easier, faster, and cheaper than moving of
data of humongous size, which can create network congestion and lead to longer overall turnaround times.
HDFS provides the facility to let applications access data at the place where they are located.
HDFS is highly profitable in the sense that it can easily work on commodity hardware that are of different types
without any issue of compatibility. Hence, it is very well suited for taking advantage of cheaply and readily
available commodity hardware components.
HDFS Benefits

❑ Issue with small files
❑ Slow processing speed
❑ Latency
❑ Security
❑ No real-time data processing
❑ Support for batch processing only
❑ Uncertainty
❑ Lengthy line of code
❑ No caching
❑ No use of use
❑ No delta iteration
HDFS Limitations

Apache Spark Overview
Spark is the cluster computing framework for large-scale data processing.
Spark offers a set of libraries in three languages (Java, Scala, Python) for
its unified computing engine. What does this definition actually mean?
▪ Unified — with Spark, there is no need to piece together an application
out of multiple APIs or systems. Spark provides you with enough built-in
APIs to get the job done.
▪ Computing Engine — Spark handles the loading of data from various file
systems and runs computations on it, but does not store any data itself
permanently. Spark operates entirely in memory, allowing unparalleled
performance and speed.
▪ Libraries — Spark is comprised of a series of libraries built for data
science tasks. Spark includes libraries for SQL (Spark SQL), Machine
Learning (MLlib), Stream Processing (Spark Streaming and Structured
Streaming), and Graph Analytics (GraphX).

Apache Hadoop MR VS Apache Spark
Spark vs Hadoop MapReduce
Factors Spark Hadoop MapReduce
Speed 100x times than MapReduce Faster than traditional system
Written in Scala Java
Data Processing
Batch/ real-time/ iterative/
interactive/ graph
Batch processing
Ease of Use
Compact and easier than
Hadoop
Complex and lengthy
Caching
Caches the data in-memory
and enhances the system
performance
Doesn’t support caching of
data

Bigdata on cloud
Cloud computing Bigdata
Definition
Provides resources (storage, computing,
databases, monitoring tools etc.) on
demand
Provides a way to handle huge
volumes of data and generate
insights
Reference
It refers to internet services from SaaS,
PaaS to Iaas
It refers to data, which can be
structured, semi-structures or
unstructured
How they are used
It uses wide range of network of cloud
servers over the internet to analyze data
and information
It could be developed either
on-premise or cloud to discover
undiscovered patterns and
generate actionable insights
Formats
Cloud computing is new paradigm to
computing resources
It consists of all kind of data,
which are in many different
formats
Used for
Used to store data and information on
remote servers
It is used to describe huge
volume of data and
information

Bigdata and Hadoop Bootcamp

More Related Content

What's hot (20)

Similar to Bigdata and Hadoop Bootcamp (20)

More from Spotle.ai (20)

Recently uploaded (20)

Bigdata and Hadoop Bootcamp