SlideShare a Scribd company logo
Training (Day –1) 
Introduction
Big-data 
Four parameters: 
–Velocity: Streaming data and large volume data movement. 
–Volume: Scale from terabytes to zettabytes. 
–Variety: Manage the complexity of multiple relational and non-relational data types and schemas. 
–Voracity: Produced data has to be consumed fast before it becomes meaningless.
Not just internet companies 
Big Data Shouldn’t Be a SiloMust be an integrated part of enterprise information architecture
Data >> Information >> Business Value 
Retail–By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues. 
Financial Services–By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets. 
Government–By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies. 
Healthcare–Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
Single-core, single processor 
Single-core, multi-processor 
Single- core 
Multi-core, single processor 
Multi-core, multi-processor 
Multi-core 
Cluster of processors (single or multi-core) with shared memory 
Cluster of processors with distributed memory 
Cluster 
Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN. 
Grid of clusters 
Embarrassingly parallel processing 
MapReduce, distributed file system 
Cloud computing 
Pipelined Instruction level 
Concurrent Thread level 
Service Object level 
Indexed File level 
Mega Block level 
Virtual System Level 
Data size: small 
Data size: large 
Reference: Bina Ramamurthy 2011 
Processing Granularity
How to Process BigData? 
Need to process large datasets (>100TB) 
–Just reading 100TB of data can be overwhelming 
–Takes ~11 days to read on a standard computer 
–Takes a day across a 10Gbit link (very high end storage solution) 
–On a single node (@50MB/s) –23days 
–On a 1000 node cluster –33min
Examples 
•Web logs; 
•RFID; 
•sensor networks; 
•social networks; 
•social data (due to thesocial data revolution), 
•Internet text and documents; 
•Internet search indexing; 
•call detail records; 
•astronomy, 
•atmospheric science, 
•genomics, 
•biogeochemical, 
•biological, and 
•other complex and/or interdisciplinary scientific research; 
•military surveillance; 
•medical records; 
•photography archives; 
•video archives; and 
•large-scale e-commerce.
Not so easy… 
Moving data from storage cluster to computation cluster is not feasible 
In large clusters 
–Failure is expected, rather than exceptional. 
–In large clusters, computers fail every day 
–Data is corrupted or lost 
–Computations are disrupted 
–The number of nodes in a cluster may not be constant. 
–Nodes can be heterogeneous. 
Very expensive to build reliability into each application 
–A programmer worries about errors, data motion, communication… 
–Traditional debugging and performance tools don’t apply 
Need a common infrastructure and standard set of tools to handle this complexity 
–Efficient, scalable, fault-tolerant and easy to use
Why is Hadoop and MapReduceneeded? 
The answer to this questions comes from another trend in disk drives: 
–seek time is improving more slowly than transfer rate. 
Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. 
It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. 
If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
Why is Hadoop and MapReduceneeded? 
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. 
For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. 
MapReducecan be seen as a complement to an RDBMS. 
MapReduceis a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
Why is Hadoop and MapReduceneeded?
Hadoop distributions 
Apache™ Hadoop™ 
Apache Hadoop-based Services for Windows Azure 
Cloudera’sDistribution Including Apache Hadoop (CDH) 
HortonworksData Platform 
IBM InfoSphereBigInsights 
Platform Symphony MapReduce 
MapRHadoop Distribution 
EMC GreenplumMR (using MapR’sM5 Distribution) 
ZettasetData Platform 
SGI Hadoop Clusters (uses Clouderadistribution) 
Grand Logic JobServer 
OceanSyncHadoop Management Software 
Oracle Big Data Appliance (uses Clouderadistribution)
What’s up with the names? 
When naming software projects, Doug Cutting seems to have been inspired by his family. 
Luceneis his wife’s middle name, and her maternal grandmother’s first name. 
His son, as a toddler, used Nutchas the all- purpose word for meal and later named a yellow stuffed elephant Hadoop. 
Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
Hadoop features 
Distributed Framework for processing and storing data generally on commodity hardware. 
Completely Open Source. 
Written in Java 
–Runs on Linux, Mac OS/X, Windows, and Solaris. 
–Client apps can be written in various languages. 
•Scalable: store and process petabytes, scale by adding Hardware 
•Economical: 1000’s of commodity machines 
•Efficient: run tasks where data is located 
•Reliable: data is replicated, failed tasks are rerun 
•Primarily used for batch data processing, not real-time / user facing applications
Components of Hadoop 
•HDFS(Hadoop Distributed File System) 
–ModeledonGFS 
–Reliable,HighBandwidthfilesystemthatcan 
store TB' and PB's data. 
•Map-Reduce 
–UsingMap/ReducemetaphorfromLisplanguage 
–Adistributedprocessingframeworkparadigmthat 
process the data stored onto HDFS in key-value . 
DFS 
Processing Framework 
Client 1 
Client 2 
Input 
data 
Output 
data 
Map 
Map 
Map 
Reduce 
Reduce 
Input 
Map 
Shuffle & Sort 
Reduce 
Output
•Very Large Distributed File System 
–10K nodes, 100 million files, 10 PB 
–Linearly scalable 
–Supports Large files (in GBs or TBs) 
•Economical 
–Uses Commodity Hardware 
–Nodes fail every day. Failure is expected, rather than exceptional. 
–The number of nodes in a cluster is not constant. 
•Optimized for Batch Processing 
HDFS
HDFS Goals 
•Highly fault-tolerant 
–runs on commodity HW, which can fail frequently 
•High throughput of data access 
–Streaming access to data 
•Large files 
–Typical file is gigabytes to terabytes in size 
–Support for tens of millions of files 
•Simple coherency 
–Write-once-read-many access model
HDFS: Files and Blocks 
•Data Organization 
–Data is organized into files and directories 
–Files are divided into uniform sized large blocks 
–Typically 128MB 
–Blocks are distributed across cluster nodes 
•Fault Tolerance 
–Blocks are replicated (default 3) to handle hardware failure 
–Replication based on Rack-Awareness for performance and fault tolerance 
–Keeps checksums of data for corruption detection and recovery 
–Client reads both checksum and data from DataNode. If checksum fails, it tries other replicas
HDFS: Files and Blocks 
•High Throughput: 
–Client talks to both NameNodeand DataNodes 
–Data is not sent through the NameNode. 
–Throughput of file system scales nearly linearly with the number of nodes. 
•HDFS exposes block placement so that computation can be migrated to data
HDFS Components 
•NameNode 
–Manages the file namespace operation like opening, creating, renaming etc. 
–File name to list blocks + location mapping 
–File metadata 
–Authorization and authentication 
–Collect block reports from DataNodeson block locations 
–Replicate missing blocks 
–Keeps ALL namespace in memory plus checkpoints & journal 
•DataNode 
–Handles block storage on multiple volumes and data integrity. 
–Clients access the blocks directly from data nodes for read and write 
–Data nodes periodically send block reports to NameNode 
–Block creation, deletion and replication upon instruction from the NameNode.
name:/users/joeYahoo/myFile -blocks:{1,3} 
name:/users/bobYahoo/someData.gzip -blocks:{2,4,5} 
Datanodes (the slaves) 
Namenode (the master) 
1 
1 
2 
2 
2 
4 
5 
3 
3 
4 
4 
5 
5 
Client 
Metadata 
I/O 
1 
3 
HDFS Architecture
Simple commands 
hdfsdfs-ls, -du, -rm, -rmr 
Uploading files 
hdfsdfs–copyFromLocalfoo mydata/foo 
Downloading files 
hdfsdfs-moveToLocalmydata/foo foo 
hdfsdfs-cat mydata/foo 
Admin 
hdfsdfsadmin–report 
Hadoop DFS Interface
Map Reduce -Introduction 
•Parallel Job processing framework 
•Written in java 
•Close integration with HDFS 
•Provides : 
–Auto partitioning of job into sub tasks 
–Auto retry on failures 
–Linear Scalability 
–Locality of task execution 
–Plugin based framework for extensibility
Map-Reduce 
•MapReduceprograms are executed in two main phases, called 
–mapping and 
–reducing. 
•In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. 
•In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. 
•The mapper is meant to filter and transform the input into something 
•That the reducer can aggregate over. 
•MapReduceuses lists and (key/value) pairs as its main data primitives.
Map-Reduce 
Map-Reduce Program 
–Based on two functions: Map and Reduce 
–Every Map/Reduce program must specify a Mapper and optionally a Reducer 
–Operate on key and value pairs 
Map-Reduce works like a Unix pipeline: 
cat input | grep| sort | uniq-c | cat > output 
Input| Map| Shuffle & Sort | Reduce| Output 
cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist 
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) 
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Hadoop and its elements 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File N-2 
File N-1 
File N 
Input 
files 
Splits 
Mapper 
Machine -1 
Machine -2 
Machine -M 
Split 1 
Split 2 
Split 3 
Split M-2 
Split M-1 
Split M 
Map 1 
Map 2 
Map 3 
Map M-2 
Map M-1 
Map M 
Combiner 1 
Combiner C 
(Kay, Value) 
pairs 
Record Reader 
combiner 
. 
. 
. 
Partition 1 
Partition 2 
Partition P-1 
Partition P 
Partitionar 
Reducer 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File O-2 
File O-1 
File O 
Reducer 1 
Reducer 2 
Reducer R-1 
Reducer R 
Input 
Output 
Machine -x
Hadoop Eco-system 
•Hadoop Common: The common utilities that support the other Hadoop subprojects. 
•Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. 
•Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. 
•Other Hadoop-related projects at Apache include: 
–Avro™: A data serialization system. 
–Cassandra™: A scalable multi-master database with no single points of failure. 
–Chukwa™: A data collection system for managing large distributed systems. 
–HBase™: A scalable, distributed database that supports structured data storage for large tables. 
–Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. 
–Mahout™: A Scalable machine learning and data mining library. 
–Pig™: A high-level data-flow language and execution framework for parallel computation. 
–ZooKeeper™: A high-performance coordination service for distributed applications.
Exercise –task 
You have timeseriesdata (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well. 
•Task: 
–Provide an architecture of such system to meet following goals 
–Fast 
–Available 
–Fair 
–Or, provide analytics algorithm and data-structure design considerations (e.g. k- means clustering, or regression) on this data set of worth 3 months. 
•Group / individual presentation
End of session 
Day –1: Introduction

More Related Content

What's hot (20)

PDF
HDFS User Reference
Biju Nair
 
PDF
Interacting with hdfs
Pradeep Kumbhar
 
PPTX
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
PDF
Hdfs architecture
Aisha Siddiqa
 
PDF
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
PPTX
Hadoop Distributed File System
Rutvik Bapat
 
PPTX
Hadoop
Esraa El Ghoul
 
PPTX
Snapshot in Hadoop Distributed File System
Bhavesh Padharia
 
PDF
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
PPTX
Hadoop HDFS Concepts
ProTechSkills Training
 
PPTX
Hadoop Distributed File System
Vaibhav Jain
 
PPTX
Hadoop Distributed File System
Anand Kulkarni
 
PDF
HDFS Trunncate: Evolving Beyond Write-Once Semantics
DataWorks Summit
 
PPTX
Ravi Namboori Hadoop & HDFS Architecture
Ravi namboori
 
PDF
Hadoop Introduction
tutorialvillage
 
PDF
Dynamic Namespace Partitioning with Giraffa File System
DataWorks Summit
 
PPTX
Hadoop and HDFS
SatyaHadoop
 
PPTX
Hadoop hdfs
Sudipta Ghosh
 
PPT
Hadoop Architecture
Delhi/NCR HUG
 
PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
HDFS User Reference
Biju Nair
 
Interacting with hdfs
Pradeep Kumbhar
 
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
Hdfs architecture
Aisha Siddiqa
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
Hadoop Distributed File System
Rutvik Bapat
 
Snapshot in Hadoop Distributed File System
Bhavesh Padharia
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
Hadoop HDFS Concepts
ProTechSkills Training
 
Hadoop Distributed File System
Vaibhav Jain
 
Hadoop Distributed File System
Anand Kulkarni
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
DataWorks Summit
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi namboori
 
Hadoop Introduction
tutorialvillage
 
Dynamic Namespace Partitioning with Giraffa File System
DataWorks Summit
 
Hadoop and HDFS
SatyaHadoop
 
Hadoop hdfs
Sudipta Ghosh
 
Hadoop Architecture
Delhi/NCR HUG
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 

Viewers also liked (11)

PPTX
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop - Lessons Learned
tcurdt
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PPTX
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
PPTX
Hadoop & HDFS for Beginners
Rahul Jain
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PPSX
Hadoop
Nishant Gandhi
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Hadoop - Lessons Learned
tcurdt
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
Hadoop & HDFS for Beginners
Rahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Seminar Presentation Hadoop
Varun Narang
 
Ad

Similar to Hadoop introduction (20)

PDF
Chapter2.pdf
WasyihunSema2
 
PPTX
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PPT
Hadoop Technology
Atul Kushwaha
 
PPTX
Big Data and Hadoop
Mr. Ankit
 
PPTX
big data and hadoop
ahmed alshikh
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PPTX
Hadoop
Mayuri Gupta
 
PPSX
Big Data
Neha Mehta
 
PPTX
Hadoop
RittikaBaksi
 
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
PPTX
Managing Big data with Hadoop
Nalini Mehta
 
PPTX
Overview of Big Data by Sunny
DignitasDigital1
 
PPTX
Big Data Analytics -Introduction education
mohammedansaralima
 
PDF
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
PPT
HDFS_architecture.ppt
vijayapraba1
 
PPTX
Big Data and Hadoop
MaulikLakhani
 
PPTX
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
PPTX
Seminar ppt
RajatTripathi34
 
Chapter2.pdf
WasyihunSema2
 
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Big data and hadoop overvew
Kunal Khanna
 
Hadoop Technology
Atul Kushwaha
 
Big Data and Hadoop
Mr. Ankit
 
big data and hadoop
ahmed alshikh
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Hadoop
Mayuri Gupta
 
Big Data
Neha Mehta
 
Hadoop
RittikaBaksi
 
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Managing Big data with Hadoop
Nalini Mehta
 
Overview of Big Data by Sunny
DignitasDigital1
 
Big Data Analytics -Introduction education
mohammedansaralima
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
HDFS_architecture.ppt
vijayapraba1
 
Big Data and Hadoop
MaulikLakhani
 
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Seminar ppt
RajatTripathi34
 
Ad

More from Subhas Kumar Ghosh (20)

PPTX
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
PPTX
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PPTX
03 hive query language (hql)
Subhas Kumar Ghosh
 
PPTX
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
PPTX
01 hbase
Subhas Kumar Ghosh
 
PPTX
06 pig etl features
Subhas Kumar Ghosh
 
PPTX
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
PPTX
04 pig data operations
Subhas Kumar Ghosh
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PPTX
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
PPTX
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
PPTX
Hadoop Day 3
Subhas Kumar Ghosh
 
PDF
Hadoop exercise
Subhas Kumar Ghosh
 
PDF
Hadoop map reduce v2
Subhas Kumar Ghosh
 
PPTX
Hadoop job chaining
Subhas Kumar Ghosh
 
PDF
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
PDF
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
PPTX
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
PDF
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
05 k-means clustering
Subhas Kumar Ghosh
 
03 hive query language (hql)
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
06 pig etl features
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
04 pig data operations
Subhas Kumar Ghosh
 
03 pig intro
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Hadoop Day 3
Subhas Kumar Ghosh
 
Hadoop exercise
Subhas Kumar Ghosh
 
Hadoop map reduce v2
Subhas Kumar Ghosh
 
Hadoop job chaining
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Hadoop map reduce in operation
Subhas Kumar Ghosh
 

Recently uploaded (20)

PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 

Hadoop introduction

  • 1. Training (Day –1) Introduction
  • 2. Big-data Four parameters: –Velocity: Streaming data and large volume data movement. –Volume: Scale from terabytes to zettabytes. –Variety: Manage the complexity of multiple relational and non-relational data types and schemas. –Voracity: Produced data has to be consumed fast before it becomes meaningless.
  • 3. Not just internet companies Big Data Shouldn’t Be a SiloMust be an integrated part of enterprise information architecture
  • 4. Data >> Information >> Business Value Retail–By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues. Financial Services–By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets. Government–By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies. Healthcare–Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
  • 5. Single-core, single processor Single-core, multi-processor Single- core Multi-core, single processor Multi-core, multi-processor Multi-core Cluster of processors (single or multi-core) with shared memory Cluster of processors with distributed memory Cluster Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN. Grid of clusters Embarrassingly parallel processing MapReduce, distributed file system Cloud computing Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large Reference: Bina Ramamurthy 2011 Processing Granularity
  • 6. How to Process BigData? Need to process large datasets (>100TB) –Just reading 100TB of data can be overwhelming –Takes ~11 days to read on a standard computer –Takes a day across a 10Gbit link (very high end storage solution) –On a single node (@50MB/s) –23days –On a 1000 node cluster –33min
  • 7. Examples •Web logs; •RFID; •sensor networks; •social networks; •social data (due to thesocial data revolution), •Internet text and documents; •Internet search indexing; •call detail records; •astronomy, •atmospheric science, •genomics, •biogeochemical, •biological, and •other complex and/or interdisciplinary scientific research; •military surveillance; •medical records; •photography archives; •video archives; and •large-scale e-commerce.
  • 8. Not so easy… Moving data from storage cluster to computation cluster is not feasible In large clusters –Failure is expected, rather than exceptional. –In large clusters, computers fail every day –Data is corrupted or lost –Computations are disrupted –The number of nodes in a cluster may not be constant. –Nodes can be heterogeneous. Very expensive to build reliability into each application –A programmer worries about errors, data motion, communication… –Traditional debugging and performance tools don’t apply Need a common infrastructure and standard set of tools to handle this complexity –Efficient, scalable, fault-tolerant and easy to use
  • 9. Why is Hadoop and MapReduceneeded? The answer to this questions comes from another trend in disk drives: –seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
  • 10. Why is Hadoop and MapReduceneeded? On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. MapReducecan be seen as a complement to an RDBMS. MapReduceis a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
  • 11. Why is Hadoop and MapReduceneeded?
  • 12. Hadoop distributions Apache™ Hadoop™ Apache Hadoop-based Services for Windows Azure Cloudera’sDistribution Including Apache Hadoop (CDH) HortonworksData Platform IBM InfoSphereBigInsights Platform Symphony MapReduce MapRHadoop Distribution EMC GreenplumMR (using MapR’sM5 Distribution) ZettasetData Platform SGI Hadoop Clusters (uses Clouderadistribution) Grand Logic JobServer OceanSyncHadoop Management Software Oracle Big Data Appliance (uses Clouderadistribution)
  • 13. What’s up with the names? When naming software projects, Doug Cutting seems to have been inspired by his family. Luceneis his wife’s middle name, and her maternal grandmother’s first name. His son, as a toddler, used Nutchas the all- purpose word for meal and later named a yellow stuffed elephant Hadoop. Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
  • 14. Hadoop features Distributed Framework for processing and storing data generally on commodity hardware. Completely Open Source. Written in Java –Runs on Linux, Mac OS/X, Windows, and Solaris. –Client apps can be written in various languages. •Scalable: store and process petabytes, scale by adding Hardware •Economical: 1000’s of commodity machines •Efficient: run tasks where data is located •Reliable: data is replicated, failed tasks are rerun •Primarily used for batch data processing, not real-time / user facing applications
  • 15. Components of Hadoop •HDFS(Hadoop Distributed File System) –ModeledonGFS –Reliable,HighBandwidthfilesystemthatcan store TB' and PB's data. •Map-Reduce –UsingMap/ReducemetaphorfromLisplanguage –Adistributedprocessingframeworkparadigmthat process the data stored onto HDFS in key-value . DFS Processing Framework Client 1 Client 2 Input data Output data Map Map Map Reduce Reduce Input Map Shuffle & Sort Reduce Output
  • 16. •Very Large Distributed File System –10K nodes, 100 million files, 10 PB –Linearly scalable –Supports Large files (in GBs or TBs) •Economical –Uses Commodity Hardware –Nodes fail every day. Failure is expected, rather than exceptional. –The number of nodes in a cluster is not constant. •Optimized for Batch Processing HDFS
  • 17. HDFS Goals •Highly fault-tolerant –runs on commodity HW, which can fail frequently •High throughput of data access –Streaming access to data •Large files –Typical file is gigabytes to terabytes in size –Support for tens of millions of files •Simple coherency –Write-once-read-many access model
  • 18. HDFS: Files and Blocks •Data Organization –Data is organized into files and directories –Files are divided into uniform sized large blocks –Typically 128MB –Blocks are distributed across cluster nodes •Fault Tolerance –Blocks are replicated (default 3) to handle hardware failure –Replication based on Rack-Awareness for performance and fault tolerance –Keeps checksums of data for corruption detection and recovery –Client reads both checksum and data from DataNode. If checksum fails, it tries other replicas
  • 19. HDFS: Files and Blocks •High Throughput: –Client talks to both NameNodeand DataNodes –Data is not sent through the NameNode. –Throughput of file system scales nearly linearly with the number of nodes. •HDFS exposes block placement so that computation can be migrated to data
  • 20. HDFS Components •NameNode –Manages the file namespace operation like opening, creating, renaming etc. –File name to list blocks + location mapping –File metadata –Authorization and authentication –Collect block reports from DataNodeson block locations –Replicate missing blocks –Keeps ALL namespace in memory plus checkpoints & journal •DataNode –Handles block storage on multiple volumes and data integrity. –Clients access the blocks directly from data nodes for read and write –Data nodes periodically send block reports to NameNode –Block creation, deletion and replication upon instruction from the NameNode.
  • 21. name:/users/joeYahoo/myFile -blocks:{1,3} name:/users/bobYahoo/someData.gzip -blocks:{2,4,5} Datanodes (the slaves) Namenode (the master) 1 1 2 2 2 4 5 3 3 4 4 5 5 Client Metadata I/O 1 3 HDFS Architecture
  • 22. Simple commands hdfsdfs-ls, -du, -rm, -rmr Uploading files hdfsdfs–copyFromLocalfoo mydata/foo Downloading files hdfsdfs-moveToLocalmydata/foo foo hdfsdfs-cat mydata/foo Admin hdfsdfsadmin–report Hadoop DFS Interface
  • 23. Map Reduce -Introduction •Parallel Job processing framework •Written in java •Close integration with HDFS •Provides : –Auto partitioning of job into sub tasks –Auto retry on failures –Linear Scalability –Locality of task execution –Plugin based framework for extensibility
  • 24. Map-Reduce •MapReduceprograms are executed in two main phases, called –mapping and –reducing. •In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. •In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. •The mapper is meant to filter and transform the input into something •That the reducer can aggregate over. •MapReduceuses lists and (key/value) pairs as its main data primitives.
  • 25. Map-Reduce Map-Reduce Program –Based on two functions: Map and Reduce –Every Map/Reduce program must specify a Mapper and optionally a Reducer –Operate on key and value pairs Map-Reduce works like a Unix pipeline: cat input | grep| sort | uniq-c | cat > output Input| Map| Shuffle & Sort | Reduce| Output cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
  • 27. Hadoop and its elements HDFS . . . File 1 File 2 File 3 File N-2 File N-1 File N Input files Splits Mapper Machine -1 Machine -2 Machine -M Split 1 Split 2 Split 3 Split M-2 Split M-1 Split M Map 1 Map 2 Map 3 Map M-2 Map M-1 Map M Combiner 1 Combiner C (Kay, Value) pairs Record Reader combiner . . . Partition 1 Partition 2 Partition P-1 Partition P Partitionar Reducer HDFS . . . File 1 File 2 File 3 File O-2 File O-1 File O Reducer 1 Reducer 2 Reducer R-1 Reducer R Input Output Machine -x
  • 28. Hadoop Eco-system •Hadoop Common: The common utilities that support the other Hadoop subprojects. •Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. •Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. •Other Hadoop-related projects at Apache include: –Avro™: A data serialization system. –Cassandra™: A scalable multi-master database with no single points of failure. –Chukwa™: A data collection system for managing large distributed systems. –HBase™: A scalable, distributed database that supports structured data storage for large tables. –Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. –Mahout™: A Scalable machine learning and data mining library. –Pig™: A high-level data-flow language and execution framework for parallel computation. –ZooKeeper™: A high-performance coordination service for distributed applications.
  • 29. Exercise –task You have timeseriesdata (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well. •Task: –Provide an architecture of such system to meet following goals –Fast –Available –Fair –Or, provide analytics algorithm and data-structure design considerations (e.g. k- means clustering, or regression) on this data set of worth 3 months. •Group / individual presentation
  • 30. End of session Day –1: Introduction