SlideShare a Scribd company logo
Darko Marjanović, CEO @ Things Solver
darko@thingsolver.com
How to use Big Data and Data
Lake concept in business using
Hadoop and Spark
About me
• CEO and Co Founder @ Things Solver
• Co Founder @ Data Science Serbia
• Big Data, Machine Learning
• Hadoop, Spark, Python
Agenda
• Big Data
• Data Lake
• Data Lake vs Data Warehouse
• Hadoop, Spark, Hive
• Big Data application and Lambda architecture
• Examples
• Data Science Lab
Big Data
• Big data is a term for data sets that are so large or complex that
traditional data processing applications are inadequate.
• Anything that Won't Fit in Excel :)
Big Data
Volume
The quantity of generated and stored data.
Variety
The type and nature of the data.
Velocity
In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the
path of growth and development.
Veracity
The quality of captured data can vary greatly, affecting
accurate analysis.
Big Data
• Email, HTML, Click Stream...
• Facebook, Twitter...
• Video, Pictures…
• Logs...
• Sensor Data...
• Relational Databases...
Big Data
Data Lake
“A data lake is a storage repository that holds a vast amount of raw
data in its native format, including structured, semi-structured, and
unstructured data. The data structure and requirements are not
defined until the data is needed.”
Data Lake - James Dixon, Pentaho chief technology officer
Data Lake
• Retain All Data
• Support All Data types
• Support All Users
• Adapt Easily to
Changes
• Provide Faster Insights
Data Lake
Data Lake Cons
• Data storage alone has no impact on the effectiveness of business
decisions
• Inexpensive storage is not infinite or limitless
Data Warehouse
Wikipedia, defines Data Warehouses as:
“…central repositories of integrated data from one or more disparate
sources. They store current and historical data and are used for
creating trending reports for senior management reporting such as
annual and quarterly comparisons.”
Data Warehouse
Problems:
• New Data Sources, Data Types
• Real Time Reports
• Streaming Data
• Software Price
• Infrastructure Price
Data Lake vs Data Warehouse
Data Lake vs Data Warehouse
• ETL
• ETL and BI projects by nature are investments into evolving processes and therefore have no
distinct end point and is an ongoing, improving and re-targeting project process.
• ETL works from the output backwards and hence on relevant data is extracted and processed.
• Future ETL requirements needing data cannot be foreseen and defined in the original design.
• ELT
• Isolating Loading and Transforming enables projects to be broken down into specific chunks
that are more isolated and become more manageable.
• ELT is an emergent approach to data warehouse design and development requiring a change
in mentality and design approach compared to traditional ETL.
• Future requirements can easily be incorporated into the warehouse structure as all data is
pulled into the Data Lake in its raw format.
Hadoop
• The Apache Hadoop software library is a framework that allows the
distributed processing of large data sets across clusters of computers
using simple programming models.
Hadoop
• Pros
• Linear scalability.
• Commodity hardware.
• Pricing and licensing.
• All data types.
• Analytical queries.
• Integration with traditional systems.
• Cons
• Implementation.
• Map Reduce ease of use.
• Intense calculations with little data.
• In memory.
• Real time analytics.
Apache Spark
• Apache Spark is a fast and general engine for big data processing,
with built-in modules for streaming, SQL, machine learning and graph
processing.
Apache Spark
• Pros
• 100X faster than Map Reduce.
• Ease of use.
• Streaming, Mllib, Graph and SQL.
• Pricing and licensing.
• In memory.
• Integration with Hadoop.
• Machine learning.
• Cons
• Integration with traditional
systems.
• Limited memory per machine(GC).
• Configuration.
Apache Spark
Apache Spark
• Resilient Distributed Datasets
(RDDs) are the basic units of
abstraction in Spark.
• RDD is an immutable, partitioned
set of objects.
• RDDs are lazy evaluated.
• RDDs are fully fault-tolerant. Lost
data can be recovered using the
lineage graph of RDDs (by
rerunning operations on the input
data).
• RDD operations:
• Transformations - Lazy evaluated
(executed by calling an action to
improve pipelining)
• -map, filter, groupByKey, join, ...
• Actions - Runned immediately (to
return the value to
application/storage)
• -count, collect, reduce, save, ...
• Don’t forget to cache()
Apache Spark
• Dataframes are common abstraction that go across languages, and they
represent a table, or two-dimensional array with columns and rows.
• Spark Datarames are distributed dataframes. They allow querying
structured data using SQL or DSL (for example in Python or Scala).
• Like RDDs, Dataframes are also immutable structure.
• They are executed in parallel.
• val df = sqlContext.read.json"pathToMyFile.json")
Hive
• Apache Hive is a data
warehouse infrastructure for
querying, analyzing and
managing large datasets
residing in distributed storage.
Hive
• Pros
• Writing ad hoc queries on large
volumes of data.
• Imposing a structure on a variety of
data formats.
• Interactive SQL queries over large
datasets residing in Hadoop.
• SQL-like data access.
• Accessing Hadoop data from
traditional DWH environment.
• Cons
• Code efficiency can be lower than
in traditional Map Reduce.
• Apache Hive has terrible
performance for OLTP tasks.
Ecosystem
• Collecting Data
• Kafka, Flume…
• Managing Data
• Pig, Spark, Hive, Flink, MapReduce
• Resource Manager
• YARN, Mesos
• Administration
• Ambari, Big Top
Big Data Application
Lambda Architecture
• Lambda Architecture is a useful framework to think about designing
big data applications. Nathan Marz designed this
generic architecture addressing common requirements for big data
based on his experience working on distributed data processing
systems at Twitter.
Lambda Architecture
• Data
• Batch Layer
• Serving Layer
• Speed Layer
Social Media Analysis
IoT Big Data Application
Planning and Optimizing Data Lake
Architecture
• Tomorrow, 12h, Big Data Track
• Data Lake Architecture in Practice
• Optimizing Hive and Spark for Data Lakes
Data Science Lab
datascience.rs
Darko Marjanović, CEO @ Things Solver
darko@thingsolver.com
How to use Big Data and Data
Lake concept in business using
Hadoop and Spark

More Related Content

What's hot (20)

PDF
Data Preparation of Data Science
DataWorks Summit/Hadoop Summit
 
PPTX
Designing modern dw and data lake
punedevscom
 
PPTX
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
PDF
Building a Data Lake - An App Dev's Perspective
GeekNightHyderabad
 
PPTX
Traditional data warehouse vs data lake
BHASKAR CHAUDHURY
 
PDF
Enterprise Data Lake - Scalable Digital
sambiswal
 
PDF
5 Steps for Architecting a Data Lake
MetroStar
 
PDF
Data lake benefits
Ricky Barron
 
PDF
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Data Con LA
 
PPTX
Microsoft Azure Big Data Analytics
Mark Kromer
 
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
PDF
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
PPTX
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
PDF
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
PDF
Building the Enterprise Data Lake: A look at architecture
mark madsen
 
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
PPTX
Introduction to Microsoft’s Hadoop solution (HDInsight)
James Serra
 
PDF
Datalake Architecture
TechYugadi IT Solutions & Consulting
 
PDF
Hadoop data-lake-white-paper
Supratim Ray
 
Data Preparation of Data Science
DataWorks Summit/Hadoop Summit
 
Designing modern dw and data lake
punedevscom
 
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
Building a Data Lake - An App Dev's Perspective
GeekNightHyderabad
 
Traditional data warehouse vs data lake
BHASKAR CHAUDHURY
 
Enterprise Data Lake - Scalable Digital
sambiswal
 
5 Steps for Architecting a Data Lake
MetroStar
 
Data lake benefits
Ricky Barron
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Data Con LA
 
Microsoft Azure Big Data Analytics
Mark Kromer
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
Building the Enterprise Data Lake: A look at architecture
mark madsen
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
James Serra
 
Hadoop data-lake-white-paper
Supratim Ray
 

Similar to How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic (20)

PPTX
Big data frameworks
Cuelogic Technologies Pvt. Ltd.
 
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
PPTX
Building a Big Data Pipeline
Jesus Rodriguez
 
PDF
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
PPTX
Data lake-itweekend-sharif university-vahid amiry
datastack
 
PDF
Agile data lake? An oxymoron?
samthemonad
 
PDF
Big data processing with apache spark
sarith divakar
 
PPTX
How Big Data ,Cloud Computing ,Data Science can help business
Ajay Ohri
 
PDF
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
Big Data Week
 
PPTX
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
JerichoGerance
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PPTX
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
meganath16032003
 
PPTX
A Glimpse of Bigdata - Introduction
saisreealekhya
 
PDF
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
The Hive
 
PPTX
Big Data - An Overview
Arvind Kalyan
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PDF
RDBMS vs Hadoop vs Spark
Laxmi8
 
PDF
Big data analytics 1
gauravsc36
 
Big data frameworks
Cuelogic Technologies Pvt. Ltd.
 
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
Building a Big Data Pipeline
Jesus Rodriguez
 
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Agile data lake? An oxymoron?
samthemonad
 
Big data processing with apache spark
sarith divakar
 
How Big Data ,Cloud Computing ,Data Science can help business
Ajay Ohri
 
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
Big Data Week
 
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Started with-apache-spark
Happiest Minds Technologies
 
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
JerichoGerance
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
meganath16032003
 
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
The Hive
 
Big Data - An Overview
Arvind Kalyan
 
Apache Spark in Industry
Dorian Beganovic
 
RDBMS vs Hadoop vs Spark
Laxmi8
 
Big data analytics 1
gauravsc36
 
Ad

More from Institute of Contemporary Sciences (20)

PDF
First 5 years of PSI:ML - Filip Panjevic
Institute of Contemporary Sciences
 
PPTX
Building valuable (online and offline) Data Science communities - Experience ...
Institute of Contemporary Sciences
 
PPT
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Institute of Contemporary Sciences
 
PPTX
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Institute of Contemporary Sciences
 
PPTX
Solving churn challenge in Big Data environment - Jelena Pekez
Institute of Contemporary Sciences
 
PDF
Application of Business Intelligence in bank risk management - Dimitar Dilov
Institute of Contemporary Sciences
 
PPTX
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Institute of Contemporary Sciences
 
PPTX
Recommender systems for personalized financial advice from concept to product...
Institute of Contemporary Sciences
 
PDF
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Institute of Contemporary Sciences
 
PPTX
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Institute of Contemporary Sciences
 
PPTX
From Zero to ML Hero for Underdogs - Amir Tabakovic
Institute of Contemporary Sciences
 
PDF
Data and data scientists are not equal to money david hoyle
Institute of Contemporary Sciences
 
PPSX
The price is right - Tomislav Krizan
Institute of Contemporary Sciences
 
PPTX
When it's raining gold, bring a bucket - Andjela Culibrk
Institute of Contemporary Sciences
 
PPTX
Reality and traps of real time data engineering - Milos Solujic
Institute of Contemporary Sciences
 
PPTX
Sensor networks for personalized health monitoring - Vladimir Brusic
Institute of Contemporary Sciences
 
PDF
Improving Data Quality with Product Similarity Search
Institute of Contemporary Sciences
 
PPTX
Prediction of good patterns for future sales using image recognition
Institute of Contemporary Sciences
 
PPTX
Using data to fight corruption: full budget transparency in local government
Institute of Contemporary Sciences
 
PPTX
Geospatial Analysis and Open Data - Forest and Climate
Institute of Contemporary Sciences
 
First 5 years of PSI:ML - Filip Panjevic
Institute of Contemporary Sciences
 
Building valuable (online and offline) Data Science communities - Experience ...
Institute of Contemporary Sciences
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Institute of Contemporary Sciences
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Institute of Contemporary Sciences
 
Solving churn challenge in Big Data environment - Jelena Pekez
Institute of Contemporary Sciences
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Institute of Contemporary Sciences
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Institute of Contemporary Sciences
 
Recommender systems for personalized financial advice from concept to product...
Institute of Contemporary Sciences
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Institute of Contemporary Sciences
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Institute of Contemporary Sciences
 
From Zero to ML Hero for Underdogs - Amir Tabakovic
Institute of Contemporary Sciences
 
Data and data scientists are not equal to money david hoyle
Institute of Contemporary Sciences
 
The price is right - Tomislav Krizan
Institute of Contemporary Sciences
 
When it's raining gold, bring a bucket - Andjela Culibrk
Institute of Contemporary Sciences
 
Reality and traps of real time data engineering - Milos Solujic
Institute of Contemporary Sciences
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Institute of Contemporary Sciences
 
Improving Data Quality with Product Similarity Search
Institute of Contemporary Sciences
 
Prediction of good patterns for future sales using image recognition
Institute of Contemporary Sciences
 
Using data to fight corruption: full budget transparency in local government
Institute of Contemporary Sciences
 
Geospatial Analysis and Open Data - Forest and Climate
Institute of Contemporary Sciences
 
Ad

Recently uploaded (20)

PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
What Is Data Integration and Transformation?
subhashenia
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 

How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

  • 1. Darko Marjanović, CEO @ Things Solver [email protected] How to use Big Data and Data Lake concept in business using Hadoop and Spark
  • 2. About me • CEO and Co Founder @ Things Solver • Co Founder @ Data Science Serbia • Big Data, Machine Learning • Hadoop, Spark, Python
  • 3. Agenda • Big Data • Data Lake • Data Lake vs Data Warehouse • Hadoop, Spark, Hive • Big Data application and Lambda architecture • Examples • Data Science Lab
  • 4. Big Data • Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. • Anything that Won't Fit in Excel :)
  • 5. Big Data Volume The quantity of generated and stored data. Variety The type and nature of the data. Velocity In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Veracity The quality of captured data can vary greatly, affecting accurate analysis.
  • 6. Big Data • Email, HTML, Click Stream... • Facebook, Twitter... • Video, Pictures… • Logs... • Sensor Data... • Relational Databases...
  • 8. Data Lake “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” Data Lake - James Dixon, Pentaho chief technology officer
  • 9. Data Lake • Retain All Data • Support All Data types • Support All Users • Adapt Easily to Changes • Provide Faster Insights
  • 11. Data Lake Cons • Data storage alone has no impact on the effectiveness of business decisions • Inexpensive storage is not infinite or limitless
  • 12. Data Warehouse Wikipedia, defines Data Warehouses as: “…central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.”
  • 13. Data Warehouse Problems: • New Data Sources, Data Types • Real Time Reports • Streaming Data • Software Price • Infrastructure Price
  • 14. Data Lake vs Data Warehouse
  • 15. Data Lake vs Data Warehouse • ETL • ETL and BI projects by nature are investments into evolving processes and therefore have no distinct end point and is an ongoing, improving and re-targeting project process. • ETL works from the output backwards and hence on relevant data is extracted and processed. • Future ETL requirements needing data cannot be foreseen and defined in the original design. • ELT • Isolating Loading and Transforming enables projects to be broken down into specific chunks that are more isolated and become more manageable. • ELT is an emergent approach to data warehouse design and development requiring a change in mentality and design approach compared to traditional ETL. • Future requirements can easily be incorporated into the warehouse structure as all data is pulled into the Data Lake in its raw format.
  • 16. Hadoop • The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models.
  • 17. Hadoop • Pros • Linear scalability. • Commodity hardware. • Pricing and licensing. • All data types. • Analytical queries. • Integration with traditional systems. • Cons • Implementation. • Map Reduce ease of use. • Intense calculations with little data. • In memory. • Real time analytics.
  • 18. Apache Spark • Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
  • 19. Apache Spark • Pros • 100X faster than Map Reduce. • Ease of use. • Streaming, Mllib, Graph and SQL. • Pricing and licensing. • In memory. • Integration with Hadoop. • Machine learning. • Cons • Integration with traditional systems. • Limited memory per machine(GC). • Configuration.
  • 21. Apache Spark • Resilient Distributed Datasets (RDDs) are the basic units of abstraction in Spark. • RDD is an immutable, partitioned set of objects. • RDDs are lazy evaluated. • RDDs are fully fault-tolerant. Lost data can be recovered using the lineage graph of RDDs (by rerunning operations on the input data). • RDD operations: • Transformations - Lazy evaluated (executed by calling an action to improve pipelining) • -map, filter, groupByKey, join, ... • Actions - Runned immediately (to return the value to application/storage) • -count, collect, reduce, save, ... • Don’t forget to cache()
  • 22. Apache Spark • Dataframes are common abstraction that go across languages, and they represent a table, or two-dimensional array with columns and rows. • Spark Datarames are distributed dataframes. They allow querying structured data using SQL or DSL (for example in Python or Scala). • Like RDDs, Dataframes are also immutable structure. • They are executed in parallel. • val df = sqlContext.read.json"pathToMyFile.json")
  • 23. Hive • Apache Hive is a data warehouse infrastructure for querying, analyzing and managing large datasets residing in distributed storage.
  • 24. Hive • Pros • Writing ad hoc queries on large volumes of data. • Imposing a structure on a variety of data formats. • Interactive SQL queries over large datasets residing in Hadoop. • SQL-like data access. • Accessing Hadoop data from traditional DWH environment. • Cons • Code efficiency can be lower than in traditional Map Reduce. • Apache Hive has terrible performance for OLTP tasks.
  • 25. Ecosystem • Collecting Data • Kafka, Flume… • Managing Data • Pig, Spark, Hive, Flink, MapReduce • Resource Manager • YARN, Mesos • Administration • Ambari, Big Top
  • 27. Lambda Architecture • Lambda Architecture is a useful framework to think about designing big data applications. Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter.
  • 28. Lambda Architecture • Data • Batch Layer • Serving Layer • Speed Layer
  • 30. IoT Big Data Application
  • 31. Planning and Optimizing Data Lake Architecture • Tomorrow, 12h, Big Data Track • Data Lake Architecture in Practice • Optimizing Hive and Spark for Data Lakes
  • 33. Darko Marjanović, CEO @ Things Solver [email protected] How to use Big Data and Data Lake concept in business using Hadoop and Spark