How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

Darko Marjanović, CEO @ Things Solver
darko@thingsolver.com
How to use Big Data and Data
Lake concept in business using
Hadoop and Spark

About me
• CEO and Co Founder @ Things Solver
• Co Founder @ Data Science Serbia
• Big Data, Machine Learning
• Hadoop, Spark, Python

Agenda
• Big Data
• Data Lake
• Data Lake vs Data Warehouse
• Hadoop, Spark, Hive
• Big Data application and Lambda architecture
• Examples
• Data Science Lab

Big Data
• Big data is a term for data sets that are so large or complex that
traditional data processing applications are inadequate.
• Anything that Won't Fit in Excel :)

Big Data
Volume
The quantity of generated and stored data.
Variety
The type and nature of the data.
Velocity
In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the
path of growth and development.
Veracity
The quality of captured data can vary greatly, affecting
accurate analysis.

Big Data
• Email, HTML, Click Stream...
• Facebook, Twitter...
• Video, Pictures…
• Logs...
• Sensor Data...
• Relational Databases...

Data Lake
“A data lake is a storage repository that holds a vast amount of raw
data in its native format, including structured, semi-structured, and
unstructured data. The data structure and requirements are not
defined until the data is needed.”
Data Lake - James Dixon, Pentaho chief technology officer

Data Lake
• Retain All Data
• Support All Data types
• Support All Users
• Adapt Easily to
Changes
• Provide Faster Insights

Data Lake Cons
• Data storage alone has no impact on the effectiveness of business
decisions
• Inexpensive storage is not infinite or limitless

Data Warehouse
Wikipedia, defines Data Warehouses as:
“…central repositories of integrated data from one or more disparate
sources. They store current and historical data and are used for
creating trending reports for senior management reporting such as
annual and quarterly comparisons.”

Data Warehouse
Problems:
• New Data Sources, Data Types
• Real Time Reports
• Streaming Data
• Software Price
• Infrastructure Price

Data Lake vs Data Warehouse
• ETL
• ETL and BI projects by nature are investments into evolving processes and therefore have no
distinct end point and is an ongoing, improving and re-targeting project process.
• ETL works from the output backwards and hence on relevant data is extracted and processed.
• Future ETL requirements needing data cannot be foreseen and defined in the original design.
• ELT
• Isolating Loading and Transforming enables projects to be broken down into specific chunks
that are more isolated and become more manageable.
• ELT is an emergent approach to data warehouse design and development requiring a change
in mentality and design approach compared to traditional ETL.
• Future requirements can easily be incorporated into the warehouse structure as all data is
pulled into the Data Lake in its raw format.

Hadoop
• The Apache Hadoop software library is a framework that allows the
distributed processing of large data sets across clusters of computers
using simple programming models.

Hadoop
• Pros
• Linear scalability.
• Commodity hardware.
• Pricing and licensing.
• All data types.
• Analytical queries.
• Integration with traditional systems.
• Cons
• Implementation.
• Map Reduce ease of use.
• Intense calculations with little data.
• In memory.
• Real time analytics.

Apache Spark
• Apache Spark is a fast and general engine for big data processing,
with built-in modules for streaming, SQL, machine learning and graph
processing.

Apache Spark
• Pros
• 100X faster than Map Reduce.
• Ease of use.
• Streaming, Mllib, Graph and SQL.
• Pricing and licensing.
• In memory.
• Integration with Hadoop.
• Machine learning.
• Cons
• Integration with traditional
systems.
• Limited memory per machine(GC).
• Configuration.

Apache Spark
• Resilient Distributed Datasets
(RDDs) are the basic units of
abstraction in Spark.
• RDD is an immutable, partitioned
set of objects.
• RDDs are lazy evaluated.
• RDDs are fully fault-tolerant. Lost
data can be recovered using the
lineage graph of RDDs (by
rerunning operations on the input
data).
• RDD operations:
• Transformations - Lazy evaluated
(executed by calling an action to
improve pipelining)
• -map, filter, groupByKey, join, ...
• Actions - Runned immediately (to
return the value to
application/storage)
• -count, collect, reduce, save, ...
• Don’t forget to cache()

Apache Spark
• Dataframes are common abstraction that go across languages, and they
represent a table, or two-dimensional array with columns and rows.
• Spark Datarames are distributed dataframes. They allow querying
structured data using SQL or DSL (for example in Python or Scala).
• Like RDDs, Dataframes are also immutable structure.
• They are executed in parallel.
• val df = sqlContext.read.json"pathToMyFile.json")

Hive
• Apache Hive is a data
warehouse infrastructure for
querying, analyzing and
managing large datasets
residing in distributed storage.

Hive
• Pros
• Writing ad hoc queries on large
volumes of data.
• Imposing a structure on a variety of
data formats.
• Interactive SQL queries over large
datasets residing in Hadoop.
• SQL-like data access.
• Accessing Hadoop data from
traditional DWH environment.
• Cons
• Code efficiency can be lower than
in traditional Map Reduce.
• Apache Hive has terrible
performance for OLTP tasks.

Ecosystem
• Collecting Data
• Kafka, Flume…
• Managing Data
• Pig, Spark, Hive, Flink, MapReduce
• Resource Manager
• YARN, Mesos
• Administration
• Ambari, Big Top

Lambda Architecture
• Lambda Architecture is a useful framework to think about designing
big data applications. Nathan Marz designed this
generic architecture addressing common requirements for big data
based on his experience working on distributed data processing
systems at Twitter.

Lambda Architecture
• Data
• Batch Layer
• Serving Layer
• Speed Layer

Planning and Optimizing Data Lake
Architecture
• Tomorrow, 12h, Big Data Track
• Data Lake Architecture in Practice
• Optimizing Hive and Spark for Data Lakes

Data Science Lab
datascience.rs

How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic

More Related Content

What's hot (20)

Similar to How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic (20)

More from Institute of Contemporary Sciences (20)

Recently uploaded (20)

How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic