Opensource Frameworks and BigData Processing

23 likes6,578 views

The document discusses using open-source technologies to build a big data processing platform on commodity machines. It outlines the challenges of big data including the volume, velocity and variety of data being created. It then describes the Hadoop ecosystem as a solution, including its use of MapReduce and various Apache projects for tasks like storage, transfer, search, messaging, logging, stream processing and machine learning.

Data & Analytics

Big-Data Processing utilizing
Open-Source Technology Stack
By
Amir Sedighi
http://www.linkedin.com/in/amirsedighi
@amirsedighi
Linux and Ubuntu 14.10 Release Conf 1

References
● http://www.slideshare.net/BernardMarr/140228-big-data-slide-share?qid=017848e
2-9e2a-4dc3-963c-52b6a90fba2a&v=default&b=&from_search=1
● http://www.forbes.com/fdc/welcome_mjx.shtml
● ZYMR Spark Your Real-Time Big Data Analytics
Linux and Ubuntu 14.10 Release Conf 2
● http://dataconomy.com
● https://datakulfi.wordpress.com/2013/03/27/big-data-open-source-technology-landsca
pe/
● http://www.slideshare.net/andrefaria/big-data-abc?qid=1ac97e4a-4acc-460a-b3f8
-9122f7210440&v=qf1&b=&from_search=12
● https://wiki.apache.org/hadoop/PoweredBy

Data Explosion
Linux and Ubuntu 14.10 Release Conf 3

Data Explosion
Linux and Ubuntu 14.10 Release Conf 4

● Big-Data is that everything we do is increasingly
leaving a digital trace which we (or others) can
gather, use and analyze.
– Data Providers
● Business Companies
● People
Linux and Ubuntu 14.10 Release Conf 5

Volume, Velocity, Variety
● “There was 5 exabytes of
information created between
the dawn of civilization
through 2003, but that much
information is now created
every 2 days, and the pace is
increasing.” Eric Schmidt
Linux and Ubuntu 14.10 Release Conf 6

Big-Data Processing
Linux and Ubuntu 14.10 Release Conf 7

How to provide a
Big-Data processing platform
using commodity machines?
Linux and Ubuntu 14.10 Release Conf 8

Vertical or Horizontal?
Linux and Ubuntu 14.10 Release Conf 9

Scale Up vs Scale Out
Linux and Ubuntu 14.10 Release Conf 10

Scale Up vs Scale Out
Linux and Ubuntu 14.10 Release Conf 11

Big-Data Processing
Open-Source Technology Stack
Linux and Ubuntu 14.10 Release Conf 12

Map-Reduce
Linux and Ubuntu 14.10 Release Conf 13

Hadoop Framework
Linux and Ubuntu 14.10 Release Conf 14

Apache Hadoop Main Projects
Linux and Ubuntu 14.10 Release Conf 15

Data Stores
Linux and Ubuntu 14.10 Release Conf 17
● Data Stores
– KeyValue
– Graph
– Columnar
– Document Store
– In Memory

Data Transfer
Linux and Ubuntu 14.10 Release Conf 18
● Apache Flume
● Apache Sqoop

Search
Linux and Ubuntu 14.10 Release Conf 19
● Elasticsearch
● Apache SolR

Messaging and Queuing
Linux and Ubuntu 14.10 Release Conf 20
● Apache Kafka
● ZeroMQ

Log Management
Linux and Ubuntu 14.10 Release Conf 21
● ELK
● Logstash
● FluentD

Stream Processing
Linux and Ubuntu 14.10 Release Conf 22
● Apache Storm
● Apache Samza
● Apache Spark

Machine Learning
● Apache Mahout
Linux and Ubuntu 14.10 Release Conf 23
● MLLib
● GraphX

Questions?
Linux and Ubuntu 14.10 Release Conf 24

More Related Content

What's hot (20)

PPTX

The of Operational Analytics Data StoreRommel Garcia

PDF

How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.

PPTX

Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit

PDF

Elastic Data Analytics Platform @DatadogC4Media

PPTX

The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse

PDF

Lambda architecture @ IndixRajesh Muppalla

PDF

Druid in Spot InstancesImply

PDF

Lambda Architectures in PracticeC4Media

PDF

Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScyllaDB

PDF

Capital One: Using Cassandra In Building A Reporting PlatformDataStax Academy

PDF

Real-time analytics with Druid at AppsflyerMichael Spector

PDF

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain

PDF

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

PPTX

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall

PDF

Apache Druid: The Foundation of Fortune 500 “Analytical Decision-Making"Rommel Garcia

PDF

Natalie Godec - AirFlow and GCP: tomorrow's health service data platformmatteo mazzeri

PDF

Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax

PPTX

Lambda architecture with SparkVincent GALOPIN

PDF

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

PDF

Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...Databricks

The of Operational Analytics Data StoreRommel Garcia

How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.

Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit

Elastic Data Analytics Platform @DatadogC4Media

The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse

Lambda architecture @ IndixRajesh Muppalla

Druid in Spot InstancesImply

Lambda Architectures in PracticeC4Media

Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScyllaDB

Capital One: Using Cassandra In Building A Reporting PlatformDataStax Academy

Real-time analytics with Druid at AppsflyerMichael Spector

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall

Apache Druid: The Foundation of Fortune 500 “Analytical Decision-Making"Rommel Garcia

Natalie Godec - AirFlow and GCP: tomorrow's health service data platformmatteo mazzeri

Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax

Lambda architecture with SparkVincent GALOPIN

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...Databricks

Similar to Opensource Frameworks and BigData Processing (20)

PDF

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.

PPTX

Scientific Computing @ Fred HutchDirk Petersen

PDF

Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge

PPTX

Open Source india 2014lohitvijayarenu

PDF

Discover the Linux on z Systems EffectIBM

PPTX

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

PPTX

Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham

PPTX

Flexible computePeter Clapham

PDF

Linux conceptsNAILBITER

KEY

London devops loggingTomas Doran

PDF

What's new with Apache Spark?Paco Nathan

PDF

How Apache Spark fits in the Big Data landscapePaco Nathan

PDF

Using Open Source technologies to create Enterprise Level Cloud SystemOpenFest team

PDF

SUSE y Big DataSUSE España

PDF

SUSE: Infraestructura definida por software para BigDataJuan Herrera Utande

PDF

Liferay & Big Data Dev Con 2014Miguel Pastor

PDF

Big Data Processing Utilizing Open-source Technologies - May 2015Amir Sedighi

PPT

Architecting Big Data Ingest & ManipulationGeorge Long

PDF

Lean Enterprise, Microservices and Big DataStylight

PDF

Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Spark Summit

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.

Scientific Computing @ Fred HutchDirk Petersen

Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge

Open Source india 2014lohitvijayarenu

Discover the Linux on z Systems EffectIBM

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham

Flexible computePeter Clapham

Linux conceptsNAILBITER

London devops loggingTomas Doran

What's new with Apache Spark?Paco Nathan

How Apache Spark fits in the Big Data landscapePaco Nathan

Using Open Source technologies to create Enterprise Level Cloud SystemOpenFest team

SUSE y Big DataSUSE España

SUSE: Infraestructura definida por software para BigDataJuan Herrera Utande

Liferay & Big Data Dev Con 2014Miguel Pastor

Big Data Processing Utilizing Open-source Technologies - May 2015Amir Sedighi

Architecting Big Data Ingest & ManipulationGeorge Long

Lean Enterprise, Microservices and Big DataStylight

Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Spark Summit

More from Amir Sedighi (19)

PDF

Dark dataAmir Sedighi

PDF

آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگAmir Sedighi

PDF

Big Data and Machine Learning Workshop - Day 7 @ UTACM Amir Sedighi

PDF

Big Data and Machine Learning Workshop - Day 6 @ UTACMAmir Sedighi

PDF

Big Data and Machine Learning Workshop - Day 5 @ UTACMAmir Sedighi

PDF

Big Data and Machine Learning Workshop - Day 4 @ UTACM Amir Sedighi

PDF

Big Data and Machine Learning Workshop - Day 3 @ UTACMAmir Sedighi

PDF

Big Data and Machine Learning Workshop - Day 2 @ UTACMAmir Sedighi

PDF

Big Data and Machine Learning Workshop - Day 1 @ UTACMAmir Sedighi

PDF

Two Case Studies Big-Data and Machine Learning at Scale Solutions in IranAmir Sedighi

PDF

Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionAmir Sedighi

PDF

Case Studies on Big-Data Processing and Streaming - Iranian Java User GroupAmir Sedighi

PDF

Elasticsearch 1.x Cluster Installation (VirtualBox)Amir Sedighi

PDF

Hadoop 2.x HDFS Cluster Installation (VirtualBox)Amir Sedighi

PDF

An Introduction to Apache KafkaAmir Sedighi

PDF

An introduction To Apache SparkAmir Sedighi

PDF

Distributed Data Processing Workshop - SBUAmir Sedighi

PDF

An introduction to Big-Data processing applying hadoopAmir Sedighi

PDF

An Introduction to Elasticsearch for BeginnersAmir Sedighi