SlideShare a Scribd company logo
Case Studies
on
Big-Data Processing and Data Streaming
By: Amir Sedighi
LinkedIn: http://linkedin.com/in/amirsedighi
Twitter: @amirsedighi
JUG - A.Sedighi - 2015 2 / 48
Background
● BS and MS degrees in Software Engineering
● Senior Software Engineer
– +20 Years of Programming Experience
● Cross-platform Software Development
– +4 Years of Big-Data Processing and Machine-Learning Experience
● Log Management and Forensic
● Big-Data Visualization
● Data Warehouse using Big-Data Technologies
● Recommender Systems
● Analytical Real-Time Search Engines
● Integrating Fedora Digital Library with HDFS
● Next Generation Event Processing
● Online Resume
– http://linkedin.com/in/amirsedighi
JUG - A.Sedighi - 2015 3 / 48
Outline
● An Introduction to Big-Data Processing
● Big-Data and Processing and Data Streaming
– Data Processing
1. +TB Scale Data Warehouse
2. Analytical Real-Time Search Solution and BI
3. Scaleable Recommender System
4. Integrating Fedora Digital Library with HDFS
– Stream and Event Processing
1. Super Fast Scaleable Log Management, Forensic and BI
2. Super Fast Scaleable Fraud Detection
JUG - A.Sedighi - 2015 4 / 48
What Big-Data Is?
JUG - A.Sedighi - 2015 5 / 48
● Every 2 Days Human Create As Much Information As We Did
Up To 2003 - Eric Schmidt
JUG - A.Sedighi - 2015 6 / 48
Big-Data Characteristics
● Volume
● Variety
● Velocity
JUG - A.Sedighi - 2015 7 / 48
You're a Part of It Every Day
● We've have the ability to store anything
● Companies and people are generating data like
never before in history
– Social Networks
– Online Web Portals
– Log Writers - Our Digital Footprint!
JUG - A.Sedighi - 2015 8 / 48
You're a Part of It Every Day
● Big-Data is whatever people do in the digital world,
including the foot print of what people, companies,
devices and services do (Logs), including traditional
tabular data stores.
JUG - A.Sedighi - 2015 9 / 48
As a Manager still You're a Part of It
● “Over half of the business leaders today, realize they
don't have access to the insights they need to do their
job.” - IBM
JUG - A.Sedighi - 2015 10 / 48
Vertical or Horizontal?
JUG - A.Sedighi - 2015 11 / 48
Scale Up or Scale Out
JUG - A.Sedighi - 2015 12 / 48
Linear Scalability
JUG - A.Sedighi - 2015 13 / 48
Big-Data Processing Solutions
JUG - A.Sedighi - 2015 14 / 48
Q: How To Be Linear Scaleable on Commodity
Machines?
A: MapReduce
JUG - A.Sedighi - 2015 15 / 48
Q: How to store big data on commodity machines?
A: Distributed File System
JUG - A.Sedighi - 2015 16 / 48
Replication → Fault Tolerant
Replication → Data Locality → Utilization
JUG - A.Sedighi - 2015 17 / 48
Big-Data Processing, Most Popular
Technologies
● Apache Hadoop Ecosystem
● NoSQL Databases
– HBase
– Cassandra
– MongoDB
– Neo4j
● Elasticsearch
– Lucene
– SolR
● Java
JUG - A.Sedighi - 2015 18 / 48
+TB Scale Data Warehouse
1
JUG - A.Sedighi - 2015 19 / 48
DW Solution
● SQL
● ETL
– RDBMS
– NoSQL
– File System
● REST API
JUG - A.Sedighi - 2015 20 / 48
REST Admin Panel
JUG - A.Sedighi - 2015 21 / 48
Features
● Extendable Capacity for Data Warehousing
● Making Very Big Integrated Databases Based on Different
Technologies/Schemas
– DB2, Oracle, MS-SQL …
– Different Schemas Such as HRMS, Banking, Sales...
– Making Small Dense Integrated RDBMSs
● SQL Language Interface
● Linear Scalability
JUG - A.Sedighi - 2015 22 / 48
Main Technologies and Frameworks
● Apache Hadoop
– Sqoop
– YARN/HDFS
– Hive or Drill or Impala
● Microservices Architecture
– Java 1.7
– Spring Boot
JUG - A.Sedighi - 2015 23 / 48
Analytical Real-Time Scalable Search Solution
and BI
2
JUG - A.Sedighi - 2015 24 / 48
+TB Scale RT Searching
● Indexing Incoming Data on-the-fly
● Highly Scaleable and Reliable
● Simple or Complex Queries
● REST API
● Schema Agnostic
● Customizable GUI and BI
JUG - A.Sedighi - 2015 25 / 48
Business Intelligence
JUG - A.Sedighi - 2015 26 / 48
Rich GUI
JUG - A.Sedighi - 2015 27 / 48
Main Technologies and Frameworks
● Elasticsearch
– Apache Lucene
– REST
● Kibana
JUG - A.Sedighi - 2015 28 / 48
Scalable Recommender System
3
JUG - A.Sedighi - 2015 29 / 48
Recommender System
● Value-added Service (Loyalty Services)
● Machine-Learning
– Clustering Throw Thousands of Nodes
● Apache Mahout
● Super Fast
JUG - A.Sedighi - 2015 30 / 48
How It Works?
JUG - A.Sedighi - 2015 31 / 48
Technologies and Frameworks
● Microservices Architecture
● Java 1.6
● Apache Mahout
● Redis
Fedora Digital Library and HDFS Integration
4
Migrating from Expensive Servers to Commodity
Machines
● Making HDFS as Fedora Digital Library Storage
– Research and Development
– Hadoop 1.2, Later Hadoop YARN 2.2
– Integrating with SolR over HDFS
● Java 1.7
● Fedora
– Islandora
– GSearch
JUG - A.Sedighi - 2015 34 / 48
Data Streaming
JUG - A.Sedighi - 2015 35 / 48
Big-Data Streaming, Most Popular Technologies
● Piping and Messaging
– Kafka, Flume, FluentD and ZeroMQ
● Stream Processing
– Storm, Samza and Spark
● Machine Learning
– Machine Learning: MLLib and Mahout
● Persisting
– NoSQL DBs
– HDFS
JUG - A.Sedighi - 2015 36 / 48
Log Management, Forensic and BI
1
JUG - A.Sedighi - 2015 37 / 48
Log Management, Forensic and BI
● Every Digital Stuff Writes Things Into Log Files
– Log Files Are Streams of Data
– Log Files Are Messy
– Log Files Come Very Fast, in an Un-Predictable Manner
– Log Files Are About Everything within Your Business
● Log Files Are Full of Insight
– Who Can Hold Them For a Reasonable Period of Time
– Who Can Search Them Rapidly
– Who Can Visualize Them Easily (BI)
JUG - A.Sedighi - 2015 38 / 48
Network Topology
LB
Masters
Data
JUG - A.Sedighi - 2015 39 / 48
Main Technologies and Frameworks
● LogStash
– Flume
● Elasticsearch
● Kibana
JUG - A.Sedighi - 2015 40 / 48
Snapshot
JUG - A.Sedighi - 2015 41 / 48
Fraud Detection
2
JUG - A.Sedighi - 2015 42 / 48
Inputs & Outputs
● Inputs: One or multiple sources generate data continuously, in
real time
– Sensor Networks
– Transaction Logs
– Text Streams such as News
– Network Traffic Analysis
● Outputs: Up-to-date Answers generated continuously or
periodically
JUG - A.Sedighi - 2015 43 / 48
Data Processing
Transient Query
– Issued once, then forgotten
Persistent Data
Stored until deleted by user or apps
JUG - A.Sedighi - 2015 44 / 48
Stream Processing
Transient Data
– Deleted as Window Slides
Forward
Generated up-to-date
answers as time goes on
Persistent Queries
TimeBased
CountBased
JUG - A.Sedighi - 2015 45 / 48
Features
● Scalability
● Real-Timing, (Only 1 Second delay at most)
● Super Fast Decision Making
● Implementing Complex Fraud Scenarios Aa Easy as Defining
Queries
● Uniform Api For Processing Old or Early Events
JUG - A.Sedighi - 2015 46 / 48
Main Technologies and Frameworks
● Java 1.7, Scala 2.11
● Apache Flume
● Apache Kafka
● Apache Spark
Where To Start?
● You need Big Amount of Data
● You need to change your mind
– Rack Space and Number of Servers, IO and Process Limitations
● You need To Understand Fundamentals
– Linux (Bash Script)
– Java is a Most, Python works and Scala is an advantage
– SQL and ETL
– MapReduce, Resource Management and Serialization Frameworks
– Apache Hadoop Ecosystem and Successors
JUG - A.Sedighi - 2015 48 / 48
Thank You!, Question?
http://slideshare.net/amirsedighi

More Related Content

What's hot (20)

PDF
Graphs & the Police: How Law Enforcement Analyze Connected Data at Scale
Neo4j
 
PDF
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
GetInData
 
PDF
Don't build a data science team
Lars Albertsson
 
PPT
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Impetus Technologies
 
PDF
Building your data driven business with Reactive Marketing Technology
Trieu Nguyen
 
PDF
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
PDF
Bigdata based fraud detection
Mk Kim
 
PDF
Moving Beyond Batch: Transactional Databases for Real-time Data
VoltDB
 
PDF
Fighting financial fraud at Danske Bank with artificial intelligence
Ron Bodkin
 
PDF
The Connected Data Imperative: An Introduction to Neo4j
Neo4j
 
PPTX
Big Data Analytics and a Chartered Accountant
Bharath Rao
 
PDF
GraphTour Keynote, Emil Eifrem, CEO and Founder, Neo4j
Neo4j
 
PPTX
Snowplow the evolving data pipeline
yalisassoon
 
PDF
nl.OUG Tech Experience 2017 - Introduction in Oracle Big Data Cloud Service
Daan Bakboord
 
PDF
WSO2Con EU 2016: An Effective Device Strategy to Accelerate your Business
WSO2
 
PDF
Autograph - Natural Signatures for Graph Modelling, Simon Brueckheimer, Ciena
Neo4j
 
PDF
Tim scottkoenverheyenpresentation
Patrick Van Renterghem
 
PDF
2017-01-08-scaling tribalknowledge
Christopher Williams
 
PDF
Big Data Analytics: From Insights to Production
Think Big, a Teradata Company
 
PPTX
Operationalized Analytics in the Enterprise
Ron Bodkin
 
Graphs & the Police: How Law Enforcement Analyze Connected Data at Scale
Neo4j
 
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
GetInData
 
Don't build a data science team
Lars Albertsson
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Impetus Technologies
 
Building your data driven business with Reactive Marketing Technology
Trieu Nguyen
 
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
Bigdata based fraud detection
Mk Kim
 
Moving Beyond Batch: Transactional Databases for Real-time Data
VoltDB
 
Fighting financial fraud at Danske Bank with artificial intelligence
Ron Bodkin
 
The Connected Data Imperative: An Introduction to Neo4j
Neo4j
 
Big Data Analytics and a Chartered Accountant
Bharath Rao
 
GraphTour Keynote, Emil Eifrem, CEO and Founder, Neo4j
Neo4j
 
Snowplow the evolving data pipeline
yalisassoon
 
nl.OUG Tech Experience 2017 - Introduction in Oracle Big Data Cloud Service
Daan Bakboord
 
WSO2Con EU 2016: An Effective Device Strategy to Accelerate your Business
WSO2
 
Autograph - Natural Signatures for Graph Modelling, Simon Brueckheimer, Ciena
Neo4j
 
Tim scottkoenverheyenpresentation
Patrick Van Renterghem
 
2017-01-08-scaling tribalknowledge
Christopher Williams
 
Big Data Analytics: From Insights to Production
Think Big, a Teradata Company
 
Operationalized Analytics in the Enterprise
Ron Bodkin
 

Viewers also liked (11)

PDF
An Introduction to Apache Kafka
Amir Sedighi
 
PDF
Elasticsearch 1.x Cluster Installation (VirtualBox)
Amir Sedighi
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PDF
آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگ
Amir Sedighi
 
PDF
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Amir Sedighi
 
PDF
Big Data and Machine Learning Workshop - Day 5 @ UTACM
Amir Sedighi
 
PDF
Dark data
Amir Sedighi
 
PDF
Distributed Data Processing Workshop - SBU
Amir Sedighi
 
PDF
Big Data and Machine Learning Workshop - Day 7 @ UTACM
Amir Sedighi
 
PDF
An Introduction to Elasticsearch for Beginners
Amir Sedighi
 
PDF
Big Data Processing Utilizing Open-source Technologies - May 2015
Amir Sedighi
 
An Introduction to Apache Kafka
Amir Sedighi
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Amir Sedighi
 
An introduction To Apache Spark
Amir Sedighi
 
آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگ
Amir Sedighi
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Amir Sedighi
 
Big Data and Machine Learning Workshop - Day 5 @ UTACM
Amir Sedighi
 
Dark data
Amir Sedighi
 
Distributed Data Processing Workshop - SBU
Amir Sedighi
 
Big Data and Machine Learning Workshop - Day 7 @ UTACM
Amir Sedighi
 
An Introduction to Elasticsearch for Beginners
Amir Sedighi
 
Big Data Processing Utilizing Open-source Technologies - May 2015
Amir Sedighi
 
Ad

Similar to Case Studies on Big-Data Processing and Streaming - Iranian Java User Group (20)

PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PPTX
Building a Big Data Pipeline
Jesus Rodriguez
 
PDF
Big Data Architecture
Guido Schmutz
 
PDF
Big Data Architectures @ JAX / BigDataCon 2016
Guido Schmutz
 
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PPTX
PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhu
bhushanshashi818
 
PDF
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
PDF
Traditional data word
orcoxsm
 
PDF
Big Data , Big Problem?
Mohammadhasan Farazmand
 
PDF
Hadoop-based architecture approaches
Miraj Godha
 
PPTX
Architecting Wide-ranging Analytical Solutions with MongoDB
Matthew Kalan
 
PPTX
Big Data/Hadoop Option Analysis
zafarali1981
 
PDF
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
PDF
Real-time big data analytics based on product recommendations case study
deep.bi
 
PPTX
Data lake-itweekend-sharif university-vahid amiry
datastack
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Big Data Architectures
Guido Schmutz
 
PPTX
Stratebi Big Data
Stratebi
 
Big Data Analytics with Hadoop
Philippe Julio
 
Building a Big Data Pipeline
Jesus Rodriguez
 
Big Data Architecture
Guido Schmutz
 
Big Data Architectures @ JAX / BigDataCon 2016
Guido Schmutz
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhu
bhushanshashi818
 
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
Traditional data word
orcoxsm
 
Big Data , Big Problem?
Mohammadhasan Farazmand
 
Hadoop-based architecture approaches
Miraj Godha
 
Architecting Wide-ranging Analytical Solutions with MongoDB
Matthew Kalan
 
Big Data/Hadoop Option Analysis
zafarali1981
 
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Real-time big data analytics based on product recommendations case study
deep.bi
 
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Stephen Alex
 
Big Data Architectures
Guido Schmutz
 
Stratebi Big Data
Stratebi
 
Ad

More from Amir Sedighi (8)

PDF
Big Data and Machine Learning Workshop - Day 6 @ UTACM
Amir Sedighi
 
PDF
Big Data and Machine Learning Workshop - Day 4 @ UTACM
Amir Sedighi
 
PDF
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Amir Sedighi
 
PDF
Big Data and Machine Learning Workshop - Day 2 @ UTACM
Amir Sedighi
 
PDF
Big Data and Machine Learning Workshop - Day 1 @ UTACM
Amir Sedighi
 
PDF
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Amir Sedighi
 
PDF
Opensource Frameworks and BigData Processing
Amir Sedighi
 
PDF
An introduction to Big-Data processing applying hadoop
Amir Sedighi
 
Big Data and Machine Learning Workshop - Day 6 @ UTACM
Amir Sedighi
 
Big Data and Machine Learning Workshop - Day 4 @ UTACM
Amir Sedighi
 
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Amir Sedighi
 
Big Data and Machine Learning Workshop - Day 2 @ UTACM
Amir Sedighi
 
Big Data and Machine Learning Workshop - Day 1 @ UTACM
Amir Sedighi
 
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Amir Sedighi
 
Opensource Frameworks and BigData Processing
Amir Sedighi
 
An introduction to Big-Data processing applying hadoop
Amir Sedighi
 

Recently uploaded (20)

PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Tally software_Introduction_Presentation
AditiBansal54083
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 

Case Studies on Big-Data Processing and Streaming - Iranian Java User Group

  • 1. Case Studies on Big-Data Processing and Data Streaming By: Amir Sedighi LinkedIn: http://linkedin.com/in/amirsedighi Twitter: @amirsedighi
  • 2. JUG - A.Sedighi - 2015 2 / 48 Background ● BS and MS degrees in Software Engineering ● Senior Software Engineer – +20 Years of Programming Experience ● Cross-platform Software Development – +4 Years of Big-Data Processing and Machine-Learning Experience ● Log Management and Forensic ● Big-Data Visualization ● Data Warehouse using Big-Data Technologies ● Recommender Systems ● Analytical Real-Time Search Engines ● Integrating Fedora Digital Library with HDFS ● Next Generation Event Processing ● Online Resume – http://linkedin.com/in/amirsedighi
  • 3. JUG - A.Sedighi - 2015 3 / 48 Outline ● An Introduction to Big-Data Processing ● Big-Data and Processing and Data Streaming – Data Processing 1. +TB Scale Data Warehouse 2. Analytical Real-Time Search Solution and BI 3. Scaleable Recommender System 4. Integrating Fedora Digital Library with HDFS – Stream and Event Processing 1. Super Fast Scaleable Log Management, Forensic and BI 2. Super Fast Scaleable Fraud Detection
  • 4. JUG - A.Sedighi - 2015 4 / 48 What Big-Data Is?
  • 5. JUG - A.Sedighi - 2015 5 / 48 ● Every 2 Days Human Create As Much Information As We Did Up To 2003 - Eric Schmidt
  • 6. JUG - A.Sedighi - 2015 6 / 48 Big-Data Characteristics ● Volume ● Variety ● Velocity
  • 7. JUG - A.Sedighi - 2015 7 / 48 You're a Part of It Every Day ● We've have the ability to store anything ● Companies and people are generating data like never before in history – Social Networks – Online Web Portals – Log Writers - Our Digital Footprint!
  • 8. JUG - A.Sedighi - 2015 8 / 48 You're a Part of It Every Day ● Big-Data is whatever people do in the digital world, including the foot print of what people, companies, devices and services do (Logs), including traditional tabular data stores.
  • 9. JUG - A.Sedighi - 2015 9 / 48 As a Manager still You're a Part of It ● “Over half of the business leaders today, realize they don't have access to the insights they need to do their job.” - IBM
  • 10. JUG - A.Sedighi - 2015 10 / 48 Vertical or Horizontal?
  • 11. JUG - A.Sedighi - 2015 11 / 48 Scale Up or Scale Out
  • 12. JUG - A.Sedighi - 2015 12 / 48 Linear Scalability
  • 13. JUG - A.Sedighi - 2015 13 / 48 Big-Data Processing Solutions
  • 14. JUG - A.Sedighi - 2015 14 / 48 Q: How To Be Linear Scaleable on Commodity Machines? A: MapReduce
  • 15. JUG - A.Sedighi - 2015 15 / 48 Q: How to store big data on commodity machines? A: Distributed File System
  • 16. JUG - A.Sedighi - 2015 16 / 48 Replication → Fault Tolerant Replication → Data Locality → Utilization
  • 17. JUG - A.Sedighi - 2015 17 / 48 Big-Data Processing, Most Popular Technologies ● Apache Hadoop Ecosystem ● NoSQL Databases – HBase – Cassandra – MongoDB – Neo4j ● Elasticsearch – Lucene – SolR ● Java
  • 18. JUG - A.Sedighi - 2015 18 / 48 +TB Scale Data Warehouse 1
  • 19. JUG - A.Sedighi - 2015 19 / 48 DW Solution ● SQL ● ETL – RDBMS – NoSQL – File System ● REST API
  • 20. JUG - A.Sedighi - 2015 20 / 48 REST Admin Panel
  • 21. JUG - A.Sedighi - 2015 21 / 48 Features ● Extendable Capacity for Data Warehousing ● Making Very Big Integrated Databases Based on Different Technologies/Schemas – DB2, Oracle, MS-SQL … – Different Schemas Such as HRMS, Banking, Sales... – Making Small Dense Integrated RDBMSs ● SQL Language Interface ● Linear Scalability
  • 22. JUG - A.Sedighi - 2015 22 / 48 Main Technologies and Frameworks ● Apache Hadoop – Sqoop – YARN/HDFS – Hive or Drill or Impala ● Microservices Architecture – Java 1.7 – Spring Boot
  • 23. JUG - A.Sedighi - 2015 23 / 48 Analytical Real-Time Scalable Search Solution and BI 2
  • 24. JUG - A.Sedighi - 2015 24 / 48 +TB Scale RT Searching ● Indexing Incoming Data on-the-fly ● Highly Scaleable and Reliable ● Simple or Complex Queries ● REST API ● Schema Agnostic ● Customizable GUI and BI
  • 25. JUG - A.Sedighi - 2015 25 / 48 Business Intelligence
  • 26. JUG - A.Sedighi - 2015 26 / 48 Rich GUI
  • 27. JUG - A.Sedighi - 2015 27 / 48 Main Technologies and Frameworks ● Elasticsearch – Apache Lucene – REST ● Kibana
  • 28. JUG - A.Sedighi - 2015 28 / 48 Scalable Recommender System 3
  • 29. JUG - A.Sedighi - 2015 29 / 48 Recommender System ● Value-added Service (Loyalty Services) ● Machine-Learning – Clustering Throw Thousands of Nodes ● Apache Mahout ● Super Fast
  • 30. JUG - A.Sedighi - 2015 30 / 48 How It Works?
  • 31. JUG - A.Sedighi - 2015 31 / 48 Technologies and Frameworks ● Microservices Architecture ● Java 1.6 ● Apache Mahout ● Redis
  • 32. Fedora Digital Library and HDFS Integration 4
  • 33. Migrating from Expensive Servers to Commodity Machines ● Making HDFS as Fedora Digital Library Storage – Research and Development – Hadoop 1.2, Later Hadoop YARN 2.2 – Integrating with SolR over HDFS ● Java 1.7 ● Fedora – Islandora – GSearch
  • 34. JUG - A.Sedighi - 2015 34 / 48 Data Streaming
  • 35. JUG - A.Sedighi - 2015 35 / 48 Big-Data Streaming, Most Popular Technologies ● Piping and Messaging – Kafka, Flume, FluentD and ZeroMQ ● Stream Processing – Storm, Samza and Spark ● Machine Learning – Machine Learning: MLLib and Mahout ● Persisting – NoSQL DBs – HDFS
  • 36. JUG - A.Sedighi - 2015 36 / 48 Log Management, Forensic and BI 1
  • 37. JUG - A.Sedighi - 2015 37 / 48 Log Management, Forensic and BI ● Every Digital Stuff Writes Things Into Log Files – Log Files Are Streams of Data – Log Files Are Messy – Log Files Come Very Fast, in an Un-Predictable Manner – Log Files Are About Everything within Your Business ● Log Files Are Full of Insight – Who Can Hold Them For a Reasonable Period of Time – Who Can Search Them Rapidly – Who Can Visualize Them Easily (BI)
  • 38. JUG - A.Sedighi - 2015 38 / 48 Network Topology LB Masters Data
  • 39. JUG - A.Sedighi - 2015 39 / 48 Main Technologies and Frameworks ● LogStash – Flume ● Elasticsearch ● Kibana
  • 40. JUG - A.Sedighi - 2015 40 / 48 Snapshot
  • 41. JUG - A.Sedighi - 2015 41 / 48 Fraud Detection 2
  • 42. JUG - A.Sedighi - 2015 42 / 48 Inputs & Outputs ● Inputs: One or multiple sources generate data continuously, in real time – Sensor Networks – Transaction Logs – Text Streams such as News – Network Traffic Analysis ● Outputs: Up-to-date Answers generated continuously or periodically
  • 43. JUG - A.Sedighi - 2015 43 / 48 Data Processing Transient Query – Issued once, then forgotten Persistent Data Stored until deleted by user or apps
  • 44. JUG - A.Sedighi - 2015 44 / 48 Stream Processing Transient Data – Deleted as Window Slides Forward Generated up-to-date answers as time goes on Persistent Queries TimeBased CountBased
  • 45. JUG - A.Sedighi - 2015 45 / 48 Features ● Scalability ● Real-Timing, (Only 1 Second delay at most) ● Super Fast Decision Making ● Implementing Complex Fraud Scenarios Aa Easy as Defining Queries ● Uniform Api For Processing Old or Early Events
  • 46. JUG - A.Sedighi - 2015 46 / 48 Main Technologies and Frameworks ● Java 1.7, Scala 2.11 ● Apache Flume ● Apache Kafka ● Apache Spark
  • 47. Where To Start? ● You need Big Amount of Data ● You need to change your mind – Rack Space and Number of Servers, IO and Process Limitations ● You need To Understand Fundamentals – Linux (Bash Script) – Java is a Most, Python works and Scala is an advantage – SQL and ETL – MapReduce, Resource Management and Serialization Frameworks – Apache Hadoop Ecosystem and Successors
  • 48. JUG - A.Sedighi - 2015 48 / 48 Thank You!, Question? http://slideshare.net/amirsedighi