SlideShare a Scribd company logo
HADOOP INFRASTRUCTURE AND
SOFTSERVE EXPERIENCE
 Pacemaker BigData, Lviv
Agenda
•Business needs
•Hadoop Infrastructure
•Hadoop Distributives
•SoftServe Experience
Presentation drivers
• Hadoop competence development
• Hadoop isn’t MapReduce only
• Components for solution building
• Case studies
Big Analytics Engineering Challenges
Data
Discovery
Business
Reporting
Real Time
Intelligence
Business Users
Intelligent AgentsConsumers
How to achieve Low Latency for
personalized customer
experience in real-time?
Data Scientists/
Analysts
How to improve
System Performance
for Data Science/
Analytics team?
How to implement
Self-Service with high
Data Quality over
terabytes and
petabytes?
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
A distributed file system
• Files are split into blocks
• Each block has 3 replicas minimum
A distribute computing framework
Apache YARN
A resource manager (Yet Another Resource Manager)
A more complex resource management
An SQL interpreter for MapReduce
Apache Pig
A script language to query HDFS
Real-Time Queries in Apache Hadoop
Runs Everywhere
Engine for large scale data processing. Could be used with Java, Scala and Python
Apache Sqoop
SQL to HADOOP – data load tool for RDBMS
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Other Databases on top of Hadoop
Column oriented Key-Value datastore
Graph oriented Database
A distributed service for collecting, aggregating, transformation and moving
large amount of log data
Distributed, real time computation service. Could be used for real time
analytics, online machine learning, continuous computation, distributed
RPC, ETL, and more
Apache Zookeeper
Distributed Service for:
• maintaining configuration information
• naming
• providing distributed synchronization
• providing group services
Service is fault tolerant:
• Zookeeper cluster is called “ensemble”
• There is one “leader” in an “ensemble”
• If “leader” is down a new “leader” is elected with quorum
Distributed messaging service
• Large amount of data
• Scalable
• Durable (messages are persisted on disc)
Popular Distributions
The last architecture trends
Lambda Architecture
http://lambda-architecture.net/
SoftServe Lambda Architecture
Accelerator
• Lambda Architecture – is a highly scalable and reliable data processing architecture based
on Twitter successful experience in Big Data and Analytics
• Supports majority of use cases: Real-time analytics, data discovery and business reports
• SoftServe’s pre-built Lambda Architecture stack accelerates customer’s Time to Market to
15-20+ man/month
25
Business Goals:
 Build a centralized platform for log data analysis which
collects data from ~270-300 Web Servers
 Provide Online Monitoring to answer the question: “What
is going on with systems now?”
 Provide Retrospective Analytics – strategic management,
capacity management/planning, route cause analysis, ad-hoc
analysis
Business Area:
Retail industry. A leading travel site in a world
Big Data Lab: Log Management
Log Data Analysis Platform
Details
26
Key Facts:
• ~270-300 Web Servers
• Log Types: HTTPD Access
logs, Error logs, Application
Server Servlet, OS Service
Logs
• ~500K events per minute
• 150GB of data per day
Technologies:
• Flume
• Hadoop/HDFS, MapReduce
• Hive, Impala
• Oozie
• Elasticsearch, Kibana
• MicroStrategy Analytics
platform
Solution Architecture
27
28
Business Goals:
 Build in-house Analytics Platform for ROI measurement
and performance analysis of every product and feature
delivered by the e-commerce platform;
 Provide the ability to understand how end-users are
interacting with service content, products, and features on
sites;
 Do clickstream analysis;
 Perform A/B Testing
Business Area:
Retail. A platform for e-commerce and
collecting feedbacks from customers
Case Study #1: Clickstream for retail website
Architectural Decisions
29
▪ Volume (45 TB)
▪ Sources (Semi-structured - JSON)
▪ Throughput (> 20K/sec)
▪ Latency (1 hour/real-time)
▪ Extensibility (Custom tags)
▪ Data Quality (Not critical)
▪ Reliability (24/7)
▪ Security (Multitenancy)
▪ Self-Service (Canned reports, Data
science)
▪ Cost (The less the better )
▪ Constraints (Public Cloud)
Architecture Drivers:
Technology Stack:
Lambda
Architecture
• Apache Kafka
• Apache Storm
• Amazon S3
• Hadoop/HDFS, MapReduce (CDH 5)
• HBase
• Oozie, Zookeper
• Cloudera Manager
Solution Architecture
30
31
Business Goals:
 In-house Web Analytics Platform for Conversion
Funnel Analysis, marketing campaign optimization,
user behavior analytics (based on server logs
analysis, page tagging, external data);
 Perform A/B Testing, platform feature usage
analysis
Business Area:
Retail. The world's largest digital coupon
marketplace. The company owns the largest
coupon sites in the US, UK, Germany,
Netherlands, France
Case Study #2: Coupon Marketplace
Coupon Marketplace: Project
Details
32
Project Facts:
• 500 million visits a year
• 25TB+ HP Vertica Data Warehouse
• 50TB+ Hadoop Cluster
• Near-Real time data visualization
Technology Stack:
• Hadoop Cluster (Amazon EMR)
/Hive/Hue/MapReduce/Flume/Spark
• HP Vertica, MySQL
• Python
• Tableau
Major Activities:
• Near-Real time data integration processes
design and implementation
• Hadoop cluster optimization
• Data Warehouse re-design and optimization
• Data Science algorithms design
Coupon Web Analytics Platform
33
Coupon Web-Site
JS Libs
Web Logs
Operational
databases
Coupon Web-Site
JS Libs
Web Logs
Operational
databases
3rd Party API
MPP Data Warehouse
Cluster
Raw Data Hadoop Cluster
ETL Additional Data Stores
Data Scientists
BI/Marketing Team
REST/SOAP
34
Business Goals:
Insights and optimization of all web, mobile,
and social channels
 Optimization of recommendations for
each visitor
 High return on online marketing
investments
Business Area:
Web Analytics Platform by Fortune 100
company is a data storage and analytics on
visitors' digital journeys
Case Study #3: Online Analytics Platform
Online Analytics Platform
Details
35
Key Facts:
• Big Data > 1PB
• 10+ GB per customer/day
• 10+ Hadoop Clusters
• 15+ Aster Data Clusters
Technologies:
• Hadoop/HBase/HiveQL
• Aster Data
• Oracle
• Java/Flex
Solution Architecture
36
Customer Marketing Team
Customer Web Server
Environment
Web Analytics Platform
Web
Analytics
Data
Offerings
Business Rules
Schedule
Recommendation
Rule Engine
Further learning
http://bigdatauniversity.com/
http://blog.cloudera.com/blog/
http://hortonworks.com/blog/
https://www.mapr.com/blog
Hadoop: The Definitive Guide, 3rd
Edition
Any
questions,
Dude?

More Related Content

What's hot (20)

PPTX
How Glidewell Moves Data to Amazon Redshift
Attunity
 
PDF
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
✔ Eric David Benari, PMP
 
PPTX
Cloudera – One Platform to Rule Them All
Xpand IT
 
PPTX
Accelerating Data Warehouse Modernization
DataWorks Summit/Hadoop Summit
 
PDF
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Mark Rittman
 
PPTX
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
PDF
The Holy Grail of Data Analytics
Dan Lynn
 
PPTX
Solving Performance Problems on Hadoop
Tyler Mitchell
 
PDF
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Mark Rittman
 
PDF
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Mark Rittman
 
PPTX
Break Free From Oracle with Attunity and Microsoft
Attunity
 
PDF
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Mark Rittman
 
PPTX
Spark Summit Keynote by Suren Nathan
Spark Summit
 
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
PDF
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Mark Rittman
 
PPTX
Real Use Cases - Pentaho & Big Data Ecosystem
Xpand IT
 
PDF
What’s New in Syncsort Integrate? New User Experience for Fast Data Onboarding
Precisely
 
PPTX
Introduction to Google Cloud Platform for Big Data - Trusted Conf
In Marketing We Trust
 
PPTX
Instrumenting your Instruments
DataWorks Summit/Hadoop Summit
 
PPTX
Webinar: BI in the Sky - The New Rules of Cloud Analytics
SnapLogic
 
How Glidewell Moves Data to Amazon Redshift
Attunity
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
✔ Eric David Benari, PMP
 
Cloudera – One Platform to Rule Them All
Xpand IT
 
Accelerating Data Warehouse Modernization
DataWorks Summit/Hadoop Summit
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Mark Rittman
 
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
The Holy Grail of Data Analytics
Dan Lynn
 
Solving Performance Problems on Hadoop
Tyler Mitchell
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Mark Rittman
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Mark Rittman
 
Break Free From Oracle with Attunity and Microsoft
Attunity
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Mark Rittman
 
Spark Summit Keynote by Suren Nathan
Spark Summit
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Mark Rittman
 
Real Use Cases - Pentaho & Big Data Ecosystem
Xpand IT
 
What’s New in Syncsort Integrate? New User Experience for Fast Data Onboarding
Precisely
 
Introduction to Google Cloud Platform for Big Data - Trusted Conf
In Marketing We Trust
 
Instrumenting your Instruments
DataWorks Summit/Hadoop Summit
 
Webinar: BI in the Sky - The New Rules of Cloud Analytics
SnapLogic
 

Similar to Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect (20)

PPTX
Skillwise Big Data part 2
Skillwise Group
 
PPTX
Skilwise Big data
Skillwise Group
 
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
PDF
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
MSAdvAnalytics
 
PPTX
How Hewlett Packard Enterprise Gets Real with IoT Analytics
Arcadia Data
 
PDF
Hadoop Master Class : A concise overview
Abhishek Roy
 
PPTX
From Data to Services at the Speed of Business
Ali Hodroj
 
PPTX
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
thando80
 
PDF
Hitachi Data Systems Hadoop Solution
Hitachi Vantara
 
PPTX
Hortonworks.bdb
Emil Andreas Siemes
 
PDF
Advanced Analytics and Big Data (August 2014)
Thomas W. Dinsmore
 
PPTX
Accelerating Big Data Analytics
Attunity
 
PDF
Teradata - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
PPTX
How does Microsoft solve Big Data?
James Serra
 
PPTX
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Rackspace
 
PPTX
Introduction To Big Data & Hadoop
Blackvard
 
PPTX
Retail & CPG
Tata Consultancy Services
 
PDF
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
Skillwise Big Data part 2
Skillwise Group
 
Skilwise Big data
Skillwise Group
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
MSAdvAnalytics
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
Arcadia Data
 
Hadoop Master Class : A concise overview
Abhishek Roy
 
From Data to Services at the Speed of Business
Ali Hodroj
 
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
thando80
 
Hitachi Data Systems Hadoop Solution
Hitachi Vantara
 
Hortonworks.bdb
Emil Andreas Siemes
 
Advanced Analytics and Big Data (August 2014)
Thomas W. Dinsmore
 
Accelerating Big Data Analytics
Attunity
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
How does Microsoft solve Big Data?
James Serra
 
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Rackspace
 
Introduction To Big Data & Hadoop
Blackvard
 
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
Ad

More from SoftServe (20)

PPTX
Approaching Quality in Digital Era
SoftServe
 
PPTX
Digital Product Security
SoftServe
 
PPTX
Testing Tools and Tips
SoftServe
 
PPTX
Android Mobile Application Testing: Human Interface Guideline, Tools
SoftServe
 
PPTX
Android Mobile Application Testing: Specific Functional, Performance, Device ...
SoftServe
 
PPTX
How to Reduce Time to Market Using Microsoft DevOps Solutions
SoftServe
 
PPTX
Containerization: The DevOps Revolution
SoftServe
 
PPTX
Essential Data Engineering for Data Scientist
SoftServe
 
PPTX
Rapid Prototyping for Big Data with AWS
SoftServe
 
PPTX
Implementing Test Automation: What a Manager Should Know
SoftServe
 
PPTX
Using AWS Lambda for Infrastructure Automation and Beyond
SoftServe
 
PPTX
Advanced Analytics and Data Science Expertise
SoftServe
 
PDF
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
PPTX
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
SoftServe
 
PPTX
Personalized Medicine in a Contemporary World by Eugene Borukhovich, SVP Heal...
SoftServe
 
PPTX
Health 2.0 WinterTech: Will Artificial Intelligence change healthcare? by Eug...
SoftServe
 
PPTX
Managing Requirements with Word and TFS by Max Markov
SoftServe
 
PPTX
How to Implement Hybrid Cloud Solutions Successfully
SoftServe
 
PPTX
Designing Big Data Systems Like a Pro
SoftServe
 
PPTX
Product Management in Outsourcing by Roman Kolodchak and Roman Pavlyuk
SoftServe
 
Approaching Quality in Digital Era
SoftServe
 
Digital Product Security
SoftServe
 
Testing Tools and Tips
SoftServe
 
Android Mobile Application Testing: Human Interface Guideline, Tools
SoftServe
 
Android Mobile Application Testing: Specific Functional, Performance, Device ...
SoftServe
 
How to Reduce Time to Market Using Microsoft DevOps Solutions
SoftServe
 
Containerization: The DevOps Revolution
SoftServe
 
Essential Data Engineering for Data Scientist
SoftServe
 
Rapid Prototyping for Big Data with AWS
SoftServe
 
Implementing Test Automation: What a Manager Should Know
SoftServe
 
Using AWS Lambda for Infrastructure Automation and Beyond
SoftServe
 
Advanced Analytics and Data Science Expertise
SoftServe
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
SoftServe
 
Personalized Medicine in a Contemporary World by Eugene Borukhovich, SVP Heal...
SoftServe
 
Health 2.0 WinterTech: Will Artificial Intelligence change healthcare? by Eug...
SoftServe
 
Managing Requirements with Word and TFS by Max Markov
SoftServe
 
How to Implement Hybrid Cloud Solutions Successfully
SoftServe
 
Designing Big Data Systems Like a Pro
SoftServe
 
Product Management in Outsourcing by Roman Kolodchak and Roman Pavlyuk
SoftServe
 
Ad

Recently uploaded (20)

PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 

Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect

Editor's Notes

  • #26: Client Our client is a leading travel site in a world. Engagement Partnering with SoftServe, the combined teams developed an and implementation of Hadoop Cluster which collects log data from ~270-300 Web Servers including HTTPD Access and Error logs, as well as Application Server Servlet and OS Service Logs for further operational and retrospective analysis. Result The client has decreased their time to react on a issues which happens with web-servers as well as increased insight into ROI analysis for marketing campaigns which enabled company to increase number of visitors.
  • #32: Clickstream Data: Google Analytics Site Catalyst, SaaS App from Adobe (prev. Omniture) Apache Web Logs Beacon JavaScript Library Financial Data: Data, provided by Affiliate Networks though API, FTP etc Marketing Data: Kenshoo: used as a platform to analyze the effectiveness of pay per click Google Ad campaigns.   The Kenshoo Conversion Feed provides sales and commission data to measure ROI on campaigns
  • #36: Tools & Technologies Extended List: SaaS, Hadoop/HDFS, Hadoop/Hbase, Aster Data, Java/Flex, J2EE, Java Script, Scape SSH/SFTP library, Velocity, Linux, Bash RDL, SQL, XSL Java, XML, Oracle database, JMS, Java Servlet, JDBC, JBoss, Flash RDL, Macromedia Flash.
  • #37: Hadoop/HiveQL: Raw data about website users behavior Aggregation information for historical analytics Customized scheduled reports HBase: Online query for immediate data access: User geographical and demographics information Recent user purchase, search, unsubscribe activities