SlideShare a Scribd company logo
Logging infrastructure for MicroServices using StreamSets Data Collector
Logging Infrastructure for microservices using StreamSets
Data Collector
Presenter:
Virag Kothari
Software Engineer at StreamSets
Open-Source Continuous Ingest
© 2015 StreamSets, Inc. All rights reserved.
About StreamSets
● Headquartered in San Francisco, CA
● Deep expertise in enterprise data management and integration
○ Girish Pancha, CEO (Formerly Chief Product Officer at Informatica)
○ Arvind Prabhakar, CTO (Formerly Director, Engineering for Integration at Cloudera)
○ Team includes Apache PMC members for Flume, Sqoop, Hadoop, Oozie, Hive, Storm
© 2015 StreamSets, Inc. All rights reserved.
Containerized services
Run batch jobs, application jobs, microservices
Logging is key in dynamic environments
HBase/Cassandra
HDFS/S3
Elasticsearch
Docker Container
Docker Container
Kafka
Application
Flume/Logstash
© 2015 StreamSets, Inc. All rights reserved.
Challenges
Semi structured logs
Semantic drift
-> Schema changes
-> Malformed records
Infrastructure drift
->New apps with their own log format
© 2015 StreamSets, Inc. All rights reserved.
StreamSets Data Collector (SDC) Pipeline
Origin
(Log Source)
Processor
Destination
(Kafka)
On
success
Kafka/Write
to File
On error
Application
Docker
container
© 2015 StreamSets, Inc. All rights reserved.
Handle semantic and infrastructure drift
● Built in transformations
● Scripting support
● Troubleshoot using snapshots
● Rules and alerting
© 2015 StreamSets, Inc. All rights reserved.
Data at scale
● Streaming/Batch Cluster deployments
● Batch - MapReduce
● Streaming - Spark Streaming on Mesos and Yarn
● Storm, Samza and others?
© 2015 StreamSets, Inc. All rights reserved.
Cluster pipeline
Kafka
Spark executor
Task Task
SDC SDC
Yarn/Mesos
HDFS/S3
HBase/Cassandra
Hive
Solr
© 2015 StreamSets, Inc. All rights reserved.
Spark Streaming + Kafka
Direct Approach
One to one mapping between Kafka and RDD partitions
Allocate executors equal to Kafka partitions
Multiple tasks within executor
Kafka partition RDD partition SDC
© 2015 StreamSets, Inc. All rights reserved.
Spark on Yarn
Client vs Cluster mode
Fault tolerant driver
Jars available through Distributed Cache
Classloader isolation due to conflicting libraries
© 2015 StreamSets, Inc. All rights reserved.
Spark on Mesos
Mesos not a framework manager
REST endpoint provided by Spark to manage the Mesos framework
No Distributed Cache
Fault-tolerance through pipeline-level retries
© 2015 StreamSets, Inc. All rights reserved.
Thank you
http://streamsets.com/careers/
We’re hiring...
https://github.com/streamsets

More Related Content

Viewers also liked (20)

PPTX
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
PPTX
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
PDF
Spark Summit EU talk by Pat Patterson
Spark Summit
 
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
PPTX
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
DataStax
 
PPTX
Adaptive Data Cleansing with StreamSets and Cassandra
Pat Patterson
 
PPTX
Bad Data is Polluting Big Data
Streamsets Inc.
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
PDF
Data pipelines from zero to solid
Lars Albertsson
 
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
PDF
Building Scalable Big Data Pipelines
Christian Gügi
 
PDF
Expanding Your Data Warehouse with Tajo
Matthew (정재화)
 
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
PDF
Streamsets and spark
Hari Shreedharan
 
PPTX
Ten canoes
BHS_Library
 
PDF
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
PDF
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
PDF
Designing Teams for Emerging Challenges
Aaron Irizarry
 
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Spark Summit EU talk by Pat Patterson
Spark Summit
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
DataStax
 
Adaptive Data Cleansing with StreamSets and Cassandra
Pat Patterson
 
Bad Data is Polluting Big Data
Streamsets Inc.
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Data pipelines from zero to solid
Lars Albertsson
 
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
Building Scalable Big Data Pipelines
Christian Gügi
 
Expanding Your Data Warehouse with Tajo
Matthew (정재화)
 
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
Streamsets and spark
Hari Shreedharan
 
Ten canoes
BHS_Library
 
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
Designing Teams for Emerging Challenges
Aaron Irizarry
 

Similar to Logging infrastructure for Microservices using StreamSets Data Collector (20)

PDF
Building Big Data Streaming Architectures
David Martínez Rego
 
PDF
Streaming architecture patterns
hadooparchbook
 
PDF
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 
PDF
Data Streaming For Big Data
Seval Çapraz
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PPTX
MANTL Data Platform, Microservices and BigData Services
Cisco DevNet
 
PDF
Webinar - Big Data: Let's SMACK - Jorg Schad
Codemotion
 
PDF
Building end to end streaming application on Spark
datamantra
 
PPTX
IoT Austin CUG talk
Felicia Haggarty
 
PDF
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
PDF
Architectural Patterns for Streaming Applications
hadooparchbook
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PDF
2014 sept 26_thug_lambda_part1
Adam Muise
 
PDF
Dive into Spark Streaming
Gerard Maas
 
PDF
Hadoop Ecosystem and Low Latency Streaming Architecture
InSemble
 
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
PDF
Mesos at OpenTable
Pablo Delgado
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
PDF
Fully fault tolerant real time data pipeline with docker and mesos
Rahul Kumar
 
Building Big Data Streaming Architectures
David Martínez Rego
 
Streaming architecture patterns
hadooparchbook
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 
Data Streaming For Big Data
Seval Çapraz
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
MANTL Data Platform, Microservices and BigData Services
Cisco DevNet
 
Webinar - Big Data: Let's SMACK - Jorg Schad
Codemotion
 
Building end to end streaming application on Spark
datamantra
 
IoT Austin CUG talk
Felicia Haggarty
 
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Architectural Patterns for Streaming Applications
hadooparchbook
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
2014 sept 26_thug_lambda_part1
Adam Muise
 
Dive into Spark Streaming
Gerard Maas
 
Hadoop Ecosystem and Low Latency Streaming Architecture
InSemble
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Mesos at OpenTable
Pablo Delgado
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
Fully fault tolerant real time data pipeline with docker and mesos
Rahul Kumar
 
Ad

More from Cask Data (13)

PDF
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Cask Data
 
PDF
About CDAP
Cask Data
 
PDF
Transaction in HBase, by Andreas Neumann, Cask
Cask Data
 
PDF
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
Cask Data
 
PPTX
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
PDF
Building Enterprise Grade Applications in Yarn with Apache Twill
Cask Data
 
PDF
Webinar: What's new in CDAP 3.5?
Cask Data
 
PDF
Transactions Over Apache HBase
Cask Data
 
PDF
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
Cask Data
 
PDF
Introducing Athena: 08/19 Big Data Application Meetup, Talk #3
Cask Data
 
PDF
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
Cask Data
 
PPTX
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Cask Data
 
PDF
HBase Meetup @ Cask HQ 09/25
Cask Data
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Cask Data
 
About CDAP
Cask Data
 
Transaction in HBase, by Andreas Neumann, Cask
Cask Data
 
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
Cask Data
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
Building Enterprise Grade Applications in Yarn with Apache Twill
Cask Data
 
Webinar: What's new in CDAP 3.5?
Cask Data
 
Transactions Over Apache HBase
Cask Data
 
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
Cask Data
 
Introducing Athena: 08/19 Big Data Application Meetup, Talk #3
Cask Data
 
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
Cask Data
 
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Cask Data
 
HBase Meetup @ Cask HQ 09/25
Cask Data
 
Ad

Recently uploaded (20)

PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 

Logging infrastructure for Microservices using StreamSets Data Collector

  • 1. Logging infrastructure for MicroServices using StreamSets Data Collector Logging Infrastructure for microservices using StreamSets Data Collector Presenter: Virag Kothari Software Engineer at StreamSets
  • 3. © 2015 StreamSets, Inc. All rights reserved. About StreamSets ● Headquartered in San Francisco, CA ● Deep expertise in enterprise data management and integration ○ Girish Pancha, CEO (Formerly Chief Product Officer at Informatica) ○ Arvind Prabhakar, CTO (Formerly Director, Engineering for Integration at Cloudera) ○ Team includes Apache PMC members for Flume, Sqoop, Hadoop, Oozie, Hive, Storm
  • 4. © 2015 StreamSets, Inc. All rights reserved. Containerized services Run batch jobs, application jobs, microservices Logging is key in dynamic environments HBase/Cassandra HDFS/S3 Elasticsearch Docker Container Docker Container Kafka Application Flume/Logstash
  • 5. © 2015 StreamSets, Inc. All rights reserved. Challenges Semi structured logs Semantic drift -> Schema changes -> Malformed records Infrastructure drift ->New apps with their own log format
  • 6. © 2015 StreamSets, Inc. All rights reserved. StreamSets Data Collector (SDC) Pipeline Origin (Log Source) Processor Destination (Kafka) On success Kafka/Write to File On error Application Docker container
  • 7. © 2015 StreamSets, Inc. All rights reserved. Handle semantic and infrastructure drift ● Built in transformations ● Scripting support ● Troubleshoot using snapshots ● Rules and alerting
  • 8. © 2015 StreamSets, Inc. All rights reserved. Data at scale ● Streaming/Batch Cluster deployments ● Batch - MapReduce ● Streaming - Spark Streaming on Mesos and Yarn ● Storm, Samza and others?
  • 9. © 2015 StreamSets, Inc. All rights reserved. Cluster pipeline Kafka Spark executor Task Task SDC SDC Yarn/Mesos HDFS/S3 HBase/Cassandra Hive Solr
  • 10. © 2015 StreamSets, Inc. All rights reserved. Spark Streaming + Kafka Direct Approach One to one mapping between Kafka and RDD partitions Allocate executors equal to Kafka partitions Multiple tasks within executor Kafka partition RDD partition SDC
  • 11. © 2015 StreamSets, Inc. All rights reserved. Spark on Yarn Client vs Cluster mode Fault tolerant driver Jars available through Distributed Cache Classloader isolation due to conflicting libraries
  • 12. © 2015 StreamSets, Inc. All rights reserved. Spark on Mesos Mesos not a framework manager REST endpoint provided by Spark to manage the Mesos framework No Distributed Cache Fault-tolerance through pipeline-level retries
  • 13. © 2015 StreamSets, Inc. All rights reserved. Thank you http://streamsets.com/careers/ We’re hiring... https://github.com/streamsets