Logging infrastructure for Microservices using StreamSets Data Collector

0 likes1,407 views

This document discusses using StreamSets Data Collector (SDC) to build a logging infrastructure for microservices. SDC can ingest logs from microservices running in containers and handle issues like schema changes and new log formats. It processes and transforms the logs, sending them to destinations like Kafka. SDC pipelines can run on Spark clusters on Yarn and Mesos to handle large volumes of log data and load it into systems like HDFS, HBase and Elasticsearch for analysis.

Software

Logging infrastructure for MicroServices using StreamSets Data Collector
Logging Infrastructure for microservices using StreamSets
Data Collector
Presenter:
Virag Kothari
Software Engineer at StreamSets

© 2015 StreamSets, Inc. All rights reserved.
About StreamSets
● Headquartered in San Francisco, CA
● Deep expertise in enterprise data management and integration
○ Girish Pancha, CEO (Formerly Chief Product Officer at Informatica)
○ Arvind Prabhakar, CTO (Formerly Director, Engineering for Integration at Cloudera)
○ Team includes Apache PMC members for Flume, Sqoop, Hadoop, Oozie, Hive, Storm

© 2015 StreamSets, Inc. All rights reserved.
Containerized services
Run batch jobs, application jobs, microservices
Logging is key in dynamic environments
HBase/Cassandra
HDFS/S3
Elasticsearch
Docker Container
Docker Container
Kafka
Application
Flume/Logstash

© 2015 StreamSets, Inc. All rights reserved.
Challenges
Semi structured logs
Semantic drift
-> Schema changes
-> Malformed records
Infrastructure drift
->New apps with their own log format

© 2015 StreamSets, Inc. All rights reserved.
StreamSets Data Collector (SDC) Pipeline
Origin
(Log Source)
Processor
Destination
(Kafka)
On
success
Kafka/Write
to File
On error
Application
Docker
container

© 2015 StreamSets, Inc. All rights reserved.
Handle semantic and infrastructure drift
● Built in transformations
● Scripting support
● Troubleshoot using snapshots
● Rules and alerting

© 2015 StreamSets, Inc. All rights reserved.
Data at scale
● Streaming/Batch Cluster deployments
● Batch - MapReduce
● Streaming - Spark Streaming on Mesos and Yarn
● Storm, Samza and others?

© 2015 StreamSets, Inc. All rights reserved.
Cluster pipeline
Kafka
Spark executor
Task Task
SDC SDC
Yarn/Mesos
HDFS/S3
HBase/Cassandra
Hive
Solr

© 2015 StreamSets, Inc. All rights reserved.
Spark Streaming + Kafka
Direct Approach
One to one mapping between Kafka and RDD partitions
Allocate executors equal to Kafka partitions
Multiple tasks within executor
Kafka partition RDD partition SDC

© 2015 StreamSets, Inc. All rights reserved.
Spark on Yarn
Client vs Cluster mode
Fault tolerant driver
Jars available through Distributed Cache
Classloader isolation due to conflicting libraries

© 2015 StreamSets, Inc. All rights reserved.
Spark on Mesos
Mesos not a framework manager
REST endpoint provided by Spark to manage the Mesos framework
No Distributed Cache
Fault-tolerance through pipeline-level retries

© 2015 StreamSets, Inc. All rights reserved.
Thank you
http://streamsets.com/careers/
We’re hiring...
https://github.com/streamsets

More Related Content

Viewers also liked (20)

PPTX

Building Continuously Curated Ingestion PipelinesArvind Prabhakar

PPTX

Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson

PPTX

Building Data Pipelines with Spark and StreamSetsPat Patterson

PDF

Spark Summit EU talk by Pat PattersonSpark Summit

PPTX

Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau

PPTX

Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...DataStax

PPTX

Adaptive Data Cleansing with StreamSets and CassandraPat Patterson

PPTX

Bad Data is Polluting Big DataStreamsets Inc.

PDF

Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan

PPTX

Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.

PDF

Data pipelines from zero to solidLars Albertsson

PPTX

Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira

PDF

Building Scalable Big Data PipelinesChristian Gügi

PDF

Expanding Your Data Warehouse with TajoMatthew (정재화)

PPTX

A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence

PDF

Streamsets and sparkHari Shreedharan

PPTX

Ten canoesBHS_Library

PDF

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

PDF

UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter

PDF

Designing Teams for Emerging ChallengesAaron Irizarry

Building Continuously Curated Ingestion PipelinesArvind Prabhakar

Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson

Building Data Pipelines with Spark and StreamSetsPat Patterson

Spark Summit EU talk by Pat PattersonSpark Summit

Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau

Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...DataStax

Adaptive Data Cleansing with StreamSets and CassandraPat Patterson

Bad Data is Polluting Big DataStreamsets Inc.

Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan

Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.

Data pipelines from zero to solidLars Albertsson

Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira

Building Scalable Big Data PipelinesChristian Gügi

Expanding Your Data Warehouse with TajoMatthew (정재화)

A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence

Streamsets and sparkHari Shreedharan

Ten canoesBHS_Library

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter

Designing Teams for Emerging ChallengesAaron Irizarry

Similar to Logging infrastructure for Microservices using StreamSets Data Collector (20)

PDF

Building Big Data Streaming ArchitecturesDavid Martínez Rego

PDF

Streaming architecture patternshadooparchbook

PDF

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSpark Summit

PDF

Data Streaming For Big DataSeval Çapraz

PDF

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

PPTX

MANTL Data Platform, Microservices and BigData ServicesCisco DevNet

PDF

Webinar - Big Data: Let's SMACK - Jorg SchadCodemotion

PDF

Building end to end streaming application on Sparkdatamantra

PPTX

IoT Austin CUG talkFelicia Haggarty

PDF

Big Data Streams Architectures. Why? What? How?Anton Nazaruk

PDF

Architectural Patterns for Streaming Applicationshadooparchbook

PDF

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson

PDF

2014 sept 26_thug_lambda_part1Adam Muise

PDF

Dive into Spark StreamingGerard Maas

PDF

Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble

PDF

AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa

PDF

Mesos at OpenTablePablo Delgado

PDF

Spark streaming State of the Union - Strata San Jose 2015Databricks

PDF

Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA

PDF

Fully fault tolerant real time data pipeline with docker and mesos Rahul Kumar

Building Big Data Streaming ArchitecturesDavid Martínez Rego

Streaming architecture patternshadooparchbook

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSpark Summit

Data Streaming For Big DataSeval Çapraz

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

MANTL Data Platform, Microservices and BigData ServicesCisco DevNet

Webinar - Big Data: Let's SMACK - Jorg SchadCodemotion

Building end to end streaming application on Sparkdatamantra

IoT Austin CUG talkFelicia Haggarty

Big Data Streams Architectures. Why? What? How?Anton Nazaruk

Architectural Patterns for Streaming Applicationshadooparchbook

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson

2014 sept 26_thug_lambda_part1Adam Muise

Dive into Spark StreamingGerard Maas

Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble

AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa

Mesos at OpenTablePablo Delgado

Spark streaming State of the Union - Strata San Jose 2015Databricks

Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA

Fully fault tolerant real time data pipeline with docker and mesos Rahul Kumar

More from Cask Data (13)

PDF

Introducing a horizontally scalable, inference-based business Rules Engine fo...Cask Data

PDF

About CDAPCask Data

PDF

Transaction in HBase, by Andreas Neumann, CaskCask Data

PDF

#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask Cask Data

PPTX

"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data

PDF

Building Enterprise Grade Applications in Yarn with Apache TwillCask Data

PDF

Webinar: What's new in CDAP 3.5?Cask Data

PDF

Transactions Over Apache HBaseCask Data

PDF

ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...Cask Data

PDF

Introducing Athena: 08/19 Big Data Application Meetup, Talk #3 Cask Data

PDF

NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015Cask Data

PPTX

Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagCask Data

PDF

HBase Meetup @ Cask HQ 09/25Cask Data

Introducing a horizontally scalable, inference-based business Rules Engine fo...Cask Data

About CDAPCask Data

Transaction in HBase, by Andreas Neumann, CaskCask Data

#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask Cask Data

"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data

Building Enterprise Grade Applications in Yarn with Apache TwillCask Data

Webinar: What's new in CDAP 3.5?Cask Data

Transactions Over Apache HBaseCask Data

ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...Cask Data

Introducing Athena: 08/19 Big Data Application Meetup, Talk #3 Cask Data

NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015Cask Data

Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagCask Data

HBase Meetup @ Cask HQ 09/25Cask Data

Recently uploaded (20)

PPTX

Tally software_Introduction_PresentationAditiBansal54083

PDF

Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...logixshapers59

PDF

NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Freesherryg1122g

PPTX

Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agentsklpathrudu

PDF

Download Canva Pro 2025 PC Crack Full Latest Versionbashirkhan333g

PDF

Build It, Buy It, or Already Got It? Make Smarter Martech Decisionsbbedford2

PPTX

AEM User Group: India Chapter Kickoff Meetingjennaf3

PDF

Digger Solo: Semantic search and maps for your local filesseanpedersen96

PDF

SciPy 2025 - Packaging a Scientific Python ProjectHenry Schreiner

PDF

IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025bashirkhan333g

PPTX

Agentic Automation: Build & Deploy Your First UiPath Agentklpathrudu

PDF

Odoo CRM vs Zoho CRM: Honest Comparison 2025 Odiware Technologies Private Limited

PDF

[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...Lingwen1998

PDF

MiniTool Power Data Recovery 8.8 With Crack New Latest 2025bashirkhan333g

PPTX

Coefficient of Variance in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

PPTX

Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

PPTX

Change Common Properties in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

PPTX

ChiSquare Procedure in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

PDF

The 5 Reasons for IT Maintenance - Arna SoftechArna Softech

PPTX

Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...klpathrudu

Tally software_Introduction_PresentationAditiBansal54083

Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...logixshapers59

NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Freesherryg1122g

Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agentsklpathrudu

Download Canva Pro 2025 PC Crack Full Latest Versionbashirkhan333g

Build It, Buy It, or Already Got It? Make Smarter Martech Decisionsbbedford2

AEM User Group: India Chapter Kickoff Meetingjennaf3

Digger Solo: Semantic search and maps for your local filesseanpedersen96

SciPy 2025 - Packaging a Scientific Python ProjectHenry Schreiner

IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025bashirkhan333g

Agentic Automation: Build & Deploy Your First UiPath Agentklpathrudu

Odoo CRM vs Zoho CRM: Honest Comparison 2025 Odiware Technologies Private Limited

[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...Lingwen1998

MiniTool Power Data Recovery 8.8 With Crack New Latest 2025bashirkhan333g

Coefficient of Variance in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

Change Common Properties in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

ChiSquare Procedure in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

The 5 Reasons for IT Maintenance - Arna SoftechArna Softech

Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...klpathrudu

Logging infrastructure for Microservices using StreamSets Data Collector

1. Logging infrastructure for MicroServices using StreamSets Data Collector Logging Infrastructure for microservices using StreamSets Data Collector Presenter: Virag Kothari Software Engineer at StreamSets

2. Open-Source Continuous Ingest

3. © 2015 StreamSets, Inc. All rights reserved. About StreamSets ● Headquartered in San Francisco, CA ● Deep expertise in enterprise data management and integration ○ Girish Pancha, CEO (Formerly Chief Product Officer at Informatica) ○ Arvind Prabhakar, CTO (Formerly Director, Engineering for Integration at Cloudera) ○ Team includes Apache PMC members for Flume, Sqoop, Hadoop, Oozie, Hive, Storm

4. © 2015 StreamSets, Inc. All rights reserved. Containerized services Run batch jobs, application jobs, microservices Logging is key in dynamic environments HBase/Cassandra HDFS/S3 Elasticsearch Docker Container Docker Container Kafka Application Flume/Logstash

6. © 2015 StreamSets, Inc. All rights reserved. StreamSets Data Collector (SDC) Pipeline Origin (Log Source) Processor Destination (Kafka) On success Kafka/Write to File On error Application Docker container

10. © 2015 StreamSets, Inc. All rights reserved. Spark Streaming + Kafka Direct Approach One to one mapping between Kafka and RDD partitions Allocate executors equal to Kafka partitions Multiple tasks within executor Kafka partition RDD partition SDC

12. © 2015 StreamSets, Inc. All rights reserved. Spark on Mesos Mesos not a framework manager REST endpoint provided by Spark to manage the Mesos framework No Distributed Cache Fault-tolerance through pipeline-level retries

13. © 2015 StreamSets, Inc. All rights reserved. Thank you http://streamsets.com/careers/ We’re hiring... https://github.com/streamsets