SlideShare a Scribd company logo
Building a data pipeline to ingest
data into Hadoop in minutes
using Streamsets Data Collector
Guglielmo Iozzia,
Big Data Infrastructure Engineer @ IBM Ireland
Data Ingestion for Analytics: a real scenario
In the business area (cloud applications) to which my team belongs there were so
many questions to be answered. They were related to:
● Defect analysis
● Outage analysis
● Cyber-Security
“Data is the second
most important
thing in analytics”
Data Ingestion: multiple sources...
● Legacy systems
● DB2
● Lotus Domino
● MongoDB
● Application logs
● System logs
● New Relic
● Jenkins pipelines
● Testing tools output
● RESTful Services
… and so many tools available to get the data
What are we going to do with all those data?
Issues
● The need to collect data from multiple sources introduces redundancy, which
costs additional disk space and increases query times.
● A small team.
● Lack of skills and experience across the team (and the business area in
general) in managing Big Data tools.
● Low budget.
Alternatives
#1 Panic
Alternatives
#2 Cloning team members
Alternatives
#3 Find a smart way to simplify the data ingestion
process
A single tool needed...
● Design complex data flows with minimal coding and the maximum flexibility.
● Provide real-time data flow statistics, metrics for each flow stage.
● Automated error handling and alerting.
● Easy to use by everyone.
● Zero-downtime when upgrading the infrastructure due to logical isolation of
each flow stage.
● Open Source
… something like this
Streamsets Data Collector
Streamsets Data Collector
Streamsets Data Collector: supported origins
Streamsets Data Collector: available destinations
Streamsets Data Collector: available processors
● Base64 Field Decoder
● Base64 Field Encoder
● Expression Evaluator
● Field Converter
● JavaScript Evaluator
● JSON Parser
● Jython Evaluator
● Log Parser
● Stream Selector
● XML Parser
...and many others
Streamsets Data Collector
Demo
Streamsets DC: performance and reliability
● Two available execution modes: standalone or cluster
● Implemented in Java: so any performance best practice/recommendation for
Java applications applies here
● REST services for performance monitoring available
● Rules and alerts (metric and data both)
Streamsets Data Collector: security
● You can authenticate user accounts based on LDAP
● Authorization: the Data Collector provides several roles (admin, manager,
creator, guest)
● You can use Kerberos authentication to connect to origin and destination
systems
● Follow the usual security best practices in terms of iptables, networking, etc.
for Java web applications running on Linux machines.
Useful Links
Streamsets Data Collector:
https://streamsets.com/product/
Thanks!
My contacts:
Linkedin: https://ie.linkedin.com/in/giozzia
Blog: http://googlielmo.blogspot.ie/
Twitter: https://twitter.com/guglielmoiozzia
Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

More Related Content

What's hot (20)

PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
PDF
Building Custom Big Data Integrations
Pat Patterson
 
PPTX
Dealing with Drift: Building an Enterprise Data Lake
Pat Patterson
 
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
PDF
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
John Mulhall
 
PPTX
Telco analytics at scale
datamantra
 
PDF
Introduction to basic data analytics tools
Nascenia IT
 
PDF
Presto: Fast SQL on Everything
David Phillips
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
PDF
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
PPTX
Obfuscating LinkedIn Member Data
DataWorks Summit
 
PDF
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 
PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
PDF
Data Pipline Observability meetup
Omid Vahdaty
 
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
PDF
Credit Fraud Prevention with Spark and Graph Analysis
Jen Aman
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
Building Custom Big Data Integrations
Pat Patterson
 
Dealing with Drift: Building an Enterprise Data Lake
Pat Patterson
 
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
John Mulhall
 
Telco analytics at scale
datamantra
 
Introduction to basic data analytics tools
Nascenia IT
 
Presto: Fast SQL on Everything
David Phillips
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Obfuscating LinkedIn Member Data
DataWorks Summit
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
Data Pipline Observability meetup
Omid Vahdaty
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
Credit Fraud Prevention with Spark and Graph Analysis
Jen Aman
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 

Viewers also liked (20)

PPTX
Base de datos
Brahian Correa
 
DOCX
A REPORT OF PROFESSIONAL TRAINING WORKSHOP
Ezekiel Tunde ADEBAMIWI
 
PPTX
Propiedad intelectual y Proteccion Juridica del Software
jorge quispe
 
PPTX
Informatica
xgmikeh
 
PPTX
JSIL Print Media Presentation_PR Lipper Award 2016
Ahmad butt
 
PPTX
Presentación Rol del estudiante y de los tutores en la educación a distancia
MONICA CALDERON
 
PDF
Housing Academy 2
Mohamed Mounir
 
PPTX
Derecho informatico
jorge quispe
 
PPTX
Contratacion Electronica & Contratacion Informatica
jorge quispe
 
PPTX
Teletrabajo en la administración pública
jorge quispe
 
PDF
tieguy brochure
Khareim Aaron
 
PPTX
Delitos informaticos
jorge quispe
 
PPTX
Tic
jorge quispe
 
PPTX
Analisis economico del derecho
jorge quispe
 
PPTX
Comercio electrónico
jorge quispe
 
PDF
CISSP Prep: Ch 1: Security Governance Through Principles and Policies
Sam Bowne
 
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
PDF
Practical Malware Analysis: Ch 11: Malware Behavior
Sam Bowne
 
PDF
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Sam Bowne
 
Base de datos
Brahian Correa
 
A REPORT OF PROFESSIONAL TRAINING WORKSHOP
Ezekiel Tunde ADEBAMIWI
 
Propiedad intelectual y Proteccion Juridica del Software
jorge quispe
 
Informatica
xgmikeh
 
JSIL Print Media Presentation_PR Lipper Award 2016
Ahmad butt
 
Presentación Rol del estudiante y de los tutores en la educación a distancia
MONICA CALDERON
 
Housing Academy 2
Mohamed Mounir
 
Derecho informatico
jorge quispe
 
Contratacion Electronica & Contratacion Informatica
jorge quispe
 
Teletrabajo en la administración pública
jorge quispe
 
tieguy brochure
Khareim Aaron
 
Delitos informaticos
jorge quispe
 
Analisis economico del derecho
jorge quispe
 
Comercio electrónico
jorge quispe
 
CISSP Prep: Ch 1: Security Governance Through Principles and Policies
Sam Bowne
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Practical Malware Analysis: Ch 11: Malware Behavior
Sam Bowne
 
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Sam Bowne
 
Ad

Similar to Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector (20)

PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
PPTX
CSE3069 - FLUENTD real time analytics.pptx
dummyuseage1
 
PDF
How to create custom dashboards in Elastic Search / Kibana with Performance V...
PerformanceVision (previously SecurActive)
 
PDF
Game Analytics at London Apache Druid Meetup
Jelena Zanko
 
PDF
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
PPTX
Challenges of monitoring distributed systems
Nenad Bozic
 
PDF
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
In-Memory Computing Summit
 
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Demi Ben-Ari
 
ODP
Log aggregation and analysis
Dhaval Mehta
 
PDF
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Codemotion
 
PDF
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Demi Ben-Ari
 
PPTX
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
PDF
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Hernan Costante
 
PDF
Monitoring in 2017 - TIAD Camp Docker
The Incredible Automation Day
 
PPTX
Real-time analysis using an in-memory data grid - Cloud Expo 2013
ScaleOut Software
 
PDF
Analytics&IoT
Selvaraj Kesavan
 
PDF
Devoxx Belgium 2017 - easy microservices with JHipster
Julien Dubois
 
PDF
Easy Microservices with JHipster - Devoxx BE 2017
Deepu K Sasidharan
 
PPTX
IBM IoT Architecture and Capabilities at the Edge and Cloud
Pradeep Natarajan
 
PDF
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
AppDynamics
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
CSE3069 - FLUENTD real time analytics.pptx
dummyuseage1
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
PerformanceVision (previously SecurActive)
 
Game Analytics at London Apache Druid Meetup
Jelena Zanko
 
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Challenges of monitoring distributed systems
Nenad Bozic
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
In-Memory Computing Summit
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Demi Ben-Ari
 
Log aggregation and analysis
Dhaval Mehta
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Demi Ben-Ari
 
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Hernan Costante
 
Monitoring in 2017 - TIAD Camp Docker
The Incredible Automation Day
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
ScaleOut Software
 
Analytics&IoT
Selvaraj Kesavan
 
Devoxx Belgium 2017 - easy microservices with JHipster
Julien Dubois
 
Easy Microservices with JHipster - Devoxx BE 2017
Deepu K Sasidharan
 
IBM IoT Architecture and Capabilities at the Edge and Cloud
Pradeep Natarajan
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
AppDynamics
 
Ad

Recently uploaded (20)

PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

  • 1. Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector Guglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland
  • 2. Data Ingestion for Analytics: a real scenario In the business area (cloud applications) to which my team belongs there were so many questions to be answered. They were related to: ● Defect analysis ● Outage analysis ● Cyber-Security
  • 3. “Data is the second most important thing in analytics”
  • 4. Data Ingestion: multiple sources... ● Legacy systems ● DB2 ● Lotus Domino ● MongoDB ● Application logs ● System logs ● New Relic ● Jenkins pipelines ● Testing tools output ● RESTful Services
  • 5. … and so many tools available to get the data
  • 6. What are we going to do with all those data?
  • 7. Issues ● The need to collect data from multiple sources introduces redundancy, which costs additional disk space and increases query times. ● A small team. ● Lack of skills and experience across the team (and the business area in general) in managing Big Data tools. ● Low budget.
  • 10. Alternatives #3 Find a smart way to simplify the data ingestion process
  • 11. A single tool needed... ● Design complex data flows with minimal coding and the maximum flexibility. ● Provide real-time data flow statistics, metrics for each flow stage. ● Automated error handling and alerting. ● Easy to use by everyone. ● Zero-downtime when upgrading the infrastructure due to logical isolation of each flow stage. ● Open Source
  • 15. Streamsets Data Collector: supported origins
  • 16. Streamsets Data Collector: available destinations
  • 17. Streamsets Data Collector: available processors ● Base64 Field Decoder ● Base64 Field Encoder ● Expression Evaluator ● Field Converter ● JavaScript Evaluator ● JSON Parser ● Jython Evaluator ● Log Parser ● Stream Selector ● XML Parser ...and many others
  • 19. Streamsets DC: performance and reliability ● Two available execution modes: standalone or cluster ● Implemented in Java: so any performance best practice/recommendation for Java applications applies here ● REST services for performance monitoring available ● Rules and alerts (metric and data both)
  • 20. Streamsets Data Collector: security ● You can authenticate user accounts based on LDAP ● Authorization: the Data Collector provides several roles (admin, manager, creator, guest) ● You can use Kerberos authentication to connect to origin and destination systems ● Follow the usual security best practices in terms of iptables, networking, etc. for Java web applications running on Linux machines.
  • 21. Useful Links Streamsets Data Collector: https://streamsets.com/product/
  • 22. Thanks! My contacts: Linkedin: https://ie.linkedin.com/in/giozzia Blog: http://googlielmo.blogspot.ie/ Twitter: https://twitter.com/guglielmoiozzia