SlideShare a Scribd company logo
Building a data pipeline to ingest
data into Hadoop in minutes
using Streamsets Data Collector
Guglielmo Iozzia,
Big Data Infrastructure Engineer @ IBM Ireland
Data Ingestion for Analytics: a real scenario
In the business area (cloud applications) to which my team belongs there were so
many questions to be answered. They were related to:
● Defect analysis
● Outage analysis
● Cyber-Security
“Data is the second
most important
thing in analytics”
Data Ingestion: multiple sources...
● Legacy systems
● DB2
● Lotus Domino
● MongoDB
● Application logs
● System logs
● New Relic
● Jenkins pipelines
● Testing tools output
● RESTful Services
… and so many tools available to get the data
What are we going to do with all those data?
Issues
● The need to collect data from multiple sources introduces redundancy, which
costs additional disk space and increases query times.
● A small team.
● Lack of skills and experience across the team (and the business area in
general) in managing Big Data tools.
● Low budget.
Alternatives
#1 Panic
Alternatives
#2 Cloning team members
Alternatives
#3 Find a smart way to simplify the data ingestion
process
A single tool needed...
● Design complex data flows with minimal coding and the maximum flexibility.
● Provide real-time data flow statistics, metrics for each flow stage.
● Automated error handling and alerting.
● Easy to use by everyone.
● Zero-downtime when upgrading the infrastructure due to logical isolation of
each flow stage.
● Open Source
… something like this
Streamsets Data Collector
Streamsets Data Collector
Streamsets Data Collector: supported origins
Streamsets Data Collector: available destinations
Streamsets Data Collector: available processors
● Base64 Field Decoder
● Base64 Field Encoder
● Expression Evaluator
● Field Converter
● JavaScript Evaluator
● JSON Parser
● Jython Evaluator
● Log Parser
● Stream Selector
● XML Parser
...and many others
Streamsets Data Collector
Demo
Streamsets DC: performance and reliability
● Two available execution modes: standalone or cluster
● Implemented in Java: so any performance best practice/recommendation for
Java applications applies here
● REST services for performance monitoring available
● Rules and alerts (metric and data both)
Streamsets Data Collector: security
● You can authenticate user accounts based on LDAP
● Authorization: the Data Collector provides several roles (admin, manager,
creator, guest)
● You can use Kerberos authentication to connect to origin and destination
systems
● Follow the usual security best practices in terms of iptables, networking, etc.
for Java web applications running on Linux machines.
Useful Links
Streamsets Data Collector:
https://streamsets.com/product/
Thanks!
My contacts:
Linkedin: https://ie.linkedin.com/in/giozzia
Blog: http://googlielmo.blogspot.ie/
Twitter: https://twitter.com/guglielmoiozzia
Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

More Related Content

What's hot (20)

PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
PDF
Building Custom Big Data Integrations
Pat Patterson
 
PPTX
Dealing with Drift: Building an Enterprise Data Lake
Pat Patterson
 
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
PDF
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
John Mulhall
 
PPTX
Telco analytics at scale
datamantra
 
PDF
Introduction to basic data analytics tools
Nascenia IT
 
PDF
Presto: Fast SQL on Everything
David Phillips
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
PDF
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
PPTX
Obfuscating LinkedIn Member Data
DataWorks Summit
 
PDF
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 
PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
PDF
Data Pipline Observability meetup
Omid Vahdaty
 
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
PDF
Credit Fraud Prevention with Spark and Graph Analysis
Jen Aman
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
Building Custom Big Data Integrations
Pat Patterson
 
Dealing with Drift: Building an Enterprise Data Lake
Pat Patterson
 
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
John Mulhall
 
Telco analytics at scale
datamantra
 
Introduction to basic data analytics tools
Nascenia IT
 
Presto: Fast SQL on Everything
David Phillips
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Obfuscating LinkedIn Member Data
DataWorks Summit
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
Data Pipline Observability meetup
Omid Vahdaty
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
Credit Fraud Prevention with Spark and Graph Analysis
Jen Aman
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 

Viewers also liked (20)

PPTX
Base de datos
Brahian Correa
 
DOCX
A REPORT OF PROFESSIONAL TRAINING WORKSHOP
Ezekiel Tunde ADEBAMIWI
 
PPTX
Propiedad intelectual y Proteccion Juridica del Software
jorge quispe
 
PPTX
Informatica
xgmikeh
 
PPTX
JSIL Print Media Presentation_PR Lipper Award 2016
Ahmad butt
 
PPTX
Presentación Rol del estudiante y de los tutores en la educación a distancia
MONICA CALDERON
 
PDF
Housing Academy 2
Mohamed Mounir
 
PPTX
Derecho informatico
jorge quispe
 
PPTX
Contratacion Electronica & Contratacion Informatica
jorge quispe
 
PPTX
Teletrabajo en la administración pública
jorge quispe
 
PDF
tieguy brochure
Khareim Aaron
 
PPTX
Delitos informaticos
jorge quispe
 
PPTX
Tic
jorge quispe
 
PPTX
Analisis economico del derecho
jorge quispe
 
PPTX
Comercio electrónico
jorge quispe
 
PDF
CISSP Prep: Ch 1: Security Governance Through Principles and Policies
Sam Bowne
 
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
PDF
Practical Malware Analysis: Ch 11: Malware Behavior
Sam Bowne
 
PDF
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Sam Bowne
 
Base de datos
Brahian Correa
 
A REPORT OF PROFESSIONAL TRAINING WORKSHOP
Ezekiel Tunde ADEBAMIWI
 
Propiedad intelectual y Proteccion Juridica del Software
jorge quispe
 
Informatica
xgmikeh
 
JSIL Print Media Presentation_PR Lipper Award 2016
Ahmad butt
 
Presentación Rol del estudiante y de los tutores en la educación a distancia
MONICA CALDERON
 
Housing Academy 2
Mohamed Mounir
 
Derecho informatico
jorge quispe
 
Contratacion Electronica & Contratacion Informatica
jorge quispe
 
Teletrabajo en la administración pública
jorge quispe
 
tieguy brochure
Khareim Aaron
 
Delitos informaticos
jorge quispe
 
Analisis economico del derecho
jorge quispe
 
Comercio electrónico
jorge quispe
 
CISSP Prep: Ch 1: Security Governance Through Principles and Policies
Sam Bowne
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Practical Malware Analysis: Ch 11: Malware Behavior
Sam Bowne
 
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Sam Bowne
 
Ad

Similar to Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector (20)

PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
valimcatiis
 
PDF
xGem Data Stream Processing
Jorge Hirtz
 
PPTX
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
PPTX
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
aroubkihak
 
PPTX
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
PDF
Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...
aisaraserale
 
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
yazitstuer
 
PDF
Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap price
conacofagot41
 
PPTX
Data ingestion
nitheeshe2
 
PPTX
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
PDF
Building end to end streaming application on Spark
datamantra
 
PPT
Architecting Big Data Ingest & Manipulation
George Long
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PDF
Data Ingestion in Big Data and IoT platforms
Guido Schmutz
 
PDF
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Impetus Technologies
 
PPTX
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
PPTX
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Impetus Technologies
 
PPTX
Devclub.lv - Introduction to stream processing
Nicolas Fränkel
 
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
valimcatiis
 
xGem Data Stream Processing
Jorge Hirtz
 
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
aroubkihak
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...
aisaraserale
 
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
yazitstuer
 
Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap price
conacofagot41
 
Data ingestion
nitheeshe2
 
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
Building end to end streaming application on Spark
datamantra
 
Architecting Big Data Ingest & Manipulation
George Long
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Data Ingestion in Big Data and IoT platforms
Guido Schmutz
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Impetus Technologies
 
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Impetus Technologies
 
Devclub.lv - Introduction to stream processing
Nicolas Fränkel
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
What Is Data Integration and Transformation?
subhashenia
 
BinarySearchTree in datastructures in detail
kichokuttu
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

  • 1. Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector Guglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland
  • 2. Data Ingestion for Analytics: a real scenario In the business area (cloud applications) to which my team belongs there were so many questions to be answered. They were related to: ● Defect analysis ● Outage analysis ● Cyber-Security
  • 3. “Data is the second most important thing in analytics”
  • 4. Data Ingestion: multiple sources... ● Legacy systems ● DB2 ● Lotus Domino ● MongoDB ● Application logs ● System logs ● New Relic ● Jenkins pipelines ● Testing tools output ● RESTful Services
  • 5. … and so many tools available to get the data
  • 6. What are we going to do with all those data?
  • 7. Issues ● The need to collect data from multiple sources introduces redundancy, which costs additional disk space and increases query times. ● A small team. ● Lack of skills and experience across the team (and the business area in general) in managing Big Data tools. ● Low budget.
  • 10. Alternatives #3 Find a smart way to simplify the data ingestion process
  • 11. A single tool needed... ● Design complex data flows with minimal coding and the maximum flexibility. ● Provide real-time data flow statistics, metrics for each flow stage. ● Automated error handling and alerting. ● Easy to use by everyone. ● Zero-downtime when upgrading the infrastructure due to logical isolation of each flow stage. ● Open Source
  • 15. Streamsets Data Collector: supported origins
  • 16. Streamsets Data Collector: available destinations
  • 17. Streamsets Data Collector: available processors ● Base64 Field Decoder ● Base64 Field Encoder ● Expression Evaluator ● Field Converter ● JavaScript Evaluator ● JSON Parser ● Jython Evaluator ● Log Parser ● Stream Selector ● XML Parser ...and many others
  • 19. Streamsets DC: performance and reliability ● Two available execution modes: standalone or cluster ● Implemented in Java: so any performance best practice/recommendation for Java applications applies here ● REST services for performance monitoring available ● Rules and alerts (metric and data both)
  • 20. Streamsets Data Collector: security ● You can authenticate user accounts based on LDAP ● Authorization: the Data Collector provides several roles (admin, manager, creator, guest) ● You can use Kerberos authentication to connect to origin and destination systems ● Follow the usual security best practices in terms of iptables, networking, etc. for Java web applications running on Linux machines.
  • 21. Useful Links Streamsets Data Collector: https://streamsets.com/product/
  • 22. Thanks! My contacts: Linkedin: https://ie.linkedin.com/in/giozzia Blog: http://googlielmo.blogspot.ie/ Twitter: https://twitter.com/guglielmoiozzia