SlideShare a Scribd company logo
Building Data Pipelines with Cask Hydrator
Jon Gray
CEO, Cask
July 9th, 2016
PROPRIETARY & CONFIDENTIAL
Web Analytics and Reporting Use Case
✦Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts

✦Not enough personnel with expertise in all the Hadoop components (HDFS,
MapReduce, Spark, YARN, HBase, Kafka) or lack of expertise

✦Hard to debug and validate, resulting in frequent failures in production environment



Transform web log data from S3 every hour to Hadoop cluster for backup, as well as,
perform analytics and enable realtime reporting of metrics such as number of
successful/failure responses, most popular webpage etc.
The Challenge —
PROPRIETARY & CONFIDENTIAL
Demo Example
Load Log Files from S3 to
HDFS and perform
aggregations/analysis
•Start with web access logs stored in
Amazon S3
•Store the raw logs into HDFS Avro Files
•Parse the access log lines into individual
fields
•Calculate the total number of requests by
IP and status code
•Find out IPs which received maximum
successful status code and error codes
69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36"
Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info
Sample Web access log (Combined Log Format):
PROPRIETARY & CONFIDENTIAL
INGEST
any data from any source
in real-time and batch
BUILD
drag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESS
any data to any destination
in real-time and batch
Data Pipeline
provides the ability to automate complex workflows that involves fetching
data, possibly from multiple data sources, combining, performing non-trivial
transformations on the data, writing it to one more data sinks and deriving/
PROPRIETARY & CONFIDENTIAL
Stack of Data Enablers
PROPRIETARY & CONFIDENTIAL
Hydrator Studio
✦Drag-and-drop GUI for visual Data
Pipeline creation

✦Rich library of pre-built sources,
transforms, sinks for data ingestion
and ETL use cases

✦Separation of pipeline creation from
execution framework - MapReduce,
Spark, Spark Streaming etc.

✦Hadoop-native and Hadoop Distro
agnostic
PROPRIETARY & CONFIDENTIAL
Hydrator Data Pipeline
✦ Captures Metadata, Audit,
Lineage info and visualized using
Cask Tracker

✦ Notification, centralized metrics
and log collection for ease of
operability

✦ Simple Java API to build your
own source, transforms, sinks
with complete class loading
isolation

✦ SparkML based plugins, Python
transforms for data scientists
PROPRIETARY & CONFIDENTIAL
✦ ElasticSearch, SFTP, Cassandra, Kafka, JMS and many more sources and
sinks

Out of the box Integrations
PROPRIETARY & CONFIDENTIAL
✦ Implement your own batch (or realtime) source, transform, sink plugins using simple
Java API
Custom Plugins
PROPRIETARY & CONFIDENTIAL
Pipeline Implementation
Logical
Physical
MR/Spark Executions
Planner
CDAP
✦ Planner converts logical pipeline to a physical
execution plan

✦ Optimizes and bundles functions into one or
more MR/Spark jobs

✦ CDAP is the runtime environment where all the
components of the data pipeline are executed

✦ CDAP provides centralized log and metrics
collection, transaction, lineage and audit
information

PROPRIETARY & CONFIDENTIAL
Pipeline Implementation
PROPRIETARY & CONFIDENTIAL
CASK DATA APPLICATION PLATFORM
Integrated Framework for Building and
Running Data Applications on Hadoop
Integrates the Latest
Big Data Technologies
Supports All Major
Hadoop Distributions
Fully Open Source
and Highly Extensible
PROPRIETARY & CONFIDENTIAL
Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions
PROPRIETARY & CONFIDENTIAL
Abstraction and Integration Layer
Data Lake
Fraud
Detection
Recommendation
Engine
Sensor Data
Analytics
Customer
360
Hydrator Tracker
Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions
PROPRIETARY & CONFIDENTIAL
Data Lake
Fraud
Detection
Recommendation
Engine
Sensor Data
Analytics
Customer
360
Hydrator Tracker
CASK DATA APP PLATFORM
Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions
PROPRIETARY & CONFIDENTIAL
Self-Service Data Ingestion
and ETL for Data Lakes
Built for Production
on CDAP
Rich Drag-and-Drop
User Interface
Open Source &
Highly Extensible
PROPRIETARY & CONFIDENTIAL
✦ Join across multiple data sources (CDAP-5588)

✦ Macro substitutions

✦ Pre-Actions in pipelines similar to post run
notifications

✦ Spark streaming support for Realtime pipelines
Hydrator Roadmap
Thank You!
cdap-user@googlegroups.com
Twitter @CaskData
Questions?
PROPRIETARY & CONFIDENTIAL
Data Lake
Enterprise-wide data management platforms
for analyzing disparate sources of data in its
native format - Gartner
Data
Lake
1
0
1
0
0
01
1
0
1
Hydrating your Data Lake
Hydrator
Self-service, hadoop-native, drag-and-
drop open source framework to
develop, run and operate data
PROPRIETARY & CONFIDENTIAL
Manual processes requiring
hand-coding and reliance on

command-line tools
Hard to find data and

it’s lineage for data

discovery and exploration
Coupling of ingestion and
processing drives

architecture decisions
Operationalizing processes

for production and to

maintain SLAs
Ensuring data is in canonical
forms with a shared schema
usable by others
Coding or filing tickets often
required to perform new

ingestion and processing tasks
Multiple architectures and
technologies used by different
teams on different clusters
Guaranteeing compliance in a
system that is designed for
schema-on-read and raw data
Sharing infrastructure in a

multi-tenant environment

without low-level QoS support
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0
Data Lake Challenges
PROPRIETARY & CONFIDENTIAL
Hydrator framework with
templates and plugins enables
production workflows in minutes
Never lose data by ensuring all
ingested data is tracked with

metadata and lineage
Separation of ingestion

and processing to support

any type, format and rate
Operationalize workflows using

scheduling and SLA monitoring

with time / partition awareness
Using common transformations
and a shared system for

defining and exposing schema
Reference architecture ensures
a common platform across
teams, orgs, ops and security
Multi-tenant namespacing
provides data and app isolation,
tying together infrastructure
Ensure compliance by

requiring the use of specific
transformations and validation
Self-service access through
Cask Hydrator for the discovery,
ingest and exploration of data
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0
Data Lakes on CDAP

More Related Content

What's hot (20)

PPTX
Built-In Security for the Cloud
DataWorks Summit
 
PPTX
Preventative Maintenance of Robots in Automotive Industry
DataWorks Summit/Hadoop Summit
 
PDF
Big Telco - Yousun Jeong
Spark Summit
 
PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
PPTX
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Data Con LA
 
PPTX
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
PDF
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
PDF
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
PPTX
Big Data Platform Industrialization
DataWorks Summit/Hadoop Summit
 
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
PPTX
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
PDF
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
PPTX
Saving the elephant—now, not later
DataWorks Summit
 
PDF
Powering Interactive BI Analytics with Presto and Delta Lake
Databricks
 
PPTX
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
PDF
Intro to databricks delta lake
Mykola Zerniuk
 
Built-In Security for the Cloud
DataWorks Summit
 
Preventative Maintenance of Robots in Automotive Industry
DataWorks Summit/Hadoop Summit
 
Big Telco - Yousun Jeong
Spark Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Data Con LA
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Big Data Platform Industrialization
DataWorks Summit/Hadoop Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit
 
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Saving the elephant—now, not later
DataWorks Summit
 
Powering Interactive BI Analytics with Presto and Delta Lake
Databricks
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
Intro to databricks delta lake
Mykola Zerniuk
 

Viewers also liked (20)

PDF
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Data Con LA
 
PDF
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Data Con LA
 
PPTX
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Data Con LA
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Data Con LA
 
PDF
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Data Con LA
 
PDF
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Data Con LA
 
PDF
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Data Con LA
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
PDF
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Data Con LA
 
PDF
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Data Con LA
 
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Data Con LA
 
PPTX
Big Data Day LA 2016/ NoSQL track - Privacy vs. Security in a Big Data World,...
Data Con LA
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Data Con LA
 
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Data Con LA
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Privacy vs. Security in a Big Data World,...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Data Con LA
 
Ad

Similar to Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data (20)

PDF
Webinar: What's new in CDAP 3.5?
Cask Data
 
PDF
Bridging the Big Data Gap in the Software-Driven World
CA Technologies
 
PPTX
Building a Big Data Pipeline
Jesus Rodriguez
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PDF
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty
 
PDF
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
PDF
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
PDF
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
PDF
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
PDF
Architecting Agile Data Applications for Scale
Databricks
 
PDF
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
Platfora
 
PDF
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
 
PDF
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks
 
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
PDF
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
PDF
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
PDF
Democratization of Data @Indix
Manoj Mahalingam
 
PDF
Managing data analytics in a hybrid cloud
Karan Singh
 
PDF
Cloudera Showcase Cask
Cloudera, Inc.
 
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Webinar: What's new in CDAP 3.5?
Cask Data
 
Bridging the Big Data Gap in the Software-Driven World
CA Technologies
 
Building a Big Data Pipeline
Jesus Rodriguez
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
Architecting Agile Data Applications for Scale
Databricks
 
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
Platfora
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
Democratization of Data @Indix
Manoj Mahalingam
 
Managing data analytics in a hybrid cloud
Karan Singh
 
Cloudera Showcase Cask
Cloudera, Inc.
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PDF
Data Con LA 2022 Keynote
Data Con LA
 
PPTX
Data Con LA 2022 - Startup Showcase
Data Con LA
 
PPTX
Data Con LA 2022 Keynote
Data Con LA
 
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
PPTX
Data Con LA 2022 - AI Ethics
Data Con LA
 
PDF
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
PDF
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
PDF
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

Recently uploaded (20)

PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data

  • 1. Building Data Pipelines with Cask Hydrator Jon Gray CEO, Cask July 9th, 2016
  • 2. PROPRIETARY & CONFIDENTIAL Web Analytics and Reporting Use Case ✦Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts
 ✦Not enough personnel with expertise in all the Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka) or lack of expertise
 ✦Hard to debug and validate, resulting in frequent failures in production environment
 
 Transform web log data from S3 every hour to Hadoop cluster for backup, as well as, perform analytics and enable realtime reporting of metrics such as number of successful/failure responses, most popular webpage etc. The Challenge —
  • 3. PROPRIETARY & CONFIDENTIAL Demo Example Load Log Files from S3 to HDFS and perform aggregations/analysis •Start with web access logs stored in Amazon S3 •Store the raw logs into HDFS Avro Files •Parse the access log lines into individual fields •Calculate the total number of requests by IP and status code •Find out IPs which received maximum successful status code and error codes 69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36" Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info Sample Web access log (Combined Log Format):
  • 4. PROPRIETARY & CONFIDENTIAL INGEST any data from any source in real-time and batch BUILD drag-and-drop ETL/ELT pipelines that run on Hadoop EGRESS any data to any destination in real-time and batch Data Pipeline provides the ability to automate complex workflows that involves fetching data, possibly from multiple data sources, combining, performing non-trivial transformations on the data, writing it to one more data sinks and deriving/
  • 6. PROPRIETARY & CONFIDENTIAL Hydrator Studio ✦Drag-and-drop GUI for visual Data Pipeline creation
 ✦Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases
 ✦Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.
 ✦Hadoop-native and Hadoop Distro agnostic
  • 7. PROPRIETARY & CONFIDENTIAL Hydrator Data Pipeline ✦ Captures Metadata, Audit, Lineage info and visualized using Cask Tracker
 ✦ Notification, centralized metrics and log collection for ease of operability
 ✦ Simple Java API to build your own source, transforms, sinks with complete class loading isolation
 ✦ SparkML based plugins, Python transforms for data scientists
  • 8. PROPRIETARY & CONFIDENTIAL ✦ ElasticSearch, SFTP, Cassandra, Kafka, JMS and many more sources and sinks
 Out of the box Integrations
  • 9. PROPRIETARY & CONFIDENTIAL ✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API Custom Plugins
  • 10. PROPRIETARY & CONFIDENTIAL Pipeline Implementation Logical Physical MR/Spark Executions Planner CDAP ✦ Planner converts logical pipeline to a physical execution plan
 ✦ Optimizes and bundles functions into one or more MR/Spark jobs
 ✦ CDAP is the runtime environment where all the components of the data pipeline are executed
 ✦ CDAP provides centralized log and metrics collection, transaction, lineage and audit information

  • 12. PROPRIETARY & CONFIDENTIAL CASK DATA APPLICATION PLATFORM Integrated Framework for Building and Running Data Applications on Hadoop Integrates the Latest Big Data Technologies Supports All Major Hadoop Distributions Fully Open Source and Highly Extensible
  • 13. PROPRIETARY & CONFIDENTIAL Hadoop ecosystem, 50 different projects Top 6 Hadoop distributions
  • 14. PROPRIETARY & CONFIDENTIAL Abstraction and Integration Layer Data Lake Fraud Detection Recommendation Engine Sensor Data Analytics Customer 360 Hydrator Tracker Hadoop ecosystem, 50 different projects Top 6 Hadoop distributions
  • 15. PROPRIETARY & CONFIDENTIAL Data Lake Fraud Detection Recommendation Engine Sensor Data Analytics Customer 360 Hydrator Tracker CASK DATA APP PLATFORM Hadoop ecosystem, 50 different projects Top 6 Hadoop distributions
  • 16. PROPRIETARY & CONFIDENTIAL Self-Service Data Ingestion and ETL for Data Lakes Built for Production on CDAP Rich Drag-and-Drop User Interface Open Source & Highly Extensible
  • 17. PROPRIETARY & CONFIDENTIAL ✦ Join across multiple data sources (CDAP-5588)
 ✦ Macro substitutions
 ✦ Pre-Actions in pipelines similar to post run notifications
 ✦ Spark streaming support for Realtime pipelines Hydrator Roadmap
  • 19. PROPRIETARY & CONFIDENTIAL Data Lake Enterprise-wide data management platforms for analyzing disparate sources of data in its native format - Gartner Data Lake 1 0 1 0 0 01 1 0 1 Hydrating your Data Lake Hydrator Self-service, hadoop-native, drag-and- drop open source framework to develop, run and operate data
  • 20. PROPRIETARY & CONFIDENTIAL Manual processes requiring hand-coding and reliance on
 command-line tools Hard to find data and
 it’s lineage for data
 discovery and exploration Coupling of ingestion and processing drives
 architecture decisions Operationalizing processes
 for production and to
 maintain SLAs Ensuring data is in canonical forms with a shared schema usable by others Coding or filing tickets often required to perform new
 ingestion and processing tasks Multiple architectures and technologies used by different teams on different clusters Guaranteeing compliance in a system that is designed for schema-on-read and raw data Sharing infrastructure in a
 multi-tenant environment
 without low-level QoS support Data Reservoir 1 0 1 0 0 0 1 Data Pond 1 0 1 0 1 0 Data Lake 1 0 1 0 1 0 Data Lake Challenges
  • 21. PROPRIETARY & CONFIDENTIAL Hydrator framework with templates and plugins enables production workflows in minutes Never lose data by ensuring all ingested data is tracked with
 metadata and lineage Separation of ingestion
 and processing to support
 any type, format and rate Operationalize workflows using
 scheduling and SLA monitoring
 with time / partition awareness Using common transformations and a shared system for
 defining and exposing schema Reference architecture ensures a common platform across teams, orgs, ops and security Multi-tenant namespacing provides data and app isolation, tying together infrastructure Ensure compliance by
 requiring the use of specific transformations and validation Self-service access through Cask Hydrator for the discovery, ingest and exploration of data Data Reservoir 1 0 1 0 0 0 1 Data Pond 1 0 1 0 1 0 Data Lake 1 0 1 0 1 0 Data Lakes on CDAP