SlideShare a Scribd company logo
Hadoop at Ayasdi
Mohit Jaggi
and
Huang Xia, Zhen Li, Ajith Warrier
Overview
- HDFS for storage
- YARN for integration into Hadoop data lake
- Parquet as the file format
- bigdf based on Spark for feature
engineering, data wrangling
Hadoop
Data
Lake
bigdf
Resource Broker
Apache Spark
YARN
Algorithms
UI/API handler
Architecture
!! Audience Poll !!
1. How many data scientists?
2. How many backend engineers?
3. UI/frontend engineers?
4. Using hadoop in production?
5. Using Spark in production?
6. Personally worked on data bigger than 100GB? 1TB? 10TB?
1PB?
HDFS
HDFS - Motivation
- installed base, large community
- ecosystem to connect to most other data
sources
- commodity cluster
- experiments with distributed NAS didn't
show enough promise to justify the
additional cost and complexity
HDFS - Usage
- used as distributed storage
- jobs dispatched to least loaded node
- run Spark jobs
FishFiFishing for insights...
big
data
YARN - Motivation
- Ayasdi scheduler
- maximize throughput for batch jobs
- minimize latency for interactive ā€œtaskletsā€
- wanted to deploy in existing Hadoop data
lakes
- integrated inhouse scheduler with YARN
- ā€œtaskletsā€ get a long running container
- batch jobs get a container on demand
YARN - Challenges
- increased latency observable for small batch
jobs
- early adopter pains
- sparse documentation
- not the best API design
big data store
compressed data
Parquet - Motivation
- legacy: data stored in both row and column
major
- requires expensive transpose on ingestion
- were designing a ā€œtiled file formatā€ when
discovered parquet
Parquet - Challenges
- early adopter challenges
- sparse documentation
- needed to access package private APIs
big data
bigdf - Motivation
- born out of experience using spark for
feature engineering
- creating classes for RDDs not reusable
across projects
- SQL not expressive enough
bigdf - details
- open source since Sep 2014
- precedes Spark DataFrame, so built on spark-core
engine
- experimenting with Catalyst using Spark DataFrame
APIs, looks promising
- python and scala APIs
- feature engineering library [not open source :-( ]
- fast CSV reader(and other features) contributed to
spark-csv
bigdf - future
- wrapper around Spark DF
- to protect from API changes
- to add features e.g. ā€œsparse column setā€ as ā€œround-
trip timeā€ for pull requests into large open source
projects is high
Thanks!
www.ayasdi.com
http://engineering.ayasdi.com
https://github.com/AyasdiOpenSource/bigdf
http://www.ayasdi.com/company/careers/

More Related Content

PDF
Introduction to df
Mohit Jaggi
Ā 
PDF
Building DSLs with Scala
Mohit Jaggi
Ā 
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
Ā 
PDF
Introduction to apache spark
UserReport
Ā 
PPTX
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
Ā 
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
Ā 
PDF
Koalas: Pandas on Apache Spark
Databricks
Ā 
PDF
Scalable Scientific Computing with Dask
Uwe Korn
Ā 
Introduction to df
Mohit Jaggi
Ā 
Building DSLs with Scala
Mohit Jaggi
Ā 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
Ā 
Introduction to apache spark
UserReport
Ā 
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
Ā 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
Ā 
Koalas: Pandas on Apache Spark
Databricks
Ā 
Scalable Scientific Computing with Dask
Uwe Korn
Ā 

What's hot (20)

PPTX
10 Things About Spark
Roger Brinkley
Ā 
PDF
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
Ā 
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
Ā 
PDF
EclairJS = Node.Js + Apache Spark
Jen Aman
Ā 
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
Ā 
PPTX
Apache Arrow: In Theory, In Practice
Dremio Corporation
Ā 
PDF
HBase at Mendeley
Dan Harvey
Ā 
PDF
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
Ā 
PDF
Deep Learning to Production with MLflow & RedisAI
Databricks
Ā 
PPTX
The Meta of Hadoop - COMAD 2012
Joydeep Sen Sarma
Ā 
PDF
NigthClazz Spark - Machine Learning / Introduction Ć  Spark et Zeppelin
Zenika
Ā 
PDF
HUG France Feb 2016 - Migration de donnƩes structurƩes entre Hadoop et RDBMS ...
Modern Data Stack France
Ā 
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
Ā 
PPTX
Cloud Optimized Big Data
Joydeep Sen Sarma
Ā 
PDF
Intro to Apache Spark
Marius Soutier
Ā 
PDF
Scala: the unpredicted lingua franca for data science
Andy Petrella
Ā 
PPTX
Ignite Your Big Data With a Spark!
Progress
Ā 
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
Ā 
PPTX
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Cedric CARBONE
Ā 
PDF
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
Ā 
10 Things About Spark
Roger Brinkley
Ā 
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
Ā 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
Ā 
EclairJS = Node.Js + Apache Spark
Jen Aman
Ā 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
Ā 
Apache Arrow: In Theory, In Practice
Dremio Corporation
Ā 
HBase at Mendeley
Dan Harvey
Ā 
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
Ā 
Deep Learning to Production with MLflow & RedisAI
Databricks
Ā 
The Meta of Hadoop - COMAD 2012
Joydeep Sen Sarma
Ā 
NigthClazz Spark - Machine Learning / Introduction Ć  Spark et Zeppelin
Zenika
Ā 
HUG France Feb 2016 - Migration de donnƩes structurƩes entre Hadoop et RDBMS ...
Modern Data Stack France
Ā 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
Ā 
Cloud Optimized Big Data
Joydeep Sen Sarma
Ā 
Intro to Apache Spark
Marius Soutier
Ā 
Scala: the unpredicted lingua franca for data science
Andy Petrella
Ā 
Ignite Your Big Data With a Spark!
Progress
Ā 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
Ā 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Cedric CARBONE
Ā 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
Ā 
Ad

Viewers also liked (9)

PDF
df: Dataframe on Spark
Alpine Data
Ā 
PPTX
Machine Learning with Ayasdi
Ayasdi
Ā 
PPTX
Belief Networks & Bayesian Classification
Adnan Masood
Ā 
PPTX
Anti-Money Laundering (AML) Risk Assessment Process
accenture
Ā 
PPTX
Implementing Anti-Money Laundering and Know Your Customer Managed Services So...
accenture
Ā 
PPTX
Bayesian Belief Networks for dummies
Gilad Barkan
Ā 
PPTX
Neural network & its applications
Ahmed_hashmi
Ā 
PPTX
Artificial neural network
DEEPASHRI HK
Ā 
PDF
Topological data analysis
Sunghyon Kyeong
Ā 
df: Dataframe on Spark
Alpine Data
Ā 
Machine Learning with Ayasdi
Ayasdi
Ā 
Belief Networks & Bayesian Classification
Adnan Masood
Ā 
Anti-Money Laundering (AML) Risk Assessment Process
accenture
Ā 
Implementing Anti-Money Laundering and Know Your Customer Managed Services So...
accenture
Ā 
Bayesian Belief Networks for dummies
Gilad Barkan
Ā 
Neural network & its applications
Ahmed_hashmi
Ā 
Artificial neural network
DEEPASHRI HK
Ā 
Topological data analysis
Sunghyon Kyeong
Ā 
Ad

Similar to Hadoop at ayasdi (20)

PDF
spark_v1_2
Frank Schroeter
Ā 
PPTX
Apache Spark Introduction @ University College London
Vitthal Gogate
Ā 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
Ā 
PDF
Agile data lake? An oxymoron?
samthemonad
Ā 
PDF
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
Ā 
PDF
Big Data Journey
Tugdual Grall
Ā 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
Ā 
PDF
Hoodie - DataEngConf 2017
Vinoth Chandar
Ā 
PDF
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
Ā 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
Ā 
PPTX
Apache spark installation [autosaved]
Shweta Patnaik
Ā 
PDF
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
Andrey Kudryavtsev
Ā 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
Ā 
PDF
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
Ā 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
Ā 
PDF
Hadoop vs spark
amarkayam
Ā 
PDF
Infra space talk on Apache Spark - Into to CASK
Rob Mueller
Ā 
PDF
5 Reasons why Spark is in demand!
Edureka!
Ā 
PPTX
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
Ā 
spark_v1_2
Frank Schroeter
Ā 
Apache Spark Introduction @ University College London
Vitthal Gogate
Ā 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
Ā 
Agile data lake? An oxymoron?
samthemonad
Ā 
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
Ā 
Big Data Journey
Tugdual Grall
Ā 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
Ā 
Hoodie - DataEngConf 2017
Vinoth Chandar
Ā 
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
Ā 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
Ā 
Apache spark installation [autosaved]
Shweta Patnaik
Ā 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
Andrey Kudryavtsev
Ā 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
Ā 
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
Ā 
APACHE SPARK.pptx
DeepaThirumurugan
Ā 
Hadoop vs spark
amarkayam
Ā 
Infra space talk on Apache Spark - Into to CASK
Rob Mueller
Ā 
5 Reasons why Spark is in demand!
Edureka!
Ā 
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
Ā 

Recently uploaded (20)

PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
Ā 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
Ā 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
Ā 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
Ā 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
Ā 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
Ā 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
Ā 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
Ā 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
Ā 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
Ā 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
Ā 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
Ā 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
Ā 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
Ā 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
Ā 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
Ā 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
Ā 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
Ā 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
Ā 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
Ā 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
Ā 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
Ā 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
Ā 
The Future of AI & Machine Learning.pptx
pritsen4700
Ā 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
Ā 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
Ā 
Simple and concise overview about Quantum computing..pptx
mughal641
Ā 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
Ā 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
Ā 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
Ā 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
Ā 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
Ā 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
Ā 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
Ā 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
Ā 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
Ā 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
Ā 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
Ā 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
Ā 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
Ā 

Hadoop at ayasdi