SlideShare a Scribd company logo
Zoosk !
Big Data Architecture
MARCH 1, 2016
Marilson Campos
Sr. Data Architect – Data Engineering
2Zoosk. Inc. |
About Zoosk
•  Online dating site & mobile apps
•  Uses behavioral data to match couples.
•  80 Countries.
•  25 languages.
3Zoosk. Inc. |
About Me !
Email: marilsonc@zoosk.com
-  Led “Large data” projects since 1998.
-  Hadoop since 2009
-  Focus on:
•  Complex data pipeline design.
•  Scaling machine learning processing.
-  Data pipeline projects
•  #1 search engine Latin America
Acquired by Yahoo.
•  Indexing all blogs in US
Buzzlogic.
•  Machine learning pipelines at scale
Rocketfuel.
•  Machine learning & Enterprise Data bus
Zoosk.
4Zoosk. Inc. |
Agenda
•  Big data @ Zoosk.
•  Data pipelines.
•  Leveraging Impala.
•  Impala and Hive trade offs.
•  Q & A.
5Zoosk. Inc. |
1. Big Data @ Zoosk
6Zoosk. Inc. |
2. Data as a product
7Zoosk. Inc. |
3. Usable data sets x Temporary data
8Zoosk. Inc. |
4. Hive Streaming
9Zoosk. Inc. |
5. Two cluster configuration
10Zoosk. Inc. |
6. Impala performance 
Impala has the largest impact.
Parquet adds:
+30% performance on Hive.
4x performance on impala.
Some parquet tables will be a
lot smaller than standard
format.
In some cases 1/6th of the size.
11Zoosk. Inc. |
7. Impala vs. Hive (again!)
Hive Impala
System type Batch processing. Interactive.
Resource Utilization Conservative. Aggressive.
Predictable time
completion.
Yes.
Almost linear. Works well
with complex queries.
Not always.
Generally very fast, but
can degenerate in queries
with lots of joins.
Allow code injection Yes.
Injection of M/R functions
with Hive streaming.
No.
Does not rely on Map/
Reduce.
User defined functions Yes. Yes.
12Zoosk. Inc. |
8. Where to use Impala or Hive ?
Impala:
- We consider it a premium service.
- Use for queries performing aggregations where
immediate response brings value. (*)
- In some cases, it makes sense for apps to call it.
Hive:
- Standard service.
- Use for processing the data pipeline.
- Use integrated with Java, Python and R.
- Use for complex queries on very large sets.
(*) Everybody wants their queries to run faster.
This is not a valid reason. J
13Zoosk. Inc. |
Q & A

More Related Content

What's hot (20)

PPTX
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
DataWorks Summit
 
PDF
Apache Flink & Kudu: a connector to develop Kappa architectures
Nacho García Fernández
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
PPTX
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
DataWorks Summit
 
PPTX
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
PPTX
Concur Discovers the True Value of Data
Cloudera, Inc.
 
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
PDF
PyData: The Next Generation
Wes McKinney
 
PPTX
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
PPTX
Accelerating Big Data Insights
DataWorks Summit
 
PPTX
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PDF
MapR-DB Elasticsearch Integration
MapR Technologies
 
PPTX
The EDW Ecosystem
DataWorks Summit/Hadoop Summit
 
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
PPTX
Optimizing Big Data to run in the Public Cloud
Qubole
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
DataWorks Summit
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Nacho García Fernández
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
DataWorks Summit
 
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Concur Discovers the True Value of Data
Cloudera, Inc.
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
PyData: The Next Generation
Wes McKinney
 
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Accelerating Big Data Insights
DataWorks Summit
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
MapR-DB Elasticsearch Integration
MapR Technologies
 
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
Optimizing Big Data to run in the Public Cloud
Qubole
 

Viewers also liked (20)

PPTX
The Impala Cookbook
Cloudera, Inc.
 
PPTX
ImpalaToGo use case
David Groozman
 
PPTX
Incredible Impala
Gwen (Chen) Shapira
 
PDF
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Stratio
 
PPTX
Query Compilation in Impala
Cloudera, Inc.
 
PDF
Impala Performance Update
Cloudera, Inc.
 
PPTX
Admission Control in Impala
Cloudera, Inc.
 
PPTX
Apache Impala (incubating) 2.5 Performance Update
Cloudera, Inc.
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PDF
Nested Types in Impala
Cloudera, Inc.
 
PDF
Non-Stop Hadoop for Hortonworks
Hortonworks
 
PDF
Cloudera Impala Overview (via Scott Leberknight)
Cloudera, Inc.
 
PDF
SQL to Hive Cheat Sheet
Hortonworks
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
ODP
Big data, map reduce and beyond
datasalt
 
PDF
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PDF
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
PPTX
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
PDF
How Impala Works
Yue Chen
 
PDF
Cloudera Impala technical deep dive
huguk
 
The Impala Cookbook
Cloudera, Inc.
 
ImpalaToGo use case
David Groozman
 
Incredible Impala
Gwen (Chen) Shapira
 
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Stratio
 
Query Compilation in Impala
Cloudera, Inc.
 
Impala Performance Update
Cloudera, Inc.
 
Admission Control in Impala
Cloudera, Inc.
 
Apache Impala (incubating) 2.5 Performance Update
Cloudera, Inc.
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Nested Types in Impala
Cloudera, Inc.
 
Non-Stop Hadoop for Hortonworks
Hortonworks
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera, Inc.
 
SQL to Hive Cheat Sheet
Hortonworks
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Big data, map reduce and beyond
datasalt
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
How Impala Works
Yue Chen
 
Cloudera Impala technical deep dive
huguk
 
Ad

Similar to Impala use case @ Zoosk (20)

PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
PPTX
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
PPTX
Overview of big data & hadoop v1
Thanh Nguyen
 
ODP
Impala turbocharge your big data access
Ophir Cohen
 
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
PDF
SQL Engines for Hadoop - The case for Impala
markgrover
 
PDF
Impala Architecture presentation
hadooparchbook
 
PDF
Impala presentation ahad rana
Data Con LA
 
PDF
Getting Started With Impala Interactive Sql For Apache Hadoop 1st Edition Joh...
boyatolonwu
 
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
PDF
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PPTX
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera, Inc.
 
PDF
Building a Hadoop Data Warehouse with Impala
huguk
 
PDF
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
PDF
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
 
PPTX
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
ODP
The other Apache technologies your big data solution needs!
gagravarr
 
PDF
Cloudera impala
Swiss Big Data User Group
 
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
Overview of big data & hadoop v1
Thanh Nguyen
 
Impala turbocharge your big data access
Ophir Cohen
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
SQL Engines for Hadoop - The case for Impala
markgrover
 
Impala Architecture presentation
hadooparchbook
 
Impala presentation ahad rana
Data Con LA
 
Getting Started With Impala Interactive Sql For Apache Hadoop 1st Edition Joh...
boyatolonwu
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
huguk
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
 
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
The other Apache technologies your big data solution needs!
gagravarr
 
Cloudera impala
Swiss Big Data User Group
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

Recently uploaded (20)

PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 

Impala use case @ Zoosk

  • 1. Zoosk ! Big Data Architecture MARCH 1, 2016 Marilson Campos Sr. Data Architect – Data Engineering
  • 2. 2Zoosk. Inc. | About Zoosk •  Online dating site & mobile apps •  Uses behavioral data to match couples. •  80 Countries. •  25 languages.
  • 3. 3Zoosk. Inc. | About Me ! Email: [email protected] -  Led “Large data” projects since 1998. -  Hadoop since 2009 -  Focus on: •  Complex data pipeline design. •  Scaling machine learning processing. -  Data pipeline projects •  #1 search engine Latin America Acquired by Yahoo. •  Indexing all blogs in US Buzzlogic. •  Machine learning pipelines at scale Rocketfuel. •  Machine learning & Enterprise Data bus Zoosk.
  • 4. 4Zoosk. Inc. | Agenda •  Big data @ Zoosk. •  Data pipelines. •  Leveraging Impala. •  Impala and Hive trade offs. •  Q & A.
  • 5. 5Zoosk. Inc. | 1. Big Data @ Zoosk
  • 6. 6Zoosk. Inc. | 2. Data as a product
  • 7. 7Zoosk. Inc. | 3. Usable data sets x Temporary data
  • 8. 8Zoosk. Inc. | 4. Hive Streaming
  • 9. 9Zoosk. Inc. | 5. Two cluster configuration
  • 10. 10Zoosk. Inc. | 6. Impala performance Impala has the largest impact. Parquet adds: +30% performance on Hive. 4x performance on impala. Some parquet tables will be a lot smaller than standard format. In some cases 1/6th of the size.
  • 11. 11Zoosk. Inc. | 7. Impala vs. Hive (again!) Hive Impala System type Batch processing. Interactive. Resource Utilization Conservative. Aggressive. Predictable time completion. Yes. Almost linear. Works well with complex queries. Not always. Generally very fast, but can degenerate in queries with lots of joins. Allow code injection Yes. Injection of M/R functions with Hive streaming. No. Does not rely on Map/ Reduce. User defined functions Yes. Yes.
  • 12. 12Zoosk. Inc. | 8. Where to use Impala or Hive ? Impala: - We consider it a premium service. - Use for queries performing aggregations where immediate response brings value. (*) - In some cases, it makes sense for apps to call it. Hive: - Standard service. - Use for processing the data pipeline. - Use integrated with Java, Python and R. - Use for complex queries on very large sets. (*) Everybody wants their queries to run faster. This is not a valid reason. J