SlideShare a Scribd company logo
Stratio Meta 
An efficient distributed datahub with batch and 
streaming query capabilities 
Daniel Higuero 
Alvaro Agea 
dhiguero@stratio.com 
alvaro@stratio.com 
#CassandraSummit-20141"
Stratio Crossdata 
An efficient distributed datahub with batch and 
streaming query capabilities 
Daniel Higuero 
Alvaro Agea 
dhiguero@stratio.com 
alvaro@stratio.com 
#CassandraSummit-20142"
Who are we? 
STRATIO 
• Stra3o-is-a-Big-Data-Company 
• Founded-in-2013 
• Commercially-launched-in-2014 
• 50+-employees-in-Madrid 
• Office-in-San-Francisco 
• Cer3fied-Spark-distribu3on 
#CassandraSummit-2014 
3"
We love… 
Cassandra 
• P2P-architecture 
• Read/write-performance 
• Fault-tolerance 
• Easy-to-deploy 
• CQL 
#CassandraSummit-2014 
4"
• Introduction 
• Crossdata architecture 
• Metadata management 
• Streaming sources 
• Full text search 
• Spark and Crossdata 
• ODBC 
• The future 
Agenda 
5"
Introduction 
o Big-Data-analysis-is-commonly-associated-with-batch-processing 
• Users-aiming-to-combine-batch-and-stream-processing-have-to- 
rely-on-tailorRmade-architectures 
o Users-buy-Big-Data-plaSorms,-but 
• How-do-I-start? 
• What-is-my-entry-point-to-the-plaSorm? 
#CassandraSummit-2014 
6"
What our clients demand? 
o Easy-deployment 
o Easy-administra3on 
o Read/write-performance 
o EasyRtoRlearn-query-language-o 
Integra3on-with-BI-Tools 
o Join-opera3ons 
o Support-for-streaming-sources 
o Integra3on-with-other-data-stores 
o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data) 
#CassandraSummit-2014 
7"
What our clients demand? 
! Easy%deployment% 
! Easy%administra0on% 
! Read/write%performance% 
! Easy6to6learn%query%language% 
o Integra3on-with-BI-Tools 
o Join-opera3ons 
o Support-for-streaming-sources 
o Integra3on-with-other-data-stores 
o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data) 
#CassandraSummit-2014 
8"
What our clients demand? 
! Easy"deployment" 
! Easy"administra8on" 
! Read/write"performance" 
! Easy>to>learn"query"language" 
! Integra3on-with-BI-Tools 
! Join-opera3ons 
! Support-for-streaming-sources 
! Integra3on-with-other-data-stores 
! Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data) 
#CassandraSummit-2014 
9"
Crossdata 
o A-new-technology-that: 
• Is-not-limited-by-the-underlying-datastore-capabili3es 
• Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons 
• Supports-batch-and-streaming-queries 
• Supports-mul3ple-clusters-and-technologies 
#CassandraSummit-2014 
10"
Our architecture 
#CassandraSummit-2014 
11"
Connecting to the outside world 
o Crossdata-defines-an-IConnector-extension-interface 
o User-can-easily-add-new-connectors-to-support 
• Different-datastores 
• Different-processing-engines 
• Different-versions 
o Where-each-connector-defines-its-capabili3es 
#CassandraSummit-2014 
12" 
Our planner will choose the best connector for each query
Query execution 
#CassandraSummit-2014 
13" 
Parsing" Valida8on" Planning" Execu8on" 
C*" 
Connector1" 
Connector2" 
Connector3" 
Our planner will choose the best connector for each query
Multi-cluster support 
o Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog- 
across-a-set-of-datastores.- 
• Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance 
" E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,- 
readRop3mized-cluster,-etc.- 
• A-table-is-saved-in-a-unique-datastore 
#CassandraSummit-2014 
14"
Logical and physical mapping 
SELECT&*&FROM&app.users;& 
Users"table" Test"table" old_users"table" 
#CassandraSummit-2014 
15" 
App"catalog" 
C*"produc8on" C*"development" Other"datastores"
Metadata 
Management 
16"
Metadata in the era of Schemaless NoSQL datastores 
o Some-datastores-are-schemaless-but-our-applica3ons-are-not!- 
• Flexible-schemas-vs-Schemaless 
• Crossdata-provides-a-Metadata-manager-that-stores-schemas- 
for-any-datasource 
" Remember-ODBC-and-those-BI-tools 
" 
1010010101010 
1010110101010 
1111010001111 
?" 001000" 
#CassandraSummit-2014 
17"
Metadata management 
#CassandraSummit-2014 
18" 
Connector" 
C*"produc8on" 
Metadata"Store" 
Infinispan" 
Metadata"Manager" 
2% 
Updated"metadata" 
informa8on"is" 
maintained"among" 
Crossdata"servers" 
using"Infinispan" 
If"the"connector"does" 
not"support"metadata" 
opera8ons"those"are" 
skipped" 1% 2%
Streaming sources 
19"
Managing streaming sources 
o Nowadays-use-cases-expect-some-type-of-streaming-datasource 
• Streaming-data-has-an-ephemeral-nature 
• In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on- 
#CassandraSummit-2014 
to-work-with-streaming-sources-as-classical- 
RDBMS-tables 
20" 
streaming" 
source" 
{schema:{col1:…},…}" 
col1:text" col2:int" col3:int" col4:text" 
Streaming_query0" 
…" 
Streaming_queryn"
Streaming queries 
o Streaming-queries-are-infinite-by-defini3on 
• A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows- 
ingested-by-the-system-in-that-period 
• The-user-launches-queries-specifying-a-processing-3me-window 
" Crossdata-provides-methods-to-list-and-stop-running-streaming- 
#CassandraSummit-2014 
queries 
21"
Streaming queries: windows syntax 
#CassandraSummit-2014 
22" 
SELECT fieldGroup,avg(Field2) 
FROM eph_table 
WITH WINDOW 5 minutes 
WHERE field1=100 AND field2>100 
GROUP BY fieldGroup;
Joining batch and streaming 
SELECT * FROM demo.temporal 
WITH WINDOW 10 secs 
INNER JOIN demo.users 
#CassandraSummit-2014 
ON users.name = temporal.name; 
SELECT * FROM 
demo.temporal 
WITH WINDOW 10 secs 
" 
SELECT * 
FROM demo.users 
" 
INNER JOIN ON 
users.name = 
temporal.name 
" 
23"
Full text search 
24"
Full text search with 
o Clients-request-the-ability-to-perform-full-text-searches 
o We-have-developed-an-integra3on-between-Lucene-and- 
Cassandra 
o C*-users-can-now-enjoy-all-Lucene-features: 
• Full-text-searches,-range-queries,-fuzzy-queries…. 
#CassandraSummit-2014 
25" 
https://github.com/Stratio/stratio-cassandra
Stratio Lucene 2i 
#CassandraSummit-2014 
26" 
C*" 
node" 
C*" 
node" 
Lucene" 
index" 
C*" 
node" 
Lucene" 
index" 
C*" 
node" 
Lucene" 
index" 
C*" 
node" 
Lucene" 
index" 
Lucene" 
index"
Full text search queries 
o With-Crossdata,-we-simplify: 
• The-crea3on-syntax- 
• The-query-syntax-using-the-match-operator 
#CassandraSummit-2014 
27" 
CREATE&FULLTEXT&INDEX&ON&app.users(name,email);& 
SELECT&*&FROM&app.users&& 
where&email&MATCH&‘*@stratio.com’;&
& Stratio Crossdata 
28"
Why Spark? 
o Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons 
o Spark-brings-several-benefits-over-Hadoop-o 
InRMemory-processing 
o RDD-abstrac3on 
o Simpler-API-o 
Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping) 
#CassandraSummit-2014 
29"
What about Spark SQL? 
o Different-approach-to-query-execu3on 
• We-only-use-Spark-when-it-speedups-queries 
" Na3ve-drivers-are-faster-for-simple-queries 
" Spark-SQL-has-limited-RDD-sources 
• Avoid-some-Spark-limita3ons 
• Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243 
#CassandraSummit-2014 
30"
Query approach 
SparkSQL"approach" Crossdata"approach" 
#CassandraSummit-2014 
SparkSQL" 
Spark" 
Cassandra" 
Spark" Na8ve"driver" 
Cassandra" 
31" 
Stra8o"Crossdata"
Our Cassandra-Spark integration 
o Project-started-in-June-2013 
" With-the-objec3ve-of-providing-a-method-to-interact-with- 
Cassandra-from-Spark 
" Ini3al-approach-based-on-the-HadoopInputFormat-interface 
" Current-version-uses-the-na3ve-Datastax-Java-driver 
#CassandraSummit-2014 
32" 
https://github.com/Stratio/stratio-deep
Our Cassandra-Spark integration 
o Benchmark-in-process-comparing-our-solu3on-with-the- 
Datastax-Spark-driver 
• Results-highly-influenced-by-the-split-size 
• Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-using- 
Datastax-default-values 
• Group-by-–-up-to-40%-faster 
• Join-–-up-to-17%-faster 
• Stay-tuned-for-the-benchmark-publica3on! 
#CassandraSummit-2014 
33"
Spark vs Lucene 2i 
#CassandraSummit-2014 
34" 
Time" 
Spark" 
Lucen"2i" 
Records/node"
ODBC 
35"
Stratio Crossdata ODBC 
o WellRknown-interface-standard-(for-BI-tools,-external-apps,-…) 
o We-have-implemented-for-Crossdata-using-Simba-SDK 
o ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external- 
world 
o Currently-tested-with-Tableau,-Qlikview-and-MS-Excel 
#CassandraSummit-2014 
36" 
One ODBC for all datastores!
The future 
37"
The future 
o Security 
o Query-op3mizer-and-smart-query-planner 
o Leverage-system-sta3s3cs 
o Support-for-UDFs 
o Become-an-Apache-project 
#CassandraSummit-2014 
38" 
https://github.com/Stratio/stratio-meta
We are looking for an Apache Champion 
#CassandraSummit-2014 
39" 
Can"you" 
help"us?"
A wish list for Cassandra 
o Ability-to-stop-running-queries 
o Interac3ve-users-are-unpredictable 
o Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes) 
o Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator 
• E.g.,-aggrega3ons-like-count(*) 
#CassandraSummit-2014 
40"
Stratio Crossdata 
An efficient distributed datahub with batch and 
streaming query capabilities 
Daniel Higuero 
Alvaro Agea 
dhiguero@stratio.com 
alvaro@stratio.com 
#CassandraSummit-201441"

More Related Content

What's hot (19)

PPTX
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
PDF
Managing your black friday logs Voxxed Luxembourg
David Pilato
 
PDF
Spark Cassandra Connector Dataframes
Russell Spitzer
 
PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
PDF
Strata London 16: sightseeing, venues, and friends
Natalino Busa
 
PDF
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Sumeet Singh
 
PDF
Cassandra & Spark for IoT
Matthias Niehoff
 
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
PDF
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
PDF
Feeding Cassandra with Spark-Streaming and Kafka
DataStax Academy
 
PDF
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax Academy
 
PDF
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Robert Stupp
 
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
PPTX
Cascading introduction
Alex Su
 
PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
PDF
Apache Spark Overview
Carol McDonald
 
PDF
Apache Cassandra for Timeseries- and Graph-Data
Guido Schmutz
 
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Managing your black friday logs Voxxed Luxembourg
David Pilato
 
Spark Cassandra Connector Dataframes
Russell Spitzer
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
Strata London 16: sightseeing, venues, and friends
Natalino Busa
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Natalino Busa
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Sumeet Singh
 
Cassandra & Spark for IoT
Matthias Niehoff
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
Feeding Cassandra with Spark-Streaming and Kafka
DataStax Academy
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax Academy
 
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Robert Stupp
 
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
Cascading introduction
Alex Su
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
Apache Spark Overview
Carol McDonald
 
Apache Cassandra for Timeseries- and Graph-Data
Guido Schmutz
 

Viewers also liked (6)

ODP
Big Data Technology
Juan J. Mostazo
 
PPTX
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
PDF
Introduction to Streaming Analytics
Guido Schmutz
 
PDF
Big Data Architectures
Guido Schmutz
 
PPTX
Importance of Big Data Analytics
Impetus Technologies
 
PDF
Introduction to Streaming Analytics
Guido Schmutz
 
Big Data Technology
Juan J. Mostazo
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
Introduction to Streaming Analytics
Guido Schmutz
 
Big Data Architectures
Guido Schmutz
 
Importance of Big Data Analytics
Impetus Technologies
 
Introduction to Streaming Analytics
Guido Schmutz
 
Ad

Similar to Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities (20)

PDF
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
dhiguero
 
PDF
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Andrés de la Peña
 
PDF
Advanced search and Top-K queries in Cassandra
Stratio
 
PPTX
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
DataStax Academy
 
PDF
An efficient data mining solution by integrating Spark and Cassandra
Stratio
 
PDF
Learn to use Stratio Crossdata
Álvaro Agea Herradón
 
PDF
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Stratio
 
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
PPTX
Stratio big data spain
Álvaro Agea Herradón
 
PDF
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
PDF
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
hamidsamadi
 
PDF
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
DataStax Academy
 
PDF
Geospatial and bitemporal search in cassandra with pluggable lucene index
Andrés de la Peña
 
PPTX
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 
PPTX
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
DataStax Academy
 
PPTX
Big Data Analytics with Spark
DataStax Academy
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PPTX
Migrating Data Pipeline from MongoDB to Cassandra
Demi Ben-Ari
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
dhiguero
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Andrés de la Peña
 
Advanced search and Top-K queries in Cassandra
Stratio
 
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
DataStax Academy
 
An efficient data mining solution by integrating Spark and Cassandra
Stratio
 
Learn to use Stratio Crossdata
Álvaro Agea Herradón
 
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Stratio
 
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Stratio big data spain
Álvaro Agea Herradón
 
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
Real Time Analytics with Dse
DataStax Academy
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
hamidsamadi
 
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
DataStax Academy
 
Geospatial and bitemporal search in cassandra with pluggable lucene index
Andrés de la Peña
 
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
DataStax Academy
 
Big Data Analytics with Spark
DataStax Academy
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Migrating Data Pipeline from MongoDB to Cassandra
Demi Ben-Ari
 
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
PDF
Apache Cassandra and Drivers
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 
Apache Cassandra and Drivers
DataStax Academy
 

Recently uploaded (20)

PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 

Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities

  • 1. Stratio Meta An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero Alvaro Agea [email protected] [email protected] #CassandraSummit-20141"
  • 2. Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero Alvaro Agea [email protected] [email protected] #CassandraSummit-20142"
  • 3. Who are we? STRATIO • Stra3o-is-a-Big-Data-Company • Founded-in-2013 • Commercially-launched-in-2014 • 50+-employees-in-Madrid • Office-in-San-Francisco • Cer3fied-Spark-distribu3on #CassandraSummit-2014 3"
  • 4. We love… Cassandra • P2P-architecture • Read/write-performance • Fault-tolerance • Easy-to-deploy • CQL #CassandraSummit-2014 4"
  • 5. • Introduction • Crossdata architecture • Metadata management • Streaming sources • Full text search • Spark and Crossdata • ODBC • The future Agenda 5"
  • 6. Introduction o Big-Data-analysis-is-commonly-associated-with-batch-processing • Users-aiming-to-combine-batch-and-stream-processing-have-to- rely-on-tailorRmade-architectures o Users-buy-Big-Data-plaSorms,-but • How-do-I-start? • What-is-my-entry-point-to-the-plaSorm? #CassandraSummit-2014 6"
  • 7. What our clients demand? o Easy-deployment o Easy-administra3on o Read/write-performance o EasyRtoRlearn-query-language-o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data) #CassandraSummit-2014 7"
  • 8. What our clients demand? ! Easy%deployment% ! Easy%administra0on% ! Read/write%performance% ! Easy6to6learn%query%language% o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data) #CassandraSummit-2014 8"
  • 9. What our clients demand? ! Easy"deployment" ! Easy"administra8on" ! Read/write"performance" ! Easy>to>learn"query"language" ! Integra3on-with-BI-Tools ! Join-opera3ons ! Support-for-streaming-sources ! Integra3on-with-other-data-stores ! Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data) #CassandraSummit-2014 9"
  • 10. Crossdata o A-new-technology-that: • Is-not-limited-by-the-underlying-datastore-capabili3es • Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons • Supports-batch-and-streaming-queries • Supports-mul3ple-clusters-and-technologies #CassandraSummit-2014 10"
  • 12. Connecting to the outside world o Crossdata-defines-an-IConnector-extension-interface o User-can-easily-add-new-connectors-to-support • Different-datastores • Different-processing-engines • Different-versions o Where-each-connector-defines-its-capabili3es #CassandraSummit-2014 12" Our planner will choose the best connector for each query
  • 13. Query execution #CassandraSummit-2014 13" Parsing" Valida8on" Planning" Execu8on" C*" Connector1" Connector2" Connector3" Our planner will choose the best connector for each query
  • 14. Multi-cluster support o Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog- across-a-set-of-datastores.- • Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance " E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,- readRop3mized-cluster,-etc.- • A-table-is-saved-in-a-unique-datastore #CassandraSummit-2014 14"
  • 15. Logical and physical mapping SELECT&*&FROM&app.users;& Users"table" Test"table" old_users"table" #CassandraSummit-2014 15" App"catalog" C*"produc8on" C*"development" Other"datastores"
  • 17. Metadata in the era of Schemaless NoSQL datastores o Some-datastores-are-schemaless-but-our-applica3ons-are-not!- • Flexible-schemas-vs-Schemaless • Crossdata-provides-a-Metadata-manager-that-stores-schemas- for-any-datasource " Remember-ODBC-and-those-BI-tools " 1010010101010 1010110101010 1111010001111 ?" 001000" #CassandraSummit-2014 17"
  • 18. Metadata management #CassandraSummit-2014 18" Connector" C*"produc8on" Metadata"Store" Infinispan" Metadata"Manager" 2% Updated"metadata" informa8on"is" maintained"among" Crossdata"servers" using"Infinispan" If"the"connector"does" not"support"metadata" opera8ons"those"are" skipped" 1% 2%
  • 20. Managing streaming sources o Nowadays-use-cases-expect-some-type-of-streaming-datasource • Streaming-data-has-an-ephemeral-nature • In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on- #CassandraSummit-2014 to-work-with-streaming-sources-as-classical- RDBMS-tables 20" streaming" source" {schema:{col1:…},…}" col1:text" col2:int" col3:int" col4:text" Streaming_query0" …" Streaming_queryn"
  • 21. Streaming queries o Streaming-queries-are-infinite-by-defini3on • A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows- ingested-by-the-system-in-that-period • The-user-launches-queries-specifying-a-processing-3me-window " Crossdata-provides-methods-to-list-and-stop-running-streaming- #CassandraSummit-2014 queries 21"
  • 22. Streaming queries: windows syntax #CassandraSummit-2014 22" SELECT fieldGroup,avg(Field2) FROM eph_table WITH WINDOW 5 minutes WHERE field1=100 AND field2>100 GROUP BY fieldGroup;
  • 23. Joining batch and streaming SELECT * FROM demo.temporal WITH WINDOW 10 secs INNER JOIN demo.users #CassandraSummit-2014 ON users.name = temporal.name; SELECT * FROM demo.temporal WITH WINDOW 10 secs " SELECT * FROM demo.users " INNER JOIN ON users.name = temporal.name " 23"
  • 25. Full text search with o Clients-request-the-ability-to-perform-full-text-searches o We-have-developed-an-integra3on-between-Lucene-and- Cassandra o C*-users-can-now-enjoy-all-Lucene-features: • Full-text-searches,-range-queries,-fuzzy-queries…. #CassandraSummit-2014 25" https://github.com/Stratio/stratio-cassandra
  • 26. Stratio Lucene 2i #CassandraSummit-2014 26" C*" node" C*" node" Lucene" index" C*" node" Lucene" index" C*" node" Lucene" index" C*" node" Lucene" index" Lucene" index"
  • 27. Full text search queries o With-Crossdata,-we-simplify: • The-crea3on-syntax- • The-query-syntax-using-the-match-operator #CassandraSummit-2014 27" CREATE&FULLTEXT&INDEX&ON&app.users(name,email);& SELECT&*&FROM&app.users&& where&email&MATCH&‘*@stratio.com’;&
  • 29. Why Spark? o Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons o Spark-brings-several-benefits-over-Hadoop-o InRMemory-processing o RDD-abstrac3on o Simpler-API-o Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping) #CassandraSummit-2014 29"
  • 30. What about Spark SQL? o Different-approach-to-query-execu3on • We-only-use-Spark-when-it-speedups-queries " Na3ve-drivers-are-faster-for-simple-queries " Spark-SQL-has-limited-RDD-sources • Avoid-some-Spark-limita3ons • Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243 #CassandraSummit-2014 30"
  • 31. Query approach SparkSQL"approach" Crossdata"approach" #CassandraSummit-2014 SparkSQL" Spark" Cassandra" Spark" Na8ve"driver" Cassandra" 31" Stra8o"Crossdata"
  • 32. Our Cassandra-Spark integration o Project-started-in-June-2013 " With-the-objec3ve-of-providing-a-method-to-interact-with- Cassandra-from-Spark " Ini3al-approach-based-on-the-HadoopInputFormat-interface " Current-version-uses-the-na3ve-Datastax-Java-driver #CassandraSummit-2014 32" https://github.com/Stratio/stratio-deep
  • 33. Our Cassandra-Spark integration o Benchmark-in-process-comparing-our-solu3on-with-the- Datastax-Spark-driver • Results-highly-influenced-by-the-split-size • Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-using- Datastax-default-values • Group-by-–-up-to-40%-faster • Join-–-up-to-17%-faster • Stay-tuned-for-the-benchmark-publica3on! #CassandraSummit-2014 33"
  • 34. Spark vs Lucene 2i #CassandraSummit-2014 34" Time" Spark" Lucen"2i" Records/node"
  • 36. Stratio Crossdata ODBC o WellRknown-interface-standard-(for-BI-tools,-external-apps,-…) o We-have-implemented-for-Crossdata-using-Simba-SDK o ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external- world o Currently-tested-with-Tableau,-Qlikview-and-MS-Excel #CassandraSummit-2014 36" One ODBC for all datastores!
  • 38. The future o Security o Query-op3mizer-and-smart-query-planner o Leverage-system-sta3s3cs o Support-for-UDFs o Become-an-Apache-project #CassandraSummit-2014 38" https://github.com/Stratio/stratio-meta
  • 39. We are looking for an Apache Champion #CassandraSummit-2014 39" Can"you" help"us?"
  • 40. A wish list for Cassandra o Ability-to-stop-running-queries o Interac3ve-users-are-unpredictable o Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes) o Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator • E.g.,-aggrega3ons-like-count(*) #CassandraSummit-2014 40"
  • 41. Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero Alvaro Agea [email protected] [email protected] #CassandraSummit-201441"