SlideShare a Scribd company logo
Version 1.0
Spark Graph Operations with
DSEGraphFrames Scala API
Scala libraries for interacting and processing data from
graph databases like DSE Graph.
Obioma Anomnachi
Engineer @ Anant
DSE Graph
● DSE Graph is a distributed graph database built on top of Cassandra that is part of Datastax
Enterprise (DSE)
○ It maintains many of the advantages of using Casandra/DSE, including potentially global distribution, zero
downtime, and DSE security protection
○ It also gains many of the benefits of being a graph database, namely in storage and analysis of complex and
inter-related data sets
● Can combine with DSEs included Search and Analytics capabilities
● Integrates with DSE support tools like OpsCenter and Datastax Studio
DSE Graph Analytics
● Most graph traversals (operations done using the adjacency of nodes and edges within a graph)
can be done in real time without making use of DSE Analytics aka Spark resources
○ Deep queries are traverals on a graph with extremely high density or branching factor (nodes are on average
connected to a large number of other nodes)
○ Scan queries traverse whole graphs or large parts of graphs
○ Either of these can require memory or computational resources beyond what the normal processing of graph
queries can provide
■ In these cases we can get better performance by having these queries run via DSE Analytics
● There are two methods for performing Analytical queries on DSE graph instances
○ OLAP queries use an alternate traversal source that uses the SparkGraphComputer to run queries on the
DSE Analytics nodes
○ The DSEGraphFrames library, support a subset of the Gremlin graph traversal language for use in Java and
Scala applications running on Spark
OLAP Queries
● Normal DSE Graph queries use Online Transactional Processing (OLTP)
○ Consists of a large number of short transactions for processing queries quickly
○ Used primarily for data entry and retrieval
○ Uses filters and subgraphs to speed up access to data in specific parts of the larger graph
● Online Analytical Processing (OLAP) is a Spark backed method for performing multidimensional
data analysis
○ Takes longer that OLTP queries
○ Works by interpreting the graph as a sequence of “star graphs” centered on a single vertex
○ For queries that process over the entire graph or at least large portions of a graph
DSE GraphFrame
● Spark API for analytics operations on DSE Graph
○ Inspired by Databricks’ GraphFrame library
○ Supports a subset of Gremlin graph traversal language
○ Faster than OLAP queries for doing filtering and counts
● Graph represented as two virtual tables
○ V() method for vertex dataframe
○ E() method for edge dataframe
● Can be used to import/export graphs
● Also supports a subset of Apache Tinkerpop traversals
Demo
● https://docs.datastax.com/en/dse/6.0/dse-
dev/datastax_enterprise/graph/quickStart/graphQSTOC.html#Quic
kStartGraphschema
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

More Related Content

Similar to Cassandra Lunch #95: Spark Graph Operations with DSEGraphFrames Scala API (20)

PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Apache Hive for modern DBAs
Luis Marques
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PPTX
Apache Spark for Beginners
Anirudh
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPT
An Introduction to Apache spark with scala
johnn210
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PDF
Introduction to Impala
markgrover
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PDF
Introduction to TitanDB
Knoldus Inc.
 
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PPTX
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Hive for modern DBAs
Luis Marques
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Apache Spark for Beginners
Anirudh
 
Spark from the Surface
Josi Aranda
 
Apache Spark Fundamentals
Zahra Eskandari
 
An Introduction to Apache spark with scala
johnn210
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Introduction to Impala
markgrover
 
Real Time Analytics with Dse
DataStax Academy
 
An introduction To Apache Spark
Amir Sedighi
 
Introduction to TitanDB
Knoldus Inc.
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Apache Spark PDF
Naresh Rupareliya
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 

More from Anant Corporation (20)

PPTX
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
PPTX
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
PDF
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
PDF
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
PDF
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
PPTX
YugabyteDB Developer Tools
Anant Corporation
 
PPTX
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
PPTX
Machine Learning Orchestration with Airflow
Anant Corporation
 
PDF
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
PDF
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
PDF
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
PDF
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
PDF
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
PPTX
CL 121
Anant Corporation
 
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
PDF
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
PPTX
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
YugabyteDB Developer Tools
Anant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
Machine Learning Orchestration with Airflow
Anant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Ad

Recently uploaded (20)

PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
BinarySearchTree in datastructures in detail
kichokuttu
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Ad

Cassandra Lunch #95: Spark Graph Operations with DSEGraphFrames Scala API

  • 1. Version 1.0 Spark Graph Operations with DSEGraphFrames Scala API Scala libraries for interacting and processing data from graph databases like DSE Graph. Obioma Anomnachi Engineer @ Anant
  • 2. DSE Graph ● DSE Graph is a distributed graph database built on top of Cassandra that is part of Datastax Enterprise (DSE) ○ It maintains many of the advantages of using Casandra/DSE, including potentially global distribution, zero downtime, and DSE security protection ○ It also gains many of the benefits of being a graph database, namely in storage and analysis of complex and inter-related data sets ● Can combine with DSEs included Search and Analytics capabilities ● Integrates with DSE support tools like OpsCenter and Datastax Studio
  • 3. DSE Graph Analytics ● Most graph traversals (operations done using the adjacency of nodes and edges within a graph) can be done in real time without making use of DSE Analytics aka Spark resources ○ Deep queries are traverals on a graph with extremely high density or branching factor (nodes are on average connected to a large number of other nodes) ○ Scan queries traverse whole graphs or large parts of graphs ○ Either of these can require memory or computational resources beyond what the normal processing of graph queries can provide ■ In these cases we can get better performance by having these queries run via DSE Analytics ● There are two methods for performing Analytical queries on DSE graph instances ○ OLAP queries use an alternate traversal source that uses the SparkGraphComputer to run queries on the DSE Analytics nodes ○ The DSEGraphFrames library, support a subset of the Gremlin graph traversal language for use in Java and Scala applications running on Spark
  • 4. OLAP Queries ● Normal DSE Graph queries use Online Transactional Processing (OLTP) ○ Consists of a large number of short transactions for processing queries quickly ○ Used primarily for data entry and retrieval ○ Uses filters and subgraphs to speed up access to data in specific parts of the larger graph ● Online Analytical Processing (OLAP) is a Spark backed method for performing multidimensional data analysis ○ Takes longer that OLTP queries ○ Works by interpreting the graph as a sequence of “star graphs” centered on a single vertex ○ For queries that process over the entire graph or at least large portions of a graph
  • 5. DSE GraphFrame ● Spark API for analytics operations on DSE Graph ○ Inspired by Databricks’ GraphFrame library ○ Supports a subset of Gremlin graph traversal language ○ Faster than OLAP queries for doing filtering and counts ● Graph represented as two virtual tables ○ V() method for vertex dataframe ○ E() method for edge dataframe ● Can be used to import/export graphs ● Also supports a subset of Apache Tinkerpop traversals
  • 7. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | [email protected] | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037