SlideShare a Scribd company logo
Developing Apache Spark Jobs
in .NET using Mobius
Kaarthik Sivashanmugam
@kaarthikss
dotnetfringe 2016
Apache Spark
• General purpose cluster computing system for big data processing
and analytics
• Ease of programming
• High performance
• Unified API to solve a diverse set of complex data problems
• API in Scala, Java, Python & R
Apache Spark Key Concepts
• Data
• RDD – Resilient Distributed Dataset
• Transformation & Action
• DataFrame
• Dstream
• Cluster
• Driver
• Executor
Mobius: C# API for Spark
• Enable organizations invested deeply in .NET to build Apache Spark
applications in C#
• Reuse of existing .NET libraries in Spark applications
.NET & Spark
Scala/Java API
SparkR PySpark
Mobius: C# API
Apache Spark
Spark Apps in .NET
Word Count in Spark using RDD
Scala
RDD of lines in the file
RDD of words in the file
RDD of tuple - (word, 1)
RDD of tuple - (word, count)
Action that triggers job
Word Count in Spark using RDD
C#
Scala
F#
Develop & Launch Mobius Applications
Spark Client
A
Get Mobius release
B
Get Mobius driver
and dependencies
1
Add Reference to
Mobius package in NuGet
2
Develop, debug, test
Mobius driver application
3
Build Mobius driver
Run
sparkclr-submit.cmd
or
sparkclr-submit.sh
C
Runs Spark job
Example: sparkclr-submit.cmd
--master spark://IP:PORT
--total-executor-cores 200
--executor-memory 12g
-- conf spark.eventLog.enabled=true
-- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog
--exe Pi.exe D:MobiusexamplesPi
Demo
Implementing a simple Mobius driver program using DataFrame
Structured Data in Mobius using DataFrame
JSON Cassandra
Note – Dataset is replacing DataFrame in Spark
Mobius & Spark
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Driver
CLR
IPC Sockets
SparkExecutor
SparkExecutor
SparkExecutor
SparkContext
JVM
JVM
JVM
JVM
Workers
Driver
Mobius can be used with
any existing Spark cluster
(Standalone, YARN) in
Windows & Linux
Mobius in Linux
• Mono is used for using Mobius with Spark in Linux
• Mobius project CI (build, unit & functional tests) in Ubuntu
• Mobius validated in Ubuntu, CentOS, OSX
• Mobius validated with Spark clusters in
Azure HDInsight and Amazon Web Services EMR
• More info at linux-instructions.md @ GitHub
Kafka Message Processing in Mobius using DStream
Initialize StreamingContext & Checkpoint
Create Kafka DStream
Use DStream transformations to count logs by loglevel within a time window
Save log count
Start stream processing
Internals of Driver & Worker
Driver-side Interop
CSharpRunner
JVM
1 Launch
sparkclr-submit.cmd
or
sparkclr-submit.sh
CSharpBackendLaunch Netty server creating
proxy for JVM calls
2
C# Driver
Launch C# process
using port number
from CSharpBackend
3
CLR
SparkConf SparkContext
Create and manage
Proxies for JVM objects
SparkConf SparkContext
Interop Components
Mirror C#-side operations
Invoke JVM methods
RDD DataFrame DStream …CSharpRDD
RDD DataFrame DStream PipelinedRDD …
1
Compute
2
CLR
CSharpWorker.exe
Launch
Worker-side Interop
JVM
CSharpRDD
Executor
Spark Worker
3
Read bytes
5
Write bytes 4
Execute C# operation
1
Compute
Mobius Project Info
• https://github.com/Microsoft/Mobius
• MIT license
• Discussions
• StackOverflow: tag “SparkCLR”
• Gitter: https://gitter.im/Microsoft/Mobius
• Twitter: @MobiusForSpark
Mobius Project Status
• Past Releases
• v1.5.200 (Spark 1.5.2)
• v1.6.100 (Spark 1.6.1)
• Upcoming Releases
• V1.6.200 (Spark 1.6.2)
• v2.0.000 (Spark 2.0.0)
• Work planned/in progress
• Support for interactive scenarios (Zeppelin/Jupyter integration – IfSharp?)
• Exploration of support for ML scenarios
• Idiomatic F# API (?)
• Support for .NET Core
Thank you
Mobius is production-ready & cloud-ready
Use Mobius to build Apache Spark jobs in .NET
Contribute to github.com/Microsoft/Mobius
@MobiusForSpark

More Related Content

What's hot (20)

PDF
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Databricks
 
PDF
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
HostedbyConfluent
 
PDF
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PDF
fluentd -- the missing log collector
Muga Nishizawa
 
PDF
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Shelan Perera
 
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
PDF
Understanding and Improving Code Generation
Databricks
 
PPTX
RedisConf17 - Pain-free Pipelining
Redis Labs
 
PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
PDF
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Databricks
 
PDF
A Collaborative Data Science Development Workflow
Databricks
 
PDF
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
PPTX
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Brandon O'Brien
 
PDF
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
PPTX
Portable Streaming Pipelines with Apache Beam
confluent
 
PDF
The Future of Real-Time in Spark
Reynold Xin
 
PDF
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
confluent
 
PDF
Scylla Summit 2022: ORM and Query Building in Rust
ScyllaDB
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Databricks
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
HostedbyConfluent
 
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
High Performance Python on Apache Spark
Wes McKinney
 
fluentd -- the missing log collector
Muga Nishizawa
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Shelan Perera
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Understanding and Improving Code Generation
Databricks
 
RedisConf17 - Pain-free Pipelining
Redis Labs
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Databricks
 
A Collaborative Data Science Development Workflow
Databricks
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Brandon O'Brien
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Portable Streaming Pipelines with Apache Beam
confluent
 
The Future of Real-Time in Spark
Reynold Xin
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
confluent
 
Scylla Summit 2022: ORM and Query Building in Rust
ScyllaDB
 

Viewers also liked (7)

PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
PPTX
Graph Analytics
Khalid Salama
 
PPTX
Machine learning with Spark
Khalid Salama
 
PDF
Parquet and AVRO
airisData
 
PDF
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Graph Analytics
Khalid Salama
 
Machine learning with Spark
Khalid Salama
 
Parquet and AVRO
airisData
 
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Ad

Similar to Developing apache spark jobs in .net using mobius (20)

PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
PPTX
Spark 101 - First steps to distributed computing
Demi Ben-Ari
 
PDF
What's new with Apache Spark?
Paco Nathan
 
PDF
Introduction to apache spark
Aakashdata
 
PPTX
Big Data Processing with Apache Spark 2014
mahchiev
 
PPTX
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Demi Ben-Ari
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
PPTX
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
PDF
Apache spark? if only it worked
Marcin Szymaniuk
 
PDF
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Apache Spark at Viadeo
Cepoi Eugen
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
PySpark on Kubernetes @ Python Barcelona March Meetup
Holden Karau
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Spark 101 - First steps to distributed computing
Demi Ben-Ari
 
What's new with Apache Spark?
Paco Nathan
 
Introduction to apache spark
Aakashdata
 
Big Data Processing with Apache Spark 2014
mahchiev
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Demi Ben-Ari
 
Apache Spark Fundamentals
Zahra Eskandari
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
Apache spark? if only it worked
Marcin Szymaniuk
 
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark at Viadeo
Cepoi Eugen
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PySpark on Kubernetes @ Python Barcelona March Meetup
Holden Karau
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Ad

Recently uploaded (20)

PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 

Developing apache spark jobs in .net using mobius

  • 1. Developing Apache Spark Jobs in .NET using Mobius Kaarthik Sivashanmugam @kaarthikss dotnetfringe 2016
  • 2. Apache Spark • General purpose cluster computing system for big data processing and analytics • Ease of programming • High performance • Unified API to solve a diverse set of complex data problems • API in Scala, Java, Python & R
  • 3. Apache Spark Key Concepts • Data • RDD – Resilient Distributed Dataset • Transformation & Action • DataFrame • Dstream • Cluster • Driver • Executor
  • 4. Mobius: C# API for Spark • Enable organizations invested deeply in .NET to build Apache Spark applications in C# • Reuse of existing .NET libraries in Spark applications
  • 5. .NET & Spark Scala/Java API SparkR PySpark Mobius: C# API Apache Spark Spark Apps in .NET
  • 6. Word Count in Spark using RDD Scala RDD of lines in the file RDD of words in the file RDD of tuple - (word, 1) RDD of tuple - (word, count) Action that triggers job
  • 7. Word Count in Spark using RDD C# Scala F#
  • 8. Develop & Launch Mobius Applications Spark Client A Get Mobius release B Get Mobius driver and dependencies 1 Add Reference to Mobius package in NuGet 2 Develop, debug, test Mobius driver application 3 Build Mobius driver Run sparkclr-submit.cmd or sparkclr-submit.sh C Runs Spark job Example: sparkclr-submit.cmd --master spark://IP:PORT --total-executor-cores 200 --executor-memory 12g -- conf spark.eventLog.enabled=true -- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog --exe Pi.exe D:MobiusexamplesPi
  • 9. Demo Implementing a simple Mobius driver program using DataFrame
  • 10. Structured Data in Mobius using DataFrame JSON Cassandra Note – Dataset is replacing DataFrame in Spark
  • 11. Mobius & Spark C# Worker CLR IPC Sockets C# Worker CLR IPC Sockets C# Worker CLR IPC Sockets C# Driver CLR IPC Sockets SparkExecutor SparkExecutor SparkExecutor SparkContext JVM JVM JVM JVM Workers Driver Mobius can be used with any existing Spark cluster (Standalone, YARN) in Windows & Linux
  • 12. Mobius in Linux • Mono is used for using Mobius with Spark in Linux • Mobius project CI (build, unit & functional tests) in Ubuntu • Mobius validated in Ubuntu, CentOS, OSX • Mobius validated with Spark clusters in Azure HDInsight and Amazon Web Services EMR • More info at linux-instructions.md @ GitHub
  • 13. Kafka Message Processing in Mobius using DStream Initialize StreamingContext & Checkpoint Create Kafka DStream Use DStream transformations to count logs by loglevel within a time window Save log count Start stream processing
  • 15. Driver-side Interop CSharpRunner JVM 1 Launch sparkclr-submit.cmd or sparkclr-submit.sh CSharpBackendLaunch Netty server creating proxy for JVM calls 2 C# Driver Launch C# process using port number from CSharpBackend 3 CLR SparkConf SparkContext Create and manage Proxies for JVM objects SparkConf SparkContext Interop Components Mirror C#-side operations Invoke JVM methods RDD DataFrame DStream …CSharpRDD RDD DataFrame DStream PipelinedRDD …
  • 17. Mobius Project Info • https://github.com/Microsoft/Mobius • MIT license • Discussions • StackOverflow: tag “SparkCLR” • Gitter: https://gitter.im/Microsoft/Mobius • Twitter: @MobiusForSpark
  • 18. Mobius Project Status • Past Releases • v1.5.200 (Spark 1.5.2) • v1.6.100 (Spark 1.6.1) • Upcoming Releases • V1.6.200 (Spark 1.6.2) • v2.0.000 (Spark 2.0.0) • Work planned/in progress • Support for interactive scenarios (Zeppelin/Jupyter integration – IfSharp?) • Exploration of support for ML scenarios • Idiomatic F# API (?) • Support for .NET Core
  • 19. Thank you Mobius is production-ready & cloud-ready Use Mobius to build Apache Spark jobs in .NET Contribute to github.com/Microsoft/Mobius @MobiusForSpark

Editor's Notes

  • #3: RDD – fault tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel Persist an RDD in memory, allowing it to be reused efficiently across parallel operations RDDs automatically recover from node failures. Transformations, create a new dataset from an existing one – transformations are lazy actions, which return a value to the driver program after running a computation on the dataset. DataFrame is a distributed collection of data organized into named columns Dstream - represents a continuous stream of data. 
  • #4: RDD – fault tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel Persist an RDD in memory, allowing it to be reused efficiently across parallel operations RDDs automatically recover from node failures. Transformations, create a new dataset from an existing one – transformations are lazy actions, which return a value to the driver program after running a computation on the dataset. DataFrame is a distributed collection of data organized into named columns Dstream - represents a continuous stream of data. 
  • #9: Exe icon credit – Icon made by Freepik from www.flaticon.com