SlideShare a Scribd company logo
MOBIUS: C# BINDING FOR SPARK
Kaarthik Sivashanmugam
Microsoft
@kaarthikss
Quick Background
• Business Scenario: Next-gen near real-time
processing of Bing.com logs
– Size of raw logs: TBs per hour
– C# library for processing ~ in use for several years
• Yesterday’s talk “Five Lessons Learned in Building Streaming
Applications at Microsoft Bing Scale” covers this scenario
& challenges
C# API - Motivations
• Enable organizations invested deeply in .NET to
build Apache Spark applications in C#
• Reuse of existing .NET libraries in Spark
applications
Why Yet Another Language Binding
FASTEST GROWING AREAS FROM 2014 TO 2015
MOST IMPORTANT ASPECTS OF SPARK
Spark Survey 2015 Results
Popularity of C#
• StackOverflow.com Developer Survey
• RedMonk ProgrammingLanguage Rankings
.NET ecosystem ~ enabling languages like F#
C# API - Goal
Make C# a first-class language for building
Apache Spark applications
Word Count Example in C#
Scala
C#
Kafka Example in C#
Initialize StreamingContext & Checkpoint
Create Kafka DStream
Use DStream transformations to count logs by loglevel within a time window
Save log count
Start stream processing
Mobius: C# API for Spark
Scala/Java	API
SparkR PySpark
C#	API
Apache	Spark
Spark	Apps	in	C#
Develop & Launch Mobius
Applications
Spark Client
A
Get Mobius release
B
Get Mobius driver
and dependencies
1
Add Reference to
Mobius package in NuGet
2
Develop, debug, test
Mobius driver application
3
Build Mobius driver
Run
sparkclr-submit.cm d
or
sparkclr-submit.sh
C
Runs Spark job
Example: sparkclr-submit.cmd
--master spark://IP:PORT
--total-executor-cores 200
--executor-memory 12g
-- conf spark.eventLog.enabled=true
-- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog
--exe Pi.exe D:MobiusexamplesPi
Mobius & Spark
C#	Worker
CLR
IPC	Sockets
C#	Worker
CLR
IPC	Sockets
C#	Worker
CLR
IPC	Sockets
C#	Driver
CLR
IPC	Sockets
SparkExecutor
SparkExecutor
SparkExecutor
SparkContext
JVM
JVM
JVM
JVM
Workers
Driver
Mobius can be used with
any existing Spark cluster
(Standalone, YARN) in
Windows & Linux
Mobius in Linux
• Mono (open source implementation of .NET framework) used for
C# with Spark in Linux
• Mobius project CI (build, unit & functional tests) in Ubuntu
• Users reported using Mobius in Ubuntu, CentOS, OSX
• Mobius validated with Spark clusters in
Azure HDInsight and Amazon Web Services EMR
• More info at linux-instructions.md @ GitHub
Project Info
• https://github.com/Microsoft/Mobius
Contributions welcome!
• MIT license
• Discussions
– StackOverflow: tag “SparkCLR”
– Gitter: https://gitter.im/Microsoft/Mobius
– Twitter: @MobiusForSpark
Project Status
• Past Releases
– v1.5.200 (Spark 1.5.2)
– v1.6.100 (Spark 1.6.1)
• Upcoming Release
– v2.0.000 (Spark 2.0.0)
• Work in progress
– Support for interactive scenarios (Zeppelin/Jupyter integration)
– Exploration of support for ML scenarios
– Idiomatic F# API
UNDER THE HOOD
CSharpRDD
• C# operations use CSharpRDD which needs CLR to execute
– If no C# transformation or UDF, CLR is not needed ~ execution is
entirely JVM-based
• RDD<byte[]>
– Data is stored as serialized objects and sent to C# worker process
• Transformations are pipelined when possible
– Avoids unnecessary serialization & deserialization within a stage
Driver-side Interop
CSharpRunner
JVM
1 Launch
sparkclr-submit.cmd
or
sparkclr-submit.sh
CSharpBackendLaunch Netty server creating
proxy for JVM calls
2
C#	Driver
Launch C# process
using port number
from CSharpBackend
3
CLR
SparkConf SparkContext
Create and manage
Proxies for JVM objects
SparkConf SparkContext
Interop Components
Mirror C#-side operations
Invoke JVM methods
RDD DataFrame DStream …CSharpRDD
RDD DataFrame DStream PipelinedRDD …
1
Compute
2
CLR
CSharpWorker.exe
Launch
Worker-side Interop
JVM
CSharpRDD
Executor
Spark Worker
3
Read bytes
5
Write bytes 4
Execute C# operation
1
Compute
Worker Optimization Options
CLR
Thread
1
Thread
2
Thread
n
…
CSharpWorker.exe
Multi-threaded ~ to avoid expensive
fork-process when executing a Task
Spark Worker Spark Worker
CLR
CSharpWorker.exe
T1 Tn…
CSharpWorker.exe
T1 Tn…
CSharpWorker.exe
T1 Tn…
CSharpWorker.exe
T1 Tn…
CLR
CLRCLR
Multi-proc ~ for higher
throughput in executing Tasks
Performance Considerations
• Map & Filter RDD operations in C# require serialization & deserialization of
data ~ impacts performance
– C# operations are pipelined when possible ~ minimizes Ser/De
– Persistence is handled by JVM ~ checkpoint/cache on a RDD impacts pipelining for
CLR operations
• DataFrame operations without C# UDFs do not require Ser/De
– Perf will be same as native Scala-based Spark application
– Execution plan optimization & code generation perf improvements in Spark leveraged
THANK YOU.
• Mobius is production-ready
• Use Mobius to build Apache Spark jobs in .NET
• Contribute to github.com/Microsoft/Mobius
• @MobiusForSpark

More Related Content

What's hot (20)

PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PPTX
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
PDF
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
PDF
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Databricks
 
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
PPTX
Introducing Kubernetes
VikRam S
 
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
PDF
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
PDF
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
PDF
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Databricks
 
PPTX
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Shivji Kumar Jha
 
PDF
Spark Uber Development Kit
Jen Aman
 
PDF
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
PDF
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
PDF
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Databricks
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Introducing Kubernetes
VikRam S
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Databricks
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Shivji Kumar Jha
 
Spark Uber Development Kit
Jen Aman
 
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Desarrollo Web con Scala
Julio Carlos Sanchez Ortega
 
PDF
Curso de Scala: Trabajando con variables
Gary Briceño
 
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
PPTX
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData
 
PDF
Stream Processing made simple with Kafka
DataWorks Summit/Hadoop Summit
 
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
PPTX
Graph Analytics
Khalid Salama
 
PPTX
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
PPTX
Machine learning with Spark
Khalid Salama
 
PDF
Parquet and AVRO
airisData
 
PDF
Scalable OCR with NiFi and Tesseract
DataWorks Summit/Hadoop Summit
 
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PPTX
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Michael Noll
 
PDF
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
PDF
Apache Spark の紹介(前半:Sparkのキホン)
NTT DATA OSS Professional Services
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PPTX
Spark with HDInsight
Khalid Salama
 
PDF
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Desarrollo Web con Scala
Julio Carlos Sanchez Ortega
 
Curso de Scala: Trabajando con variables
Gary Briceño
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData
 
Stream Processing made simple with Kafka
DataWorks Summit/Hadoop Summit
 
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Graph Analytics
Khalid Salama
 
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
Machine learning with Spark
Khalid Salama
 
Parquet and AVRO
airisData
 
Scalable OCR with NiFi and Tesseract
DataWorks Summit/Hadoop Summit
 
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Michael Noll
 
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Apache Spark の紹介(前半:Sparkのキホン)
NTT DATA OSS Professional Services
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Spark with HDInsight
Khalid Salama
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Ad

Similar to Mobius: C# Language Binding For Spark (20)

PPTX
Spark Summit - Mobius C# Binding for Apache Spark
shareddatamsft
 
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
PPTX
.NET per la Data Science e oltre
Marco Parenzan
 
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
PPTX
AI and Spark - IBM Community AI Day
Nick Pentreath
 
PPTX
Apache spark
Sameer Mahajan
 
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
PPT
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Mike Broberg
 
PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
PDF
Introducing Kafka's Streams API
confluent
 
PDF
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
confluent
 
PDF
AI & Machine Learning Pipelines with Knative
Animesh Singh
 
PPTX
Mini .net conf 2020
Marco Parenzan
 
PDF
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
PDF
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Indrajit Poddar
 
PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Databricks
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Spark Summit - Mobius C# Binding for Apache Spark
shareddatamsft
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
.NET per la Data Science e oltre
Marco Parenzan
 
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
AI and Spark - IBM Community AI Day
Nick Pentreath
 
Apache spark
Sameer Mahajan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Mike Broberg
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
Introducing Kafka's Streams API
confluent
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
confluent
 
AI & Machine Learning Pipelines with Knative
Animesh Singh
 
Mini .net conf 2020
Marco Parenzan
 
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Indrajit Poddar
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Databricks
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 

Mobius: C# Language Binding For Spark

  • 1. MOBIUS: C# BINDING FOR SPARK Kaarthik Sivashanmugam Microsoft @kaarthikss
  • 2. Quick Background • Business Scenario: Next-gen near real-time processing of Bing.com logs – Size of raw logs: TBs per hour – C# library for processing ~ in use for several years • Yesterday’s talk “Five Lessons Learned in Building Streaming Applications at Microsoft Bing Scale” covers this scenario & challenges
  • 3. C# API - Motivations • Enable organizations invested deeply in .NET to build Apache Spark applications in C# • Reuse of existing .NET libraries in Spark applications
  • 4. Why Yet Another Language Binding FASTEST GROWING AREAS FROM 2014 TO 2015 MOST IMPORTANT ASPECTS OF SPARK Spark Survey 2015 Results Popularity of C# • StackOverflow.com Developer Survey • RedMonk ProgrammingLanguage Rankings .NET ecosystem ~ enabling languages like F#
  • 5. C# API - Goal Make C# a first-class language for building Apache Spark applications
  • 6. Word Count Example in C# Scala C#
  • 7. Kafka Example in C# Initialize StreamingContext & Checkpoint Create Kafka DStream Use DStream transformations to count logs by loglevel within a time window Save log count Start stream processing
  • 8. Mobius: C# API for Spark Scala/Java API SparkR PySpark C# API Apache Spark Spark Apps in C#
  • 9. Develop & Launch Mobius Applications Spark Client A Get Mobius release B Get Mobius driver and dependencies 1 Add Reference to Mobius package in NuGet 2 Develop, debug, test Mobius driver application 3 Build Mobius driver Run sparkclr-submit.cm d or sparkclr-submit.sh C Runs Spark job Example: sparkclr-submit.cmd --master spark://IP:PORT --total-executor-cores 200 --executor-memory 12g -- conf spark.eventLog.enabled=true -- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog --exe Pi.exe D:MobiusexamplesPi
  • 11. Mobius in Linux • Mono (open source implementation of .NET framework) used for C# with Spark in Linux • Mobius project CI (build, unit & functional tests) in Ubuntu • Users reported using Mobius in Ubuntu, CentOS, OSX • Mobius validated with Spark clusters in Azure HDInsight and Amazon Web Services EMR • More info at linux-instructions.md @ GitHub
  • 12. Project Info • https://github.com/Microsoft/Mobius Contributions welcome! • MIT license • Discussions – StackOverflow: tag “SparkCLR” – Gitter: https://gitter.im/Microsoft/Mobius – Twitter: @MobiusForSpark
  • 13. Project Status • Past Releases – v1.5.200 (Spark 1.5.2) – v1.6.100 (Spark 1.6.1) • Upcoming Release – v2.0.000 (Spark 2.0.0) • Work in progress – Support for interactive scenarios (Zeppelin/Jupyter integration) – Exploration of support for ML scenarios – Idiomatic F# API
  • 15. CSharpRDD • C# operations use CSharpRDD which needs CLR to execute – If no C# transformation or UDF, CLR is not needed ~ execution is entirely JVM-based • RDD<byte[]> – Data is stored as serialized objects and sent to C# worker process • Transformations are pipelined when possible – Avoids unnecessary serialization & deserialization within a stage
  • 16. Driver-side Interop CSharpRunner JVM 1 Launch sparkclr-submit.cmd or sparkclr-submit.sh CSharpBackendLaunch Netty server creating proxy for JVM calls 2 C# Driver Launch C# process using port number from CSharpBackend 3 CLR SparkConf SparkContext Create and manage Proxies for JVM objects SparkConf SparkContext Interop Components Mirror C#-side operations Invoke JVM methods RDD DataFrame DStream …CSharpRDD RDD DataFrame DStream PipelinedRDD …
  • 18. Worker Optimization Options CLR Thread 1 Thread 2 Thread n … CSharpWorker.exe Multi-threaded ~ to avoid expensive fork-process when executing a Task Spark Worker Spark Worker CLR CSharpWorker.exe T1 Tn… CSharpWorker.exe T1 Tn… CSharpWorker.exe T1 Tn… CSharpWorker.exe T1 Tn… CLR CLRCLR Multi-proc ~ for higher throughput in executing Tasks
  • 19. Performance Considerations • Map & Filter RDD operations in C# require serialization & deserialization of data ~ impacts performance – C# operations are pipelined when possible ~ minimizes Ser/De – Persistence is handled by JVM ~ checkpoint/cache on a RDD impacts pipelining for CLR operations • DataFrame operations without C# UDFs do not require Ser/De – Perf will be same as native Scala-based Spark application – Execution plan optimization & code generation perf improvements in Spark leveraged
  • 20. THANK YOU. • Mobius is production-ready • Use Mobius to build Apache Spark jobs in .NET • Contribute to github.com/Microsoft/Mobius • @MobiusForSpark