Mobius: C# Language Binding For Spark

MOBIUS: C# BINDING FOR SPARK
Kaarthik Sivashanmugam
Microsoft
@kaarthikss

Quick Background
• Business Scenario: Next-gen near real-time
processing of Bing.com logs
– Size of raw logs: TBs per hour
– C# library for processing ~ in use for several years
• Yesterday’s talk “Five Lessons Learned in Building Streaming
Applications at Microsoft Bing Scale” covers this scenario
& challenges

C# API - Motivations
• Enable organizations invested deeply in .NET to
build Apache Spark applications in C#
• Reuse of existing .NET libraries in Spark
applications

Why Yet Another Language Binding
FASTEST GROWING AREAS FROM 2014 TO 2015
MOST IMPORTANT ASPECTS OF SPARK
Spark Survey 2015 Results
Popularity of C#
• StackOverflow.com Developer Survey
• RedMonk ProgrammingLanguage Rankings
.NET ecosystem ~ enabling languages like F#

C# API - Goal
Make C# a first-class language for building
Apache Spark applications

Word Count Example in C#
Scala
C#

Kafka Example in C#
Initialize StreamingContext & Checkpoint
Create Kafka DStream
Use DStream transformations to count logs by loglevel within a time window
Save log count
Start stream processing

Mobius: C# API for Spark
Scala/Java API
SparkR PySpark
C# API
Apache Spark
Spark Apps in C#

Develop & Launch Mobius
Applications
Spark Client
A
Get Mobius release
B
Get Mobius driver
and dependencies
1
Add Reference to
Mobius package in NuGet
2
Develop, debug, test
Mobius driver application
3
Build Mobius driver
Run
sparkclr-submit.cm d
or
sparkclr-submit.sh
C
Runs Spark job
Example: sparkclr-submit.cmd
--master spark://IP:PORT
--total-executor-cores 200
--executor-memory 12g
-- conf spark.eventLog.enabled=true
-- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog
--exe Pi.exe D:MobiusexamplesPi

Mobius & Spark
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Driver
CLR
IPC Sockets
SparkExecutor
SparkExecutor
SparkExecutor
SparkContext
JVM
JVM
JVM
JVM
Workers
Driver
Mobius can be used with
any existing Spark cluster
(Standalone, YARN) in
Windows & Linux

Mobius in Linux
• Mono (open source implementation of .NET framework) used for
C# with Spark in Linux
• Mobius project CI (build, unit & functional tests) in Ubuntu
• Users reported using Mobius in Ubuntu, CentOS, OSX
• Mobius validated with Spark clusters in
Azure HDInsight and Amazon Web Services EMR
• More info at linux-instructions.md @ GitHub

Project Info
• https://github.com/Microsoft/Mobius
Contributions welcome!
• MIT license
• Discussions
– StackOverflow: tag “SparkCLR”
– Gitter: https://gitter.im/Microsoft/Mobius
– Twitter: @MobiusForSpark

Project Status
• Past Releases
– v1.5.200 (Spark 1.5.2)
– v1.6.100 (Spark 1.6.1)
• Upcoming Release
– v2.0.000 (Spark 2.0.0)
• Work in progress
– Support for interactive scenarios (Zeppelin/Jupyter integration)
– Exploration of support for ML scenarios
– Idiomatic F# API

CSharpRDD
• C# operations use CSharpRDD which needs CLR to execute
– If no C# transformation or UDF, CLR is not needed ~ execution is
entirely JVM-based
• RDD<byte[]>
– Data is stored as serialized objects and sent to C# worker process
• Transformations are pipelined when possible
– Avoids unnecessary serialization & deserialization within a stage

Driver-side Interop
CSharpRunner
JVM
1 Launch
sparkclr-submit.cmd
or
sparkclr-submit.sh
CSharpBackendLaunch Netty server creating
proxy for JVM calls
2
C# Driver
Launch C# process
using port number
from CSharpBackend
3
CLR
SparkConf SparkContext
Create and manage
Proxies for JVM objects
SparkConf SparkContext
Interop Components
Mirror C#-side operations
Invoke JVM methods
RDD DataFrame DStream …CSharpRDD
RDD DataFrame DStream PipelinedRDD …

1
Compute
2
CLR
CSharpWorker.exe
Launch
Worker-side Interop
JVM
CSharpRDD
Executor
Spark Worker
3
Read bytes
5
Write bytes 4
Execute C# operation
1
Compute

Worker Optimization Options
CLR
Thread
1
Thread
2
Thread
n
…
CSharpWorker.exe
Multi-threaded ~ to avoid expensive
fork-process when executing a Task
Spark Worker Spark Worker
CLR
CSharpWorker.exe
T1 Tn…
CSharpWorker.exe
T1 Tn…
CSharpWorker.exe
T1 Tn…
CSharpWorker.exe
T1 Tn…
CLR
CLRCLR
Multi-proc ~ for higher
throughput in executing Tasks

Performance Considerations
• Map & Filter RDD operations in C# require serialization & deserialization of
data ~ impacts performance
– C# operations are pipelined when possible ~ minimizes Ser/De
– Persistence is handled by JVM ~ checkpoint/cache on a RDD impacts pipelining for
CLR operations
• DataFrame operations without C# UDFs do not require Ser/De
– Perf will be same as native Scala-based Spark application
– Execution plan optimization & code generation perf improvements in Spark leveraged

THANK YOU.
• Mobius is production-ready
• Use Mobius to build Apache Spark jobs in .NET
• Contribute to github.com/Microsoft/Mobius
• @MobiusForSpark

Mobius: C# Language Binding For Spark

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Mobius: C# Language Binding For Spark (20)

More from Spark Summit (20)

Recently uploaded (20)

Mobius: C# Language Binding For Spark