Developing apache spark jobs in .net using mobius

Download as PPTX, PDF

0 likes1,892 views

The document discusses Mobius, a C# API for developing Apache Spark applications within the .NET framework, enabling organizations to leverage existing .NET libraries with Spark's high-performance computing capabilities. It covers key concepts of Spark, a detailed process for building and launching Mobius applications, as well as integration with various data sources. Additionally, it provides information on the project status, upcoming releases, and community engagement platforms.

Technology

Developing Apache Spark Jobs
in .NET using Mobius
Kaarthik Sivashanmugam
@kaarthikss
dotnetfringe 2016

Apache Spark
• General purpose cluster computing system for big data processing
and analytics
• Ease of programming
• High performance
• Unified API to solve a diverse set of complex data problems
• API in Scala, Java, Python & R

Apache Spark Key Concepts
• Data
• RDD – Resilient Distributed Dataset
• Transformation & Action
• DataFrame
• Dstream
• Cluster
• Driver
• Executor

Mobius: C# API for Spark
• Enable organizations invested deeply in .NET to build Apache Spark
applications in C#
• Reuse of existing .NET libraries in Spark applications

.NET & Spark
Scala/Java API
SparkR PySpark
Mobius: C# API
Apache Spark
Spark Apps in .NET

Word Count in Spark using RDD
Scala
RDD of lines in the file
RDD of words in the file
RDD of tuple - (word, 1)
RDD of tuple - (word, count)
Action that triggers job

Word Count in Spark using RDD
C#
Scala
F#

Develop & Launch Mobius Applications
Spark Client
A
Get Mobius release
B
Get Mobius driver
and dependencies
1
Add Reference to
Mobius package in NuGet
2
Develop, debug, test
Mobius driver application
3
Build Mobius driver
Run
sparkclr-submit.cmd
or
sparkclr-submit.sh
C
Runs Spark job
Example: sparkclr-submit.cmd
--master spark://IP:PORT
--total-executor-cores 200
--executor-memory 12g
-- conf spark.eventLog.enabled=true
-- conf spark.eventLog.dir=hdfs://nn/path/to/eventlog
--exe Pi.exe D:MobiusexamplesPi

Demo
Implementing a simple Mobius driver program using DataFrame

Structured Data in Mobius using DataFrame
JSON Cassandra
Note – Dataset is replacing DataFrame in Spark

Mobius & Spark
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Worker
CLR
IPC Sockets
C# Driver
CLR
IPC Sockets
SparkExecutor
SparkExecutor
SparkExecutor
SparkContext
JVM
JVM
JVM
JVM
Workers
Driver
Mobius can be used with
any existing Spark cluster
(Standalone, YARN) in
Windows & Linux

Mobius in Linux
• Mono is used for using Mobius with Spark in Linux
• Mobius project CI (build, unit & functional tests) in Ubuntu
• Mobius validated in Ubuntu, CentOS, OSX
• Mobius validated with Spark clusters in
Azure HDInsight and Amazon Web Services EMR
• More info at linux-instructions.md @ GitHub

Kafka Message Processing in Mobius using DStream
Initialize StreamingContext & Checkpoint
Create Kafka DStream
Use DStream transformations to count logs by loglevel within a time window
Save log count
Start stream processing

Driver-side Interop
CSharpRunner
JVM
1 Launch
sparkclr-submit.cmd
or
sparkclr-submit.sh
CSharpBackendLaunch Netty server creating
proxy for JVM calls
2
C# Driver
Launch C# process
using port number
from CSharpBackend
3
CLR
SparkConf SparkContext
Create and manage
Proxies for JVM objects
SparkConf SparkContext
Interop Components
Mirror C#-side operations
Invoke JVM methods
RDD DataFrame DStream …CSharpRDD
RDD DataFrame DStream PipelinedRDD …

1
Compute
2
CLR
CSharpWorker.exe
Launch
Worker-side Interop
JVM
CSharpRDD
Executor
Spark Worker
3
Read bytes
5
Write bytes 4
Execute C# operation
1
Compute

Mobius Project Info
• https://github.com/Microsoft/Mobius
• MIT license
• Discussions
• StackOverflow: tag “SparkCLR”
• Gitter: https://gitter.im/Microsoft/Mobius
• Twitter: @MobiusForSpark

Mobius Project Status
• Past Releases
• v1.5.200 (Spark 1.5.2)
• v1.6.100 (Spark 1.6.1)
• Upcoming Releases
• V1.6.200 (Spark 1.6.2)
• v2.0.000 (Spark 2.0.0)
• Work planned/in progress
• Support for interactive scenarios (Zeppelin/Jupyter integration – IfSharp?)
• Exploration of support for ML scenarios
• Idiomatic F# API (?)
• Support for .NET Core

Thank you
Mobius is production-ready & cloud-ready
Use Mobius to build Apache Spark jobs in .NET
Contribute to github.com/Microsoft/Mobius
@MobiusForSpark

More Related Content

What's hot (20)

PDF

HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks

PDF

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent

PDF

Whirlpools in the Stream with Jayesh LalwaniDatabricks

PDF

High Performance Python on Apache SparkWes McKinney

PDF

fluentd -- the missing log collectorMuga Nishizawa

PDF

Apache Flink vs Apache Spark - Reproducible experiments on cloud.Shelan Perera

PDF

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit

PDF

Understanding and Improving Code GenerationDatabricks

PPTX

RedisConf17 - Pain-free PipeliningRedis Labs

PDF

[Spark Summit 2017 NA] Apache Spark on KubernetesTimothy Chen

PPTX

Tuning and Monitoring Deep Learning on Apache SparkDatabricks

PDF

Connect Code to Resource Consumption to Scale Your Production Spark Applicati...Databricks

PDF

A Collaborative Data Science Development WorkflowDatabricks

PDF

Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks

PPTX

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien

PDF

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan

PPTX

Portable Streaming Pipelines with Apache Beamconfluent

PDF

The Future of Real-Time in SparkReynold Xin

PDF

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...confluent

PDF

Scylla Summit 2022: ORM and Query Building in RustScyllaDB

HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent

Whirlpools in the Stream with Jayesh LalwaniDatabricks

High Performance Python on Apache SparkWes McKinney

fluentd -- the missing log collectorMuga Nishizawa

Apache Flink vs Apache Spark - Reproducible experiments on cloud.Shelan Perera

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit

Understanding and Improving Code GenerationDatabricks

RedisConf17 - Pain-free PipeliningRedis Labs

[Spark Summit 2017 NA] Apache Spark on KubernetesTimothy Chen

Tuning and Monitoring Deep Learning on Apache SparkDatabricks

Connect Code to Resource Consumption to Scale Your Production Spark Applicati...Databricks

A Collaborative Data Science Development WorkflowDatabricks

Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan

Portable Streaming Pipelines with Apache Beamconfluent

The Future of Real-Time in SparkReynold Xin

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...confluent

Scylla Summit 2022: ORM and Query Building in RustScyllaDB

Viewers also liked (7)

PPTX

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

PPTX

Graph AnalyticsKhalid Salama

PPTX

Machine learning with SparkKhalid Salama

PDF

Parquet and AVROairisData

PDF

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

PDF

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

PPTX

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Graph AnalyticsKhalid Salama

Machine learning with SparkKhalid Salama

Parquet and AVROairisData

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon

Similar to Developing apache spark jobs in .net using mobius (20)

PPTX

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

PDF

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

PPTX

Spark 101 - First steps to distributed computingDemi Ben-Ari

PDF

What's new with Apache Spark?Paco Nathan

PDF

Introduction to apache spark Aakashdata

PPTX

Big Data Processing with Apache Spark 2014mahchiev

PPTX

Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniDemi Ben-Ari

PPTX

Apache Spark FundamentalsZahra Eskandari

PDF

Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau

PPTX

Simplifying training deep and serving learning models with big data in python...Holden Karau

PDF

Apache spark? if only it workedMarcin Szymaniuk

PDF

Apache Spark: The Analytics Operating SystemAdarsh Pannu

PDF

How Apache Spark fits into the Big Data landscapePaco Nathan

PPTX

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

PPTX

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

PDF

Apache Spark at ViadeoCepoi Eugen

PDF

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

PDF

PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau

PDF

Productionizing Spark and the Spark Job ServerEvan Chan

PPTX

Productionizing Spark and the REST Job Server- Evan ChanSpark Summit

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

Spark 101 - First steps to distributed computingDemi Ben-Ari

What's new with Apache Spark?Paco Nathan

Introduction to apache spark Aakashdata

Big Data Processing with Apache Spark 2014mahchiev

Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniDemi Ben-Ari

Apache Spark FundamentalsZahra Eskandari

Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau

Simplifying training deep and serving learning models with big data in python...Holden Karau

Apache spark? if only it workedMarcin Szymaniuk

Apache Spark: The Analytics Operating SystemAdarsh Pannu

How Apache Spark fits into the Big Data landscapePaco Nathan

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Apache Spark at ViadeoCepoi Eugen

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau

Productionizing Spark and the Spark Job ServerEvan Chan

Productionizing Spark and the REST Job Server- Evan ChanSpark Summit

Recently uploaded (20)

PDF

Book industry state of the nation 2025 - Tech Forum 2025BookNet Canada

PPTX

Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptxpresentifyai

PDF

“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...Edge AI and Vision Alliance

PDF

UiPath DevConnect 2025: Agentic Automation Community User Group MeetingDianaGray10

PDF

NASA A Researcher’s Guide to International Space Station : Physical Sciences ...Dr. PANKAJ DHUSSA

PDF

Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdfEmily Achieng

DOCX

Cryptography Quiz: test your knowledge of this important security concept.Rajni Bhardwaj Grover

PPT

Ericsson LTE presentation SEMINAR 2010.pptnpat3

PDF

“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...Edge AI and Vision Alliance

PDF

Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdfWonjun Hwang

PDF

What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSidesMark Simos

PDF

Reverse Engineering of Security Products: Developing an Advanced Microsoft De...nwbxhhcyjv

PDF

Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...darshakparmar

PPTX

MuleSoft MCP Support (Model Context Protocol) and Use Case Demoshyamraj55

PPTX

Agentforce World Tour Toronto '25 - MCP with MuleSoftAlexandra N. Martinez

PPTX

From Sci-Fi to Reality: Exploring AI EvolutionSvetlana Meissner

PDF

How do you fast track Agentic automation use cases discovery?DianaGray10

PDF

CIFDAQ Market Wrap for the week of 4th July 2025CIFDAQ

PDF

AI Agents in the Cloud: The Rise of Agentic Cloud ArchitectureLilly Gracia

PPTX

Q2 FY26 Tableau User Group Leader Quarterly Calllward7