SlideShare a Scribd company logo
Lessons: Porting a Streaming Pipeline from
Scala to Rust
2023 Scale by the Bay
Evan Chan
Principal Engineer - Conviva
http://velvia.github.io/presentations/2023-conviva-scala-to-rust
1 / 38
Conviva
2 / 38
Massive Real-time Streaming Analytics
5 trillion events processed per day
800-2000GB/hour (not peak!!)
Started with custom Java code
went through Spark Streaming and Flink iterations
Most backend data components in production are written in Scala
Today: 420 pods running custom Akka Streams processors
3 / 38
Data World is Going Native and Rust
Going native: Python, end of Moore's Law, cloud compute
Safe, fast, and high-level abstractions
Functional data patterns - map, fold, pattern matching, etc.
Static dispatch and no allocations by default
PyO3 - Rust is the best way to write native Python extensions
JVM Rust projects
Spark, Hive DataFusion, Ballista, Amadeus
Flink Arroyo, RisingWave, Materialize
Kafka/KSQL Fluvio
ElasticSearch / Lucene Toshi, MeiliDB
Cassandra, HBase Skytable, Sled, Sanakirja...
Neo4J TerminusDB, IndraDB
4 / 38
About our Architecture
graph LR; SAE(Streaming
Data
Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE SAE -->
DB[(Metrics
Database)] DB --> Dashboards
5 / 38
What We Are Porting to Rust
graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px
SAE(Streaming
Data
Pipeline) Sensors:::highlighted --> Gateways:::highlighted Gateways --> Kafka
Kafka --> SAE:::highlighted SAE --> DB[(Metrics
Database)] DB --> Dashboards
graph LR; Notes1(Sensors: consolidate
fragmented code base) Notes2(Gateway:
Improve on JVM and Go) Notes3(Pipeline:
Improve efficiency
New operator architecture) Notes1 ~~~ Notes2 Notes2 ~~~ Notes3
6 / 38
Our Journey to Rust
gantt title From Hackathon to Multiple Teams dateFormat YYYY-MM
axisFormat %y-%b section Data Pipeline Hackathon :Small Kafka ingestion
project, 2022-11, 30d Scala prototype :2023-02, 6w Initial Rust Port : small
team, 2023-04, 45d Bring on more people :2023-07, 8w 20-25 people 4 teams
:2023-11, 1w section Gateway Go port :2023-07, 6w Rust port :2023-09, 4w
“I like that if it compiles, I know it will work, so it gives confidence.”
7 / 38
Promising Rust Hackathon
graph LR; Kafka --> RustDeser(Rust Deserializer) RustDeser --> RA(Rust Actors -
Lightweight Processing)
Measurement Improvement over Scala/Akka
Throughput (CPU) 2.6x more
Memory used 12x less
Mostly I/O-bound lightweight deserialization and processing workload
Found out Actix does not work well with Tokio
8 / 38
Performance Results - Gateway
9 / 38
Key Lessons or Questions
What matters for a Rust port?
The 4 P's ?
People How do we bring developers onboard?
Performance How do I get performance? Data structures? Static dispatch?
Patterns What coding patterns port well from Scala? Async?
Project How do I build? Tooling, IDEs?
10 / 38
People
How do we bring developers onboard?
11 / 38
A Phased Rust Bringup
We ported our main data pipeline in two phases:
Phase Team Rust Expertise Work
First 3-5, very senior
1-2 with significant
Rust
Port core project
components
Second
10-15, mixed,
distributed
Most with zero
Rust
Smaller, broken down
tasks
Have organized list of learning resources
2-3 weeks to learn Rust and come up to speed
12 / 38
Difficulties:
Lifetimes
Compiler errors
Porting previous patterns
Ownership and async
etc.
How we helped:
Good docs
Start with tests
ChatGPT!
Rust Book
Office hours
Lots of detailed reviews
Split project into async and
sync cores
Overcoming Challenges
13 / 38
Performance
Data structures, static dispatch, etc.
"I enjoy the fact that the default route is performant. It makes you write
performant code, and if you go out the way, it becomes explicit (e.g., with dyn,
Boxed, or clone etc). "
14 / 38
Porting from Scala: Huge Performance Win
graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px
SAE(Streaming
Data
Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE:::highlighted
SAE --> DB[(Metrics
Database)] DB --> Dashboards
CPU-bound, programmable, heavy data processing
Neither Rust nor Scala is productionized nor optimized
Same architecture and same input/outputs
Scala version was not designed for speed, lots of objects
Rust: we chose static dispatch and minimizing allocations
Type of comparison Improvement over Scala
Throughput, end to end 22x
Throughput, single-threaded microbenchmark >= 40x
15 / 38
Building a Flexible Data Pipeline
graph LR; RawEvents(Raw Events) RawEvents -->| List of numbers | Extract1
RawEvents --> Extract2 Extract1 --> DoSomeMath Extract2 -->
TransformSomeFields DoSomeMath --> Filter1 TransformSomeFields -->
Filter1 Filter1 --> MoreProcessing
An interpreter passes time-ordered data between flexible DAG of operators.
Span1
Start time: 1000
End time: 1100
Events: ["start", "click"]
Span2
Start time: 1100
End time: 1300
Events: ["ad_load"]
16 / 38
Scala: Object Graph on Heap
graph TB; classDef default font-
size:24px
ArraySpan["`Array[Span]`"]
TL(Timeline - Seq) --> ArraySpan
ArraySpan --> Span1["`Span(start,
end, Payload)`"] ArraySpan -->
Span2["`Span(start, end,
Payload)`"] Span1 -->
EventsAtSpanEnd("`Events(Seq[A])`")
EventsAtSpanEnd -->
ArrayEvent["`Array[A]`"]
Rust: mostly stack based / 0 alloc:
flowchart TB; subgraph Timeline
subgraph OutputSpans subgraph
Span1 subgraph Events EvA ~~~
EvB end TimeInterval ~~~ Events
end subgraph Span2 Time2 ~~~
Events2 end Span1 ~~~ Span2 end
DataType ~~~ OutputSpans end
Data Structures: Scala vs Rust
17 / 38
Rust: Using Enums and Avoiding Boxing
pub enum Timeline {
EventNumber(OutputSpans<EventsAtEnd<f64>>),
EventBoolean(OutputSpans<EventsAtEnd<bool>>),
EventString(OutputSpans<EventsAtEnd<DataString>>),
}
type OutputSpans<V> = SmallVec<[Spans<V>; 2]>;
pub struct Span<SV: SpanValue> {
pub time: TimeInterval,
pub value: SV,
}
pub struct EventsAtEnd<V>(SmallVec<[V; 1]>);
In the above, the Timeline enum can fit entirely in the stack and avoid all
boxing and allocations, if:
The number of spans is very small, below limit set in code
The number of events in each span is very small (1 in this case, which is
the common case)
The base type is a primitive, or a string which is below a certain length 18 / 38
Avoiding Allocations using SmallVec and
SmallString
SmallVec is something like this:
pub enum SmallVec<T, const N: usize> {
Stack([T; N]),
Heap(Vec<T>),
}
The enum can hold up to N items inline in an array with no allocations, but
switches to the Heap variant if the number of items exceeds N.
There are various crates for small strings and other data structures.
19 / 38
Static vs Dynamic Dispatch
Often one will need to work with many different structs that implement a Trait
-- for us, different operator implementations supporting different types. Static
dispatch and inlined code is much faster.
1. Monomorphisation using generics
fn execute_op<O: Operator>(op: O) -> Result<...>
Compiler creates a new instance of execute_op for every different O
Only works when you know in advance what Operator to pass in
2. Use Enums and enum_dispatch
fn execute_op(op: OperatorEnum) -> Result<...>
3. Dynamic dispatch
fn execute_op(op: Box<dyn Operator>) -> Result<...>
fn execute_op(op: &dyn Operator) -> Result<...> (avoids allocation)
4. Function wrapping
Embedding functions in a generic struct
20 / 38
enum_dispatch
Suppose you have
trait KnobControl {
fn set_position(&mut self, value: f64);
fn get_value(&self) -> f64;
}
struct LinearKnob {
position: f64,
}
struct LogarithmicKnob {
position: f64,
}
impl KnobControl for LinearKnob...
enum_dispatch lets you do this:
#[enum_dispatch]
trait KnobControl {
//...
} 21 / 38
Function wrapping
Static function wrapping - no generics
pub struct OperatorWrapper {
name: String,
func: fn(input: &Data) -> Data,
}
Need a generic - but accepts closures
pub struct OperatorWrapper<F>
where F: Fn(input: &Data) -> Data {
name: String,
func: F,
}
22 / 38
Patterns
Async, Type Classes, etc.
23 / 38
Rust Async: Different Paradigms
"Async: It is well designed... Yes, it is still pretty complicated piece of code, but
the logic or the framework is easier to grasp compared to other languages."
Having to use Arc: Data Structures are not Thread-safe by default!
Scala Rust
Futures futures, async functions
?? async-await
Actors(Akka) Actix, Bastion, etc.
Async streams Tokio streams
Reactive (Akka streams, Monix, ZIO) reactive_rs, rxRust, etc.
24 / 38
Replacing Akka: Actors in Rust
Actix threading model doesn't mix well with Tokio
We moved to tiny-tokio-actor, then wrote our own
pub struct AnomalyActor {}
#[async_trait]
impl ChannelActor<Anomaly, AnomalyActorError> for AnomalyActor {
async fn handle(
&mut self,
msg: Anomaly,
ctx: &mut ActorContext<Anomaly>,
) -> Result<(), Report<AnomalyActorError>> {
use Anomaly::*;
match msg {
QuantityOverflowAnomaly {
ctx: _, ts: _, qual: _,
qty: _, cnt: _, data: _,
} => {}
PoisonPill => {
ctx.stop();
}
}
Ok(())
}
25 / 38
Other Patterns to Learn
Old Pattern New Pattern
No inheritance
Use composition!
- Compose data structures
- Compose small Traits
No exceptions Use Result and ?
Data structures are not
Thread safe
Learn to use Arc etc.
Returning Iterators
Don't return things that borrow other things.
This makes life difficult.
26 / 38
Type Classes
In Rust, type classes (Traits) are smaller and more compositional.
pub trait Inhale {
fn sniff(&self);
}
You can implement new Traits for existing types, and have different impl's for
different types.
impl Inhale for String {
fn sniff(&self) {
println!("I sniffed {}", self);
}
}
// Only implemented for specific N subtypes of MyStruct
impl<N: Numeric> Inhale for MyStruct<N> {
fn sniff(&self) {
....
}
}
27 / 38
Project
Build, IDE, Tooling
28 / 38
"Cargo is the best build tool ever"
Almost no dependency conflicts due to multiple dep versioning
Configuration by convention - common directory/file layouts for example
Really simple .toml - no need for XML, functional Scala, etc.
Rarely need code to build anything, even for large projects
[package]
name = "telemetry-subscribers"
version = "0.3.0"
license = "Apache-2.0"
description = "Library for common telemetry and observability functionality"
[dependencies]
console-subscriber = { version = "0.1.6", optional = true }
crossterm = "0.25.0"
once_cell = "1.13.0"
opentelemetry = { version = "0.18.0", features = ["rt-tokio"], optional = true }
29 / 38
IDEs, CI, and Tooling
IDEs/Editors
VSCode, RustRover (IntelliJ),
vim/emacs/etc with Rust Analyzer
Code Coverage VSCode inline, grcov/lcov, Tarpaulin (Linux only)
Slow build times Caching: cargo-chef, rust-cache
Slow test times cargo-nextest
Property Testing proptest
Benchmarking Criterion
https://blog.logrocket.com/optimizing-ci-cd-pipelines-rust-projects/
VSCode's "LiveShare" feature for distributed pair programming is TOP NOTCH.
30 / 38
Rust Resources and Projects
https://github.com/velvia/links/blob/main/rust.md - this is my list of Rust
projects and learning resources
https://github.com/rust-unofficial/awesome-rust
https://www.arewelearningyet.com - ML focused
31 / 38
What do we miss from Scala?
More mature libraries - in some cases: HDFS, etc.
Good streaming libraries - like Monix, Akka Streams etc.
I guess all of Akka
"Less misleading compiler messages"
Rust error messages read better from the CLI, IMO (not an IDE)
32 / 38
Takeaways
It's a long journey but Rust is worth it.
Structuring a project for successful onramp is really important
Think about data structure design early on
Allow plenty of time to ramp up on Rust patterns, tools
We are hiring across multiple roles/levels!
33 / 38
https://velvia.github.io/about
https://github.com/velvia
@evanfchan
IG: @platypus.arts
Thank You Very Much!
34 / 38
Extra slides
35 / 38
Data World is Going Native (from JVM)
The rise of Python and Data Science
Led to AnyScale, Dask, and many other Python-oriented data
frameworks
Rise of newer, developer-friendly native languages (Go, Swift, Rust, etc.)
Migration from Hadoop/HDFS to more cloud-based data architectures
Apache Arrow and other data interchange formats
Hardware architecture trends - end of Moore's Law, rise of GPUs etc
36 / 38
Why We Went with our Own Actors
1. Initial Hackathon prototype used Actix
Actix has its own event-loop / threading model, using Arbiters
Difficult to co-exist with Tokio and configure both
2. Moved to tiny-tokio-actor
Really thin layer on top of Tokio
25% improvement over rdkafka + Tokio + Actix
3. Ultimately wrote our own, 100-line mini Actor framework
tiny-tokio-actor required messages to be Clone so we could not, for
example, send OneShot channels for other actors to reply
Wanted ActorRef<MessageType> instead of ActorRef<ActorType,
MessageType>
supports tell() and ask() semantics
37 / 38
Scala: Object Graphs and Any
class Timeline extends BufferedIterator[Span[Payload]]
final case class Span[+A](start: Timestamp, end: Timestamp, payload: A) {
def mapPayload[B](f: A => B): Span[B] = copy(payload = f(payload))
}
type Event[+A] = Span[EventsAtSpanEnd[A]]
@newtype final case class EventsAtSpanEnd[+A](events: Iterable[A])
BufferedIterator must be on the heap
Each Span Payload is also boxed and on the heap, even for numbers
To be dynamically interpretable, we need BufferedIterator[Span[Any]]
in many places :(
Yes, specialization is possible, at the cost of complexity
38 / 38

More Related Content

What's hot (20)

PDF
Kafka Connect:Iceberg Sink Connectorを使ってみる
MicroAd, Inc.(Engineer)
 
PDF
Introdution to Dataops and AIOps (or MLOps)
Adrien Blind
 
PDF
AlmaLinux と Rocky Linux の誕生経緯&比較
beyond Co., Ltd.
 
PDF
Logging, Metrics, and APM: The Operations Trifecta (P)
Elasticsearch
 
PDF
MLOps Bridging the gap between Data Scientists and Ops.
Knoldus Inc.
 
PPTX
Azure Functions with terraform
Tomokazu Tochi
 
PDF
The A-Z of Data: Introduction to MLOps
DataPhoenix
 
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
PDF
How to Monitoring the SRE Golden Signals (E-Book)
Siglos
 
PDF
僕とヤフーと時々Teradata #prestodb
Yahoo!デベロッパーネットワーク
 
PDF
OpenSC: eID interoperability through open source software
Martin Paljak
 
PDF
Oracle Cloud Infrastructure:2023年2月度サービス・アップデート
オラクルエンジニア通信
 
PDF
Google Cloud Dataflow を理解する - #bq_sushi
Google Cloud Platform - Japan
 
PDF
The Ensemble Logical Model (by Remco Broekmans)
Patrick Van Renterghem
 
PDF
Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Str...
confluent
 
PDF
JCBの Payment as a Service 実現にむけたゼロベースの組織変革とテクニカル・イネーブラー(NTTデータ テクノロジーカンファレンス ...
NTT DATA Technology & Innovation
 
PDF
SMTPのSTARTTLSにおけるTLSバージョンについて
Sparx Systems Japan
 
PDF
Azure Kubernetes Service Overview
Takeshi Fukuhara
 
PDF
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
Preferred Networks
 
PPTX
Oracle Forms Modernization Roadmap
Kai-Uwe Möller
 
Kafka Connect:Iceberg Sink Connectorを使ってみる
MicroAd, Inc.(Engineer)
 
Introdution to Dataops and AIOps (or MLOps)
Adrien Blind
 
AlmaLinux と Rocky Linux の誕生経緯&比較
beyond Co., Ltd.
 
Logging, Metrics, and APM: The Operations Trifecta (P)
Elasticsearch
 
MLOps Bridging the gap between Data Scientists and Ops.
Knoldus Inc.
 
Azure Functions with terraform
Tomokazu Tochi
 
The A-Z of Data: Introduction to MLOps
DataPhoenix
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
How to Monitoring the SRE Golden Signals (E-Book)
Siglos
 
僕とヤフーと時々Teradata #prestodb
Yahoo!デベロッパーネットワーク
 
OpenSC: eID interoperability through open source software
Martin Paljak
 
Oracle Cloud Infrastructure:2023年2月度サービス・アップデート
オラクルエンジニア通信
 
Google Cloud Dataflow を理解する - #bq_sushi
Google Cloud Platform - Japan
 
The Ensemble Logical Model (by Remco Broekmans)
Patrick Van Renterghem
 
Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Str...
confluent
 
JCBの Payment as a Service 実現にむけたゼロベースの組織変革とテクニカル・イネーブラー(NTTデータ テクノロジーカンファレンス ...
NTT DATA Technology & Innovation
 
SMTPのSTARTTLSにおけるTLSバージョンについて
Sparx Systems Japan
 
Azure Kubernetes Service Overview
Takeshi Fukuhara
 
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
Preferred Networks
 
Oracle Forms Modernization Roadmap
Kai-Uwe Möller
 

Similar to Porting a Streaming Pipeline from Scala to Rust (20)

PDF
Rust: Reach Further (from QCon Sao Paolo 2018)
nikomatsakis
 
PDF
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
ScyllaDB
 
PDF
Practical Rust 1x Cookbook Rustacean Team
mavukahimota22
 
PDF
Rust in Action Systems programming concepts and techniques 1st Edition Tim Mc...
paaolablan
 
PDF
Reactive Software Systems
Behrad Zari
 
PDF
Intro to Rust 2019
Timothy Bess
 
PDF
The Rust Programming Language 2nd Edition Second Converted Steve Klabnik Caro...
cizekchingbj
 
PDF
The Rust Programming Language
Mario Alexandro Santini
 
PPTX
Introduction to Rust (Presentation).pptx
Knoldus Inc.
 
PDF
Rust All Hands Winter 2011
Patrick Walton
 
ODP
Rust Primer
Knoldus Inc.
 
PPTX
Indic threads pune12-typesafe stack software development on the jvm
IndicThreads
 
PDF
State of Akka 2017 - The best is yet to come
Konrad Malawski
 
PPT
Devoxx
Martin Odersky
 
PDF
Rust: Systems Programming for Everyone
C4Media
 
PDF
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
ScyllaDB
 
PPTX
Building reactive systems with Akka
Kristof Jozsa
 
PDF
The Rust Programming Language Second Edition 2 Converted Steve Klabnik
nuovochady2s
 
PDF
The Rust Programming Language 2nd Edition Steve Klabnik
urbuuawara
 
PDF
SE 20016 - programming languages landscape.
Ruslan Shevchenko
 
Rust: Reach Further (from QCon Sao Paolo 2018)
nikomatsakis
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
ScyllaDB
 
Practical Rust 1x Cookbook Rustacean Team
mavukahimota22
 
Rust in Action Systems programming concepts and techniques 1st Edition Tim Mc...
paaolablan
 
Reactive Software Systems
Behrad Zari
 
Intro to Rust 2019
Timothy Bess
 
The Rust Programming Language 2nd Edition Second Converted Steve Klabnik Caro...
cizekchingbj
 
The Rust Programming Language
Mario Alexandro Santini
 
Introduction to Rust (Presentation).pptx
Knoldus Inc.
 
Rust All Hands Winter 2011
Patrick Walton
 
Rust Primer
Knoldus Inc.
 
Indic threads pune12-typesafe stack software development on the jvm
IndicThreads
 
State of Akka 2017 - The best is yet to come
Konrad Malawski
 
Rust: Systems Programming for Everyone
C4Media
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
ScyllaDB
 
Building reactive systems with Akka
Kristof Jozsa
 
The Rust Programming Language Second Edition 2 Converted Steve Klabnik
nuovochady2s
 
The Rust Programming Language 2nd Edition Steve Klabnik
urbuuawara
 
SE 20016 - programming languages landscape.
Ruslan Shevchenko
 
Ad

More from Evan Chan (17)

PDF
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
PDF
Designing Stateful Apps for Cloud and Kubernetes
Evan Chan
 
PDF
Histograms at scale - Monitorama 2019
Evan Chan
 
PDF
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
PDF
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
PDF
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
PDF
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PDF
Akka in Production - ScalaDays 2015
Evan Chan
 
PDF
MIT lecture - Socrata Open Data Architecture
Evan Chan
 
PDF
OLAP with Cassandra and Spark
Evan Chan
 
PDF
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
PDF
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
PDF
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
PDF
Real-time Analytics with Cassandra, Spark, and Shark
Evan Chan
 
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Evan Chan
 
Histograms at scale - Monitorama 2019
Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
Akka in Production - ScalaDays 2015
Evan Chan
 
MIT lecture - Socrata Open Data Architecture
Evan Chan
 
OLAP with Cassandra and Spark
Evan Chan
 
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Evan Chan
 
Ad

Recently uploaded (20)

PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Tally software_Introduction_Presentation
AditiBansal54083
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Executive Business Intelligence Dashboards
vandeslie24
 

Porting a Streaming Pipeline from Scala to Rust

  • 1. Lessons: Porting a Streaming Pipeline from Scala to Rust 2023 Scale by the Bay Evan Chan Principal Engineer - Conviva http://velvia.github.io/presentations/2023-conviva-scala-to-rust 1 / 38
  • 3. Massive Real-time Streaming Analytics 5 trillion events processed per day 800-2000GB/hour (not peak!!) Started with custom Java code went through Spark Streaming and Flink iterations Most backend data components in production are written in Scala Today: 420 pods running custom Akka Streams processors 3 / 38
  • 4. Data World is Going Native and Rust Going native: Python, end of Moore's Law, cloud compute Safe, fast, and high-level abstractions Functional data patterns - map, fold, pattern matching, etc. Static dispatch and no allocations by default PyO3 - Rust is the best way to write native Python extensions JVM Rust projects Spark, Hive DataFusion, Ballista, Amadeus Flink Arroyo, RisingWave, Materialize Kafka/KSQL Fluvio ElasticSearch / Lucene Toshi, MeiliDB Cassandra, HBase Skytable, Sled, Sanakirja... Neo4J TerminusDB, IndraDB 4 / 38
  • 5. About our Architecture graph LR; SAE(Streaming Data Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE SAE --> DB[(Metrics Database)] DB --> Dashboards 5 / 38
  • 6. What We Are Porting to Rust graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px SAE(Streaming Data Pipeline) Sensors:::highlighted --> Gateways:::highlighted Gateways --> Kafka Kafka --> SAE:::highlighted SAE --> DB[(Metrics Database)] DB --> Dashboards graph LR; Notes1(Sensors: consolidate fragmented code base) Notes2(Gateway: Improve on JVM and Go) Notes3(Pipeline: Improve efficiency New operator architecture) Notes1 ~~~ Notes2 Notes2 ~~~ Notes3 6 / 38
  • 7. Our Journey to Rust gantt title From Hackathon to Multiple Teams dateFormat YYYY-MM axisFormat %y-%b section Data Pipeline Hackathon :Small Kafka ingestion project, 2022-11, 30d Scala prototype :2023-02, 6w Initial Rust Port : small team, 2023-04, 45d Bring on more people :2023-07, 8w 20-25 people 4 teams :2023-11, 1w section Gateway Go port :2023-07, 6w Rust port :2023-09, 4w “I like that if it compiles, I know it will work, so it gives confidence.” 7 / 38
  • 8. Promising Rust Hackathon graph LR; Kafka --> RustDeser(Rust Deserializer) RustDeser --> RA(Rust Actors - Lightweight Processing) Measurement Improvement over Scala/Akka Throughput (CPU) 2.6x more Memory used 12x less Mostly I/O-bound lightweight deserialization and processing workload Found out Actix does not work well with Tokio 8 / 38
  • 9. Performance Results - Gateway 9 / 38
  • 10. Key Lessons or Questions What matters for a Rust port? The 4 P's ? People How do we bring developers onboard? Performance How do I get performance? Data structures? Static dispatch? Patterns What coding patterns port well from Scala? Async? Project How do I build? Tooling, IDEs? 10 / 38
  • 11. People How do we bring developers onboard? 11 / 38
  • 12. A Phased Rust Bringup We ported our main data pipeline in two phases: Phase Team Rust Expertise Work First 3-5, very senior 1-2 with significant Rust Port core project components Second 10-15, mixed, distributed Most with zero Rust Smaller, broken down tasks Have organized list of learning resources 2-3 weeks to learn Rust and come up to speed 12 / 38
  • 13. Difficulties: Lifetimes Compiler errors Porting previous patterns Ownership and async etc. How we helped: Good docs Start with tests ChatGPT! Rust Book Office hours Lots of detailed reviews Split project into async and sync cores Overcoming Challenges 13 / 38
  • 14. Performance Data structures, static dispatch, etc. "I enjoy the fact that the default route is performant. It makes you write performant code, and if you go out the way, it becomes explicit (e.g., with dyn, Boxed, or clone etc). " 14 / 38
  • 15. Porting from Scala: Huge Performance Win graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px SAE(Streaming Data Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE:::highlighted SAE --> DB[(Metrics Database)] DB --> Dashboards CPU-bound, programmable, heavy data processing Neither Rust nor Scala is productionized nor optimized Same architecture and same input/outputs Scala version was not designed for speed, lots of objects Rust: we chose static dispatch and minimizing allocations Type of comparison Improvement over Scala Throughput, end to end 22x Throughput, single-threaded microbenchmark >= 40x 15 / 38
  • 16. Building a Flexible Data Pipeline graph LR; RawEvents(Raw Events) RawEvents -->| List of numbers | Extract1 RawEvents --> Extract2 Extract1 --> DoSomeMath Extract2 --> TransformSomeFields DoSomeMath --> Filter1 TransformSomeFields --> Filter1 Filter1 --> MoreProcessing An interpreter passes time-ordered data between flexible DAG of operators. Span1 Start time: 1000 End time: 1100 Events: ["start", "click"] Span2 Start time: 1100 End time: 1300 Events: ["ad_load"] 16 / 38
  • 17. Scala: Object Graph on Heap graph TB; classDef default font- size:24px ArraySpan["`Array[Span]`"] TL(Timeline - Seq) --> ArraySpan ArraySpan --> Span1["`Span(start, end, Payload)`"] ArraySpan --> Span2["`Span(start, end, Payload)`"] Span1 --> EventsAtSpanEnd("`Events(Seq[A])`") EventsAtSpanEnd --> ArrayEvent["`Array[A]`"] Rust: mostly stack based / 0 alloc: flowchart TB; subgraph Timeline subgraph OutputSpans subgraph Span1 subgraph Events EvA ~~~ EvB end TimeInterval ~~~ Events end subgraph Span2 Time2 ~~~ Events2 end Span1 ~~~ Span2 end DataType ~~~ OutputSpans end Data Structures: Scala vs Rust 17 / 38
  • 18. Rust: Using Enums and Avoiding Boxing pub enum Timeline { EventNumber(OutputSpans<EventsAtEnd<f64>>), EventBoolean(OutputSpans<EventsAtEnd<bool>>), EventString(OutputSpans<EventsAtEnd<DataString>>), } type OutputSpans<V> = SmallVec<[Spans<V>; 2]>; pub struct Span<SV: SpanValue> { pub time: TimeInterval, pub value: SV, } pub struct EventsAtEnd<V>(SmallVec<[V; 1]>); In the above, the Timeline enum can fit entirely in the stack and avoid all boxing and allocations, if: The number of spans is very small, below limit set in code The number of events in each span is very small (1 in this case, which is the common case) The base type is a primitive, or a string which is below a certain length 18 / 38
  • 19. Avoiding Allocations using SmallVec and SmallString SmallVec is something like this: pub enum SmallVec<T, const N: usize> { Stack([T; N]), Heap(Vec<T>), } The enum can hold up to N items inline in an array with no allocations, but switches to the Heap variant if the number of items exceeds N. There are various crates for small strings and other data structures. 19 / 38
  • 20. Static vs Dynamic Dispatch Often one will need to work with many different structs that implement a Trait -- for us, different operator implementations supporting different types. Static dispatch and inlined code is much faster. 1. Monomorphisation using generics fn execute_op<O: Operator>(op: O) -> Result<...> Compiler creates a new instance of execute_op for every different O Only works when you know in advance what Operator to pass in 2. Use Enums and enum_dispatch fn execute_op(op: OperatorEnum) -> Result<...> 3. Dynamic dispatch fn execute_op(op: Box<dyn Operator>) -> Result<...> fn execute_op(op: &dyn Operator) -> Result<...> (avoids allocation) 4. Function wrapping Embedding functions in a generic struct 20 / 38
  • 21. enum_dispatch Suppose you have trait KnobControl { fn set_position(&mut self, value: f64); fn get_value(&self) -> f64; } struct LinearKnob { position: f64, } struct LogarithmicKnob { position: f64, } impl KnobControl for LinearKnob... enum_dispatch lets you do this: #[enum_dispatch] trait KnobControl { //... } 21 / 38
  • 22. Function wrapping Static function wrapping - no generics pub struct OperatorWrapper { name: String, func: fn(input: &Data) -> Data, } Need a generic - but accepts closures pub struct OperatorWrapper<F> where F: Fn(input: &Data) -> Data { name: String, func: F, } 22 / 38
  • 24. Rust Async: Different Paradigms "Async: It is well designed... Yes, it is still pretty complicated piece of code, but the logic or the framework is easier to grasp compared to other languages." Having to use Arc: Data Structures are not Thread-safe by default! Scala Rust Futures futures, async functions ?? async-await Actors(Akka) Actix, Bastion, etc. Async streams Tokio streams Reactive (Akka streams, Monix, ZIO) reactive_rs, rxRust, etc. 24 / 38
  • 25. Replacing Akka: Actors in Rust Actix threading model doesn't mix well with Tokio We moved to tiny-tokio-actor, then wrote our own pub struct AnomalyActor {} #[async_trait] impl ChannelActor<Anomaly, AnomalyActorError> for AnomalyActor { async fn handle( &mut self, msg: Anomaly, ctx: &mut ActorContext<Anomaly>, ) -> Result<(), Report<AnomalyActorError>> { use Anomaly::*; match msg { QuantityOverflowAnomaly { ctx: _, ts: _, qual: _, qty: _, cnt: _, data: _, } => {} PoisonPill => { ctx.stop(); } } Ok(()) } 25 / 38
  • 26. Other Patterns to Learn Old Pattern New Pattern No inheritance Use composition! - Compose data structures - Compose small Traits No exceptions Use Result and ? Data structures are not Thread safe Learn to use Arc etc. Returning Iterators Don't return things that borrow other things. This makes life difficult. 26 / 38
  • 27. Type Classes In Rust, type classes (Traits) are smaller and more compositional. pub trait Inhale { fn sniff(&self); } You can implement new Traits for existing types, and have different impl's for different types. impl Inhale for String { fn sniff(&self) { println!("I sniffed {}", self); } } // Only implemented for specific N subtypes of MyStruct impl<N: Numeric> Inhale for MyStruct<N> { fn sniff(&self) { .... } } 27 / 38
  • 29. "Cargo is the best build tool ever" Almost no dependency conflicts due to multiple dep versioning Configuration by convention - common directory/file layouts for example Really simple .toml - no need for XML, functional Scala, etc. Rarely need code to build anything, even for large projects [package] name = "telemetry-subscribers" version = "0.3.0" license = "Apache-2.0" description = "Library for common telemetry and observability functionality" [dependencies] console-subscriber = { version = "0.1.6", optional = true } crossterm = "0.25.0" once_cell = "1.13.0" opentelemetry = { version = "0.18.0", features = ["rt-tokio"], optional = true } 29 / 38
  • 30. IDEs, CI, and Tooling IDEs/Editors VSCode, RustRover (IntelliJ), vim/emacs/etc with Rust Analyzer Code Coverage VSCode inline, grcov/lcov, Tarpaulin (Linux only) Slow build times Caching: cargo-chef, rust-cache Slow test times cargo-nextest Property Testing proptest Benchmarking Criterion https://blog.logrocket.com/optimizing-ci-cd-pipelines-rust-projects/ VSCode's "LiveShare" feature for distributed pair programming is TOP NOTCH. 30 / 38
  • 31. Rust Resources and Projects https://github.com/velvia/links/blob/main/rust.md - this is my list of Rust projects and learning resources https://github.com/rust-unofficial/awesome-rust https://www.arewelearningyet.com - ML focused 31 / 38
  • 32. What do we miss from Scala? More mature libraries - in some cases: HDFS, etc. Good streaming libraries - like Monix, Akka Streams etc. I guess all of Akka "Less misleading compiler messages" Rust error messages read better from the CLI, IMO (not an IDE) 32 / 38
  • 33. Takeaways It's a long journey but Rust is worth it. Structuring a project for successful onramp is really important Think about data structure design early on Allow plenty of time to ramp up on Rust patterns, tools We are hiring across multiple roles/levels! 33 / 38
  • 36. Data World is Going Native (from JVM) The rise of Python and Data Science Led to AnyScale, Dask, and many other Python-oriented data frameworks Rise of newer, developer-friendly native languages (Go, Swift, Rust, etc.) Migration from Hadoop/HDFS to more cloud-based data architectures Apache Arrow and other data interchange formats Hardware architecture trends - end of Moore's Law, rise of GPUs etc 36 / 38
  • 37. Why We Went with our Own Actors 1. Initial Hackathon prototype used Actix Actix has its own event-loop / threading model, using Arbiters Difficult to co-exist with Tokio and configure both 2. Moved to tiny-tokio-actor Really thin layer on top of Tokio 25% improvement over rdkafka + Tokio + Actix 3. Ultimately wrote our own, 100-line mini Actor framework tiny-tokio-actor required messages to be Clone so we could not, for example, send OneShot channels for other actors to reply Wanted ActorRef<MessageType> instead of ActorRef<ActorType, MessageType> supports tell() and ask() semantics 37 / 38
  • 38. Scala: Object Graphs and Any class Timeline extends BufferedIterator[Span[Payload]] final case class Span[+A](start: Timestamp, end: Timestamp, payload: A) { def mapPayload[B](f: A => B): Span[B] = copy(payload = f(payload)) } type Event[+A] = Span[EventsAtSpanEnd[A]] @newtype final case class EventsAtSpanEnd[+A](events: Iterable[A]) BufferedIterator must be on the heap Each Span Payload is also boxed and on the heap, even for numbers To be dynamically interpretable, we need BufferedIterator[Span[Any]] in many places :( Yes, specialization is possible, at the cost of complexity 38 / 38