SlideShare a Scribd company logo
Connect S3 with Kafka
leveraging Akka Streams
Seiya Mizuno @Saint1991
Developing data processing platform like below
Who am I?
Introduction to Akka Streams
Components of Akka Streams
Glance at GraphStage
Connect S3 with Kafka using Alpakka
Agenda
HERE!
Introduction to Akka Streams
The toolkit to process data streams on Akka actors
Describe processing pipeline as a graph
Easy to define complex pipeline
What is Akka Streams?
Source
Flow
SinkBroadcast
Flow
Merge
Input
Generating stream elements
Fetching stream elements from outside
Processing
Processing stream elements sent from
upstreams one by one
Output
To a File
To outer resources
Sample code!
implicit val system = ActorSystem()
implicit val dispatcher = system.dispatcher
implicit val mat = ActorMaterializer()
val s3Keys = List(“key1”, “key2”)
val sinkForeach = Sink.foreach(println)
val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) {
implicit builder: GraphDSL.Builder[Future[Done]] =>
sink: Sink[String, Future[Done]]#Shape =>
import GraphDSL.Implicits._
val src = Source(s3Keys)
val flowA = Flow[String].map(key => s“s3://bucketA/$key”)
val flowB = Flow[String].map(key => s"s3://bucketB/$key")
val broadcast = builder.add(Broadcast[String](2))
val merge = builder.add(Merge[String](2))
src ~> broadcast ~> flowA ~> merge ~> sink
broadcast ~> flowB ~> merge
ClosedShape
})
blueprint.run() onComplete { _ =>
Await.ready(system.terminate(), 10 seconds)
}
// stream elements
// a sink that prints received stream elements
// a source send elements defined above
// a flow maps received element to the URL of Bucket A
// a flow maps received element to the URL of Bucket B
// a Junction that broadcasts received elements to 2 outlets
// a Junction that merge received elements from 2 inlets
// THIS IS GREAT FUNCTIONALITY OF GraphDSL
// easy to describe graph
// Run the graph!!!
// terminate actor system when the graph is completed
Easy to use without knowing the detail of Akka Actor
GOOD!
Akka Streams implicitly do everything
implicit val system = ActorSystem()
implicit val dispatcher = system.dispatcher
implicit val mat = ActorMaterializer()
val s3Keys = List(“key1”, “key2”)
val sinkForeach = Sink.foreach(println)
val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) {
implicit builder: GraphDSL.Builder[Future[Done]] =>
sink: Sink[String, Future[Done]]#Shape =>
import GraphDSL.Implicits._
val src = Source(s3Keys)
val flowA = Flow[String].map(key => s“s3://bucketA/$key”)
val flowB = Flow[String].map(key => s"s3://bucketB/$key")
val broadcast = builder.add(Broadcast[String](2))
val merge = builder.add(Merge[String](2))
src ~> broadcast ~> flowA ~> merge ~> sink
broadcast ~> flowB ~> merge
ClosedShape
})
blueprint.run() onComplete { _ =>
Await.ready(system.terminate(), 10 seconds)
}
// dispatch threads to actors
// create actors
Materializer creates Akka Actors based on
the blueprint when called RunnableGraph#run
and processing is going!!!
Conclusion
Built a graph with
Source, Flow, Sink etc
Declare materializer with implicit
RunnableGraph ActorMaterializer Actors
Almost Automatically
working with actors!!!
Tips
implicit val system = ActorSystem()
implicit val dispatcher = system.dispatcher
implicit val mat = ActorMaterializer()
val s3Keys = List(“key1”, “key2”)
val sinkForeach = Sink.foreach(println)
val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) {
implicit builder: GraphDSL.Builder[Future[Done]] =>
sink: Sink[String, Future[Done]]#Shape =>
import GraphDSL.Implicits._
val src = Source(s3Keys)
val flowA = Flow[String].map(key => s“s3://bucketA/$key”)
val flowB = Flow[String].map(key => s"s3://bucketB/$key")
val broadcast = builder.add(Broadcast[String](2))
val merge = builder.add(Merge[String](2))
src ~> broadcast ~> flowA ~> merge ~> sink
broadcast ~> flowB ~> merge
ClosedShape
})
blueprint.run() onComplete { _ =>
Await.ready(system.terminate(), 10 seconds)
}
To return MaterializedValue using GraphDSL, the graph
component that create MaterializedValue to return has to
be passed to GrapDSL#create. So it must be defined
outside GraphDSL builer… orz
Process will not be completed till
terminate ActorSystem
Don’t forget to terminate it!!!
If not define materialized value, blueprint does not
Return completion future…
Glance at GraphStage
Asynchronous message passing
Efficient use of CPU
Back pressure
Remarkable of Akka Streams are…
Source Sink
① Request a next element
② send a element
Upstreams send elements only when
received requests from downstream.
Down streams’ buffer will not overflow
What is GraphStage?
Source Sink
① Request a next element
Every Graph Component is
GraphStage!!
Not found in Akka streams standard library?
But want backpressure???
Implement custom GraphStages!!!
② send a element
SourceStage that emits Fibonacci
class FibonacciSource(to: Int) extends GraphStage[SourceShape[Int]] {
val out: Outlet[Int] = Outlet("Fibonacci.out")
override val shape = SourceShape(out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
var fn_2 = 0
var fn_1 = 0
var n = 0
setHandler(out, new OutHandler {
override def onPull(): Unit = {
val fn =
if (n == 0) 0
else if (n == 1) 1
else fn_2 + fn_1
if (fn >= to) completeStage()
else push(out, fn)
fn_2 = fn_1
fn_1 = fn
n += 1
}
})
}
}
Define a shape of Graph
SourceShape that has a outlet that emit int elements
// new instance is created every time
RunnableGraph#run is called
// terminate this stage with completion
// called when every time received a request
from downstream (backpressure)
So mutable state must be initizalized
within the GraphStageLogic
// send an element to the downstream
Connect S3 with Kafka
Connect S3 with Kafka
Docker Container
Direct connect
Put 2.5TB/day !!! Must be scalable
Our architecture
Direct connect
① Notify
Created Events
② Receive object
keys to ingest
…③ Download ④ Produce
Distribute object keys to containers
(Work as Load Balancer)
At least once
= Sometimes duplicate
Once an event is read, it becomes invisible and
basically any consumers does not receive
the same event until passed visibility timeout
Load Balancing
Elements are not deleted until sending Ack
It is retriable, by not sending Ack when a failure occurs
Amazon SQS
Alpakka (Implementation of GraphStages)
SQS Connector
• Read events from SQS
• Ack
S3 Connector
• Downloading content of a S3 object
Reactive Kafka
Produce content to Kafka
Various connector libraries!!
https://github.com/akka/alpakka/tree/master/sqs
https://github.com/akka/alpakka/tree/master/s3
https://github.com/akka/reactive-kafka
S3 → Kafka
val src: Source[ByteString, NotUsed] =
S3Client().download(bucket, key)
val decompress: Flow[ByteString, ByteString, NotUsed] =
Compression.gunzip()
val lineFraming: Flow[ByteString, ByteString, NotUsed] =
Framing.delimiter(delimiter = ByteString("n"),
maximumFrameLength = 65536, allowTruncation = false)
val sink: Sink[ProducerMessage.Message[Array[Byte], Array[Byte], Any], Future[Done]] =
Producer.plainSink(producerSettings)
val blueprint: RunnableGraph[Future[String]] = src
.via(decompress)
.via(lineFraming)
.via(Flow[ByteString]
.map(_.toArray)
.map { record => ProducerMessage.Message[Array[Byte], Array[Byte], Null](
new ProducerRecord[Array[Byte], Array[Byte]](conf.topic, record), null
)})
.toMat(sink)(Keep.right)
.mapMaterializedValue { done =>
done.map(_ => objectLocation)
}
// alpakka S3Connector
// a built-in flow to decompress gzipped content
// a built-in flow to divide file content into lines
// ReactiveKafka Producer Sink
// to return a future of completed object
key when called blueprint.run()
// convert binary to ProducerRecord of Kafka
Overall
implicit val mat: Materializer = ActorMaterializer(
ActorMaterializerSettings(system).withSupervisionStrategy( ex => ex match {
case ex: Throwable =>
system.log.error(ex, "an error occurs - skip and resume")
Supervision.Resume
})
)
val src = SqsSource(queueUrl)
val sink = SqsAckSink(queueUrl)
val blueprint: RunnableGraph[Future[Done]] =
src
.via(Flow[Message].map(parse)
.mapAsyncUnordered(concurrency) { case (msg, events) =>
Future.sequence(
events.collect {
case event: S3Created =>
S3KafkaGraph(event.location).run() map { completedLocation =>
s3.deleteObject(completedLocation.bucket, completedLocation.key)
}
}
) map (_ => msg -> Ack())
}
.toMat(sink)(Keep.right)
// alpakka SqsSource
// alpakka SqsAckSink
// Parse a SQS message to
keys of S3 object to consume
Run S3 -> Kafka graph
Delete success fully produced file
// Ack to a successfully handled message
Workaround for duplication in SQS, with supervision Resume,
app keeps going with ignoring failed message
(Such messages become visible after
visibility timeout but deleted after retention period)
Efficiency
Handle 3TB/day data with 24cores!!
Direct connect
① Notify
Created Events
② Receive object
locations to ingest
…③ Download ④ Produce
Conclusion
Easily implements stream processing with
high resource efficiency and back pressure
even if you do not familiar with Akka Actor!
Conclusion
Easy to connect outer resource
thanks to Alpakka connector!!!
A sample code of GraphDSL (First example)
FibonacciSource
FlowStage with Buffer (Not in this slide)
gists
https://gist.github.com/Saint1991/d2737721551bc908f48b08e15f0b12d4
https://gist.github.com/Saint1991/2aa5841eea5669e8b86a5eb2df8ecb15
https://gist.github.com/Saint1991/29d097f83942d52b598cda20372ad671

More Related Content

PDF
Using akka streams to access s3 objects
Mikhail Girkin
 
PDF
Streaming data to s3 using akka streams
Mikhail Girkin
 
PPTX
Intro to Akka Streams
Michael Kendra
 
PPTX
Introduction to rx java for android
Esa Firman
 
PDF
Practical RxJava for Android
Tomáš Kypta
 
PDF
My Gentle Introduction to RxJS
Mattia Occhiuto
 
PDF
Reactive programming on Android
Tomáš Kypta
 
PDF
Streaming all the things with akka streams
Johan Andrén
 
Using akka streams to access s3 objects
Mikhail Girkin
 
Streaming data to s3 using akka streams
Mikhail Girkin
 
Intro to Akka Streams
Michael Kendra
 
Introduction to rx java for android
Esa Firman
 
Practical RxJava for Android
Tomáš Kypta
 
My Gentle Introduction to RxJS
Mattia Occhiuto
 
Reactive programming on Android
Tomáš Kypta
 
Streaming all the things with akka streams
Johan Andrén
 

What's hot (20)

PDF
ReactiveCocoa and Swift, Better Together
Colin Eberhardt
 
PDF
Intro to RxJava/RxAndroid - GDG Munich Android
Egor Andreevich
 
PPTX
Rxjs ppt
Christoffer Noring
 
PDF
A dive into akka streams: from the basics to a real-world scenario
Gioia Ballin
 
PDF
Akka streams - Umeå java usergroup
Johan Andrén
 
PDF
Introduction to RxJS
Brainhub
 
PDF
JS Fest 2019. Anjana Vakil. Serverless Bebop
JSFestUA
 
PDF
Reactive streams processing using Akka Streams
Johan Andrén
 
PDF
Swift Ready for Production?
Crispy Mountain
 
PDF
Reactive Applications in Java
Alexander Mrynskyi
 
PDF
RxJava applied [JavaDay Kyiv 2016]
Igor Lozynskyi
 
PDF
GPars howto - when to use which concurrency abstraction
Vaclav Pech
 
PDF
Gpars workshop
Vaclav Pech
 
PDF
Intro to ReactiveCocoa
kleneau
 
ODP
Concurrency on the JVM
Vaclav Pech
 
PDF
Reactive stream processing using Akka streams
Johan Andrén
 
ODP
Pick up the low-hanging concurrency fruit
Vaclav Pech
 
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
PPTX
Sharding and Load Balancing in Scala - Twitter's Finagle
Geoff Ballinger
 
PPTX
RxJava Applied
Igor Lozynskyi
 
ReactiveCocoa and Swift, Better Together
Colin Eberhardt
 
Intro to RxJava/RxAndroid - GDG Munich Android
Egor Andreevich
 
A dive into akka streams: from the basics to a real-world scenario
Gioia Ballin
 
Akka streams - Umeå java usergroup
Johan Andrén
 
Introduction to RxJS
Brainhub
 
JS Fest 2019. Anjana Vakil. Serverless Bebop
JSFestUA
 
Reactive streams processing using Akka Streams
Johan Andrén
 
Swift Ready for Production?
Crispy Mountain
 
Reactive Applications in Java
Alexander Mrynskyi
 
RxJava applied [JavaDay Kyiv 2016]
Igor Lozynskyi
 
GPars howto - when to use which concurrency abstraction
Vaclav Pech
 
Gpars workshop
Vaclav Pech
 
Intro to ReactiveCocoa
kleneau
 
Concurrency on the JVM
Vaclav Pech
 
Reactive stream processing using Akka streams
Johan Andrén
 
Pick up the low-hanging concurrency fruit
Vaclav Pech
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
Sharding and Load Balancing in Scala - Twitter's Finagle
Geoff Ballinger
 
RxJava Applied
Igor Lozynskyi
 
Ad

Similar to Connect S3 with Kafka using Akka Streams (20)

PDF
Akka Streams - From Zero to Kafka
Mark Harrison
 
PDF
Akka stream and Akka CQRS
Milan Das
 
PPTX
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
PDF
Reactive Streams / Akka Streams - GeeCON Prague 2014
Konrad Malawski
 
PDF
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Stephane Manciot
 
ODP
Introduction to Akka Streams [Part-I]
Knoldus Inc.
 
PDF
Journey into Reactive Streams and Akka Streams
Kevin Webber
 
PDF
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Lightbend
 
PDF
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
Lightbend
 
PDF
Gearpump akka streams
Kam Kasravi
 
PDF
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
Konrad Malawski
 
PDF
Reactive integrations with Akka Streams
Konrad Malawski
 
PDF
Scala usergroup stockholm - reactive integrations with akka streams
Johan Andrén
 
PPTX
Reactive Streams - László van den Hoek
RubiX BV
 
PPTX
Stream processing from single node to a cluster
Gal Marder
 
ODP
Introduction to Akka Streams [Part-II]
Knoldus Inc.
 
PDF
VJUG24 - Reactive Integrations with Akka Streams
Johan Andrén
 
PDF
Akka Streams and HTTP
Roland Kuhn
 
PDF
Akka A to Z: A Guide To The Industry’s Best Toolkit for Fast Data and Microse...
Lightbend
 
PDF
Asynchronous stream processing with Akka Streams
Johan Andrén
 
Akka Streams - From Zero to Kafka
Mark Harrison
 
Akka stream and Akka CQRS
Milan Das
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
Reactive Streams / Akka Streams - GeeCON Prague 2014
Konrad Malawski
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Stephane Manciot
 
Introduction to Akka Streams [Part-I]
Knoldus Inc.
 
Journey into Reactive Streams and Akka Streams
Kevin Webber
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Lightbend
 
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
Lightbend
 
Gearpump akka streams
Kam Kasravi
 
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
Konrad Malawski
 
Reactive integrations with Akka Streams
Konrad Malawski
 
Scala usergroup stockholm - reactive integrations with akka streams
Johan Andrén
 
Reactive Streams - László van den Hoek
RubiX BV
 
Stream processing from single node to a cluster
Gal Marder
 
Introduction to Akka Streams [Part-II]
Knoldus Inc.
 
VJUG24 - Reactive Integrations with Akka Streams
Johan Andrén
 
Akka Streams and HTTP
Roland Kuhn
 
Akka A to Z: A Guide To The Industry’s Best Toolkit for Fast Data and Microse...
Lightbend
 
Asynchronous stream processing with Akka Streams
Johan Andrén
 
Ad

More from Seiya Mizuno (9)

PPTX
Fluentd1.2 & Fluent Bit
Seiya Mizuno
 
PDF
SysML meetup
Seiya Mizuno
 
PPTX
Apache Avro vs Protocol Buffers
Seiya Mizuno
 
PPTX
Connect S3 with Kafka using Akka Streams
Seiya Mizuno
 
PPTX
Prometheus
Seiya Mizuno
 
PPTX
Introduction to Finch
Seiya Mizuno
 
PPTX
The future of Apache Hadoop YARN
Seiya Mizuno
 
PPTX
Yarn application-master
Seiya Mizuno
 
PPTX
Yarn resource-manager
Seiya Mizuno
 
Fluentd1.2 & Fluent Bit
Seiya Mizuno
 
SysML meetup
Seiya Mizuno
 
Apache Avro vs Protocol Buffers
Seiya Mizuno
 
Connect S3 with Kafka using Akka Streams
Seiya Mizuno
 
Prometheus
Seiya Mizuno
 
Introduction to Finch
Seiya Mizuno
 
The future of Apache Hadoop YARN
Seiya Mizuno
 
Yarn application-master
Seiya Mizuno
 
Yarn resource-manager
Seiya Mizuno
 

Recently uploaded (20)

PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
PDF
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
PPT
Ppt for engineering students application on field effect
lakshmi.ec
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
Software Testing Tools - names and explanation
shruti533256
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PPTX
Introduction of deep learning in cse.pptx
fizarcse
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PDF
flutter Launcher Icons, Splash Screens & Fonts
Ahmed Mohamed
 
PDF
Traditional Exams vs Continuous Assessment in Boarding Schools.pdf
The Asian School
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
Ppt for engineering students application on field effect
lakshmi.ec
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
Software Testing Tools - names and explanation
shruti533256
 
Inventory management chapter in automation and robotics.
atisht0104
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
Introduction of deep learning in cse.pptx
fizarcse
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
flutter Launcher Icons, Splash Screens & Fonts
Ahmed Mohamed
 
Traditional Exams vs Continuous Assessment in Boarding Schools.pdf
The Asian School
 
Information Retrieval and Extraction - Module 7
premSankar19
 

Connect S3 with Kafka using Akka Streams

  • 1. Connect S3 with Kafka leveraging Akka Streams
  • 2. Seiya Mizuno @Saint1991 Developing data processing platform like below Who am I?
  • 3. Introduction to Akka Streams Components of Akka Streams Glance at GraphStage Connect S3 with Kafka using Alpakka Agenda HERE!
  • 5. The toolkit to process data streams on Akka actors Describe processing pipeline as a graph Easy to define complex pipeline What is Akka Streams? Source Flow SinkBroadcast Flow Merge Input Generating stream elements Fetching stream elements from outside Processing Processing stream elements sent from upstreams one by one Output To a File To outer resources
  • 6. Sample code! implicit val system = ActorSystem() implicit val dispatcher = system.dispatcher implicit val mat = ActorMaterializer() val s3Keys = List(“key1”, “key2”) val sinkForeach = Sink.foreach(println) val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) { implicit builder: GraphDSL.Builder[Future[Done]] => sink: Sink[String, Future[Done]]#Shape => import GraphDSL.Implicits._ val src = Source(s3Keys) val flowA = Flow[String].map(key => s“s3://bucketA/$key”) val flowB = Flow[String].map(key => s"s3://bucketB/$key") val broadcast = builder.add(Broadcast[String](2)) val merge = builder.add(Merge[String](2)) src ~> broadcast ~> flowA ~> merge ~> sink broadcast ~> flowB ~> merge ClosedShape }) blueprint.run() onComplete { _ => Await.ready(system.terminate(), 10 seconds) } // stream elements // a sink that prints received stream elements // a source send elements defined above // a flow maps received element to the URL of Bucket A // a flow maps received element to the URL of Bucket B // a Junction that broadcasts received elements to 2 outlets // a Junction that merge received elements from 2 inlets // THIS IS GREAT FUNCTIONALITY OF GraphDSL // easy to describe graph // Run the graph!!! // terminate actor system when the graph is completed
  • 7. Easy to use without knowing the detail of Akka Actor GOOD!
  • 8. Akka Streams implicitly do everything implicit val system = ActorSystem() implicit val dispatcher = system.dispatcher implicit val mat = ActorMaterializer() val s3Keys = List(“key1”, “key2”) val sinkForeach = Sink.foreach(println) val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) { implicit builder: GraphDSL.Builder[Future[Done]] => sink: Sink[String, Future[Done]]#Shape => import GraphDSL.Implicits._ val src = Source(s3Keys) val flowA = Flow[String].map(key => s“s3://bucketA/$key”) val flowB = Flow[String].map(key => s"s3://bucketB/$key") val broadcast = builder.add(Broadcast[String](2)) val merge = builder.add(Merge[String](2)) src ~> broadcast ~> flowA ~> merge ~> sink broadcast ~> flowB ~> merge ClosedShape }) blueprint.run() onComplete { _ => Await.ready(system.terminate(), 10 seconds) } // dispatch threads to actors // create actors Materializer creates Akka Actors based on the blueprint when called RunnableGraph#run and processing is going!!!
  • 9. Conclusion Built a graph with Source, Flow, Sink etc Declare materializer with implicit RunnableGraph ActorMaterializer Actors Almost Automatically working with actors!!!
  • 10. Tips implicit val system = ActorSystem() implicit val dispatcher = system.dispatcher implicit val mat = ActorMaterializer() val s3Keys = List(“key1”, “key2”) val sinkForeach = Sink.foreach(println) val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) { implicit builder: GraphDSL.Builder[Future[Done]] => sink: Sink[String, Future[Done]]#Shape => import GraphDSL.Implicits._ val src = Source(s3Keys) val flowA = Flow[String].map(key => s“s3://bucketA/$key”) val flowB = Flow[String].map(key => s"s3://bucketB/$key") val broadcast = builder.add(Broadcast[String](2)) val merge = builder.add(Merge[String](2)) src ~> broadcast ~> flowA ~> merge ~> sink broadcast ~> flowB ~> merge ClosedShape }) blueprint.run() onComplete { _ => Await.ready(system.terminate(), 10 seconds) } To return MaterializedValue using GraphDSL, the graph component that create MaterializedValue to return has to be passed to GrapDSL#create. So it must be defined outside GraphDSL builer… orz Process will not be completed till terminate ActorSystem Don’t forget to terminate it!!! If not define materialized value, blueprint does not Return completion future…
  • 12. Asynchronous message passing Efficient use of CPU Back pressure Remarkable of Akka Streams are… Source Sink ① Request a next element ② send a element Upstreams send elements only when received requests from downstream. Down streams’ buffer will not overflow
  • 13. What is GraphStage? Source Sink ① Request a next element Every Graph Component is GraphStage!! Not found in Akka streams standard library? But want backpressure??? Implement custom GraphStages!!! ② send a element
  • 14. SourceStage that emits Fibonacci class FibonacciSource(to: Int) extends GraphStage[SourceShape[Int]] { val out: Outlet[Int] = Outlet("Fibonacci.out") override val shape = SourceShape(out) override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) { var fn_2 = 0 var fn_1 = 0 var n = 0 setHandler(out, new OutHandler { override def onPull(): Unit = { val fn = if (n == 0) 0 else if (n == 1) 1 else fn_2 + fn_1 if (fn >= to) completeStage() else push(out, fn) fn_2 = fn_1 fn_1 = fn n += 1 } }) } } Define a shape of Graph SourceShape that has a outlet that emit int elements // new instance is created every time RunnableGraph#run is called // terminate this stage with completion // called when every time received a request from downstream (backpressure) So mutable state must be initizalized within the GraphStageLogic // send an element to the downstream
  • 16. Connect S3 with Kafka Docker Container Direct connect Put 2.5TB/day !!! Must be scalable
  • 17. Our architecture Direct connect ① Notify Created Events ② Receive object keys to ingest …③ Download ④ Produce Distribute object keys to containers (Work as Load Balancer)
  • 18. At least once = Sometimes duplicate Once an event is read, it becomes invisible and basically any consumers does not receive the same event until passed visibility timeout Load Balancing Elements are not deleted until sending Ack It is retriable, by not sending Ack when a failure occurs Amazon SQS
  • 19. Alpakka (Implementation of GraphStages) SQS Connector • Read events from SQS • Ack S3 Connector • Downloading content of a S3 object Reactive Kafka Produce content to Kafka Various connector libraries!! https://github.com/akka/alpakka/tree/master/sqs https://github.com/akka/alpakka/tree/master/s3 https://github.com/akka/reactive-kafka
  • 20. S3 → Kafka val src: Source[ByteString, NotUsed] = S3Client().download(bucket, key) val decompress: Flow[ByteString, ByteString, NotUsed] = Compression.gunzip() val lineFraming: Flow[ByteString, ByteString, NotUsed] = Framing.delimiter(delimiter = ByteString("n"), maximumFrameLength = 65536, allowTruncation = false) val sink: Sink[ProducerMessage.Message[Array[Byte], Array[Byte], Any], Future[Done]] = Producer.plainSink(producerSettings) val blueprint: RunnableGraph[Future[String]] = src .via(decompress) .via(lineFraming) .via(Flow[ByteString] .map(_.toArray) .map { record => ProducerMessage.Message[Array[Byte], Array[Byte], Null]( new ProducerRecord[Array[Byte], Array[Byte]](conf.topic, record), null )}) .toMat(sink)(Keep.right) .mapMaterializedValue { done => done.map(_ => objectLocation) } // alpakka S3Connector // a built-in flow to decompress gzipped content // a built-in flow to divide file content into lines // ReactiveKafka Producer Sink // to return a future of completed object key when called blueprint.run() // convert binary to ProducerRecord of Kafka
  • 21. Overall implicit val mat: Materializer = ActorMaterializer( ActorMaterializerSettings(system).withSupervisionStrategy( ex => ex match { case ex: Throwable => system.log.error(ex, "an error occurs - skip and resume") Supervision.Resume }) ) val src = SqsSource(queueUrl) val sink = SqsAckSink(queueUrl) val blueprint: RunnableGraph[Future[Done]] = src .via(Flow[Message].map(parse) .mapAsyncUnordered(concurrency) { case (msg, events) => Future.sequence( events.collect { case event: S3Created => S3KafkaGraph(event.location).run() map { completedLocation => s3.deleteObject(completedLocation.bucket, completedLocation.key) } } ) map (_ => msg -> Ack()) } .toMat(sink)(Keep.right) // alpakka SqsSource // alpakka SqsAckSink // Parse a SQS message to keys of S3 object to consume Run S3 -> Kafka graph Delete success fully produced file // Ack to a successfully handled message Workaround for duplication in SQS, with supervision Resume, app keeps going with ignoring failed message (Such messages become visible after visibility timeout but deleted after retention period)
  • 22. Efficiency Handle 3TB/day data with 24cores!! Direct connect ① Notify Created Events ② Receive object locations to ingest …③ Download ④ Produce
  • 23. Conclusion Easily implements stream processing with high resource efficiency and back pressure even if you do not familiar with Akka Actor!
  • 24. Conclusion Easy to connect outer resource thanks to Alpakka connector!!!
  • 25. A sample code of GraphDSL (First example) FibonacciSource FlowStage with Buffer (Not in this slide) gists https://gist.github.com/Saint1991/d2737721551bc908f48b08e15f0b12d4 https://gist.github.com/Saint1991/2aa5841eea5669e8b86a5eb2df8ecb15 https://gist.github.com/Saint1991/29d097f83942d52b598cda20372ad671

Editor's Notes

  • #7: ちなみに実行結果は以下のようになります s3://bucketA/key1 s3://bucketB/key1 s3://bucketA/key2 s3://bucketB/key2