SlideShare a Scribd company logo
Apache Spark @enbrite.ly
Budapest Spark Meetup
March 30, 2016
Joe MÉSZÁROS
software engineer
@joemesz
joemeszaros
Who we are?
Our vision is to revolutionize the KPIs and metrics the online
advertisement industry currently using. With our products,
Antifraud, Brandsafety and Viewability we provide actionable
data to our customers.
Agenda
● What we do?
● How we do? - enbrite.ly data platform
● Real world antifraud example
● LL + Spark in scale +/-
DATA
COLLECTION
ANALYZE
DATA PROCESSION
ANTI FRAUD
VIEWABILITY
BRAND SAFETY
REPORT + API
What we do?
How we do? DATA COLLECTION
How we do? DATA PROCESSION
Amazon EMR
● Most popular cloud service provider
● Amazon Big Data ecosystem
● Applications: Hadoop, Spark, Hive, ….
● Scaling is easy
● Do not trust the BIG guys (API problem)
● Spark application in EMR runs on YARN (cluster
manager)
For more information: https://aws.amazon.com/elasticmapreduce/
Tools we use
https://github.com/spotify/luigi | 4500 ★ | more than 200 contributors
Workflow engine, that helps you build
complex data pipelines of batch jobs.
Created by Spotify’s engineering team.
Your friendly plumber, that sticks your Hadoop, Spark, … jobs
with simple dependency definition and failure management.
class SparkMeetupTask(luigi.Task):
param = luigi.Parameter(default=42)
def requires(self):
return SomeOtherTask(self.param)
def run(self):
with self.output().open('w') as f:
f.write('Hello Spark meetup!')
def output(self):
return luigi.LocalTarget('/meetup/message')
if __name__ == '__main__':
luigi.run()
Web interface
Web interface
Let me tell you a short story...
Tools we created GABO LUIGI
Luigi + enbrite.ly extensions = Gabo Luigi
● Dynamic task configuration + dependencies
● Reshaped web interface
● Define reusable data pipeline template
● Monitoring for each task
Tools we created GABO LUIGI
Tools we created GABO LUIGI
We plan to release it to the wild and make it open
source as part of Spotify’s Luigi! If you are
interested, you are front of open doors :-)
Tools we created GABO MARATHON
Motivation: Testing with large data sets and slow batch jobs is
boring and wasteful!
Tools we created GABO MARATHON
Graphite
Real world example
You are fighting against robots and want to humanize
ad tech era. You have a simple idea to detect bot traffic,
which saves the world. Let’s implement it!
Real world example
THE IDEA: Analyse events which are too hasty and deviate
from regular, humanlike profiles: too many clicks in a defined
timeframe.
INPUT: Load balancer access logs files on S3
OUTPUT: Print invalid sessions
Step 1: convert access log files to events
Step 2: sessionize events
Step 3: detect too many clicks
How to solve it?
The way to access log
{
"session_id": "spark_meetup_jsmmmoq",
"timestamp": 1456080915621,
"type": "click"
}
eyJzZXNzaW9uX2lkIjoic3Bhcmtfb
WVldHVwX2pzbW1tb3EiLCJ0aW1l
c3RhbXAiOjE0NTYwODA5MTU2M
jEsInR5cGUiOiAiY2xpY2sifQo=
Click event attributes
(created by JS tracker)
Access log format
TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
1. 2.
3.
Step 1: log to event
Simplify: log files are on the local storage, only click events.
SparkConf conf = new SparkConf().setAppName("LogToEvent");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> rawEvents = sparkContext.textFile(LOG_FOLDER);
// 2016-02-29T23:50:36.269432Z 178.165.132.37 200 "GET
https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
Step 1: log to event
JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]);
// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...
JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]);
// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...
JavaRDD<String> eventParameter = rawUrls
.map(u -> parseUrl(u).get("event"));
// eyJzZXNzaW9uX2lkIj…
JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]);
// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...
JavaRDD<String> eventParameter = rawUrls
.map(u -> parseUrl(u).get("event"));
// eyJzZXNzaW9uX2lk
JavaRDD<String> base64Decoded = eventParameter
.map(e -> new String(Base64.getDecoder().decode(e)));
// {"session_id": "spark_meetup_jsmmmoq",
// "timestamp": 1456080915621, "type": "click"}
IoUtil.saveAsJsonGzipped(base64Decoded);
Step 2: event to session
SparkConf conf = new SparkConf().setAppName("EventToSession");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);
JavaRDD<ClickEvent> clickEvents = jsonEvents
.map(e -> readJsonObject(e));
SparkConf conf = new SparkConf().setAppName("EventToSession");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);
JavaRDD<ClickEvent> clickEvents = jsonEvents
.map(e -> readJsonObject(e));
JavaPairRDD<String, Iterable<ClickEvent>> groupedEvents =
clickEvents.groupBy(e -> e.getSessionId());
JavaPairRDD<String, Session> sessions = grouped
.flatMapValues(sessionizer);
Step 2: event to session
//Sessionizer
public Session call(Iterable<ClickEvent> clickEvents) {
List<ClickEvent> ordered = sortByTimestamp(clickEvents);
Session session = new Session();
for (ClickEvent event: ordered) {
session.addClick(event)
}
return session;
}
Step 2: event to session
class Session {
public Boolean isBad = False;
public List<Long> clickTimestamps;
public void addClick(ClickEvent e) {
clickTimestamps.add(e.getTimestamp());
}
public void badify() { this.isBad = True; }
}
Step 3: detect bad sessions
JavaRDD<Session> sessions = IoUtil.readFrom(LOCAL_STORAGE);
JavaRDD<Session> markedSessions = sessions
.map(s -> s.clickTimestamps.size() > THRESHOLD);
JavaRDD<Session> badSessions = markedSessions
.filter(s -> s.isBad());
badSessions.collect().foreach(println);
Congratulation!
MISSION COMPLETED
YOU just saved the world with a
simple idea within ~10 minutes.
Using Spark pros
● Sparking is funny, community, tools
● Easy to start with it
● Language support: Python, Scala, Java, R
● Unified stack: batch, streaming, SQL,
ML
Using Spark cons
● You need memory and memory
● Distributed application, hard to debug
● Hard to optimize
Lessons learned
● Do not use default config, always optimize!
● Eliminate technical debt + automate
● Failures happen, use monitoring from the very
first breath + fault tolerant implementation
● Sparking is funny, but not a hammer for
everything
Data platform future
● Would like to play with Redshift
● Change data format (avro, parquet, …)
● Would like to play with streaming
● Would like to play with Spark 2.0
WE ARE HIRING!
working @exPrezi office, K9
check out the company in Forbes :-)
amazing company culture
BUT the real reason ….
WE ARE HIRING!
… is our mood manager, Bigyó :)
Joe MÉSZÁROS
software engineer
joe@enbrite.ly
@joemesz
@enbritely
joemeszaros
enbritely
THANK YOU!
QUESTIONS?

More Related Content

What's hot (19)

PPTX
Blockchain for Java Developers - Cloud Conference Day
Juarez Junior
 
PPTX
Data ANZ - Using database for ML.NET.pptx
Luis Beltran
 
PDF
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
confluent
 
PDF
Bot Revolution lab at Codemotion Milan 2016
gjuljo
 
PPTX
Microsoft Azure Technical Overview
gjuljo
 
PPTX
Accelerating Digital Transformation With Microsoft Azure And Cognitive Services
Thuan Ng
 
PPTX
Cosmos DB Conf - Cosmos DB + Azure Functions .pptx
Luis Beltran
 
PPTX
Windows azure mobile services from start to rest
Aidan Casey
 
PDF
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Daniel Zivkovic
 
PPTX
Journey to the Modern App with Containers, Microservices and Big Data
Lightbend
 
PDF
BRK20011: Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...
Tracy Van der Schyff
 
PPTX
SQLDay 2021 PL AI Enrichment Azure Search.pptx
Luis Beltran
 
PPTX
Azure functions
Mohit Chhabra
 
PPTX
Microsoft Azure News - August 2021
Daniel Toomey
 
PDF
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
confluent
 
PDF
If an Event is Published to a Topic and No One is Around to Consume it, Does ...
confluent
 
PPTX
[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft Cloud
European Collaboration Summit
 
PPTX
Games en
Microsoft
 
PDF
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
confluent
 
Blockchain for Java Developers - Cloud Conference Day
Juarez Junior
 
Data ANZ - Using database for ML.NET.pptx
Luis Beltran
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
confluent
 
Bot Revolution lab at Codemotion Milan 2016
gjuljo
 
Microsoft Azure Technical Overview
gjuljo
 
Accelerating Digital Transformation With Microsoft Azure And Cognitive Services
Thuan Ng
 
Cosmos DB Conf - Cosmos DB + Azure Functions .pptx
Luis Beltran
 
Windows azure mobile services from start to rest
Aidan Casey
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Daniel Zivkovic
 
Journey to the Modern App with Containers, Microservices and Big Data
Lightbend
 
BRK20011: Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...
Tracy Van der Schyff
 
SQLDay 2021 PL AI Enrichment Azure Search.pptx
Luis Beltran
 
Azure functions
Mohit Chhabra
 
Microsoft Azure News - August 2021
Daniel Toomey
 
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
confluent
 
If an Event is Published to a Topic and No One is Around to Consume it, Does ...
confluent
 
[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft Cloud
European Collaboration Summit
 
Games en
Microsoft
 
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
confluent
 

Viewers also liked (20)

PDF
Build Features, Not Apps
Natasha Murashev
 
PDF
Designing Teams for Emerging Challenges
Aaron Irizarry
 
PDF
Study: The Future of VR, AR and Self-Driving Cars
LinkedIn
 
PDF
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
PDF
Hype vs. Reality: The AI Explainer
Luminary Labs
 
PDF
3 Things Every Sales Team Needs to Be Thinking About in 2017
Drift
 
PDF
Visual Design with Data
Seth Familian
 
PDF
TEDx Manchester: AI & The Future of Work
Volker Hirsch
 
PDF
How to Become a Thought Leader in Your Niche
Leslie Samuel
 
PDF
BigWeatherGear Group and Corporate Services Brochure 2013
Kristin Matson
 
PPTX
Shall we play a game?
Maciej Lasyk
 
PDF
Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013
Cain Ransbottyn
 
PPTX
Projeto gelo
patronatobonanca
 
PPTX
出版学会(活字離れ)資料
Tomohiko (TOMO) Hayashi
 
PPTX
Endocarditis
Juan Meléndez
 
PPTX
Pénfigo
Juan Meléndez
 
PDF
Technology Vision 2017 infographic
Accenture Technology
 
PPTX
Agriculture connectée 4.0
Jérôme Monteil
 
PDF
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
Jason Cheng
 
PDF
State of the Cloud 2017
Bessemer Venture Partners
 
Build Features, Not Apps
Natasha Murashev
 
Designing Teams for Emerging Challenges
Aaron Irizarry
 
Study: The Future of VR, AR and Self-Driving Cars
LinkedIn
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
Hype vs. Reality: The AI Explainer
Luminary Labs
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
Drift
 
Visual Design with Data
Seth Familian
 
TEDx Manchester: AI & The Future of Work
Volker Hirsch
 
How to Become a Thought Leader in Your Niche
Leslie Samuel
 
BigWeatherGear Group and Corporate Services Brochure 2013
Kristin Matson
 
Shall we play a game?
Maciej Lasyk
 
Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013
Cain Ransbottyn
 
Projeto gelo
patronatobonanca
 
出版学会(活字離れ)資料
Tomohiko (TOMO) Hayashi
 
Endocarditis
Juan Meléndez
 
Pénfigo
Juan Meléndez
 
Technology Vision 2017 infographic
Accenture Technology
 
Agriculture connectée 4.0
Jérôme Monteil
 
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
Jason Cheng
 
State of the Cloud 2017
Bessemer Venture Partners
 
Ad

Similar to Budapest Spark Meetup - Apache Spark @enbrite.ly (20)

PPTX
Startup Safary | Fight against robots with enbrite.ly data platform
Mészáros József
 
PPTX
What is going on - Application diagnostics on Azure - TechDays Finland
Maarten Balliauw
 
PDF
NSA for Enterprises Log Analysis Use Cases
WSO2
 
PDF
[@IndeedEng] Logrepo: Enabling Data-Driven Decisions
indeedeng
 
PDF
Scaling Experimentation & Data Capture at Grab
Roman
 
PPTX
IndexedDB and Push Notifications in Progressive Web Apps
Adégòkè Obasá
 
PPTX
E.D.D.I - Open Source Chatbot Platform
Gregor Jarisch
 
PPTX
Microsoft Graph: Connect to essential data every app needs
Microsoft Tech Community
 
PPTX
Microsoft Graph: Connect to essential data every app needs
Microsoft Tech Community
 
PDF
Un-broken logging - the foundation of software operability - Operability.io -...
Matthew Skelton
 
PDF
Un-broken Logging - Operability.io 2015 - Matthew Skelton
Skelton Thatcher Consulting Ltd
 
PPTX
Introduction to WSO2 Data Analytics Platform
Srinath Perera
 
PDF
Un-broken Logging - TechnologyUG - Leeds - Matthew Skelton
Skelton Thatcher Consulting Ltd
 
PDF
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
ITCamp
 
PDF
Google App Engine for Java v0.0.2
Matthew McCullough
 
PDF
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
GITS Indonesia
 
PPTX
Monitoring Spark Applications
Tzach Zohar
 
PDF
Async
jonathanfmills
 
PDF
Large scale data capture and experimentation platform at Grab
Roman
 
KEY
Flexible Event Tracking (Paul Gebheim)
MongoSF
 
Startup Safary | Fight against robots with enbrite.ly data platform
Mészáros József
 
What is going on - Application diagnostics on Azure - TechDays Finland
Maarten Balliauw
 
NSA for Enterprises Log Analysis Use Cases
WSO2
 
[@IndeedEng] Logrepo: Enabling Data-Driven Decisions
indeedeng
 
Scaling Experimentation & Data Capture at Grab
Roman
 
IndexedDB and Push Notifications in Progressive Web Apps
Adégòkè Obasá
 
E.D.D.I - Open Source Chatbot Platform
Gregor Jarisch
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Tech Community
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Tech Community
 
Un-broken logging - the foundation of software operability - Operability.io -...
Matthew Skelton
 
Un-broken Logging - Operability.io 2015 - Matthew Skelton
Skelton Thatcher Consulting Ltd
 
Introduction to WSO2 Data Analytics Platform
Srinath Perera
 
Un-broken Logging - TechnologyUG - Leeds - Matthew Skelton
Skelton Thatcher Consulting Ltd
 
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
ITCamp
 
Google App Engine for Java v0.0.2
Matthew McCullough
 
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
GITS Indonesia
 
Monitoring Spark Applications
Tzach Zohar
 
Large scale data capture and experimentation platform at Grab
Roman
 
Flexible Event Tracking (Paul Gebheim)
MongoSF
 
Ad

Recently uploaded (20)

PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 

Budapest Spark Meetup - Apache Spark @enbrite.ly

  • 1. Apache Spark @enbrite.ly Budapest Spark Meetup March 30, 2016
  • 3. Who we are? Our vision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.
  • 4. Agenda ● What we do? ● How we do? - enbrite.ly data platform ● Real world antifraud example ● LL + Spark in scale +/-
  • 6. How we do? DATA COLLECTION
  • 7. How we do? DATA PROCESSION
  • 8. Amazon EMR ● Most popular cloud service provider ● Amazon Big Data ecosystem ● Applications: Hadoop, Spark, Hive, …. ● Scaling is easy ● Do not trust the BIG guys (API problem) ● Spark application in EMR runs on YARN (cluster manager) For more information: https://aws.amazon.com/elasticmapreduce/
  • 9. Tools we use https://github.com/spotify/luigi | 4500 ★ | more than 200 contributors Workflow engine, that helps you build complex data pipelines of batch jobs. Created by Spotify’s engineering team.
  • 10. Your friendly plumber, that sticks your Hadoop, Spark, … jobs with simple dependency definition and failure management.
  • 11. class SparkMeetupTask(luigi.Task): param = luigi.Parameter(default=42) def requires(self): return SomeOtherTask(self.param) def run(self): with self.output().open('w') as f: f.write('Hello Spark meetup!') def output(self): return luigi.LocalTarget('/meetup/message') if __name__ == '__main__': luigi.run()
  • 14. Let me tell you a short story...
  • 15. Tools we created GABO LUIGI Luigi + enbrite.ly extensions = Gabo Luigi ● Dynamic task configuration + dependencies ● Reshaped web interface ● Define reusable data pipeline template ● Monitoring for each task
  • 16. Tools we created GABO LUIGI
  • 17. Tools we created GABO LUIGI We plan to release it to the wild and make it open source as part of Spotify’s Luigi! If you are interested, you are front of open doors :-)
  • 18. Tools we created GABO MARATHON Motivation: Testing with large data sets and slow batch jobs is boring and wasteful!
  • 19. Tools we created GABO MARATHON Graphite
  • 20. Real world example You are fighting against robots and want to humanize ad tech era. You have a simple idea to detect bot traffic, which saves the world. Let’s implement it!
  • 21. Real world example THE IDEA: Analyse events which are too hasty and deviate from regular, humanlike profiles: too many clicks in a defined timeframe. INPUT: Load balancer access logs files on S3 OUTPUT: Print invalid sessions
  • 22. Step 1: convert access log files to events Step 2: sessionize events Step 3: detect too many clicks How to solve it?
  • 23. The way to access log { "session_id": "spark_meetup_jsmmmoq", "timestamp": 1456080915621, "type": "click" } eyJzZXNzaW9uX2lkIjoic3Bhcmtfb WVldHVwX2pzbW1tb3EiLCJ0aW1l c3RhbXAiOjE0NTYwODA5MTU2M jEsInR5cGUiOiAiY2xpY2sifQo= Click event attributes (created by JS tracker) Access log format TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..." 1. 2. 3.
  • 24. Step 1: log to event Simplify: log files are on the local storage, only click events. SparkConf conf = new SparkConf().setAppName("LogToEvent"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> rawEvents = sparkContext.textFile(LOG_FOLDER); // 2016-02-29T23:50:36.269432Z 178.165.132.37 200 "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
  • 25. Step 1: log to event JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> eventParameter = rawUrls .map(u -> parseUrl(u).get("event")); // eyJzZXNzaW9uX2lkIj… JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> eventParameter = rawUrls .map(u -> parseUrl(u).get("event")); // eyJzZXNzaW9uX2lk JavaRDD<String> base64Decoded = eventParameter .map(e -> new String(Base64.getDecoder().decode(e))); // {"session_id": "spark_meetup_jsmmmoq", // "timestamp": 1456080915621, "type": "click"} IoUtil.saveAsJsonGzipped(base64Decoded);
  • 26. Step 2: event to session SparkConf conf = new SparkConf().setAppName("EventToSession"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<ClickEvent> clickEvents = jsonEvents .map(e -> readJsonObject(e)); SparkConf conf = new SparkConf().setAppName("EventToSession"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<ClickEvent> clickEvents = jsonEvents .map(e -> readJsonObject(e)); JavaPairRDD<String, Iterable<ClickEvent>> groupedEvents = clickEvents.groupBy(e -> e.getSessionId()); JavaPairRDD<String, Session> sessions = grouped .flatMapValues(sessionizer);
  • 27. Step 2: event to session //Sessionizer public Session call(Iterable<ClickEvent> clickEvents) { List<ClickEvent> ordered = sortByTimestamp(clickEvents); Session session = new Session(); for (ClickEvent event: ordered) { session.addClick(event) } return session; }
  • 28. Step 2: event to session class Session { public Boolean isBad = False; public List<Long> clickTimestamps; public void addClick(ClickEvent e) { clickTimestamps.add(e.getTimestamp()); } public void badify() { this.isBad = True; } }
  • 29. Step 3: detect bad sessions JavaRDD<Session> sessions = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<Session> markedSessions = sessions .map(s -> s.clickTimestamps.size() > THRESHOLD); JavaRDD<Session> badSessions = markedSessions .filter(s -> s.isBad()); badSessions.collect().foreach(println);
  • 30. Congratulation! MISSION COMPLETED YOU just saved the world with a simple idea within ~10 minutes.
  • 31. Using Spark pros ● Sparking is funny, community, tools ● Easy to start with it ● Language support: Python, Scala, Java, R ● Unified stack: batch, streaming, SQL, ML
  • 32. Using Spark cons ● You need memory and memory ● Distributed application, hard to debug ● Hard to optimize
  • 33. Lessons learned ● Do not use default config, always optimize! ● Eliminate technical debt + automate ● Failures happen, use monitoring from the very first breath + fault tolerant implementation ● Sparking is funny, but not a hammer for everything
  • 34. Data platform future ● Would like to play with Redshift ● Change data format (avro, parquet, …) ● Would like to play with streaming ● Would like to play with Spark 2.0
  • 35. WE ARE HIRING! working @exPrezi office, K9 check out the company in Forbes :-) amazing company culture BUT the real reason ….
  • 36. WE ARE HIRING! … is our mood manager, Bigyó :)