Benchmarking of distributed linked data streaming systems

Benchmarking of distributed linked data streaming systems
This project has received funding from the European Union's H2020 research and innovation action program under grant agreement number 688227.
The project runtime is December 2015 until November 2018.
The HOBBIT project
Pavel Smirnov
AGT International
1
Stream Reasoning Workshop
January 17, 2018

2
Overview
• The HOBBIT project
• DEBS challenges
• Available benchmarks overview
• Summary

Goal
To abolish the barriers in the adoption and deployment of Big Linked Data by European companies by:
• The deployment of benchmarks on data that reflects reality within realistic settings.
• The provision of corresponding industry-relevant key performance indicators (KPIs).
• The computation of comparable results on standardized hardware.
• The institution of an independent and thus bias-free organization to conduct regular benchmarks and
provide the European industry with up-to-date performance results.
Deliverables:
• The benchmarking platform (the HOBBIT platform)
• The set of benchmarks with KPIs
• Benchmarking association
3
The HOBBIT project. Overview
http://project-hobbit.eu

4
The HOBBIT platform. Business logic
1
2
3.
2
3.
1
4
5
6
Customer
Requires ranking of alternative
solutions by some KPI
Solution provider (vendor)
(e.g. DB, Streaming Platforms, ML
frameworks, etc…)
The HOBBIT platform
(online or local instance)
Customer
solutions by some KPI
Customer
solutions by some KPI Provides:
1. Automatic benchmark executions
2. Leaderboards (online or private)
Main advantages:
1. Streaming fashion
2. Docker virtualization
3. RDF-enabled
Submit
benchmarks
Submit
systems
http://github.com/hobbit-project/platform

5
The HOBBIT platform. Architecture
The data pipeline:
1. Raw/initial data send (optional)
2. Sending raw tuples
3.1 Sending tasks (task={tuple, id})
3.2 Sending expected results per tasks
4. Send actual results per tasks
5. Send the “expected-actual” pairs
6. Send KPIs back to the controller
7. Send KPIs back to the platform
Benchmark (customer’s application)
System components
(black box for customers)
Platform components
1
2
3.1
3.2
4
5
6
The online platform:
http://master.project-hobbit.eu/
Cluster: 6 nodes, each is
2×64 bit Intel Xeon E5-2630v3
(8-Cores, 2.4 GHz, HT, 20MB
Cache, each proc.), 256 GB RAM,
1Gb Ethernet
Nodes (benchmark/system): 3/3
https://github.com/hobbit-project/platform/wiki/Overview
7

6
The HOBBIT platform. Technologies
https://github.com/hobbit-project/platform/wiki/Overview
Platform communication channel (RarritMQ only)
Data transportation channel (app-specific)
Platform-side:
1. Java
2. RabbitMQ
3. Docker+Swarm
4. GitLab
5. Redis
6. Virtuoso (RDF)
7. NodeJS
8. KeyCloak
App-side (defaults):
1. Java
2. RabbitMQ
Application side Platform side
(RabbitMQ, Kafka, Netty, Akka…)

Design and upload to HOBBIT
Create a project at
https://git.project-hobbit.eu
Create and account at
https://master.project-hobbit.eu
Clone and extend the basic codes:
https://github.com/hobbit-project/java-sdk-
example
Design components using the manuals:
Run tests locally as pure java code
Update ttl-files for you project
Upload Design (alternative using the JAVA SDK)
Develop a benchmark component in Java
Develop a component in Java
Develop a system adapter
Develop a system adapter in Java
Create docker files using details (manual)
Design (the standard HOBBIT way)
Debug Docker images by running tests
Find your benchmark or system at
https://master.project-hobbit.eu
Build images (manual)
Configure remote project details
Upload docker images to
https://git.project-hobbit.eu
- Lots of understanding and manual work
- Impossible to debug locally *
- Upload non-tested images *
- No logs from the online platform, only GUI *
+ Clone and extend standard classes with your logic
+ Test and debug your code from IDE
+ Built Docker images on demand from IDE
+ Run your images from IDE, check all internal logs
+ Upload fully tested images
7
* Unless you haven’t a local HOBBIT deployment

8
Example: single benchmark run

9
Example: challenges & leaderboards

Challenges: DEBS GC 2017
DEBS Grand Challenge 2017 successfully completed
Anomaly detection for injection molding machines over RDF-streams.
10
14 teams
registered
7 teams passed
correctness check
2 were awarded
(main and audience
award)
StreaML Open Challenge is opened; Price: 500 €
The main result:
For the first time we can objectively quantify the performance of
a distributed stream processing pipeline running analytics algorithms
https://project-hobbit.eu/challenges/debs-grand-challenge/
https://project-hobbit.eu/open-challenges/streaml-open-challenge/
Find Cluster
Centers Over W
time units
Apply Markov
Model for
Anomaly Detection
Train Markov
Model over last W
time units
start
After at least W
time units
The anomaly detector:

Challenges: DEBS GC 2018
DEBS Grand Challenge 2018 is just started
https://project-hobbit.eu/challenges/debs2018-grand-challenge/
Prediction of arrival times and ports on marine traffic data.
Price: 1000 € + publication at DEBS proceedings (conf. will be in New Zealand)
11
• Synthetic generated data
• Predefined algorithms
• True RDF-streaming benchmark
• Focus: correctness check,
throughput, latency
• Real annotated data
• No predefined approach
• True ML-benchmark
• Focus: prediction accuracy,
performance
DEBS Grand Challenge 2018DEBS Grand Challenge 2017

12
Available benchmarks overview
Versioning Benchmark
• Benchmark for assessing an ability of
versioning systems to efficiently
manage evolving datasets and queries
Data Storage Benchmark
 benchmark for RDF data storage
solutions against an interactive
workload in a real-world scenario, using
various dataset sizes
Linking Benchmark
 Benchmark for assessing the
performance of instance Matching
tools that implement string-based
approaches
Faceted Browsing Benchmark
• Benchmark for systems which support
browsing through linked data by
iterative transitions performed by an
intelligent user
ODIN Benchmark
• benchmark for data extraction
solutions for structured data
• simulates the ingestion, storage
and retrieval of streams of RDF
data
Spatial Benchmark
 Benchmark for systems which deal with
topological relations proposed in the
state of the art DE-9IM model.
Question Answering Benchmark
• Benchmark for ranking question
answering systems based on their
performance and accuracy
GERBIL Benchmark
• benchmark for entity annotation
and disambiguation tools
• 9 annotators, 11 RDF datasets
Stream Machine Learning Benchmark
 Benchmark for assess the performance of
anomaly detection for injection molding
machines over RDF-streams
Stream Machine Learning Benchmark v2
• Benchmark for assess the accuracy of
prediction over stream of marine traffic
data
http://github.com/hobbit-project

Summary
The HOBBIT platform
• Ability to benchmark heterogeneous distibuted systems in streaming fashion
• A set of benchmarks to compare relevant Linked Data technologies and solutions
• We apply the HOBBIT platform to rank machine-learning pipelines over the RDF-streams
• The platform may be a basics for benchmark of stream-reasoning solutions
13

QA
Thank you for attention!
14
psmirnov@agtinternational.com
http://twitter.com/smirnp
http://twitter.com/AGTIntl

Benchmarking of distributed linked data streaming systems

More Related Content

What's hot (19)

Similar to Benchmarking of distributed linked data streaming systems (20)

More from Holistic Benchmarking of Big Linked Data (20)

Recently uploaded (20)

Benchmarking of distributed linked data streaming systems