SlideShare a Scribd company logo
Lambda Architecture:
from zero to One
Serhiy Masyutin
Me
• Staff Engineer @ Lohika
• Passionate Developer
• Father
• Mountain Biker
Agenda
• Project Overview
• Architecture Evolution
• What is Lambda Architecture?
• Cluster Evolution
• What We Achieved?
Project Overview
Project Goals
• Portfolio-driven R&D project
• Focus on Technology
• Focus on Knowledge
• Focus on a new remote Team
Service designed to offload highly
concurrent scenario of live voting
Service designed to offload highly
concurrent scenario of live voting
• User puts a vote
• User requests results on campaign
• Manager requests reports on campaigns
• Admin controls the system
Architecture Goals
• SaaS Solution
• High Throughput
• Scalability
• Low Latency
Essential Data Model
• campaign { startDate, endDate }
• vote { user, campaign, timestamp }
Architecture Evolution
Votes
Start Simple
Reports
Start Simple
Java 8
Spring Boot 1.2.5
MariaDB 5.5
Angularjs 1.4
Benchmark it!
• Simple throughout scenario:
user.vote()
user.request(results)
• Stop tests when error rate raises above 5%
• Benchmark tool runs locally, targeting could server
Gatling
• An open-source load testing framework based on
Scala, Akka and Netty
• High performance
• Out-of-box HTTP support
• Ready-to-present HTML reports
• Scenario recorder and developer-friendly DSL
http://gatling.io
Gatling
scenario(“Throughout simulation").repeat(repeatCount) {
feed(voteFeeder())
.exec(http("Vote")
.post(voteLink)
.headers(sentHeaders).header("Authorization", token)
.body(StringBody("${vote}"))
.check(status.is(200)).asJSON)
.exec(http("Report")
.get(reportByOptionLink+"/${votingSchemaId}")
.headers(sentHeaders).header("Authorization", token)
.check(status.is(200)).asJSON)
}
Gatling
Benchmark!
100
325
550
775
1000
2000 4000 6000 8000 10000 12000
Requestspersecond
Number of concurrent users
Initial Initial no-joins
Kafka
• Publisher-subscriber
• Distributed by design
• Scalable
• Fast
• Durable
http://kafka.apache.org
Incoming Queue
Votes Reports
Benchmark!
100
325
550
775
1000
2000 4000 6000 8000 10000 12000
Requestspersecond
Number of concurrent users
Initial
Initial no-joins
Incoming Queue
Redis
• In-memory data structure store (set, map, etc)
• Easy leader board implementation
• HyperLogLog is its native data structure
http://redis.io
In-memory Storage
Votes
Reports
Benchmark!
100
325
550
775
1000
2000 4000 6000 8000 10000 12000
Requestspersecond
Number of concurrent users
Initial
Initial no-joins
Incoming Queue
In-memory Storage
• A fast and general engine for large-scale data
processing
http://spark.apache.org
Scalable Processing
Votes
Reports
TODO: Benchmarks
• Processing latency
• Latency vs Data Volume
TODO: Scalable Storage
Reports
Votes
Architecture Goals Met
• High Throughput
• Scalable Storage
• Scalable Processing
• Extensible Processing
• Low Latency Reads & Updates
Lambda Architecture
A Single Picture
http://lambda-architecture.net/img/la-overview_small.png
A Single Picture
QUERY = f_query(batch_view, realtime_view)
batch_view = f_batch(all_data)
realtime_view = f_speed(new_data, realtime_view)
Batch Layer
• Immutable append-only data store
• Batch computations produce batch views
Serving Layer
• Random reads/queries on batch views
• Batch updates from batch layer
• No need for random writes
Batch + Serving Layer
• Robustness and fault tolerance
• Scalability
• Generalization
• Extensibility
• Minimal maintenance
• Debuggability
Speed Layer
• Low latency reads and updates
• Incremental computation (different from batch one)
• Scalability
• Fault tolerance
• Minimal amount of stored data
Goals
• Robustness and fault tolerance
• Scalability
• Generalization
• Extensibility
• Minimal maintenance
• Debuggability
• Low latency reads and updates
Lambda Architectrue
http://lambda-architecture.net/img/la-overview_small.png
Cluster Evolution
Start Simple
single box
Optimization:
Tomcat Connector
• Start with a single machine
• Number of threads matter, benchmark it
• Fine-tuning can be OS specific
Benchmark!
100
325
550
775
1000
2000 4000 6000 8000 10000 12000
Requestspersecond
Number of concurrent users
Initial
Initial no-joins
Incoming Queue
In-memory Storage
???
Haproxy
• The Reliable, High Performance TCP/HTTP Load
Balancer
• A single-process program
http://haproxy.org
A cluster of 10 servers
Optimization:
Load Balancing
0
0.25
0.5
0.75
1
1.25
0
10000
20000
30000
40000
dev 1 2 3 6
Gainbyaddinganotherserver
Requestspersecond
Number of servers
requests per second
scaling factor
When to Stop?
CPU %
Memory
GB
haproxy 95 2.5
tomcat 397 6
kafka 1 1.3
redis 55 3.5
What We Achieved?
Experience
• Lambda Architecture: we have One
• Cluster Scaling & Optimization
• Excellent team
Technology
Java 8
Spring Boot 1.2.5
Spring Data 1.2.5
Tomcat 8
MariaDB 5.5
Haproxy 1.5.14
Kafka 0.8
Redis 2.8
Spark 1.4
HDFS 2.6
Gatling 2.2
Angularjs 1.4
Things That Matter
• Small steps make huge difference
• Choose right metrics
• Benchmark
• Optimize!
Q/A
Thank You!

More Related Content

What's hot (20)

PDF
Big Data visualization with Apache Spark and Zeppelin
prajods
 
PDF
Flink in Zalando's world of Microservices
ZalandoHayley
 
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
PDF
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
PPTX
Taboola Road To Scale With Apache Spark
tsliwowicz
 
PDF
Modern ETL Pipelines with Change Data Capture
Databricks
 
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
PDF
Deep Dive into the New Features of Apache Spark 3.1
Databricks
 
PDF
Spark Summit EU talk by Stephan Kessler
Spark Summit
 
PDF
Family data sheet HP Virtual Connect(May 2013)
E. Balauca
 
ODP
Lambda Architecture with Spark
Knoldus Inc.
 
PDF
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Flink in Zalando's world of Microservices
ZalandoHayley
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
Taboola Road To Scale With Apache Spark
tsliwowicz
 
Modern ETL Pipelines with Change Data Capture
Databricks
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Deep Dive into the New Features of Apache Spark 3.1
Databricks
 
Spark Summit EU talk by Stephan Kessler
Spark Summit
 
Family data sheet HP Virtual Connect(May 2013)
E. Balauca
 
Lambda Architecture with Spark
Knoldus Inc.
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 

Viewers also liked (20)

PPTX
Spark - Migration Story
Roman Chukh
 
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
PDF
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
NoSQLmatters
 
PDF
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
Serhiy Batyuk
 
PPTX
Big data analysis in java world
Serg Masyutin
 
PDF
Tweaking performance on high-load projects
Dmitriy Dumanskiy
 
PPTX
React. Flux. Redux
Andrey Kolodnitsky
 
PPTX
Lambda Architecture in Practice
Navneet kumar
 
PPTX
Marionette talk 2016
Kseniya Redunova
 
PPTX
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
ODP
Zero to one.PETER THIEL
Sreeja Sarella
 
PPTX
High performance web sites with multilevel caching
Dotnet Open Group
 
PPTX
Career talk
Jay, Tu The Hien
 
PDF
Cẩm nang kinh doanh tết 2017
Haravan Official
 
PPTX
ITLCHN 18 - Automation & DevOps - Automic
IT Expert Club
 
PPTX
NLP: a peek into a day of a computational linguist
Mariana Romanyshyn
 
PPTX
itlchn 20 - Kien truc he thong chung khoan - Phan 1
IT Expert Club
 
PPTX
ITLC - Hanoi - NodeJS - ArrowJS - 27-11 - 2015
IT Expert Club
 
Spark - Migration Story
Roman Chukh
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
NoSQLmatters
 
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
Serhiy Batyuk
 
Big data analysis in java world
Serg Masyutin
 
Tweaking performance on high-load projects
Dmitriy Dumanskiy
 
React. Flux. Redux
Andrey Kolodnitsky
 
Lambda Architecture in Practice
Navneet kumar
 
Marionette talk 2016
Kseniya Redunova
 
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Zero to one.PETER THIEL
Sreeja Sarella
 
High performance web sites with multilevel caching
Dotnet Open Group
 
Career talk
Jay, Tu The Hien
 
Cẩm nang kinh doanh tết 2017
Haravan Official
 
ITLCHN 18 - Automation & DevOps - Automic
IT Expert Club
 
NLP: a peek into a day of a computational linguist
Mariana Romanyshyn
 
itlchn 20 - Kien truc he thong chung khoan - Phan 1
IT Expert Club
 
ITLC - Hanoi - NodeJS - ArrowJS - 27-11 - 2015
IT Expert Club
 
Ad

Similar to Lambda architecture: from zero to One (20)

PPTX
Building Scalable Applications with Microsoft Azure
Fisnik Doko
 
PPTX
Neotys PAC - Ian Molyneaux
Neotys_Partner
 
PPTX
SharePoint 2013 Performance Analysis - Robi Vončina
SPC Adriatics
 
PPTX
Distributed Kafka Architecture Taboola Scale
Apache Kafka TLV
 
PPTX
Service quality monitoring system architecture
Matsuo Sawahashi
 
PPTX
Serverless without Code (Lambda)
CloudHesive
 
PPTX
Building a highly scalable and available cloud application
Noam Sheffer
 
PPTX
Migration from Oracle to PostgreSQL: NEED vs REALITY
Ashnikbiz
 
PDF
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
Apigee | Google Cloud
 
PDF
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
Equnix Business Solutions
 
PPTX
Architectures, Frameworks and Infrastructure
harendra_pathak
 
PPT
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Victor Holman
 
PPTX
Building FoundationDB
FoundationDB
 
PPTX
Grails in the Cloud (2013)
Meni Lubetkin
 
PPTX
Application Performance Management
Noriaki Tatsumi
 
PPT
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Denny Lee
 
PPTX
Correlate Log Data with Business Metrics Like a Jedi
Trevor Parsons
 
PDF
Enterprise WordPress - Performance, Scalability and Redundancy
John Giaconia
 
PPTX
SQL Explore 2012: P&T Part 1
sqlserver.co.il
 
PPTX
What is Serverless Computing?
AIMDek Technologies
 
Building Scalable Applications with Microsoft Azure
Fisnik Doko
 
Neotys PAC - Ian Molyneaux
Neotys_Partner
 
SharePoint 2013 Performance Analysis - Robi Vončina
SPC Adriatics
 
Distributed Kafka Architecture Taboola Scale
Apache Kafka TLV
 
Service quality monitoring system architecture
Matsuo Sawahashi
 
Serverless without Code (Lambda)
CloudHesive
 
Building a highly scalable and available cloud application
Noam Sheffer
 
Migration from Oracle to PostgreSQL: NEED vs REALITY
Ashnikbiz
 
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
Apigee | Google Cloud
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
Equnix Business Solutions
 
Architectures, Frameworks and Infrastructure
harendra_pathak
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Victor Holman
 
Building FoundationDB
FoundationDB
 
Grails in the Cloud (2013)
Meni Lubetkin
 
Application Performance Management
Noriaki Tatsumi
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Denny Lee
 
Correlate Log Data with Business Metrics Like a Jedi
Trevor Parsons
 
Enterprise WordPress - Performance, Scalability and Redundancy
John Giaconia
 
SQL Explore 2012: P&T Part 1
sqlserver.co.il
 
What is Serverless Computing?
AIMDek Technologies
 
Ad

Recently uploaded (20)

PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPT
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
PPTX
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPTX
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Distribution reservoir and service storage pptx
dhanashree78
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
Design Thinking basics for Engineers.pdf
CMR University
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 

Lambda architecture: from zero to One

Editor's Notes

  • #6: Робота над проектом триває, хоча останнім часом активність не велика.
  • #8: Сервіс для підтримки високонавантаженого сценарію одночасного голосування
  • #9: Software as a Service
  • #12: Maria DB 5.5.44 (no second level cache in our persistence, all tests started with empty DB, connection pooling)
  • #13: Spring Boot makes it easy to create stand-alone, production-grade Spring based Applications that you can "just run". Spring Data's mission is to provide a familiar and consistent, Spring-based programming model for data access while still retaining the special traits of the underlying data store. MariaDB An enhanced, drop-in replacement for MySQL. MariaDB 5.5 is a stable (GA) release of MariaDB. It is MariaDB 5.3 + MySQL 5.5
  • #14: Дає видимість руху, правильно, неправильно, і чи взагалі дало якусь зміну 1000 campaigns 15-100k votes
  • #18: Коли позбулися 3-4 джойнів то система змогла витримувати більше навантаження, хоча продуктивність не збільшилася. Флуктуації на графіку пов”язані з нестабільність віртаулок які нами викоистовувалися. Бачимо, що система має явне обмеження по перформансу. З цим треба щось робити. Звичайним в цій ситуації є буферизація запитів, тобто будемо ставити їх в чергу. Черга — я з малими бавлюся в паттерн черга — я складаю іграшки на коврик, а вони їх з коврика сортують по ящиках )) Ми знаємо про круту технологію, зараз популярна…
  • #19: Є чудова презентація від Колі Аліменкова з цьогорічного ЖЕЕКонфа A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
  • #20: Kafka 0.8.2.1 (Scala 2.9.1, only one partition) Система стала асинхронна, варто було б міряти також лейтенсі. Тримаємо це в секреті ) Хто зауважить — тому подарунок
  • #21: 1.5 - 2.1x Це вже добряче швидко, але зараз ми роздаємо дані з диска. Обрахунки виконуються в запитах до бази даних які досить часто не є найшвидні і не скейляться. Якщо роздавати дані з памяті це дасть нам приріст продуктивності…
  • #22: Вибрали бо ориганільно розраховували на використання лідербордів. З реального використання можна взяти HyperLogLog як ефективний приблизний підрахунок голосів по компанії In the Redis implementation it only uses 12kbytes per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 2^64 items (which seems quite unlikely). // key == set
  • #24: ~1.6x performance boost Але є обмеження на алгоритми по створенню репортів: було б добре мати якісь загальний інструмент для їх реалізації
  • #25: Про спарк є дуже добра презентація від Тараса, якщо ви її не чули — рекомендую подивитися відео на сайті морнинга. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.
  • #26: Основна мета цього кроку — екстенсібіліті системи по відношенню до алгоритмів обробки даних. Також на live layer можна було б використати Spark Streaming, як узагальненя інкрементальних обрахунків. Можливо це буде реалізовано на одному з наступних етапів. Додавання ще одного розподіленого компонента швидше за все не дасть приросту продуктивності, на цьому етапі заміри лейтенсі яке ми міряли Обмеженням системи на даному етапі стає можливість збереження великого об”єму даних: хоч до сих пір база даних справлялася з цією роботою, але вона не найкращий кандидат. Є старенька всім відома технологія… але про це за декілька слайдів.
  • #28: Зараз іде робота над заміною маріяДБ на HDFS.
  • #29: Отже ми прийшли до картинки із остаточним варіантом архітектури. Дані вливаються великим потоком Можемо зберігати великі масиви даних Можем опрацьовувати великі масиви даних Є можливість швидкої відповіді на ряд поставлених питань, більш складні питання можна обрахувати точно з певною затримкою Кажуть в тої архітектури є вже назва.
  • #32: ВИБОРИ В УКРАЇНІ: бюлетні == all_data ЦВК == batch_view екзитпол == realtime_view
  • #33: Hadoop The batch layer precomputes results using a distributed processing system that can handle very large quantities of data. The batch layer aims at perfect accuracy by being able to process all available data when generating views. This means it can fix any errors by recomputing based on the complete data set, then updating existing views. Output is typically stored in a read-only database, with updates completely replacing existing precomputed views. Apache Hadoop is the de facto standard batch-processing system used in most high-throughput architectures.
  • #34: Simple Robust Predictable Easy to configure and operate Cassandra/HBase/ElaphantDB, також може бути Hive/Impala Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data. Examples of technologies used in the serving layer include Druid, which provides a single cluster to handle output from both layers. Dedicated stores used in the serving layer include Apache Cassandra or Apache HBase for speed-layer output, and Elephant DB or Cloudera Impala for batch-layer output.
  • #35: Але чогось не вистарчає…
  • #36: Complexity isolation — complexity is pushed to layer whose results are temporary . Eventual accuracy — eventually all the results will be taken from serving layer. Speed layer might use approximate algorithms like HyperLogLog and BloomFilters for computations. Storm/Spark Streaming The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Essentially, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer's views for the same data become available. Stream-processing technologies typically used in this layer include Apache Storm, SQLstream and Apache Spark. Output is typically stored on fast NoSQL databases.
  • #38: High-load meets big data
  • #41: Java Nio2 Connector, 500 threads http://techblog.netflix.com/2015/07/tuning-tomcat-for-high-throughput-fail.html
  • #42: Бачило що маємо певний ліміт по навантаженню, спільний для всіх варіантів. Мабуть це томкет ) Хто дасть такий варіант — подарунок. Система тепер досить добре витримує наватнаження. Цікавим є закономірний спад в околі 12к…
  • #44: Ubuntu 14.04 (virtual machines, 2 cores, 8 GB RAM) згадати про Haproxy vs nginx Згадати про обмеження в 12к для всіх тестів, це був томкет. Треба оптимізувати.
  • #45: Стовпчики Скейлінг фактор!!!
  • #46: Як це типово для високонагружених рішень, я очікував що вульким місцем стане збереження даних або їх обробка, наприклад перша моя задача була — давай зробимо так щоб Кафка стала вузьким місцем, але вепрлися в лоад балансер. Та й 30к RPS це нормальна нагрузка.
  • #49: Зоопарк