SlideShare a Scribd company logo
Casual mass parallel data
processing in Java
Alexey Ragozin

Mar 2014
Building new bicycle …
Build Vs. Buy
Build
• No dedicated team to
support infrastructure
• Very specific tasks
• Exclusive use of
infrastructure
• Reasonable scale

Buy
• Product can bought as
service (internal or external)
• Large scale
• Multi tenancy
• You are going to use
advanced features
(e.g. map/reduce)
“Casual” computing
•
•
•
•
•

Small computation farms (< 100 servers)
Team owns both application and grid
Java platform
Reasonably short batches (< 24 hours)
Reasonably small data sets (< 10 TiB)
Simple master slave topology
Master process
Scheduler

Task queue
Ad v
e
Tas rtise
k
Rep

Slave

Slave

ort

Slave
Simple master slave topology
Control plane
 RMI

Queue / scheduler
 Simple in memory queue
 May be more complex than just task queue

Data plane
…
Data plane
Never, ever, try to send data over RMI 
File system
 Avoid network mounts!

In-memory key-value
 Client side sharding works best

Disk database (RDBMS or NoSQL)
 Consider prefetch of data

Direct socket streaming
…
Distributed objects revised
Pit falls of CORBA/RMI
• IDL – functional contract
• IDL – protocol

Separating concerns
• Functional contract – wrapper object
• Protocol – hidden remote interface
Distributed objects revised
Renewed distributed objects paradigm
Strong
• Polymorphism
• Encapsulation
 Network protocol, caching aspects etc

Weak
• Homogenous code base required
• Synchronous network communications
Deployment problem
Brute force

Computation grid software







 Compile and run batch
Behind scene
 Your classes would be collected
 Associated with batch
 Deployed on participating slaves

Build / package
Deploy / SCP
Restart slaves
Start batch
Change code, repeat
Central scheduler topology
Batch controller
Batch controller

Queue server
Add tasks
Consume
reports

Task queue

task
Task
ort
Rep

Pu l l

Slave

Slave

Slave
Or more elaborated
Flavors of parallel processing
Flow organized tasks
• Input data available before
task starts
• e.g. Map/Reduce

Collaborative tasks
• Tasks communicate
intermediate results to each
other
• e.g. physic simulations
Get back to data plane
Rules of thumb
•
•
•
•

Insert / delete – never update
Write locally (reducing risks)
Read remotely (retry on error)
Store input as is
 File system
 Document / column oriented NoSQL

• Input and temporary data is different
 Choose right store for each
Exploiting file system
Avoid network file systems
• File system concept is not designed to be distributed
• Good network file system cannot not exists
• Use simple remote file access protocols
• SCP (unencrypted data transfer options added by CERN guys)
• HTTP (if you really do not want SCP)

Cheap SAN could be build from open source
Algorithmic optimization
Parallel computing
• N times speed up will increase
your OPEX and CAPEX cost by N*lg(N)

Algorithmic optimization
•
•
•
•

Up front costs only
Orders of magnitude optimization opportunities
Exciting coding
Ecological way of computing 
Streaming algorithms
Finding N most frequent elements
• Min-Count

Estimating number of unique values
• HyperLogLog

Distribution histograms
https://github.com/addthis/stream-lib

https://github.com/rwl/ParallelColt
NanoCloud – drastically simplified
coding for computing clusters
As easy as …

@Test
public void hello_remote_world() {
Cloud cloud = CloudFactory.createSimpleSshCloud();
cloud.node("myserver.acme.com").exec(new Callable<Void>(){
@Override
public Void call() throws Exception {
String localhost = InetAddress.getLocalHost().toString();
System.out.println("Hi! I'm running on " + localhost);
return null;
}
});
}
All you need is …
NanoCloud requirements
 SSHd
 Java (1.6 and above) present
 Works though NAT and firewalls
 Works on Amazon EC2

 Works everywhere where SSH works
Master – slave communications

SSH

Master process
diag

Slave host

(Single TCP)

Agent

multiplexed slave streams

Slave
controller

Slave
controller

std out
std err
std in

RMI
(TCP)

Slave

Slave
Links
NanoCloud
• https://code.google.com/p/gridkit/wiki/NanoCloudTutorial
• Maven Central: org.gridkit.lab:telecontrol-ssh:0.7.23
• http://blog.ragozin.info/2013/01/remote-code-execution-in-java-made.html

ANT task
• https://github.com/gridkit/gridant
Thank you
http://blog.ragozin.info
- my articles
http://code.google.com/p/gridkit
http://github.com/gridkit
- my open source code
http://aragozin.timepad.ru
- community events in Moscow

Alexey Ragozin
alexey.ragozin@gmail.com

More Related Content

What's hot (20)

PDF
25 snowflake
剑飞 陈
 
PPTX
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
PDF
Scylla Summit 2016: Graph Processing with Titan and Scylla
ScyllaDB
 
PDF
Роман Новиков "Best Practices for MySQL Performance & Troubleshooting with th...
Fwdays
 
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
PDF
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
PPTX
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
Redis Labs
 
PDF
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
Altinity Ltd
 
PDF
Kognitio - an overview
Kognitio
 
PPTX
PolarDB
Manyi Lu
 
PDF
InfluxDB Internals
InfluxData
 
PDF
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
PDF
tdtechtalk20160330johan
Johan Gustavsson
 
PDF
Avoiding Data Hotspots at Scale
ScyllaDB
 
PPTX
High Performance Computing Presentation
omar altayyan
 
PPTX
Geek Sync I Capacity Planning for Improved Uptime
IDERA Software
 
PPTX
Architecture et coût
Aymeric Weinbach
 
PPTX
GCS' Private Cloud Analysis
joegleinser
 
PDF
POLARDB: A database architecture for the cloud
oysteing
 
PDF
Scylla Summit 2016: ScyllaDB, Present and Future
ScyllaDB
 
25 snowflake
剑飞 陈
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
Scylla Summit 2016: Graph Processing with Titan and Scylla
ScyllaDB
 
Роман Новиков "Best Practices for MySQL Performance & Troubleshooting with th...
Fwdays
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
Redis Labs
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
Altinity Ltd
 
Kognitio - an overview
Kognitio
 
PolarDB
Manyi Lu
 
InfluxDB Internals
InfluxData
 
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
tdtechtalk20160330johan
Johan Gustavsson
 
Avoiding Data Hotspots at Scale
ScyllaDB
 
High Performance Computing Presentation
omar altayyan
 
Geek Sync I Capacity Planning for Improved Uptime
IDERA Software
 
Architecture et coût
Aymeric Weinbach
 
GCS' Private Cloud Analysis
joegleinser
 
POLARDB: A database architecture for the cloud
oysteing
 
Scylla Summit 2016: ScyllaDB, Present and Future
ScyllaDB
 

Similar to Casual mass parallel data processing in Java (20)

PDF
Casual mass parallel computing
aragozin
 
PDF
Xldb2011 wed 1415_andrew_lamb-buildingblocks
liqiang xu
 
PPT
Clusters (Distributed computing)
Sri Prasanna
 
PDF
Presentazione laurea 1.2 matteo concas
Matteo Concas
 
PPT
Computing Outside The Box September 2009
Ian Foster
 
PDF
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Adrian Cockcroft
 
PPTX
Apache Hadoop
Ajit Koti
 
PPTX
High-Availability of YARN (MRv2)
Mário Almeida
 
PPTX
Membase Meetup 2010
Membase
 
PPT
Cluster Tutorial
cybercbm
 
PPT
Computing Outside The Box June 2009
Ian Foster
 
PPT
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
Robert Grossman
 
PPTX
KIISE:SIGDB Workshop presentation.
Kyong-Ha Lee
 
PDF
Why Distributed Databases?
Sargun Dhillon
 
PPTX
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
PDF
Processing Big Data (Chapter 3, SC 11 Tutorial)
Robert Grossman
 
PDF
System design handwritten notes guidance
Shabista Imam
 
PDF
Petabyte scale on commodity infrastructure
elliando dias
 
PDF
Developing and Deploying Java applications on the Amazon Elastic Compute Clou...
Chris Richardson
 
Casual mass parallel computing
aragozin
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
liqiang xu
 
Clusters (Distributed computing)
Sri Prasanna
 
Presentazione laurea 1.2 matteo concas
Matteo Concas
 
Computing Outside The Box September 2009
Ian Foster
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Adrian Cockcroft
 
Apache Hadoop
Ajit Koti
 
High-Availability of YARN (MRv2)
Mário Almeida
 
Membase Meetup 2010
Membase
 
Cluster Tutorial
cybercbm
 
Computing Outside The Box June 2009
Ian Foster
 
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Robert Grossman
 
KIISE:SIGDB Workshop presentation.
Kyong-Ha Lee
 
Why Distributed Databases?
Sargun Dhillon
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Robert Grossman
 
System design handwritten notes guidance
Shabista Imam
 
Petabyte scale on commodity infrastructure
elliando dias
 
Developing and Deploying Java applications on the Amazon Elastic Compute Clou...
Chris Richardson
 
Ad

More from Altoros (20)

PDF
Maturing with Kubernetes
Altoros
 
PDF
Kubernetes Platform Readiness and Maturity Assessment
Altoros
 
PDF
Journey Through Four Stages of Kubernetes Deployment Maturity
Altoros
 
PPTX
SGX: Improving Privacy, Security, and Trust Across Blockchain Networks
Altoros
 
PPTX
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
Altoros
 
PPTX
A Zero-Knowledge Proof: Improving Privacy on a Blockchain
Altoros
 
PPTX
Crap. Your Big Data Kitchen Is Broken.
Altoros
 
PDF
Containers and Kubernetes
Altoros
 
PPTX
Distributed Ledger Technology for Over-the-Counter Trading
Altoros
 
PPTX
5-Step Deployment of Hyperledger Fabric on Multiple Nodes
Altoros
 
PPTX
Deploying Kubernetes on GCP with Kubespray
Altoros
 
PPTX
UAA for Kubernetes
Altoros
 
PPTX
Troubleshooting .NET Applications on Cloud Foundry
Altoros
 
PPTX
Continuous Integration and Deployment with Jenkins for PCF
Altoros
 
PPTX
How to Never Leave Your Deployment Unattended
Altoros
 
PPTX
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Altoros
 
PDF
Smart Baggage Tracking: End-to-End Sensor-Based Solution
Altoros
 
PPTX
Navigating the Ecosystem of Pivotal Cloud Foundry Tiles
Altoros
 
PPTX
AI as a Catalyst for IoT
Altoros
 
PPTX
Over-Engineering: Causes, Symptoms, and Treatment
Altoros
 
Maturing with Kubernetes
Altoros
 
Kubernetes Platform Readiness and Maturity Assessment
Altoros
 
Journey Through Four Stages of Kubernetes Deployment Maturity
Altoros
 
SGX: Improving Privacy, Security, and Trust Across Blockchain Networks
Altoros
 
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
Altoros
 
A Zero-Knowledge Proof: Improving Privacy on a Blockchain
Altoros
 
Crap. Your Big Data Kitchen Is Broken.
Altoros
 
Containers and Kubernetes
Altoros
 
Distributed Ledger Technology for Over-the-Counter Trading
Altoros
 
5-Step Deployment of Hyperledger Fabric on Multiple Nodes
Altoros
 
Deploying Kubernetes on GCP with Kubespray
Altoros
 
UAA for Kubernetes
Altoros
 
Troubleshooting .NET Applications on Cloud Foundry
Altoros
 
Continuous Integration and Deployment with Jenkins for PCF
Altoros
 
How to Never Leave Your Deployment Unattended
Altoros
 
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Altoros
 
Smart Baggage Tracking: End-to-End Sensor-Based Solution
Altoros
 
Navigating the Ecosystem of Pivotal Cloud Foundry Tiles
Altoros
 
AI as a Catalyst for IoT
Altoros
 
Over-Engineering: Causes, Symptoms, and Treatment
Altoros
 
Ad

Recently uploaded (20)

PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 

Casual mass parallel data processing in Java

  • 1. Casual mass parallel data processing in Java Alexey Ragozin Mar 2014
  • 3. Build Vs. Buy Build • No dedicated team to support infrastructure • Very specific tasks • Exclusive use of infrastructure • Reasonable scale Buy • Product can bought as service (internal or external) • Large scale • Multi tenancy • You are going to use advanced features (e.g. map/reduce)
  • 4. “Casual” computing • • • • • Small computation farms (< 100 servers) Team owns both application and grid Java platform Reasonably short batches (< 24 hours) Reasonably small data sets (< 10 TiB)
  • 5. Simple master slave topology Master process Scheduler Task queue Ad v e Tas rtise k Rep Slave Slave ort Slave
  • 6. Simple master slave topology Control plane  RMI Queue / scheduler  Simple in memory queue  May be more complex than just task queue Data plane …
  • 7. Data plane Never, ever, try to send data over RMI  File system  Avoid network mounts! In-memory key-value  Client side sharding works best Disk database (RDBMS or NoSQL)  Consider prefetch of data Direct socket streaming …
  • 8. Distributed objects revised Pit falls of CORBA/RMI • IDL – functional contract • IDL – protocol Separating concerns • Functional contract – wrapper object • Protocol – hidden remote interface
  • 9. Distributed objects revised Renewed distributed objects paradigm Strong • Polymorphism • Encapsulation  Network protocol, caching aspects etc Weak • Homogenous code base required • Synchronous network communications
  • 10. Deployment problem Brute force Computation grid software       Compile and run batch Behind scene  Your classes would be collected  Associated with batch  Deployed on participating slaves Build / package Deploy / SCP Restart slaves Start batch Change code, repeat
  • 11. Central scheduler topology Batch controller Batch controller Queue server Add tasks Consume reports Task queue task Task ort Rep Pu l l Slave Slave Slave
  • 13. Flavors of parallel processing Flow organized tasks • Input data available before task starts • e.g. Map/Reduce Collaborative tasks • Tasks communicate intermediate results to each other • e.g. physic simulations
  • 14. Get back to data plane Rules of thumb • • • • Insert / delete – never update Write locally (reducing risks) Read remotely (retry on error) Store input as is  File system  Document / column oriented NoSQL • Input and temporary data is different  Choose right store for each
  • 15. Exploiting file system Avoid network file systems • File system concept is not designed to be distributed • Good network file system cannot not exists • Use simple remote file access protocols • SCP (unencrypted data transfer options added by CERN guys) • HTTP (if you really do not want SCP) Cheap SAN could be build from open source
  • 16. Algorithmic optimization Parallel computing • N times speed up will increase your OPEX and CAPEX cost by N*lg(N) Algorithmic optimization • • • • Up front costs only Orders of magnitude optimization opportunities Exciting coding Ecological way of computing 
  • 17. Streaming algorithms Finding N most frequent elements • Min-Count Estimating number of unique values • HyperLogLog Distribution histograms https://github.com/addthis/stream-lib https://github.com/rwl/ParallelColt
  • 18. NanoCloud – drastically simplified coding for computing clusters
  • 19. As easy as … @Test public void hello_remote_world() { Cloud cloud = CloudFactory.createSimpleSshCloud(); cloud.node("myserver.acme.com").exec(new Callable<Void>(){ @Override public Void call() throws Exception { String localhost = InetAddress.getLocalHost().toString(); System.out.println("Hi! I'm running on " + localhost); return null; } }); }
  • 20. All you need is … NanoCloud requirements  SSHd  Java (1.6 and above) present  Works though NAT and firewalls  Works on Amazon EC2  Works everywhere where SSH works
  • 21. Master – slave communications SSH Master process diag Slave host (Single TCP) Agent multiplexed slave streams Slave controller Slave controller std out std err std in RMI (TCP) Slave Slave
  • 22. Links NanoCloud • https://code.google.com/p/gridkit/wiki/NanoCloudTutorial • Maven Central: org.gridkit.lab:telecontrol-ssh:0.7.23 • http://blog.ragozin.info/2013/01/remote-code-execution-in-java-made.html ANT task • https://github.com/gridkit/gridant
  • 23. Thank you http://blog.ragozin.info - my articles http://code.google.com/p/gridkit http://github.com/gridkit - my open source code http://aragozin.timepad.ru - community events in Moscow Alexey Ragozin [email protected]