SlideShare a Scribd company logo
Leveraging Open Source for Large Scale
Analytics on HPC Systems
Rob Vesse, Software Engineer, Cray Inc
C O M P U T E | S T O R E | A N A L Y Z E
Overview
● Background
● Challenges
● Packaging and Deployment
● Input/Output
● Scaling Analytics
● Python Data Science
● Machine Learning
Slides: https://cray.box.com/v/sw-data-july-2018
Copyright Cray Inc 2018
2
C O M P U T E | S T O R E | A N A L Y Z E
Legal Disclaimer
Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to
any intellectual property rights is granted by this document.
Cray Inc. may make changes to specifications and product descriptions at any time, without notice.
All products, dates and figures specified are preliminary based on current expectations, and are subject to
change without notice.
Cray hardware and software products may contain design defects or errors known as errata, which may cause
the product to deviate from published specifications. Current characterized errata are available on request.
Cray uses codenames internally to identify products that are in development and not yet publically announced
for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising,
promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.
Performance tests and ratings are measured using specific systems and/or components and reflect the
approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware
or software design or configuration may affect actual performance.
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and
design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL,
CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, REVEAL,
THREADSTORM. The following system family marks, and associated model number marks, are trademarks of
Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a
sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other
trademarks used in this document are the property of their respective owners.
Copyright Cray Inc 2018
3
C O M P U T E | S T O R E | A N A L Y Z E
Background
● About Me
● Software Engineer in the Analytics R&D Group
● Develop hardware and software solutions across Cray's product portfolio
● Primarily focused on integrating open source software into a coherent user friendly
product
● Involved in open source for ~15 years, committer at Apache Software Foundation
since 2012, and member since 2015
● Definition - High Performance Computing (HPC)
● Any sufficiently large high performance computer
● Typically $500,000 dollars plus
● As small as 10s of nodes up to 10,000s of nodes
● Creates some interesting scaling and implementation challenges for analytics
● Why analytics on HPC Systems?
● Scale
● Productivity
● Utilization
Copyright Cray Inc 2018
4
C O M P U T E | S T O R E | A N A L Y Z E
Packaging and Deployment
● Challenges
● HPC Systems are highly
controlled environments
● Users are granted the
minimum permissions
possible
● Many open source packages
have extensive dependencies
or expect users to bring in
their own
Copyright Cray Inc 2018
5
C O M P U T E | S T O R E | A N A L Y Z E
Solution - Containers
● An easy solution right?
● HPC Sysadmins are really paranoid
● Docker still considered insecure by many
● NERSC Shifter
● A HPC centric containerizer, used on our top end systems
● Designed to scale out massively
● Forces containerized process to run as the launching users UID
● Can consume Docker images but has own image gateway and
format
● Docker
● Currently used for our cluster systems
● Eventually will be used on our next generation supercomputers
Copyright Cray Inc 2018
6
C O M P U T E | S T O R E | A N A L Y Z E
Containers - Shifter vs Docker
● Both are open source so why choose Docker?
● https://github.com/NERSC/shifter
● https://github.com/docker
● Docker has a far more vibrant community
● Many of its shortcomings for HPC have or are being addressed
● E.g. Container access to hardware devices like GPUs
● NVidia Docker - https://github.com/NVIDIA/nvidia-docker
● It's Open Container Initiative (OCI) compliant
● Docker can be used with other key technologies e.g.
Kubernetes
Copyright Cray Inc 2018
7
C O M P U T E | S T O R E | A N A L Y Z E
Orchestration
● For distributed applications we need something to tie the
containers together
● Also want to support multi-tenant isolation
● Kubernetes
● Fastest growing container orchestrator out there
● Open APIs and highly extensible
● Declaratively specify complex applications and self-service
configuration via APIs
● E.g. Deploying Apache Spark on Kubernetes using Bloomberg's
Kerberos support mods
● Biggest problem for us is networking!
Copyright Cray Inc 2018
8
C O M P U T E | S T O R E | A N A L Y Z E
Kubernetes Cluster Networking
● Kubernetes has a networking model that supports
customizable network providers
● Differing capabilities, bare networking through to network
traffic policy management
● E.g. isolated Tenant A from Tenant B
● Different providers use different approaches e.g.
● Flannel and Weave use VXLAN
● Cilium uses eBPF
● Calico and Romana uses static routing
● Our Aries network doesn't support VLANs and our kernel
doesn't support eBPF!
● Therefore we chose Romana
Copyright Cray Inc 2018
9
C O M P U T E | S T O R E | A N A L Y Z E
Input/Output Challenges
● Lots of analytics
frameworks e.g. Apache
Hadoop Map/Reduce,
Apache Spark rely on local
storage
● E.g. temporary scratch space
● BUT many HPC systems
have no local storage
Map task
thread
Block
manager
Disk
Reduce
task
threadRequest
TCP
Spark
Scheduler
Shuffle write
Shuffle read
Meta data
Copyright Cray Inc 2018
10
C O M P U T E | S T O R E | A N A L Y Z E
Virtual Local Storage
● tmpfs/ramfs
● Standard temporary file system for *nix OSes
● Stored in RAM
● tmpfs is preferred as can be specified with a max size
● BUT competes with your analytics frameworks for memory
● Use the systems parallel file system e.g. Lustre
● Unfortunately these aren't designed for small file IO
● Deadlocks the metadata servers causing significant slowdown for
everyone!
● Using Linux loopback mounts to solve this
● Short lived files never leave OS disk cache i.e. still in memory
● OS can flush OS disk cache as needed
Copyright Cray Inc 2018
11
C O M P U T E | S T O R E | A N A L Y Z E
Python Data Science
● Challenges
● Managing dependencies
● Compute nodes typically have
no external network
connectivity
● Distributed computation
● Maximising hardware
utilization for performance
Copyright Cray Inc 2018
12
C O M P U T E | S T O R E | A N A L Y Z E
Dependency Management
● Using Anaconda to solve this
● Have to resolve the environments up front
● Compute nodes can't access external network
● Also need to project environments onto compute nodes
as needed
● For containers use volume mounts and environment variable
injection into the container
● For standard jobs need to store environments on a file system
visible to compute nodes
Copyright Cray Inc 2018
13
C O M P U T E | S T O R E | A N A L Y Z E
Distributed Computation - Dask
● Distributed work
scheduling library for
Python
● Integrates with
common data science
libraries
● Numpy, Pandas,
SciKit-Learn
● Familiar Pythonic API
for scaling out
workloads
● Can be installed as part
of the Conda
environment
>>> from dask.distributed import Client
>>> client =
Client(scheduler_file='/path/to/scheduler.json')
>>> def square(x):
return x ** 2
>>> def neg(x):
return -x
>>> A = client.map(square, range(10))
>>> B = client.map(neg, A)
>>> total = client.submit(sum, B)
>>> total # Function hasn't yet completed
<Future: status: waiting, key: sum-
58999c52e0fa35c7d7346c098f5085c7>
>>> total.result() -285
>>> client.gather(A)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Copyright Cray Inc 2018
14
C O M P U T E | S T O R E | A N A L Y Z E
Dask - Scheduler & Environment Setup
● Using Dask requires running scheduler and worker
processes on our compute resources
● We don't necessarily know the set of physical nodes we will get
ahead of time
● Dask provides a scheduler file mechanism for this
● Need to start a scheduler and worker on each physical
node
● We use the entry point scripts of our container images to do this
● Also need to integrate with users Conda environment
● MUST activate the volume mounted environments prior to
starting Dask
Copyright Cray Inc 2018
15
C O M P U T E | S T O R E | A N A L Y Z E
Maximising Performance
● To fully take advantage of HPC hardware need to use
appropriately optimized libraries
● Option 1 - Custom Anaconda Channels
● E.g. Intel Distribution for Python
● Uses Intel AVX and MKL (Math Kernel Library) underneath popular
libraries
● Option 2 - ABI Injection
● Where a library uses a defined ABI e.g. mpi4py ensure it is
compiled against the generic ABI
● At runtime use volume mounts to mount the platform specific
ABI implementation at the appropriate location
● E.g. Cray MPICH, Open MPI, Intel MPI
Copyright Cray Inc 2018
16
C O M P U T E | S T O R E | A N A L Y Z E
Machine Learning
● Challenges
● How do we take advantage of
both GPUs and CPUs?
● Efficiently scale out onto
distributed systems
Copyright Cray Inc 2018
17
C O M P U T E | S T O R E | A N A L Y Z E
GPUs vs CPUs
● GPUs typically best suited
to training models
● More time and resource
intensive
● CPUs typically best suited
to inference
● i.e. Make predictions using a
trained model
● Need different hardware optimisations for each
● Don't necessarily know where our code will run ahead of time
● Therefore compile separately for each environment and
select desired build via container entry point script
● This requires a container runtime that supports GPUs e.g. Shifter or
NVidia Docker
● NB - We're trading off image size for performance
Copyright Cray Inc 2018
18
C O M P U T E | S T O R E | A N A L Y Z E
Distributed Training
● Framework support for
distributed training is not
well optimized
● Typically TCP/IP based
protocols e.g. gRPC
● Esoteric to configure
● Want to utilize full
capabilities of the network
● Uber's Horovod
● https://github.com/uber/horovod
● Uses MPI to better leverage the
network (Inifiniband/RoCE)
● Minor changes needed to your
ML scripts
● Interleaves computation and
communication
● Uses more efficient MPI
collectives where possible
Copyright Cray Inc 2018
19
C O M P U T E | S T O R E | A N A L Y Z E
Horovod vs gRPC Performance
https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy#slide15
Copyright Cray Inc 2018
20
C O M P U T E | S T O R E | A N A L Y Z E
Conclusions
● Scaling open source analytics has some non-obvious
gotchas
● Often assumes a traditional cluster environment
● Most challenges revolve around IO and Networking
● There's some promising open source efforts to solve these
more thoroughly
● Our Roadmap
● Looking to have stock Docker running on next generation
systems
● Leverage more of Kubernetes features to provide a cloud like
self service HPC model
Copyright Cray Inc 2018
21
C O M P U T E | S T O R E | A N A L Y Z E
Questions?
rvesse@cray.com
https://cray.box.com/v/sw-data-july-2018
C O M P U T E | S T O R E | A N A L Y Z E
References - Containers
Copyright Cray Inc 2018
23
Tool Project Homepage/Repository
NERSC Shifter https://github.com/NERSC/shifter
Docker https://docker.com
NVidia Docker https://github.com/NVIDIA/nvidia-docker
Kubernetes https://kubernetes.io
Flannel https://coreos.com/flannel
Weave https://www.weave.works
Cilium https://cilium.io
Calico https://www.projectcalico.org
Romana https://romana.io
C O M P U T E | S T O R E | A N A L Y Z E
References - Analytics & Data Science
Copyright Cray Inc 2018
24
Tool Project Homepage/Repository
Apache Hadoop https://hadoop.apache.org
Anaconda https://conda.io/docs/
Dask http://dask.pydata.org/en/latest/
NumPy http://www.numpy.org
xarray http://xarray.pydata.org/en/stable/
SciPy https://www.scipy.org
Pandas https://pandas.pydata.org
mpi4py http://mpi4py.scipy.org/docs/
Intel Distribution of Python https://software.intel.com/en-us/distribution-for-
python
C O M P U T E | S T O R E | A N A L Y Z E
References - Machine Learning
Copyright Cray Inc 2018
25
Tool Project Homepage/Repository
TensorFlow https://www.tensorflow.org
gRPC https://grpc.io
Horovod https://github.com/uber/horovod

More Related Content

What's hot (20)

PDF
LAS16-TR06: Remoteproc & rpmsg development
Linaro
 
PDF
Foss Gadgematics
Bud Siddhisena
 
PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
PDF
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
PDF
LCA14: LCA14-209: ODP Project Update
Linaro
 
PDF
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
Linaro
 
PDF
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
Equnix Business Solutions
 
PDF
LAS16-109: LAS16-109: The status quo and the future of 96Boards
Linaro
 
PDF
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
Linaro
 
PDF
ODP Presentation LinuxCon NA 2014
Michael Christofferson
 
PDF
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
AMD Developer Central
 
PDF
LAS16-108: JerryScript and other scripting languages for IoT
Linaro
 
PDF
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
AMD Developer Central
 
PDF
Programming the Network Data Plane
C4Media
 
PDF
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
AMD Developer Central
 
PDF
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
Linaro
 
PPTX
LAS16-106: GNU Toolchain Development Lifecycle
Linaro
 
PDF
LAS16-201: ART JIT in Android N
Linaro
 
PDF
LAS16-209: Finished and Upcoming Projects in LMG
Linaro
 
PDF
DPDK In Depth
Kernel TLV
 
LAS16-TR06: Remoteproc & rpmsg development
Linaro
 
Foss Gadgematics
Bud Siddhisena
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
LCA14: LCA14-209: ODP Project Update
Linaro
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
Linaro
 
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
Equnix Business Solutions
 
LAS16-109: LAS16-109: The status quo and the future of 96Boards
Linaro
 
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
Linaro
 
ODP Presentation LinuxCon NA 2014
Michael Christofferson
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
AMD Developer Central
 
LAS16-108: JerryScript and other scripting languages for IoT
Linaro
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
AMD Developer Central
 
Programming the Network Data Plane
C4Media
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
AMD Developer Central
 
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
Linaro
 
LAS16-106: GNU Toolchain Development Lifecycle
Linaro
 
LAS16-201: ART JIT in Android N
Linaro
 
LAS16-209: Finished and Upcoming Projects in LMG
Linaro
 
DPDK In Depth
Kernel TLV
 

Similar to Leveraging open source for large scale analytics (20)

PDF
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Subbu Rama
 
PDF
DDDP 2019 - Brown to Green
John Archer
 
PDF
Google Storage concepts and computing concepts.pdf
ashokchoppadandi685
 
PPT
Presentation-1.ppt
ssuserbfbf6f1
 
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
PPT
Current Trends in HPC
Putchong Uthayopas
 
PPTX
5 Paths to HPC - SUSE
Jeff Reser
 
PDF
Software Design Practices for Large-Scale Automation
Hao Xu
 
PDF
Refactoring Applications for the XK7 and Future Hybrid Architectures
Jeff Larkin
 
PDF
Beyond Moore's Law: The Challenge of Heterogeneous Compute & Memory Systems
inside-BigData.com
 
PDF
Mauricio breteernitiz hpc-exascale-iscte
mbreternitz
 
PDF
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
PDF
Exploring emerging technologies in the HPC co-design space
jsvetter
 
PPTX
SoC HPC: Design, Optimization, and Application to Algorithmic Trading
Mark Delgado
 
PDF
Hadoop As The Platform For The Smartgrid At TVA
Cloudera, Inc.
 
PDF
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
BigDataEverywhere
 
PDF
Cloud, Fog, or Edge: Where and When to Compute?
Förderverein Technische Fakultät
 
PDF
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
Hitoshi Sato
 
PDF
Exascale Scientific Applications Scalability and Performance Portability 1st ...
hichamrameo
 
PPTX
Introduction to Cloud Data Center and Network Issues
Jason TC HOU (侯宗成)
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Subbu Rama
 
DDDP 2019 - Brown to Green
John Archer
 
Google Storage concepts and computing concepts.pdf
ashokchoppadandi685
 
Presentation-1.ppt
ssuserbfbf6f1
 
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
Current Trends in HPC
Putchong Uthayopas
 
5 Paths to HPC - SUSE
Jeff Reser
 
Software Design Practices for Large-Scale Automation
Hao Xu
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Jeff Larkin
 
Beyond Moore's Law: The Challenge of Heterogeneous Compute & Memory Systems
inside-BigData.com
 
Mauricio breteernitiz hpc-exascale-iscte
mbreternitz
 
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
Exploring emerging technologies in the HPC co-design space
jsvetter
 
SoC HPC: Design, Optimization, and Application to Algorithmic Trading
Mark Delgado
 
Hadoop As The Platform For The Smartgrid At TVA
Cloudera, Inc.
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
BigDataEverywhere
 
Cloud, Fog, or Edge: Where and When to Compute?
Förderverein Technische Fakultät
 
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
Hitoshi Sato
 
Exascale Scientific Applications Scalability and Performance Portability 1st ...
hichamrameo
 
Introduction to Cloud Data Center and Network Issues
Jason TC HOU (侯宗成)
 
Ad

More from South West Data Meetup (11)

PDF
Met Office Informatics Lab
South West Data Meetup
 
PDF
Time Series Analytics for Big Fast Data
South West Data Meetup
 
PDF
@Bristol Data Dome Workshop (ISO/Urban Tide)
South West Data Meetup
 
PPTX
Assurance Scoring: using machine learning and analytics to reduce risk in the...
South West Data Meetup
 
PDF
Imagine Bristol - interactive workshop day
South West Data Meetup
 
PDF
Open Data Institute (ODI) Node
South West Data Meetup
 
PPTX
Bristol's Open Data Journey
South West Data Meetup
 
PDF
@Bristol Data Dome workshop - NSC Creative
South West Data Meetup
 
PDF
Declarative data analysis
South West Data Meetup
 
PPTX
Bristol is Open: Exploring Open Data in the City
South West Data Meetup
 
PDF
Ask bigger questions
South West Data Meetup
 
Met Office Informatics Lab
South West Data Meetup
 
Time Series Analytics for Big Fast Data
South West Data Meetup
 
@Bristol Data Dome Workshop (ISO/Urban Tide)
South West Data Meetup
 
Assurance Scoring: using machine learning and analytics to reduce risk in the...
South West Data Meetup
 
Imagine Bristol - interactive workshop day
South West Data Meetup
 
Open Data Institute (ODI) Node
South West Data Meetup
 
Bristol's Open Data Journey
South West Data Meetup
 
@Bristol Data Dome workshop - NSC Creative
South West Data Meetup
 
Declarative data analysis
South West Data Meetup
 
Bristol is Open: Exploring Open Data in the City
South West Data Meetup
 
Ask bigger questions
South West Data Meetup
 
Ad

Recently uploaded (20)

PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 

Leveraging open source for large scale analytics

  • 1. Leveraging Open Source for Large Scale Analytics on HPC Systems Rob Vesse, Software Engineer, Cray Inc
  • 2. C O M P U T E | S T O R E | A N A L Y Z E Overview ● Background ● Challenges ● Packaging and Deployment ● Input/Output ● Scaling Analytics ● Python Data Science ● Machine Learning Slides: https://cray.box.com/v/sw-data-july-2018 Copyright Cray Inc 2018 2
  • 3. C O M P U T E | S T O R E | A N A L Y Z E Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, REVEAL, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright Cray Inc 2018 3
  • 4. C O M P U T E | S T O R E | A N A L Y Z E Background ● About Me ● Software Engineer in the Analytics R&D Group ● Develop hardware and software solutions across Cray's product portfolio ● Primarily focused on integrating open source software into a coherent user friendly product ● Involved in open source for ~15 years, committer at Apache Software Foundation since 2012, and member since 2015 ● Definition - High Performance Computing (HPC) ● Any sufficiently large high performance computer ● Typically $500,000 dollars plus ● As small as 10s of nodes up to 10,000s of nodes ● Creates some interesting scaling and implementation challenges for analytics ● Why analytics on HPC Systems? ● Scale ● Productivity ● Utilization Copyright Cray Inc 2018 4
  • 5. C O M P U T E | S T O R E | A N A L Y Z E Packaging and Deployment ● Challenges ● HPC Systems are highly controlled environments ● Users are granted the minimum permissions possible ● Many open source packages have extensive dependencies or expect users to bring in their own Copyright Cray Inc 2018 5
  • 6. C O M P U T E | S T O R E | A N A L Y Z E Solution - Containers ● An easy solution right? ● HPC Sysadmins are really paranoid ● Docker still considered insecure by many ● NERSC Shifter ● A HPC centric containerizer, used on our top end systems ● Designed to scale out massively ● Forces containerized process to run as the launching users UID ● Can consume Docker images but has own image gateway and format ● Docker ● Currently used for our cluster systems ● Eventually will be used on our next generation supercomputers Copyright Cray Inc 2018 6
  • 7. C O M P U T E | S T O R E | A N A L Y Z E Containers - Shifter vs Docker ● Both are open source so why choose Docker? ● https://github.com/NERSC/shifter ● https://github.com/docker ● Docker has a far more vibrant community ● Many of its shortcomings for HPC have or are being addressed ● E.g. Container access to hardware devices like GPUs ● NVidia Docker - https://github.com/NVIDIA/nvidia-docker ● It's Open Container Initiative (OCI) compliant ● Docker can be used with other key technologies e.g. Kubernetes Copyright Cray Inc 2018 7
  • 8. C O M P U T E | S T O R E | A N A L Y Z E Orchestration ● For distributed applications we need something to tie the containers together ● Also want to support multi-tenant isolation ● Kubernetes ● Fastest growing container orchestrator out there ● Open APIs and highly extensible ● Declaratively specify complex applications and self-service configuration via APIs ● E.g. Deploying Apache Spark on Kubernetes using Bloomberg's Kerberos support mods ● Biggest problem for us is networking! Copyright Cray Inc 2018 8
  • 9. C O M P U T E | S T O R E | A N A L Y Z E Kubernetes Cluster Networking ● Kubernetes has a networking model that supports customizable network providers ● Differing capabilities, bare networking through to network traffic policy management ● E.g. isolated Tenant A from Tenant B ● Different providers use different approaches e.g. ● Flannel and Weave use VXLAN ● Cilium uses eBPF ● Calico and Romana uses static routing ● Our Aries network doesn't support VLANs and our kernel doesn't support eBPF! ● Therefore we chose Romana Copyright Cray Inc 2018 9
  • 10. C O M P U T E | S T O R E | A N A L Y Z E Input/Output Challenges ● Lots of analytics frameworks e.g. Apache Hadoop Map/Reduce, Apache Spark rely on local storage ● E.g. temporary scratch space ● BUT many HPC systems have no local storage Map task thread Block manager Disk Reduce task threadRequest TCP Spark Scheduler Shuffle write Shuffle read Meta data Copyright Cray Inc 2018 10
  • 11. C O M P U T E | S T O R E | A N A L Y Z E Virtual Local Storage ● tmpfs/ramfs ● Standard temporary file system for *nix OSes ● Stored in RAM ● tmpfs is preferred as can be specified with a max size ● BUT competes with your analytics frameworks for memory ● Use the systems parallel file system e.g. Lustre ● Unfortunately these aren't designed for small file IO ● Deadlocks the metadata servers causing significant slowdown for everyone! ● Using Linux loopback mounts to solve this ● Short lived files never leave OS disk cache i.e. still in memory ● OS can flush OS disk cache as needed Copyright Cray Inc 2018 11
  • 12. C O M P U T E | S T O R E | A N A L Y Z E Python Data Science ● Challenges ● Managing dependencies ● Compute nodes typically have no external network connectivity ● Distributed computation ● Maximising hardware utilization for performance Copyright Cray Inc 2018 12
  • 13. C O M P U T E | S T O R E | A N A L Y Z E Dependency Management ● Using Anaconda to solve this ● Have to resolve the environments up front ● Compute nodes can't access external network ● Also need to project environments onto compute nodes as needed ● For containers use volume mounts and environment variable injection into the container ● For standard jobs need to store environments on a file system visible to compute nodes Copyright Cray Inc 2018 13
  • 14. C O M P U T E | S T O R E | A N A L Y Z E Distributed Computation - Dask ● Distributed work scheduling library for Python ● Integrates with common data science libraries ● Numpy, Pandas, SciKit-Learn ● Familiar Pythonic API for scaling out workloads ● Can be installed as part of the Conda environment >>> from dask.distributed import Client >>> client = Client(scheduler_file='/path/to/scheduler.json') >>> def square(x): return x ** 2 >>> def neg(x): return -x >>> A = client.map(square, range(10)) >>> B = client.map(neg, A) >>> total = client.submit(sum, B) >>> total # Function hasn't yet completed <Future: status: waiting, key: sum- 58999c52e0fa35c7d7346c098f5085c7> >>> total.result() -285 >>> client.gather(A) [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] Copyright Cray Inc 2018 14
  • 15. C O M P U T E | S T O R E | A N A L Y Z E Dask - Scheduler & Environment Setup ● Using Dask requires running scheduler and worker processes on our compute resources ● We don't necessarily know the set of physical nodes we will get ahead of time ● Dask provides a scheduler file mechanism for this ● Need to start a scheduler and worker on each physical node ● We use the entry point scripts of our container images to do this ● Also need to integrate with users Conda environment ● MUST activate the volume mounted environments prior to starting Dask Copyright Cray Inc 2018 15
  • 16. C O M P U T E | S T O R E | A N A L Y Z E Maximising Performance ● To fully take advantage of HPC hardware need to use appropriately optimized libraries ● Option 1 - Custom Anaconda Channels ● E.g. Intel Distribution for Python ● Uses Intel AVX and MKL (Math Kernel Library) underneath popular libraries ● Option 2 - ABI Injection ● Where a library uses a defined ABI e.g. mpi4py ensure it is compiled against the generic ABI ● At runtime use volume mounts to mount the platform specific ABI implementation at the appropriate location ● E.g. Cray MPICH, Open MPI, Intel MPI Copyright Cray Inc 2018 16
  • 17. C O M P U T E | S T O R E | A N A L Y Z E Machine Learning ● Challenges ● How do we take advantage of both GPUs and CPUs? ● Efficiently scale out onto distributed systems Copyright Cray Inc 2018 17
  • 18. C O M P U T E | S T O R E | A N A L Y Z E GPUs vs CPUs ● GPUs typically best suited to training models ● More time and resource intensive ● CPUs typically best suited to inference ● i.e. Make predictions using a trained model ● Need different hardware optimisations for each ● Don't necessarily know where our code will run ahead of time ● Therefore compile separately for each environment and select desired build via container entry point script ● This requires a container runtime that supports GPUs e.g. Shifter or NVidia Docker ● NB - We're trading off image size for performance Copyright Cray Inc 2018 18
  • 19. C O M P U T E | S T O R E | A N A L Y Z E Distributed Training ● Framework support for distributed training is not well optimized ● Typically TCP/IP based protocols e.g. gRPC ● Esoteric to configure ● Want to utilize full capabilities of the network ● Uber's Horovod ● https://github.com/uber/horovod ● Uses MPI to better leverage the network (Inifiniband/RoCE) ● Minor changes needed to your ML scripts ● Interleaves computation and communication ● Uses more efficient MPI collectives where possible Copyright Cray Inc 2018 19
  • 20. C O M P U T E | S T O R E | A N A L Y Z E Horovod vs gRPC Performance https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy#slide15 Copyright Cray Inc 2018 20
  • 21. C O M P U T E | S T O R E | A N A L Y Z E Conclusions ● Scaling open source analytics has some non-obvious gotchas ● Often assumes a traditional cluster environment ● Most challenges revolve around IO and Networking ● There's some promising open source efforts to solve these more thoroughly ● Our Roadmap ● Looking to have stock Docker running on next generation systems ● Leverage more of Kubernetes features to provide a cloud like self service HPC model Copyright Cray Inc 2018 21
  • 22. C O M P U T E | S T O R E | A N A L Y Z E Questions? [email protected] https://cray.box.com/v/sw-data-july-2018
  • 23. C O M P U T E | S T O R E | A N A L Y Z E References - Containers Copyright Cray Inc 2018 23 Tool Project Homepage/Repository NERSC Shifter https://github.com/NERSC/shifter Docker https://docker.com NVidia Docker https://github.com/NVIDIA/nvidia-docker Kubernetes https://kubernetes.io Flannel https://coreos.com/flannel Weave https://www.weave.works Cilium https://cilium.io Calico https://www.projectcalico.org Romana https://romana.io
  • 24. C O M P U T E | S T O R E | A N A L Y Z E References - Analytics & Data Science Copyright Cray Inc 2018 24 Tool Project Homepage/Repository Apache Hadoop https://hadoop.apache.org Anaconda https://conda.io/docs/ Dask http://dask.pydata.org/en/latest/ NumPy http://www.numpy.org xarray http://xarray.pydata.org/en/stable/ SciPy https://www.scipy.org Pandas https://pandas.pydata.org mpi4py http://mpi4py.scipy.org/docs/ Intel Distribution of Python https://software.intel.com/en-us/distribution-for- python
  • 25. C O M P U T E | S T O R E | A N A L Y Z E References - Machine Learning Copyright Cray Inc 2018 25 Tool Project Homepage/Repository TensorFlow https://www.tensorflow.org gRPC https://grpc.io Horovod https://github.com/uber/horovod