SlideShare a Scribd company logo
The Why and How of HPC-Cloud Hybrids
with OpenStack
OpenStack Australia Day Melbourne
June, 2017
Lev Lafayette, HPC Support and Training Officer, University of Melbourne
lev.lafayette@unimelb.edu.au
1.0 Management Layer
1.1 HPC for performance
High-performance computing (HPC) is any computer system whose architecture allows for
above average performance. Main use case refers to compute clusters with a teamster
separation between head node and workers nodes and a high-speed interconnect acts as a
single system.
1.2 Clouds for flexibility.
Precursor with virtualised hardware. Cloud VMs always have lower performance than HPC.
Question is whether the flexibility is worth the overhead.
1.3 Hybrid HPC/Clouds.
University of Melbourne model, "the chimera". Cloud VMs deployed as HPC nodes and the
Freiburg University Model, "the cyborg", HPC nodes deploying Cloud VMs.
1.4 Reviewing user preferences and usage.
Users always want more of 'x'; real issue identified was queue times. Usage indicated a high
proportion of single-node jobs.
1.5 Review and Architecture.
Review discussed whether UoM needed HPC; architecture was to use existing NeCTAR
Research cloud with an expansion of general cloud compute provisioning and use of a smaller
"true HPC" system on bare metal nodes.
2.0 Physical Layer
2.1 Physical Partitions.
"Real" HPC is a mere c276 cores, 21 GB per core. 2 socket Intel E5-2643 v3 E5-2643,
3.4GHz CPU with 6-core per socket, 192GB memory, 2x 1.2TB SAS drives, 2x 40GbE
network. “Cloud” partitions is almost 400 virtual machines with over 3,000 2.3GHz Haswell
cores with 8GB per core and . There is also a GPU partition with Dual Nvidia Tesla K80s (big
expansion this year), and departmental partitions (water and ashley). Management and login
nodes are VMs as is I/O for transferring data.
2.2 Network.
System network includes: Cloud nodes Cisco Nexus 10Gbe TCP/IP 60 usec latency (mpi-
pingpong); Bare Metal Mellanox 2100 Cumulos Linux 40Gbe TCP/IP 6.85 usec latency and
then RDMA Ethernet 1.15 usec latency
2.3 Storage.
Mountpoints to home, projects (/project /home for user data & scripts, NetApp SAS aggregate
70TB usable) and applications directories across all nodes. Additional mountpoins to VicNode
Aspera Shares. Applications directory currently on management node, needs to be decoupled.
Bare metal nodes have /scratch shared storage for MPI jobs (Dell R730 with 14 x 800GB
mixed use SSDs providing 8TB of usable storage, NFS over RDMA)., /var/local/tmp for single
node jobs, pcie SSD 1.6TB.
3.0 Operating System and
Scheduler Layer
3.1 Red Hat Linux.
Scalable FOSS operating system, high performance, very well suited for research
applications. In November 2016 of the Top 500 Supercomputers worldwide, every single
machine used a "UNIX-like" operating system; and 99.6% used Linux.
3.2 Slurm Workload Manager.
Job schedulers and resource managers allow for unattended background tasks expressed as
batch jobs among the available resources; allows multicore, multinode, arrays, dependencies,
and interactive submissions. The scheduler provides for paramterisation of computer
resources, an automatic submission of execution tasks, and a notification system for incidents.
Slurm (originally Simple Linux Utility for Resource Management), developed by Lawrence
Livermore et al., is FOSS, and used by majority of world's top systems. Scalable, offers many
optional plugins, power-saving features, accounting features, etc. Divided into logical partitions
which correlate with hardware partitions.
3.3 Git, Gerrit, and Puppet.
Version control, paired systems administration, configuration management.
3.4 OpenStack Node Deployment.
Significant use of Nova (compute) service for provisioning and decommissioning of virtual
machines on demand.
4.0 Application Layer
4.1 Source Code and EasyBuild.
Source code provides better control over security updates, integration, development, and
much better performance. Absolutely essential for reproducibility in research environment.
EasyBuild makes source software installs easier with scripts containing specified compilation
blocks (e.g., configuremake, cmake etc) and specified toolchains (GCC, Intel etc) and
environment modules (LMod). Modulefiles allow for dynamic changes to a user's environment
and ease with multiple versions of software applications on a system.
4.2 Compilers, Scripting Languages, and Applications.
Usual range of suspects; Intel and GCC, for compilers (and a little bit of PGI), Python Ruby,
and Perl for scripting languages, OpenMPI wrappers. Major applications include: MATLAB,
Gaussian, NAMD, R, OpenFOAM, Octave etc.
Almost 1,000 applications/versions installed from source, plus packages.
4.3 Containers with Singularity.
A container in a cloud virtual machine on an HPC! Wait, what?
5.0 User Layer
5.1 Karaage.
Spartan uses its own LDAP authentication that is tied to the university Security Assertion
Markup Language (SAML). Users on Spartan must belong to a project. Projects must be led
by a University of Melbourne researcher (the "Principal Investigator") and are subject to
approval by the Head of Research Compute Services. Participants in a project can be
researchers or research support staff from anywhere. Karaage is Django-based application for
user, project, and cluster reporting and management.
5.2 Freshdesk.
OMG Users!
5.3 Online Instructions and Training.
Many users (even post-doctoral researchers) require basic training in Linux command line, a
requisite skill for HPC use. Extensive training programme for researchers available using
andragogical methods, including day-long courses in “Introduction to Linux and HPC Using
Spartan”, “Linux Shell Scripting for High Performance Computing”, and “Parallel Programming
On Spartan”.
Documentation online (Github, Website, and man pages) and plenty of Slurm examples on
system.
6.0 Future Developments
6.1 Cloudbursting with Azure.
Slurm allows cloudbursting via the powersave feature; successfully experiments (and bug
discovery) within the NeCTAR research cloud.
About to add Azure through same login node. Does not mount applications directory; wrap
necessary data for transfer in script.
6.2 GPU Expansion.
Plans for a significant increase in the GPU allocation.
6.3 Test cluster (Thespian).
Everyone has a test environment, some people also have a production and a test
environment.
Test nodes already exist for Cloud and Physical partitions. Replicate management and login
nodes.
6.3 New Architectures
New architectures can be added to the system with separate build node (another VM) and with
software built for that architecture. Don't need an entirely new system.
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, University of Melbourne

More Related Content

PPT
OpenStack - An Overview
graziol
 
PPTX
WAN & LAN Cluster with Diagrams and OSI explanation
Jonathan Reid
 
PDF
HPC Best Practices: Application Performance Optimization
inside-BigData.com
 
PDF
Microkernel design
microkerneldude
 
DOC
Computer cluster
Shiva Krishna Chandra Shekar
 
PDF
IBM Data Centric Systems & OpenPOWER
inside-BigData.com
 
PPT
Cluster Computers
shopnil786
 
PPTX
Optimising nfv service chains on open stack using docker
Ananth Padmanabhan
 
OpenStack - An Overview
graziol
 
WAN & LAN Cluster with Diagrams and OSI explanation
Jonathan Reid
 
HPC Best Practices: Application Performance Optimization
inside-BigData.com
 
Microkernel design
microkerneldude
 
IBM Data Centric Systems & OpenPOWER
inside-BigData.com
 
Cluster Computers
shopnil786
 
Optimising nfv service chains on open stack using docker
Ananth Padmanabhan
 

What's hot (20)

PDF
Computer_Clustering_Technologies
Manish Chopra
 
PDF
The State of Linux Containers
inside-BigData.com
 
PDF
Dell Lustre Storage Architecture Presentation - MBUG 2016
Andrew Underwood
 
PPTX
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
Yahoo Developer Network
 
PDF
Application layer
Neha Kurale
 
PPT
Cluster Computing Seminar.
Balvant Biradar
 
ODP
Systems Support for Many Task Computing
Eric Van Hensbergen
 
PPTX
EAS Data Flow lessons learnt
euc-dm-test
 
PDF
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
PPTX
Beowulf cluster
Yash Karanke
 
PDF
Blazing Fast Lustre Storage
Intel IT Center
 
PPTX
Cluster computer
Ashraful Hoda
 
PDF
Enea Enabling Real-Time in Linux Whitepaper
Enea Software AB
 
PPT
Cluster Computing
NIKHIL NAIR
 
PDF
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
inside-BigData.com
 
PDF
CloudLightning and the OPM-based Use Case
CloudLightning
 
PPT
tittle
uvolodia
 
PPTX
DOE Magellan OpenStack user story
laurabeckcahoon
 
PDF
Oracle rac 10g best practices
Haseeb Alam
 
PDF
Docker Application to Scientific Computing
Peter Bryzgalov
 
Computer_Clustering_Technologies
Manish Chopra
 
The State of Linux Containers
inside-BigData.com
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Andrew Underwood
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
Yahoo Developer Network
 
Application layer
Neha Kurale
 
Cluster Computing Seminar.
Balvant Biradar
 
Systems Support for Many Task Computing
Eric Van Hensbergen
 
EAS Data Flow lessons learnt
euc-dm-test
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Beowulf cluster
Yash Karanke
 
Blazing Fast Lustre Storage
Intel IT Center
 
Cluster computer
Ashraful Hoda
 
Enea Enabling Real-Time in Linux Whitepaper
Enea Software AB
 
Cluster Computing
NIKHIL NAIR
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
inside-BigData.com
 
CloudLightning and the OPM-based Use Case
CloudLightning
 
tittle
uvolodia
 
DOE Magellan OpenStack user story
laurabeckcahoon
 
Oracle rac 10g best practices
Haseeb Alam
 
Docker Application to Scientific Computing
Peter Bryzgalov
 
Ad

Similar to The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, University of Melbourne (20)

PPTX
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
PPTX
Flexible compute
Peter Clapham
 
PPTX
HPC and cloud distributed computing, as a journey
Peter Clapham
 
PDF
HPC HUB - Virtual Supercomputer on Demand
Vilgelm Bitner
 
PDF
2023comp90024_Spartan.pdf
LevLafayette1
 
PPTX
Brad stack - Digital Health and Well-Being Festival
Digital Health Enterprise Zone
 
PDF
Elastic HBase on Mesos - HBaseCon 2015
Cosmin Lehene
 
PPTX
Scientific Computing @ Fred Hutch
Dirk Petersen
 
PDF
An Update on Arm HPC
inside-BigData.com
 
PDF
Architecting a 35 PB distributed parallel file system for science
Speck&Tech
 
PDF
Making Clouds: Turning OpenNebula into a Product
NETWAYS
 
PDF
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...
OpenNebula Project
 
PDF
Making clouds: turning opennebula into a product
Carlo Daffara
 
PDF
OpenNebulaConf 2014 - Dynamic virtual private clusters with OpenNebula and SG...
OpenNebula Project
 
PDF
Supercomputing by API: Connecting Modern Web Apps to HPC
OpenStack
 
PDF
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
PDF
How DreamHost builds a public cloud with OpenStack.pdf
OpenStack Foundation
 
PDF
How DreamHost builds a Public Cloud with OpenStack
Carl Perry
 
PDF
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
inside-BigData.com
 
PDF
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Subbu Rama
 
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Flexible compute
Peter Clapham
 
HPC and cloud distributed computing, as a journey
Peter Clapham
 
HPC HUB - Virtual Supercomputer on Demand
Vilgelm Bitner
 
2023comp90024_Spartan.pdf
LevLafayette1
 
Brad stack - Digital Health and Well-Being Festival
Digital Health Enterprise Zone
 
Elastic HBase on Mesos - HBaseCon 2015
Cosmin Lehene
 
Scientific Computing @ Fred Hutch
Dirk Petersen
 
An Update on Arm HPC
inside-BigData.com
 
Architecting a 35 PB distributed parallel file system for science
Speck&Tech
 
Making Clouds: Turning OpenNebula into a Product
NETWAYS
 
OpenNebulaConf 2013 - Making Clouds: Turning OpenNebula into a Product by Car...
OpenNebula Project
 
Making clouds: turning opennebula into a product
Carlo Daffara
 
OpenNebulaConf 2014 - Dynamic virtual private clusters with OpenNebula and SG...
OpenNebula Project
 
Supercomputing by API: Connecting Modern Web Apps to HPC
OpenStack
 
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
How DreamHost builds a public cloud with OpenStack.pdf
OpenStack Foundation
 
How DreamHost builds a Public Cloud with OpenStack
Carl Perry
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
inside-BigData.com
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Subbu Rama
 
Ad

More from OpenStack (20)

PDF
Swinburne University of Technology - Shunde Zhang & Kieran Spear, Aptira
OpenStack
 
PDF
Related OSS Projects - Peter Rowe, Flexera Software
OpenStack
 
PDF
Federation and Interoperability in the Nectar Research Cloud
OpenStack
 
PDF
Simplifying the Move to OpenStack
OpenStack
 
PDF
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
OpenStack
 
PDF
Migrating your infrastructure to OpenStack - Avi Miller, Oracle
OpenStack
 
PDF
A glimpse into an industry Cloud using Open Source Technologies - Adrian Koh,...
OpenStack
 
PDF
Enabling OpenStack for Enterprise - Tarso Dos Santos, Veritas
OpenStack
 
PDF
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
OpenStack
 
PDF
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack
 
PPTX
Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...
OpenStack
 
PDF
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
OpenStack
 
PDF
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
OpenStack
 
PDF
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
OpenStack
 
PPTX
Ironically, Infrastructure Doesn't Matter - Quinton Anderson, Commonwealth Ba...
OpenStack
 
PDF
Traditional Enterprise to OpenStack Cloud - An Unexpected Journey
OpenStack
 
PDF
Building a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash University
OpenStack
 
PDF
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
OpenStack
 
PPTX
Containers and OpenStack: Marc Van Hoof, Kumulus: Containers and OpenStack
OpenStack
 
PDF
Moving to Cloud for Good: Alexander Tsirel, HiveTec
OpenStack
 
Swinburne University of Technology - Shunde Zhang & Kieran Spear, Aptira
OpenStack
 
Related OSS Projects - Peter Rowe, Flexera Software
OpenStack
 
Federation and Interoperability in the Nectar Research Cloud
OpenStack
 
Simplifying the Move to OpenStack
OpenStack
 
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
OpenStack
 
Migrating your infrastructure to OpenStack - Avi Miller, Oracle
OpenStack
 
A glimpse into an industry Cloud using Open Source Technologies - Adrian Koh,...
OpenStack
 
Enabling OpenStack for Enterprise - Tarso Dos Santos, Veritas
OpenStack
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
OpenStack
 
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack
 
Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...
OpenStack
 
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
OpenStack
 
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
OpenStack
 
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
OpenStack
 
Ironically, Infrastructure Doesn't Matter - Quinton Anderson, Commonwealth Ba...
OpenStack
 
Traditional Enterprise to OpenStack Cloud - An Unexpected Journey
OpenStack
 
Building a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash University
OpenStack
 
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
OpenStack
 
Containers and OpenStack: Marc Van Hoof, Kumulus: Containers and OpenStack
OpenStack
 
Moving to Cloud for Good: Alexander Tsirel, HiveTec
OpenStack
 

Recently uploaded (20)

PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Doc9.....................................
SofiaCollazos
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 

The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, University of Melbourne

  • 1. The Why and How of HPC-Cloud Hybrids with OpenStack OpenStack Australia Day Melbourne June, 2017 Lev Lafayette, HPC Support and Training Officer, University of Melbourne [email protected]
  • 2. 1.0 Management Layer 1.1 HPC for performance High-performance computing (HPC) is any computer system whose architecture allows for above average performance. Main use case refers to compute clusters with a teamster separation between head node and workers nodes and a high-speed interconnect acts as a single system. 1.2 Clouds for flexibility. Precursor with virtualised hardware. Cloud VMs always have lower performance than HPC. Question is whether the flexibility is worth the overhead. 1.3 Hybrid HPC/Clouds. University of Melbourne model, "the chimera". Cloud VMs deployed as HPC nodes and the Freiburg University Model, "the cyborg", HPC nodes deploying Cloud VMs. 1.4 Reviewing user preferences and usage. Users always want more of 'x'; real issue identified was queue times. Usage indicated a high proportion of single-node jobs. 1.5 Review and Architecture. Review discussed whether UoM needed HPC; architecture was to use existing NeCTAR Research cloud with an expansion of general cloud compute provisioning and use of a smaller "true HPC" system on bare metal nodes.
  • 3. 2.0 Physical Layer 2.1 Physical Partitions. "Real" HPC is a mere c276 cores, 21 GB per core. 2 socket Intel E5-2643 v3 E5-2643, 3.4GHz CPU with 6-core per socket, 192GB memory, 2x 1.2TB SAS drives, 2x 40GbE network. “Cloud” partitions is almost 400 virtual machines with over 3,000 2.3GHz Haswell cores with 8GB per core and . There is also a GPU partition with Dual Nvidia Tesla K80s (big expansion this year), and departmental partitions (water and ashley). Management and login nodes are VMs as is I/O for transferring data. 2.2 Network. System network includes: Cloud nodes Cisco Nexus 10Gbe TCP/IP 60 usec latency (mpi- pingpong); Bare Metal Mellanox 2100 Cumulos Linux 40Gbe TCP/IP 6.85 usec latency and then RDMA Ethernet 1.15 usec latency 2.3 Storage. Mountpoints to home, projects (/project /home for user data & scripts, NetApp SAS aggregate 70TB usable) and applications directories across all nodes. Additional mountpoins to VicNode Aspera Shares. Applications directory currently on management node, needs to be decoupled. Bare metal nodes have /scratch shared storage for MPI jobs (Dell R730 with 14 x 800GB mixed use SSDs providing 8TB of usable storage, NFS over RDMA)., /var/local/tmp for single node jobs, pcie SSD 1.6TB.
  • 4. 3.0 Operating System and Scheduler Layer 3.1 Red Hat Linux. Scalable FOSS operating system, high performance, very well suited for research applications. In November 2016 of the Top 500 Supercomputers worldwide, every single machine used a "UNIX-like" operating system; and 99.6% used Linux. 3.2 Slurm Workload Manager. Job schedulers and resource managers allow for unattended background tasks expressed as batch jobs among the available resources; allows multicore, multinode, arrays, dependencies, and interactive submissions. The scheduler provides for paramterisation of computer resources, an automatic submission of execution tasks, and a notification system for incidents. Slurm (originally Simple Linux Utility for Resource Management), developed by Lawrence Livermore et al., is FOSS, and used by majority of world's top systems. Scalable, offers many optional plugins, power-saving features, accounting features, etc. Divided into logical partitions which correlate with hardware partitions. 3.3 Git, Gerrit, and Puppet. Version control, paired systems administration, configuration management. 3.4 OpenStack Node Deployment. Significant use of Nova (compute) service for provisioning and decommissioning of virtual machines on demand.
  • 5. 4.0 Application Layer 4.1 Source Code and EasyBuild. Source code provides better control over security updates, integration, development, and much better performance. Absolutely essential for reproducibility in research environment. EasyBuild makes source software installs easier with scripts containing specified compilation blocks (e.g., configuremake, cmake etc) and specified toolchains (GCC, Intel etc) and environment modules (LMod). Modulefiles allow for dynamic changes to a user's environment and ease with multiple versions of software applications on a system. 4.2 Compilers, Scripting Languages, and Applications. Usual range of suspects; Intel and GCC, for compilers (and a little bit of PGI), Python Ruby, and Perl for scripting languages, OpenMPI wrappers. Major applications include: MATLAB, Gaussian, NAMD, R, OpenFOAM, Octave etc. Almost 1,000 applications/versions installed from source, plus packages. 4.3 Containers with Singularity. A container in a cloud virtual machine on an HPC! Wait, what?
  • 6. 5.0 User Layer 5.1 Karaage. Spartan uses its own LDAP authentication that is tied to the university Security Assertion Markup Language (SAML). Users on Spartan must belong to a project. Projects must be led by a University of Melbourne researcher (the "Principal Investigator") and are subject to approval by the Head of Research Compute Services. Participants in a project can be researchers or research support staff from anywhere. Karaage is Django-based application for user, project, and cluster reporting and management. 5.2 Freshdesk. OMG Users! 5.3 Online Instructions and Training. Many users (even post-doctoral researchers) require basic training in Linux command line, a requisite skill for HPC use. Extensive training programme for researchers available using andragogical methods, including day-long courses in “Introduction to Linux and HPC Using Spartan”, “Linux Shell Scripting for High Performance Computing”, and “Parallel Programming On Spartan”. Documentation online (Github, Website, and man pages) and plenty of Slurm examples on system.
  • 7. 6.0 Future Developments 6.1 Cloudbursting with Azure. Slurm allows cloudbursting via the powersave feature; successfully experiments (and bug discovery) within the NeCTAR research cloud. About to add Azure through same login node. Does not mount applications directory; wrap necessary data for transfer in script. 6.2 GPU Expansion. Plans for a significant increase in the GPU allocation. 6.3 Test cluster (Thespian). Everyone has a test environment, some people also have a production and a test environment. Test nodes already exist for Cloud and Physical partitions. Replicate management and login nodes. 6.3 New Architectures New architectures can be added to the system with separate build node (another VM) and with software built for that architecture. Don't need an entirely new system.