SlideShare a Scribd company logo
Scalable Parallel Programming in Python with Parsl
Kyle Chard (chard@uchicago.edu)
Yadu Babuji, Anna Woodard, Ben Clifford, Zhuozhao Li, Mike Wilde, Dan Katz, Ian Foster
http://parsl-project.org
2
Composition and parallelism
Scientific software is increasingly assembled rather than written
– High-level language to integrate and wrap components from many sources
Parallel and distributed computing is ubiquitous
– Increasing data sizes combined with plateauing sequential processing power
Python (and the SciPy ecosystem) is the de facto standard language for science
– Libraries, tools, Jupyter, etc.
Parsl allows for the natural expression of parallelism in Python:
– Programs can express opportunities for parallelism
– Realized, at execution time, using different execution models on different
parallel platforms
3
Parsl: parallel programming in Python
Apps define opportunities for parallelism
• Python apps call Python functions
• Bash apps call external applications
Apps return “futures”: a proxy for a result
that might not yet be available
Apps run concurrently respecting data
dependencies. Natural parallel programming!
Parsl scripts are independent of where they
run. Write once run anywhere!
pip install parsl
Try Parsl: https://mybinder.org/v2/gh/Parsl/parsl-tutorial/master
4
Data-driven example: parallel geospatial analysis
Land-use Image processing pipeline for the MODIS remote sensor
Analyze
Landuse
Colorize
Mark
Assemble
5
Expressing a many task workflow in Parsl
1) Wrap the science applications as Parsl Apps:
@bash_app
def simulate(outputs=[]):
return './simulation_app.exe {outputs[0]}’
@bash_app
def merge(inputs=[], outputs=[]):
i = inputs; o = outputs
return './merge {1} {0}'.format(' '.join(i), o[0])
@python_app
def analyze(inputs=[]):
return analysis_package(inputs)
6
Expressing a many task workflow in Parsl
2) Execute the parallel workflow by calling Apps:
sims = []
for i in range (nsims):
sims.append(simulate(outputs=['sim-%s.txt' % i]))
all = merge(inputs=[i.outputs[0] for i in sims],
outputs=['all.txt'])
result = analyze(inputs=[all.outputs[0]])
7
Decomposing dynamic parallel execution into a task-
dependency graph
Parsl
8
Parsl scripts are execution provider independent
The same script can be run locally, on grids, clouds, or
supercomputers
Growing support for various schedulers and cloud vendors
9
Separation of code and execution
Choose execution environment
at runtime. Parsl will direct
tasks to the configured
execution environment(s).
10
Authentication and authorization
Authn/z is hard…
– 2FA, X509, GSISSH, etc.
Integration with Globus Auth to
support native app integration
for accessing Globus (and other)
services
Using scoped access tokens,
refresh tokens, delegation
support
11
Parsl provides transparent (wide area) data management
Implicit data movement to/from
repositories, laptops, supercomputers
Globus for third-party, high
performance and reliable data
transfer
– Support for site-specific DTNs
HTTP/FTP direct data staging
parsl_file =
File(globus://EP/path/file)
www.globus.org
12
Parallel applications are very different
High-throughput workloads
– Protein docking, image processing, materials reconstructions
– Requirements: 1000s of tasks, 100s of nodes, reliability, usability,
monitoring, elasticity, etc.
Extreme-scale workloads
– Cosmology simulations, imaging the arctic, genomics analysis
– Requirements: millions of tasks, 1000s of nodes (100,000s cores), capacity
Interactive and real-time workloads
– Materials science, cosmic ray shower analysis, machine learning inference
– Requirements: 10s of nodes, rapid response, pipelining
13
Parsl implements an extensible executor interface
High-throughput executor (HTEX)
– Pilot job-based model with multi-threaded manager deployed on workers
– Designed for ease of use, fault-tolerance, etc.
– <2000 nodes (~60K workers), Ms tasks, task duration/nodes > 0.01
Extreme-scale executor (EXEX)
– Distributed MPI job manages execution. Manager rank communicates
workload to other worker ranks directly
– Designed for extreme scale execution on supercomputers
– >1000 nodes (>30K workers), Ms tasks, >1m task duration
Low-latency Executor (LLEX)
– Direct socket communication to workers, fixed resource pool, limited features
– 10s nodes, <1M tasks, <1m tasks
14
Parsl executors scale to 2M tasks/256K workers
Weak scaling: 10 tasks per worker
● HTEX and EXEX outperform other
Python-based approaches and scale
beyond ~2M tasks
● HTEX and EXEX scale to 2K nodes (~65k
workers) and 8K nodes (~262K workers),
respectively, with >1K tasks/s
0s tasks
1s tasks
15
Interactive supercomputing in Jupyter notebooks
16
Monitoring
and
visualization
Workflow view Task view
17
Other functionality provided by Parsl
Globus. Delegated authentication
and wide area data management
Fault tolerance. Support for retries,
checkpointing, and memoization
Containers. Sandboxed execution
environments for workers and tasks
Data management. Automated
staging with HTTP, FTP, and Globus
Multi site. Combining
executors/providers for execution
across different resources
Elasticity. Automated resource
expansion/retraction based on
workload
Monitoring. Workflow and resource
monitoring and visualization
Reproducibility. Capture workflow
provenance in the task graph
Jupyter integration. Seamless
description and management of
workflows
Resource abstraction. Block-based
model overlaying different providers
and resources
18
Parsl is being used in a wide range of scientific applications
E
C
A B
D
G
• Machine learning to predict
stopping power in materials
• Protein and biomolecule
structure and interaction
• Weak lensing using sky
surveys
• Cosmic ray showers as part of
QuarkNet
• Information extraction to
classify image types in papers
• Materials science at the
Advanced Photon Source
• Machine learning and data
analytics (DLHub)
A
B
C
D
E
F
G
F
19
Parsl provides simple, safe, scalable, and flexible
parallelism in Python
Simple: Python with minimal new constructs (integrated with the growing
SciPy ecosystem and other scientific services)
Safe: deterministic parallel programs through immutable input/output
objects, dependency task graph, etc.
Scalable: efficient execution from laptops to the largest supercomputers
Flexible: programs composed from existing components and then applied
to different resources/workloads
20
Parsl is an open-source Python project
https://github.com/Parsl/parsl
21
Questions?
U . S . D E P A R T M E N T O F
ENERGY
http://parsl-project.org
https://mybinder.org/v2/gh/Parsl/parsl-tutorial/master

More Related Content

What's hot (20)

PDF
Scaling up genomic analysis with ADAM
fnothaft
 
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
PDF
Apache spark - Spark's distributed programming model
Martin Zapletal
 
PPT
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
rhatr
 
PPT
Giraph at Hadoop Summit 2014
Claudio Martella
 
PDF
Apache Giraph
Ahmet Emre Aladağ
 
PPTX
Map Reduce
Rahul Agarwal
 
PDF
Handling data and workflows in computational materials science: the AiiDA ini...
Research Data Alliance
 
PDF
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
PDF
Atomate: a high-level interface to generate, execute, and analyze computation...
Anubhav Jain
 
PDF
InternReport
Swetha Tanamala
 
PPTX
Greenplum-Spark November 2018
KongYew Chan, MBA
 
PDF
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
inside-BigData.com
 
PPTX
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Dataconomy Media
 
PPTX
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
Geoffrey Fox
 
PDF
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Spark Summit
 
PDF
Studies of HPCC Systems from Machine Learning Perspectives
HPCC Systems
 
PPTX
Interpreting the Data:Parallel Analysis with Sawzall
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
PPTX
ElasticSearch as (only) datastore
Tomas Sirny
 
PPTX
Cloudgene - A MapReduce based Workflow Management System
Lukas Forer
 
Scaling up genomic analysis with ADAM
fnothaft
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
rhatr
 
Giraph at Hadoop Summit 2014
Claudio Martella
 
Apache Giraph
Ahmet Emre Aladağ
 
Map Reduce
Rahul Agarwal
 
Handling data and workflows in computational materials science: the AiiDA ini...
Research Data Alliance
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Anubhav Jain
 
InternReport
Swetha Tanamala
 
Greenplum-Spark November 2018
KongYew Chan, MBA
 
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
inside-BigData.com
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Dataconomy Media
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
Geoffrey Fox
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Spark Summit
 
Studies of HPCC Systems from Machine Learning Perspectives
HPCC Systems
 
Interpreting the Data:Parallel Analysis with Sawzall
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
ElasticSearch as (only) datastore
Tomas Sirny
 
Cloudgene - A MapReduce based Workflow Management System
Lukas Forer
 

Similar to Scalable Parallel Programming in Python with Parsl (20)

PPTX
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz
 
PDF
Globus Labs: Forging the Next Frontier
Globus
 
PDF
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
inside-BigData.com
 
PDF
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward
 
PPTX
Communication Frameworks for HPC and Big Data
inside-BigData.com
 
PDF
Expressing and sharing workflows
Daniel S. Katz
 
PPT
CLUSTER COMPUTING
KITE www.kitecolleges.com
 
PPTX
Swift Parallel Scripting for High-Performance Workflow
Daniel S. Katz
 
PDF
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
inside-BigData.com
 
PDF
Automating Environmental Computing Applications with Scientific Workflows
Rafael Ferreira da Silva
 
PPTX
Advances in Scientific Workflow Environments
Carole Goble
 
PDF
Ganga: an interface to the LHC computing grid
Matt Williams
 
PDF
ParaForming - Patterns and Refactoring for Parallel Programming
khstandrews
 
PDF
St Petersburg R user group meetup 2, Parallel R
Andrew Bzikadze
 
PDF
High-Performance and Scalable Designs of Programming Models for Exascale Systems
inside-BigData.com
 
PDF
eScience Cluster Arch. Overview
Francesco Bongiovanni
 
PDF
Parallel Computing with R
Abhirup Mallik
 
PDF
Programming Models for Exascale Systems
inside-BigData.com
 
PPT
Parallel Computing 2007: Overview
Geoffrey Fox
 
PDF
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz
 
Globus Labs: Forging the Next Frontier
Globus
 
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
inside-BigData.com
 
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward
 
Communication Frameworks for HPC and Big Data
inside-BigData.com
 
Expressing and sharing workflows
Daniel S. Katz
 
CLUSTER COMPUTING
KITE www.kitecolleges.com
 
Swift Parallel Scripting for High-Performance Workflow
Daniel S. Katz
 
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
inside-BigData.com
 
Automating Environmental Computing Applications with Scientific Workflows
Rafael Ferreira da Silva
 
Advances in Scientific Workflow Environments
Carole Goble
 
Ganga: an interface to the LHC computing grid
Matt Williams
 
ParaForming - Patterns and Refactoring for Parallel Programming
khstandrews
 
St Petersburg R user group meetup 2, Parallel R
Andrew Bzikadze
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
inside-BigData.com
 
eScience Cluster Arch. Overview
Francesco Bongiovanni
 
Parallel Computing with R
Abhirup Mallik
 
Programming Models for Exascale Systems
inside-BigData.com
 
Parallel Computing 2007: Overview
Geoffrey Fox
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
Ad

More from Globus (20)

PDF
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
PDF
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
PDF
Globus Compute Introduction - GlobusWorld 2024
Globus
 
PDF
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
PDF
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
PDF
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
PDF
First Steps with Globus Compute Multi-User Endpoints
Globus
 
PDF
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
PDF
Understanding Globus Data Transfers with NetSage
Globus
 
PDF
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
PDF
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
PDF
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
PDF
The Department of Energy's Integrated Research Infrastructure (IRI)
Globus
 
PDF
GlobusWorld 2024 Opening Keynote session
Globus
 
PDF
Enhancing Performance with Globus and the Science DMZ
Globus
 
PDF
Extending Globus into a Site-wide Automated Data Infrastructure.pdf
Globus
 
PDF
Globus at the United States Geological Survey
Globus
 
PDF
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
PDF
Globus Compute with Integrated Research Infrastructure (IRI) workflows
Globus
 
PDF
Reactive Documents and Computational Pipelines - Bridging the Gap
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Understanding Globus Data Transfers with NetSage
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
The Department of Energy's Integrated Research Infrastructure (IRI)
Globus
 
GlobusWorld 2024 Opening Keynote session
Globus
 
Enhancing Performance with Globus and the Science DMZ
Globus
 
Extending Globus into a Site-wide Automated Data Infrastructure.pdf
Globus
 
Globus at the United States Geological Survey
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Globus Compute with Integrated Research Infrastructure (IRI) workflows
Globus
 
Reactive Documents and Computational Pipelines - Bridging the Gap
Globus
 
Ad

Recently uploaded (20)

PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Biography of Daniel Podor.pdf
Daniel Podor
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 

Scalable Parallel Programming in Python with Parsl

  • 1. Scalable Parallel Programming in Python with Parsl Kyle Chard ([email protected]) Yadu Babuji, Anna Woodard, Ben Clifford, Zhuozhao Li, Mike Wilde, Dan Katz, Ian Foster http://parsl-project.org
  • 2. 2 Composition and parallelism Scientific software is increasingly assembled rather than written – High-level language to integrate and wrap components from many sources Parallel and distributed computing is ubiquitous – Increasing data sizes combined with plateauing sequential processing power Python (and the SciPy ecosystem) is the de facto standard language for science – Libraries, tools, Jupyter, etc. Parsl allows for the natural expression of parallelism in Python: – Programs can express opportunities for parallelism – Realized, at execution time, using different execution models on different parallel platforms
  • 3. 3 Parsl: parallel programming in Python Apps define opportunities for parallelism • Python apps call Python functions • Bash apps call external applications Apps return “futures”: a proxy for a result that might not yet be available Apps run concurrently respecting data dependencies. Natural parallel programming! Parsl scripts are independent of where they run. Write once run anywhere! pip install parsl Try Parsl: https://mybinder.org/v2/gh/Parsl/parsl-tutorial/master
  • 4. 4 Data-driven example: parallel geospatial analysis Land-use Image processing pipeline for the MODIS remote sensor Analyze Landuse Colorize Mark Assemble
  • 5. 5 Expressing a many task workflow in Parsl 1) Wrap the science applications as Parsl Apps: @bash_app def simulate(outputs=[]): return './simulation_app.exe {outputs[0]}’ @bash_app def merge(inputs=[], outputs=[]): i = inputs; o = outputs return './merge {1} {0}'.format(' '.join(i), o[0]) @python_app def analyze(inputs=[]): return analysis_package(inputs)
  • 6. 6 Expressing a many task workflow in Parsl 2) Execute the parallel workflow by calling Apps: sims = [] for i in range (nsims): sims.append(simulate(outputs=['sim-%s.txt' % i])) all = merge(inputs=[i.outputs[0] for i in sims], outputs=['all.txt']) result = analyze(inputs=[all.outputs[0]])
  • 7. 7 Decomposing dynamic parallel execution into a task- dependency graph Parsl
  • 8. 8 Parsl scripts are execution provider independent The same script can be run locally, on grids, clouds, or supercomputers Growing support for various schedulers and cloud vendors
  • 9. 9 Separation of code and execution Choose execution environment at runtime. Parsl will direct tasks to the configured execution environment(s).
  • 10. 10 Authentication and authorization Authn/z is hard… – 2FA, X509, GSISSH, etc. Integration with Globus Auth to support native app integration for accessing Globus (and other) services Using scoped access tokens, refresh tokens, delegation support
  • 11. 11 Parsl provides transparent (wide area) data management Implicit data movement to/from repositories, laptops, supercomputers Globus for third-party, high performance and reliable data transfer – Support for site-specific DTNs HTTP/FTP direct data staging parsl_file = File(globus://EP/path/file) www.globus.org
  • 12. 12 Parallel applications are very different High-throughput workloads – Protein docking, image processing, materials reconstructions – Requirements: 1000s of tasks, 100s of nodes, reliability, usability, monitoring, elasticity, etc. Extreme-scale workloads – Cosmology simulations, imaging the arctic, genomics analysis – Requirements: millions of tasks, 1000s of nodes (100,000s cores), capacity Interactive and real-time workloads – Materials science, cosmic ray shower analysis, machine learning inference – Requirements: 10s of nodes, rapid response, pipelining
  • 13. 13 Parsl implements an extensible executor interface High-throughput executor (HTEX) – Pilot job-based model with multi-threaded manager deployed on workers – Designed for ease of use, fault-tolerance, etc. – <2000 nodes (~60K workers), Ms tasks, task duration/nodes > 0.01 Extreme-scale executor (EXEX) – Distributed MPI job manages execution. Manager rank communicates workload to other worker ranks directly – Designed for extreme scale execution on supercomputers – >1000 nodes (>30K workers), Ms tasks, >1m task duration Low-latency Executor (LLEX) – Direct socket communication to workers, fixed resource pool, limited features – 10s nodes, <1M tasks, <1m tasks
  • 14. 14 Parsl executors scale to 2M tasks/256K workers Weak scaling: 10 tasks per worker ● HTEX and EXEX outperform other Python-based approaches and scale beyond ~2M tasks ● HTEX and EXEX scale to 2K nodes (~65k workers) and 8K nodes (~262K workers), respectively, with >1K tasks/s 0s tasks 1s tasks
  • 15. 15 Interactive supercomputing in Jupyter notebooks
  • 17. 17 Other functionality provided by Parsl Globus. Delegated authentication and wide area data management Fault tolerance. Support for retries, checkpointing, and memoization Containers. Sandboxed execution environments for workers and tasks Data management. Automated staging with HTTP, FTP, and Globus Multi site. Combining executors/providers for execution across different resources Elasticity. Automated resource expansion/retraction based on workload Monitoring. Workflow and resource monitoring and visualization Reproducibility. Capture workflow provenance in the task graph Jupyter integration. Seamless description and management of workflows Resource abstraction. Block-based model overlaying different providers and resources
  • 18. 18 Parsl is being used in a wide range of scientific applications E C A B D G • Machine learning to predict stopping power in materials • Protein and biomolecule structure and interaction • Weak lensing using sky surveys • Cosmic ray showers as part of QuarkNet • Information extraction to classify image types in papers • Materials science at the Advanced Photon Source • Machine learning and data analytics (DLHub) A B C D E F G F
  • 19. 19 Parsl provides simple, safe, scalable, and flexible parallelism in Python Simple: Python with minimal new constructs (integrated with the growing SciPy ecosystem and other scientific services) Safe: deterministic parallel programs through immutable input/output objects, dependency task graph, etc. Scalable: efficient execution from laptops to the largest supercomputers Flexible: programs composed from existing components and then applied to different resources/workloads
  • 20. 20 Parsl is an open-source Python project https://github.com/Parsl/parsl
  • 21. 21 Questions? U . S . D E P A R T M E N T O F ENERGY http://parsl-project.org https://mybinder.org/v2/gh/Parsl/parsl-tutorial/master