SlideShare a Scribd company logo
National Center for Supercomputing Applications
University of Illinois at Urbana–Champaign
Expressing and sharing workflows
Daniel S. Katz
Assistant Director for Scientific Software & Applications, NCSA
Research Associate Professor, CS
Research Associate Professor, ECE
Research Associate Professor, iSchool
dskatz@illinois.edu, d.katz@ieee.org
@danielskatz
What’s a workflow?
• A set of tasks and dependencies between them
• Perhaps expressed as data structure, e.g. graph (DAG or cyclic)
• How is this different than a computer program?
• The tasks as more well-defined (inputs, outputs)
• The tasks are longer (running time O(sec) – O(hr))
• Why express it differently?
• Program (script) is a natural way of expressing a workflow
• Examples: shell scripts, programs in Swift/Parsl
• YesWorkflow annotations to help in understanding scripts
• Swift/Parsl: functions used to identify components
• Expressing it as data corresponds to the compiled (assembly)
version of the workflow
• Useful for a lot of things, but not understanding
http://swift-lang.org, http://parsl-project.org
YesWorkflow (YW)
• Name: “Yes, scripts are (can be) workflows, too!”
• But, workflow (dataflow) usually hidden in the script
• Idea: let the script author reveal the structure by
declaring tasks (steps) and dataflow between tasks.
• This is a modeling step
• very coarse (workflow: one big black box w/ inputs & outputs)
• or rather fine (workflow has many steps, linked by dataflow)
• => language to explain (graphically) what the concepts
(relevant steps, relevant data) you want to share
• => this conceptual YW model can itself be queried; linked
with runtime observables, provenance
Credit: Bertram Ludäscher
Parsl
• A python-based parallel scripting library (http://parsl-project.org),
based on ideas in Swift (http://swift-lang.org)
• Tasks exposed as functions (python or bash)
@App('bash', data_flow_kernel)
def echo(message, outputs=[]):
return 'echo {0} &> {outputs[0]}’
@App('python', data_flow_kernel)
def cat(inputs=[]):
with open(inputs[0]) as f:
return f.readlines()
• Return values are futures
• Other tasks can be called that depend on these futures
• Will not run until futures are satisfied/filled
• Main code used to glue functions together
hello = echo("Hello World!", outputs=['hello1.txt'])
message = cat(inputs=[hello.outputs[0]])
• Fairly easy to understand
How to promote/share workflows
• How do we share general software?
• Libraries (units of execution with well-defined APIs)
• Source code (fork model)
• Source code repositories (GitHub), packaging
systems/repositories (PyPI, CRAN)
• How do we share data?
• Repositories (Dryad)
• For workflows
• Libraries -> sub-workflows, defined to provide well-specified
functionality
• Source code -> source code (scripts), may still be hard to
understand
• Data -> data repository for workflows (MyExperiment)
www.myexperiment.org
De	Roure,	D.,	Goble,	C.	Stevens,	R.	(2009)	The	Design	
and	Realisation of	the	myExperiment Virtual	Research	
Environment	for	Social	Sharing	of	Workflows.	Future	
Generation	Computer	Systems	25,	pp.	561-7.
• A	workflow	commons	for	workflow	sharing,	
designed	using	Web	2.0	principles
• Launched	open	beta	in	November	2007,	still	
actively	used
• Largest	public	collection	of	workflows,	for	
multiple	workflow	systems
• 2400+	entries	in	Google	Scholar	refer	to	
myExperiment
• Open	source,	REST	API,	part	of	Open	Linked	
Data	cloud	(66k	triples)	- lod-cloud.net
• Introduced	“packs”	which	led	to	Research	
Objects	– www.researchobject.org
• Workflow	collection	studied	in	scientific	
workflow	and	e-Science	communities
• Service	maintained	by	Manchester	and	Oxford	
universities.	Informs	design	of	other	workflow	
sharing	systems.	
• Content	stats:	10591	members,	393	groups,	
3876	workflows,	1233	files,	477	packs
Credit: Carole Goble
GitHub
• Widely used for sharing software, and socially working
on/with software (and many other types of documents)
• GitHub is used for sharing workflows today
• Both scripts and data
• Borrowing from “Software vs. data in the context of
citation”
• A workflow as a program or a script is code, a creative work
• Appropriate license: OSI-approved open source (e.g., BSD)
• A workflow as a DAG is data?
• Appropriate license: Creative Commons (e.g., CC-BY)?
• So, let’s keep workflows as programs/scripts
• Use YesWorkflow with scripts
• Use GitHub to share
Katz DS, Niemeyer KE, et al. (2016) Software vs. data in the context of citation.
PeerJ Preprints 4:e2630v1 doi: 10.7287/peerj.preprints.2630v1

More Related Content

What's hot (14)

PPTX
Rust & Apache Arrow @ RMS
Andy Grove
 
PPTX
Azure search
Alexej Sommer
 
PDF
Introduction To Apache Lucene
Mindfire Solutions
 
PDF
R training
Hellen Gakuruh
 
PDF
WebTech Tutorial Querying DBPedia
Katrien Verbert
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
An Introduction to NLP4L
Koji Sekiguchi
 
PDF
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Oleksii Holub
 
PPT
Python first day
MARISSTELLA2
 
PPT
Python first day
farkhand
 
PDF
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
Koji Sekiguchi
 
PDF
Perl tutorial final
Ashoka Vanjare
 
PDF
Understanding Hadoop through examples
Yoshitomo Matsubara
 
PDF
Anatomy of spark catalyst
datamantra
 
Rust & Apache Arrow @ RMS
Andy Grove
 
Azure search
Alexej Sommer
 
Introduction To Apache Lucene
Mindfire Solutions
 
R training
Hellen Gakuruh
 
WebTech Tutorial Querying DBPedia
Katrien Verbert
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
An Introduction to NLP4L
Koji Sekiguchi
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Oleksii Holub
 
Python first day
MARISSTELLA2
 
Python first day
farkhand
 
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
Koji Sekiguchi
 
Perl tutorial final
Ashoka Vanjare
 
Understanding Hadoop through examples
Yoshitomo Matsubara
 
Anatomy of spark catalyst
datamantra
 

Similar to Expressing and sharing workflows (20)

PPTX
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz
 
PDF
FireWorks overview
Anubhav Jain
 
PPTX
Advances in Scientific Workflow Environments
Carole Goble
 
PDF
PyData Meetup Presentation in Natal April 2024
MarcelRibeiroDantas
 
PDF
FireWorks workflow software
Anubhav Jain
 
PDF
Lightweight continuous delivery for small schools
Charles Fulton
 
PDF
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Bertram Ludäscher
 
PDF
Netty training
Marcelo Serpa
 
PDF
Online Workflow Management and Performance Analysis with Stampede
Dan Gunter
 
PPTX
FAIR Computational Workflows
Carole Goble
 
PDF
Netty training
Jackson dos Santos Olveira
 
PDF
Overview of Scientific Workflows - Why Use Them?
inside-BigData.com
 
PDF
Data Pipelines with Python - NWA TechFest 2017
Casey Kinsey
 
PDF
Scalable Parallel Programming in Python with Parsl
Globus
 
PDF
Elixir Programming Language 101
Around25
 
PDF
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
Stian Soiland-Reyes
 
PDF
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
DevOps_Fest
 
PPTX
Swift: A parallel scripting for applications at the petascale and beyond.
Nagasuri Bala Venkateswarlu
 
PDF
Building Web APIs that Scale
Salesforce Developers
 
PDF
JUGUtrecht2023 - GithubActions
Ixchel Ruiz
 
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz
 
FireWorks overview
Anubhav Jain
 
Advances in Scientific Workflow Environments
Carole Goble
 
PyData Meetup Presentation in Natal April 2024
MarcelRibeiroDantas
 
FireWorks workflow software
Anubhav Jain
 
Lightweight continuous delivery for small schools
Charles Fulton
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Bertram Ludäscher
 
Netty training
Marcelo Serpa
 
Online Workflow Management and Performance Analysis with Stampede
Dan Gunter
 
FAIR Computational Workflows
Carole Goble
 
Netty training
Jackson dos Santos Olveira
 
Overview of Scientific Workflows - Why Use Them?
inside-BigData.com
 
Data Pipelines with Python - NWA TechFest 2017
Casey Kinsey
 
Scalable Parallel Programming in Python with Parsl
Globus
 
Elixir Programming Language 101
Around25
 
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
Stian Soiland-Reyes
 
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
DevOps_Fest
 
Swift: A parallel scripting for applications at the petascale and beyond.
Nagasuri Bala Venkateswarlu
 
Building Web APIs that Scale
Salesforce Developers
 
JUGUtrecht2023 - GithubActions
Ixchel Ruiz
 
Ad

More from Daniel S. Katz (20)

PDF
Research software susainability
Daniel S. Katz
 
PPTX
Software Professionals (RSEs) at NCSA
Daniel S. Katz
 
PPTX
Requiring Publicly-Funded Software, Algorithms, and Workflows to be Made Publ...
Daniel S. Katz
 
PPTX
What is eScience, and where does it go from here?
Daniel S. Katz
 
PDF
Citation and Research Objects: Toward Active Research Objects
Daniel S. Katz
 
PDF
FAIR is not Fair Enough, Particularly for Software Citation, Availability, or...
Daniel S. Katz
 
PPTX
Fundamentals of software sustainability
Daniel S. Katz
 
PPTX
Software Citation in Theory and Practice
Daniel S. Katz
 
PPTX
URSSI
Daniel S. Katz
 
PDF
Research Software Sustainability: WSSSPE & URSSI
Daniel S. Katz
 
PDF
Software citation
Daniel S. Katz
 
PDF
Citation and reproducibility in software
Daniel S. Katz
 
PPTX
Software Citation: Principles, Implementation, and Impact
Daniel S. Katz
 
PPTX
Summary of WSSSPE and its working groups
Daniel S. Katz
 
PPTX
Working towards Sustainable Software for Science: Practice and Experience (WS...
Daniel S. Katz
 
PPTX
20160607 citation4software panel
Daniel S. Katz
 
PPTX
20160607 citation4software opening
Daniel S. Katz
 
PPTX
Scientific Software Challenges and Community Responses
Daniel S. Katz
 
PPTX
What do we need beyond a DOI?
Daniel S. Katz
 
PPTX
Looking at Software Sustainability and Productivity Challenges from NSF
Daniel S. Katz
 
Research software susainability
Daniel S. Katz
 
Software Professionals (RSEs) at NCSA
Daniel S. Katz
 
Requiring Publicly-Funded Software, Algorithms, and Workflows to be Made Publ...
Daniel S. Katz
 
What is eScience, and where does it go from here?
Daniel S. Katz
 
Citation and Research Objects: Toward Active Research Objects
Daniel S. Katz
 
FAIR is not Fair Enough, Particularly for Software Citation, Availability, or...
Daniel S. Katz
 
Fundamentals of software sustainability
Daniel S. Katz
 
Software Citation in Theory and Practice
Daniel S. Katz
 
Research Software Sustainability: WSSSPE & URSSI
Daniel S. Katz
 
Software citation
Daniel S. Katz
 
Citation and reproducibility in software
Daniel S. Katz
 
Software Citation: Principles, Implementation, and Impact
Daniel S. Katz
 
Summary of WSSSPE and its working groups
Daniel S. Katz
 
Working towards Sustainable Software for Science: Practice and Experience (WS...
Daniel S. Katz
 
20160607 citation4software panel
Daniel S. Katz
 
20160607 citation4software opening
Daniel S. Katz
 
Scientific Software Challenges and Community Responses
Daniel S. Katz
 
What do we need beyond a DOI?
Daniel S. Katz
 
Looking at Software Sustainability and Productivity Challenges from NSF
Daniel S. Katz
 
Ad

Recently uploaded (20)

PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 

Expressing and sharing workflows

  • 1. National Center for Supercomputing Applications University of Illinois at Urbana–Champaign Expressing and sharing workflows Daniel S. Katz Assistant Director for Scientific Software & Applications, NCSA Research Associate Professor, CS Research Associate Professor, ECE Research Associate Professor, iSchool [email protected], [email protected] @danielskatz
  • 2. What’s a workflow? • A set of tasks and dependencies between them • Perhaps expressed as data structure, e.g. graph (DAG or cyclic) • How is this different than a computer program? • The tasks as more well-defined (inputs, outputs) • The tasks are longer (running time O(sec) – O(hr)) • Why express it differently? • Program (script) is a natural way of expressing a workflow • Examples: shell scripts, programs in Swift/Parsl • YesWorkflow annotations to help in understanding scripts • Swift/Parsl: functions used to identify components • Expressing it as data corresponds to the compiled (assembly) version of the workflow • Useful for a lot of things, but not understanding http://swift-lang.org, http://parsl-project.org
  • 3. YesWorkflow (YW) • Name: “Yes, scripts are (can be) workflows, too!” • But, workflow (dataflow) usually hidden in the script • Idea: let the script author reveal the structure by declaring tasks (steps) and dataflow between tasks. • This is a modeling step • very coarse (workflow: one big black box w/ inputs & outputs) • or rather fine (workflow has many steps, linked by dataflow) • => language to explain (graphically) what the concepts (relevant steps, relevant data) you want to share • => this conceptual YW model can itself be queried; linked with runtime observables, provenance Credit: Bertram Ludäscher
  • 4. Parsl • A python-based parallel scripting library (http://parsl-project.org), based on ideas in Swift (http://swift-lang.org) • Tasks exposed as functions (python or bash) @App('bash', data_flow_kernel) def echo(message, outputs=[]): return 'echo {0} &> {outputs[0]}’ @App('python', data_flow_kernel) def cat(inputs=[]): with open(inputs[0]) as f: return f.readlines() • Return values are futures • Other tasks can be called that depend on these futures • Will not run until futures are satisfied/filled • Main code used to glue functions together hello = echo("Hello World!", outputs=['hello1.txt']) message = cat(inputs=[hello.outputs[0]]) • Fairly easy to understand
  • 5. How to promote/share workflows • How do we share general software? • Libraries (units of execution with well-defined APIs) • Source code (fork model) • Source code repositories (GitHub), packaging systems/repositories (PyPI, CRAN) • How do we share data? • Repositories (Dryad) • For workflows • Libraries -> sub-workflows, defined to provide well-specified functionality • Source code -> source code (scripts), may still be hard to understand • Data -> data repository for workflows (MyExperiment)
  • 6. www.myexperiment.org De Roure, D., Goble, C. Stevens, R. (2009) The Design and Realisation of the myExperiment Virtual Research Environment for Social Sharing of Workflows. Future Generation Computer Systems 25, pp. 561-7. • A workflow commons for workflow sharing, designed using Web 2.0 principles • Launched open beta in November 2007, still actively used • Largest public collection of workflows, for multiple workflow systems • 2400+ entries in Google Scholar refer to myExperiment • Open source, REST API, part of Open Linked Data cloud (66k triples) - lod-cloud.net • Introduced “packs” which led to Research Objects – www.researchobject.org • Workflow collection studied in scientific workflow and e-Science communities • Service maintained by Manchester and Oxford universities. Informs design of other workflow sharing systems. • Content stats: 10591 members, 393 groups, 3876 workflows, 1233 files, 477 packs Credit: Carole Goble
  • 7. GitHub • Widely used for sharing software, and socially working on/with software (and many other types of documents) • GitHub is used for sharing workflows today • Both scripts and data • Borrowing from “Software vs. data in the context of citation” • A workflow as a program or a script is code, a creative work • Appropriate license: OSI-approved open source (e.g., BSD) • A workflow as a DAG is data? • Appropriate license: Creative Commons (e.g., CC-BY)? • So, let’s keep workflows as programs/scripts • Use YesWorkflow with scripts • Use GitHub to share Katz DS, Niemeyer KE, et al. (2016) Software vs. data in the context of citation. PeerJ Preprints 4:e2630v1 doi: 10.7287/peerj.preprints.2630v1