SlideShare a Scribd company logo
Blastn + Jupyter on Docker
Examples from Bioinformatics
Samantha & Lynn Langit
“
”
Jupyter - Inspired by
Mathematica
Thanks Steven Wolfram
If you can SEE it (your data and code), you can work with it better
@lynnlangit
Next terminal <- a better Python REPL
• Fernando Perez in 2001
• IPython (interactive)
• Modeled - Mathematica
Notebooks
• IP(y): Notebook -> in a browser
• 2012 IPython -> Jupyter
Notebook
@lynnlangit
Enter Jupyter Notebooks
@lynnlangit
Jupyter Notebooks supports ML Lifecycle
1. Collect
Data
Retrieve Files
Query SQL Databases
Call Web Services
“Scrape” Web Pages
2.
Prepare
Data
Explore Data
Validate Data
Clean Data
Features / Data
4.
Evaluate
Model
Test Performance
Compare Models
Validate Model
Visualize
5. Deploy
Model
Export Model File
Prepare Job
Deploy Container
Re-package Model
Execute code blocks:
- Python, R… code
- SQL queries
- Shell commands
3. Train
Model
Prepare Training Set
Experiment
Test Model
Visualize
Write Documentation:
- Markdown language
Visualize Data
- Viz tools…
Jupyter Visualizations –
so many possibilities
Notebook Customizations
Multiple
Runtimes
Languages
Share output
Code or
Equations
LaTex
Math
Comments
Markdown
Wiki-like
Graphics
Visualizations
Charting
Results
LIVE
DOCUMENTATION
Reproducible
Research
@lynnlangit
Example
Jupyter locally
@lynnlangit
Mathematica evolved…
Jupyter Notebook
Market leader
Started for single use
Academic community
GitHub integration
Added Jupyter Hub for
collaboration
Zeppelin Notebook
Start for collaboration
Enterprise
Security
Vendor Notebook
Databricks for Apache Spark
Jupyter-like, but proprietary
format
@lynnlangit
Running Notebooks
Desktop
Install and run
Local Server
Can use Jupyter Hub for groups
Cloud
Large number of options
@lynnlangit
Docker
Start a container
Extending, Refactoring Open Notebooks
• Write functions in one notebook
• Link to another notebook
• Write extensions (nbextensions.com)
Up the bar
Personalized medicine via genomic analysis
@lynnlangit
Reproducible Research – Experiments as Code
@lynnlangit
What is Blastn?
Basic Local Alignment Search Tool - BLAST finds regions of similarity
between biological sequences. The program compares nucleotide or
protein sequences to sequence databases and calculates the
statistical significance.
Blastn plus jupyter on Docker
Cloud-based Jupyter
PaaS
• AWS SageMaker
• Azure Notebooks
• Google Colabs
Wireframe that
first the role of UX
in agencies
@lynnlangit
Blastn plus jupyter on Docker
Tools for Jupyter
• Binder for GitHub
• Point to your GitHub Repo
• Jupyter Notebooks
• Requirements.txt
• It builds a Docker image
• You can run your Notebooks
@lynnlangit
Example
Binder
@lynnlangit
Example - GT-Scan2
Jupyter for Genomics Research
@lynnlangit
Future of Jupyter for Research
Academic
Institutions
and
Research
Labs
UC Berkeley, Davis, San Diego
Cal Poly San Luis Obispo
Clemson University
UC Boulder
U of Illinois, Minnesota, Missouri, Rochester, Texas
MIT
Michigan State U
Texas A & M
@lynnlangit

More Related Content

What's hot (18)

PPTX
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Jaipaul Agonus
 
PPT
Hadoop at Yahoo! -- Hadoop World NY 2009
yhadoop
 
PPTX
Research in the Cloud
David Wallom
 
PDF
Recommender Systems at Scale
Eoin Hurrell, PhD
 
PPTX
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 
PDF
Webinar kubernetes and-spark
cnvrg.io AI OS - Hands-on ML Workshops
 
PDF
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Databricks
 
PPTX
Serverless spark
MamathaBusi
 
PDF
Charles_Qian_Resume
Charles Qian
 
PPTX
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
PPTX
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Seldon
 
PDF
Big data ecosystem
SlideCentral
 
PPTX
Cost effective BigData Processing on Amazon EC2
Sujee Maniyam
 
PDF
3rd Hivemall meetup
Makoto Yui
 
PPTX
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Akshay Rai
 
PDF
Fast and Reliable Apache Spark SQL Engine
Databricks
 
PPTX
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz
 
PPT
Riding the Elephant - Hadoop 2.0
Simon Elliston Ball
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Jaipaul Agonus
 
Hadoop at Yahoo! -- Hadoop World NY 2009
yhadoop
 
Research in the Cloud
David Wallom
 
Recommender Systems at Scale
Eoin Hurrell, PhD
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 
Webinar kubernetes and-spark
cnvrg.io AI OS - Hands-on ML Workshops
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Databricks
 
Serverless spark
MamathaBusi
 
Charles_Qian_Resume
Charles Qian
 
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Seldon
 
Big data ecosystem
SlideCentral
 
Cost effective BigData Processing on Amazon EC2
Sujee Maniyam
 
3rd Hivemall meetup
Makoto Yui
 
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Akshay Rai
 
Fast and Reliable Apache Spark SQL Engine
Databricks
 
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz
 
Riding the Elephant - Hadoop 2.0
Simon Elliston Ball
 

Similar to Blastn plus jupyter on Docker (20)

PDF
2019 03-11 bio it-world west genepattern notebook slides
Michael Reich
 
PPTX
03_aiops-1.pptx
FarazulHoda2
 
PDF
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
PDF
Building Reproducible Network Data Analysis / Visualization Workflows
Keiichiro Ono
 
PDF
04 open source_tools
Marco Quartulli
 
PPTX
SplunkLive London 2014 Developer Presentation
Damien Dallimore
 
PDF
Using_python_webdevolopment_datascience.pdf
Sudipta Bhattacharya
 
PDF
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
PPT
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
GigaScience, BGI Hong Kong
 
PDF
E Afgan - Zero to a bioinformatics analysis platform in four minutes
Jan Aerts
 
PDF
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
PPTX
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
PPTX
Architecting an Open Source AI Platform 2018 edition
David Talby
 
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
PDF
Parallel Programming in Python: Speeding up your analysis
Manojit Nandi
 
PPTX
Advances in Scientific Workflow Environments
Carole Goble
 
PPTX
December 2013 HUG: Hunk - Splunk over Hadoop
Yahoo Developer Network
 
PDF
G3 talk rld_2
Robert Davidson
 
PPTX
IBM Strategy for Spark
Mark Kerzner
 
PDF
Deep Learning with CNTK
Ashish Jaiman
 
2019 03-11 bio it-world west genepattern notebook slides
Michael Reich
 
03_aiops-1.pptx
FarazulHoda2
 
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
Building Reproducible Network Data Analysis / Visualization Workflows
Keiichiro Ono
 
04 open source_tools
Marco Quartulli
 
SplunkLive London 2014 Developer Presentation
Damien Dallimore
 
Using_python_webdevolopment_datascience.pdf
Sudipta Bhattacharya
 
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
GigaScience, BGI Hong Kong
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
Jan Aerts
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
Architecting an Open Source AI Platform 2018 edition
David Talby
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
Parallel Programming in Python: Speeding up your analysis
Manojit Nandi
 
Advances in Scientific Workflow Environments
Carole Goble
 
December 2013 HUG: Hunk - Splunk over Hadoop
Yahoo Developer Network
 
G3 talk rld_2
Robert Davidson
 
IBM Strategy for Spark
Mark Kerzner
 
Deep Learning with CNTK
Ashish Jaiman
 
Ad

More from Lynn Langit (20)

PPTX
Serverless Architectures
Lynn Langit
 
PPTX
10+ Years of Teaching Kids Programming
Lynn Langit
 
PDF
Testing in Ballerina Language
Lynn Langit
 
PPTX
Teaching Kids to create Alexa Skills
Lynn Langit
 
PPTX
Practical cloud
Lynn Langit
 
PPTX
Teaching Kids Programming
Lynn Langit
 
PPTX
Practical Cloud
Lynn Langit
 
PPTX
Serverless Reality
Lynn Langit
 
PPTX
Genomic Scale Big Data Pipelines
Lynn Langit
 
PPTX
Bioinformatics Data Pipelines built by CSIRO on AWS
Lynn Langit
 
PPTX
Serverless Reality
Lynn Langit
 
PDF
Beyond Relational
Lynn Langit
 
PPTX
New AWS Services for Bioinformatics
Lynn Langit
 
PPTX
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
PPTX
Scaling Galaxy on Google Cloud Platform
Lynn Langit
 
PPTX
SQL Server on Google Cloud Platform
Lynn Langit
 
PPTX
Redis Labs and SQL Server
Lynn Langit
 
PPT
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Lynn Langit
 
PPTX
What is 'Teaching Kids Programming'
Lynn Langit
 
PPTX
Teaching Kids Programming for Developers
Lynn Langit
 
Serverless Architectures
Lynn Langit
 
10+ Years of Teaching Kids Programming
Lynn Langit
 
Testing in Ballerina Language
Lynn Langit
 
Teaching Kids to create Alexa Skills
Lynn Langit
 
Practical cloud
Lynn Langit
 
Teaching Kids Programming
Lynn Langit
 
Practical Cloud
Lynn Langit
 
Serverless Reality
Lynn Langit
 
Genomic Scale Big Data Pipelines
Lynn Langit
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Lynn Langit
 
Serverless Reality
Lynn Langit
 
Beyond Relational
Lynn Langit
 
New AWS Services for Bioinformatics
Lynn Langit
 
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
Scaling Galaxy on Google Cloud Platform
Lynn Langit
 
SQL Server on Google Cloud Platform
Lynn Langit
 
Redis Labs and SQL Server
Lynn Langit
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Lynn Langit
 
What is 'Teaching Kids Programming'
Lynn Langit
 
Teaching Kids Programming for Developers
Lynn Langit
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Climate Action.pptx action plan for climate
justfortalabat
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 

Blastn plus jupyter on Docker

  • 1. Blastn + Jupyter on Docker Examples from Bioinformatics Samantha & Lynn Langit
  • 2. “ ” Jupyter - Inspired by Mathematica Thanks Steven Wolfram If you can SEE it (your data and code), you can work with it better @lynnlangit
  • 3. Next terminal <- a better Python REPL • Fernando Perez in 2001 • IPython (interactive) • Modeled - Mathematica Notebooks • IP(y): Notebook -> in a browser • 2012 IPython -> Jupyter Notebook @lynnlangit
  • 5. Jupyter Notebooks supports ML Lifecycle 1. Collect Data Retrieve Files Query SQL Databases Call Web Services “Scrape” Web Pages 2. Prepare Data Explore Data Validate Data Clean Data Features / Data 4. Evaluate Model Test Performance Compare Models Validate Model Visualize 5. Deploy Model Export Model File Prepare Job Deploy Container Re-package Model Execute code blocks: - Python, R… code - SQL queries - Shell commands 3. Train Model Prepare Training Set Experiment Test Model Visualize Write Documentation: - Markdown language Visualize Data - Viz tools…
  • 7. Notebook Customizations Multiple Runtimes Languages Share output Code or Equations LaTex Math Comments Markdown Wiki-like Graphics Visualizations Charting Results LIVE DOCUMENTATION Reproducible Research @lynnlangit
  • 9. Mathematica evolved… Jupyter Notebook Market leader Started for single use Academic community GitHub integration Added Jupyter Hub for collaboration Zeppelin Notebook Start for collaboration Enterprise Security Vendor Notebook Databricks for Apache Spark Jupyter-like, but proprietary format @lynnlangit
  • 10. Running Notebooks Desktop Install and run Local Server Can use Jupyter Hub for groups Cloud Large number of options @lynnlangit Docker Start a container
  • 11. Extending, Refactoring Open Notebooks • Write functions in one notebook • Link to another notebook • Write extensions (nbextensions.com)
  • 12. Up the bar Personalized medicine via genomic analysis @lynnlangit
  • 13. Reproducible Research – Experiments as Code @lynnlangit
  • 14. What is Blastn? Basic Local Alignment Search Tool - BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.
  • 16. Cloud-based Jupyter PaaS • AWS SageMaker • Azure Notebooks • Google Colabs Wireframe that first the role of UX in agencies @lynnlangit
  • 18. Tools for Jupyter • Binder for GitHub • Point to your GitHub Repo • Jupyter Notebooks • Requirements.txt • It builds a Docker image • You can run your Notebooks @lynnlangit
  • 20. Example - GT-Scan2 Jupyter for Genomics Research @lynnlangit
  • 21. Future of Jupyter for Research Academic Institutions and Research Labs UC Berkeley, Davis, San Diego Cal Poly San Luis Obispo Clemson University UC Boulder U of Illinois, Minnesota, Missouri, Rochester, Texas MIT Michigan State U Texas A & M @lynnlangit

Editor's Notes

  • #4: History talk from Cristian Prieto (NDC Oslo 2016) -- https://vimeo.com/223984769 http://blog.fperez.org/2012/01/ipython-notebook-historical.html
  • #10: Local install pip install –iPython all -OR- can use anaconda, which installs Jupyter notebooks by default pip install jupyter[all] and you can pip install R You can use Docker – 2.1 GB image contains all libraries or you can use Azure Notebooks or AWS SageMaker Notebooks Only Python2 is installed by default, you can install other runtimes Start and run in local browser (no database, uses local .json files) IPython notebook -> localhost:8888/tree Use GitHub-flavor Markdown (by default) https://dwhsys.com/2017/03/25/apache-zeppelin-vs-jupyter-notebook/
  • #12: https://github.com/ipython-contrib/jupyter_contrib_nbextensions pip install jupyter_contrib_nbextensions –OR- conda install -c conda-forge jupyter_contrib_nbextensions
  • #14: https://github.com/Microsoft/Elevation/blob/master/notebooks/aggregation.ipynb https://www.microsoft.com/en-us/research/project/crispr/
  • #15: https://blast.ncbi.nlm.nih.gov/Blast.cgi
  • #16: https://hub.docker.com/r/lynnlangit/blastn-jupyter-docker/
  • #17: https://medium.com/@lynnlangit/aws-sagemaker-for-bioinformatics-b8e8a96479d8 Jupyter on GCE VM -- https://towardsdatascience.com/running-jupyter-notebook-in-google-cloud-platform-in-15-min-61e16da34d52
  • #19: https://mybinder.org/ -ALSO- https://nbviewer.jupyter.org/ - allows you to run notebooks stored in GitHub
  • #22: http://jupyterhub-tutorial.readthedocs.io/en/latest/ https://github.com/jupyterhub/jupyterhub-tutorial/blob/master/JupyterHub.pdf http://jupyterhub.readthedocs.io/en/latest/gallery-jhub-deployments.html