Understanding Jupyter notebooks using bioinformatics examples

The next terminal – Jupyter
With examples from Bioinformatics
@lynnlangit

“
”
How often do you use
the terminal?
@lynnlangit

Terminal Customizations
Prompt Output Aesthetics Code Comments Graphics
@lynnlangit

What does this Code do?
@lynnlangit

“
”
But it’s not good enough
Why not?
@lynnlangit

Machine Learning
Too much data to process? Or too much code? Can you ‘see’ what is happening?
@lynnlangit

What does this Code do?
Which algorithm?
@lynnlangit

Visualizing Data Processing ML Code
Which algorithm?
@lynnlangit

Now – more data, much more…
IoT increases data volume and complexity exponentially
@lynnlangit

“
”
Inspired by
Mathematica
Thanks Steven Wolfram
If you can SEE it (your data and code), you can work with it better
@lynnlangit

Next terminal -> a better Python REPL
• Fernando Perez in 2001
• IPython (interactive)
• Modeled - Mathematica Notebooks
• IP(y): Notebook -> in a browser
• 2012 IPython -> Jupyter Notebook
@lynnlangit

Enter Jupyter Notebooks
@lynnlangit

Jupyter Notebooks supports ML Lifecycle
1. Collect
Data
Retrieve Files
Query SQL Databases
Call Web Services
“Scrape” Web Pages
2.
Prepare
Data
Explore Data
Validate Data
Clean Data
Features / Data
4.
Evaluate
Model
Test Performance
Compare Models
Validate Model
Visualize
5. Deploy
Model
Export Model File
Prepare Job
Deploy Container
Re-package Model
Execute code blocks:
- Python, R… code
- SQL queries
- Shell commands
3. Train
Model
Prepare Training Set
Experiment
Test Model
Visualize
Write Documentation:
- Markdown language
Visualize Data
- Viz tools…

Jupyter Visualizations –
so many possibilities

Notebook Customizations
Multiple
Runtimes
Languages
Share output
Code or
Equations
LaTex
Math
Comments
Markdown
Wiki-like
Graphics
Visualizations
Charting
Results
LIVE
DOCUMENTATION
Reproducible
Research
@lynnlangit

Example
Jupyter locally
@lynnlangit

Mathematica evolved…
Jupyter Notebook
Market leader
Started for single use
Academic community
GitHub integration
Added Jupyter Hub for
collaboration
Zeppelin Notebook
Start for collaboration
Enterprise
Security
Vendor Notebook
Databricks for Apache Spark
Jupyter-like, but proprietary
format
@lynnlangit

Running Notebooks
Desktop
Install and run
Local Server
Can use Jupyter Hub for groups
Cloud
Large number of options
@lynnlangit

Extending, Refactoring Open Notebooks
• Write functions in one notebook
• Link to another notebook
• Write extensions (nbextensions.com)

Up the bar
Personalized medicine via genomic analysis
@lynnlangit

Reproducible Research – Experiments as Code
@lynnlangit

Bioinformatics | Denis C. Bauer | @allPowerde|
GT-Scan2
How can genome engineering
be made more effective?
Variant Spark
How to find disease genes in
population-size cohorts?
Genomic
Research
Tools
Two
Examples

Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Machine learning…
on 1.7 Trillion data points
https://www.projectmine.com/about/

VariantSpark - Parallelize Random Forest for scalability
• Spark ML’s RF was designed for ‘Big’ low dimensional data.
• The full genome-wide profile does NOT fit into the executors memory
“Cursed” BigData: e.g. Genomics
Moderate number of samples with many features
Feature set too large to be handled by single executer

Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK
Flip the matrix: partition by column
VariantSpark - Parallelize RF to scale with features

Wide RF scalable with features and samples

# set up context and input parameters
spark = SparkSession(sc)
vc = VariantsContext(spark)
label = vc.load_label('dius/data/chr22-labels.csv', 'col_name')
features = vc.import_vcf('dius/data/chr22_1000.vcf')
# instantiate analysis (parameters are type-checked)
imp_analysis = features.importance_analysis(label)
# get significant factors as both a tuple list and a dataframe
imp_vars = imp_analysis.important_variables(20)
most_imp_var = imp_vars[0][0]
imp_df = imp_analysis.variable_importance()
oob_error = imp_analysis.oob_error()
# convert to work with common Python tools
pandas_imp_df = imp_df.toPandas()
New -- Python API for VariantSpark

Demo VariantSpark
Jupyter for Genomics Research
@lynnlangit

Understanding Jupyter notebooks using bioinformatics examples

Cloud-based Jupyter
PaaS
• AWS SageMaker
• Azure Notebooks
• Others…
@lynnlangit

Example - GT-Scan2
Jupyter for Genomics Research
@lynnlangit

Tools for Jupyter
• Binder for GitHub
• Point to your GitHub Repo
• Jupyter Notebooks
• Requirements.txt
• It builds a Docker image
• You can run your Notebooks
@lynnlangit

Future of Jupyter for Research
Academic
Institutions
and
Research
Labs
UC Berkeley, Davis, San Diego
Cal Poly San Luis Obispo
Clemson University
UC Boulder
U of Illinois, Minnesota, Missouri, Rochester, Texas
MIT
Michigan State U
Texas A & M
@lynnlangit

Understanding Jupyter notebooks using bioinformatics examples

More Related Content

What's hot (18)

Similar to Understanding Jupyter notebooks using bioinformatics examples (20)

More from Lynn Langit (20)

Recently uploaded (20)

Understanding Jupyter notebooks using bioinformatics examples

Editor's Notes