From algorithms to advancing care:
genomics data drives progress
Jack DiGiovanna, PhD
Festival of Genomics Boston
24 June 2025
2
Signal drives insights
Purposefully high-level as you are in diverse therapy areas, drug modalities, data types, & stages of drug discovery
Image credits: ideogram.ai 3
Signal drives insights
Create VERY LARGE datasets
Create VERY LARGE datasets
Image credits: ideogram.ai 4
Signal drives insights
Create HIGH-SIGNAL datasets
Create VERY LARGE datasets
Image credits: ideogram.ai 5
Signal drives insights
Create HIGH-SIGNAL datasets
There’s an art to designing the signal
Pathways, mechanisms?
Receptors, targets?
Data type(s)?
Analytics?
Time?
Lessons learned in data management
Confidential – for internal use only 6
From experimentalist to analyst to partner
7
Confidential – for internal use only 8
Data is often unused because it’s not findable
Worked closely with a Top 20 pharma across multiple therapy areas, locations,
and analysis teams. Trying to incorporate population scale datasets and in-
house data to accelerate target identification.
Isn’t it expensive that one group
collects that same data when a
different group has already done this?
Yes, but it’s less
expensive than making
the data findable.
Confidential – for internal use only 9
Findable, Accessible, Interoperable, & Reusable data would be great!
Significant efforts across academia and life sciences since the
publication of the FAIR guidelines (Wilkinson et al, Scientific Data, 2016)
The concepts seem simple, but it create ~15 criteria to meet.
”FAIR is a goal, not an end state”
Multiple pharma have incorporated FAIR principles into the design of
Data as a Product. This may be local or org-wide.
Confidential – for internal use only 10
It’s challenging to make data FAIR… for myself
Reuse the data
& analytics?
Feb 2024
Search
usual external
hard drives
EPFL linked
Evernote and
NAS
Search
personal
Google Drive
Do my old
laptops still
boot up?
Random walk
through other
hard drives
Wait, didn't I
use Dropbox
as a post-doc?
FOUND THE
DATA &
CODE!!!
11
There is a spectrum of FAIRness and ROI is crucial
There are multiple levels of information that are searchable, also endless debates in defining "data" vs "metadata"
Dataset level Sample level
Variant or feature
level
Insight level
… dataset with AML
patients, and
both transcriptomics
and EDoH data
... samples from Male,
patients, with prior
aggressive lymphoma tr
eatment within 15
years prior to AML
diagnosis
... subjects where there is a
missense mutation in the
(ALK or KRAS), TMB >
N, environmental feature,
and non-response to first-
line therapy.
... subjects with variants also
common in NSCLC, eligible
for clinical trials within 100
miles, and likely to respond to
immunotherapy
12
Incentives are crucial to create FAIR data
NIH Data Management & Sharing Policy (2023) requires sharing but limited
• Funding or Time
• Attribution (though the s-index challenge is a step in right direction)
• Consequences
• Sustainability
Mixture of human and technology problems
13
Top-down mandates may be necessary but are not sufficient
Similarly, many large pharma have taken a top-down approach of
1. Top-down mandates across pharma that data must be shared
2. Data generators want to maintain control of their shared data and
describe their data for primary use
3. Lack of a centralized system allowing decentralized governance
4. Metadata may not be sufficient for secondary use
Mixture of human (incentives) and technology problems
“Permissions are the biggest problem today,
structure & system consistently a blocker”
Strategy & Portfolio Lead: Analytics & Insight
“Governance is always an issue and the biggest
challenge that they face”
VP Data Science
~20 FTEs in Data Governance, FAIR data and
analytics are important
Exec Dir: Data Science
14
Data integration is a consistent pain point
Detailed interviews with Top 20 (n=3) and Mid-sized
(n=3) pharma in 2025 to understand pain points with
current solutions
Data integration avg score = 9.3
Data management avg score = 8
Two of the top five pains reported for both groups
Image credit: : https://labs.openai.com/s/VtVExoasoSJuPuAl6cTTsND8
Pain adapted from: Belbury - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=92568509
Confidential – for internal use only 15
Build vs buy considerations lean towards buy
Engage with large data vendors likely have relevant data, e.g., Tempus,
Truveta, Flatiron, ConcertAI, IQVIA, Komodo, SciBrite
Data arrives assembled and often with more permissive Data Use
Agreements
Apply for access to nation-scale research datasets, e.g., UK Biobank,
FinnGen, All of Us
"The cost for my org to assemble data internally
is incalculable... I think it's infinite"
Pharma SVP at BioTechX EU 2023
(on why his org aggressively buys data)
“We want to find target engagement biomarkers
and use then guide our phase 1-2 trials. Shorten
the research time to the first human dose which
is ~1500 days and we target 500 days. It’s a
big ambition … so we are heavily investing in
cohort datasets for relevant indications."
Head of Research Solutions
(on why she needs data to achieve Corp goals)
16
Confidential – for internal use only 17
Solving data governance, mgmt, & integration leads to great things
Calibration of a mixture-
of-experts TMB
estimator to rescue an
ICI candidate after a
failed stage with
oncology datasets
Optimization of neoantigen
workflow and epitope
prioritization leveraging
Parker Institute datasets
More effective and
efficient target
identification using
population scale
datasets
Indication expansion,
market access using claims
and EMR data
Leverage >10x more datapoints to
guide diagnosis and treatment of
aggressive pediatric cancers
Confidential – for internal use only 18
With only ~200 new cases of
high-risk paediatric cancer
per year, it is imperative to
aggregate Australian data
with global data to develop
strategies to effectively treat
high-risk childhood cancer
Mark Cowley, ~2019
Image [2019] credit:
https://www.zerochildhoodcancer.org.au/about/
research---clinical-partners
19
ZERO has cutting edge somatic multi-omic workflows
20
Those portable workflows also can be brought to CBTN data
21
Bringing compute to data increased samples used for treatment by 10x
The workflows are running considerably
faster & cheaper. We have more data.
The seamless connection with NetAPP
is an absolute game-changer.
- Mark Cowley, Genomics and
Bioinformatics Lead, ZERO
22
Bringing compute to data increased samples used for treatment by 10x
PROJECT TIME
24
Key lessons learned from cases where data drove progress
FAIR is a good direction but is more complex than it seems.
Understand who your secondary users are – this can dramatically
affect metadata requirements and models.
Prospective changes are much less effort than retrospective,
spending time analyzing ROI of historical data is important.
Invest early in solid Data Governance.
Treat your data (and Analytics) as a Product.
Leveraging international standards community can be helpful for
scalability and sustainability.
25
Our mission is to
accelerate drug discovery
26
Introducing the Global Data Network (GDN)
• The GDN empowers pharma and biotech to tackle rising costs
and long timelines by leveraging fit-for-purpose data,
reducing traditional data acquisition from years to weeks.
• Previously impossible due to the lack of federated
technology to bridge the gap between providers and allow the
systemic pooling or stacking of data.
• This globally diverse provider network includes private and
public data companies/entities such as healthcare
organizations, academic research institutions, biobanks,
biospecimen providers, commercial data vendors, health
nonprofits, and national research initiatives.
• Approximately 40% of data is derived from non-US patients.
The world's largest federated data network spread across over 175M patients, 50 data providers
and hundreds of locations globally
GDN provides unique, diverse data that is curated and assembled
Data Sources +
Geographies
Data
Modalities
Therapeutic
Areas
Typical GDN datasets are
composed of longitudinal
EMR data paired with linked
additional modalities.
(e.g., Whole-exome seq,
single-cell RNA, & imaging)
Regulatory-grade,
meticulously-validated data
adheres to GDPR, HIPAA, and
global standards. Delivered in
a secure, compliant manner.
All major disease areas
covered, with a focus on
Oncology, Immunology &
Inflammation, Neurology, &
Cardiometabolic data.
Work with team of experts to
define data criteria quickly and
efficiently to expedite delivery.
Access data from 175M
patients globally.
50+ global data providers
including healthcare orgs,
biobanks, research institutions.
40% non-US sources
including Europe, Asia, Latin
America, and more.
Access to 48 Pb of public data
sets through Seven Bridges
platforms.
28
GDN provides global data for many use cases with lifetime consent
Prioritization of modalities and specific consent
mechanisms, including:
✓ Lifetime consent: Follow patients for their entire journey
✓ Direct to patient access: Recontact patients for additional data
✓ Prospective data collection: If needed
North America
South America
Asia - Pacific
Europe - Middle
East - Africa
Novel Target Identification and Disease Subtyping
Patient Cohort Identification and Trial Feasibility
Synthetic Control Arms and External Comparators
Biomarker Discovery and Validation
Real-World Evidence (RWE) for Regulatory or Label Expansion
Post-Market Safety and Effectiveness Monitoring
Digital twins
AI/ML Model Training and Validation
…
29
Our approach returns the entire dataset if available instead of
piecewise subsets
30
Our approach returns the entire dataset if available instead of
piecewise subsets
The challenges of
typical approaches
lead to 70% of
healthcare searches
going unmet
Bring us your last unmet data request
Data Sourcing
We tap into 175M+
diverse (40% non-US)
patient records, to
deliver existing
datasets or collect
bespoke cohorts.
Deliver Insights
Leverage the Velsera
Seven Bridges Platform
for secure, collaborative
analysis to derive
actionable insights for
your R&D pipeline.
Harmonization
GDN harmonizes data
subsets to meet your
needs, connecting
securely while
respecting consent and
DUAs.
Define Needs
Share your use case,
data modalities, and
cohort requirements.
We iterate with you
quickly and efficiently to
refine criteria for
optimal outcomes.
80% of requests fulfilled in a few weeks of initial inquiry into GDN
Feasibility analysis is not charged
Reduce your data acquisition time from years to weeks!
36
Find out more about Velsera & Global Data Network at booth #44
Delighted to talk data, diagnostics, or discovery
Somehow there’s also elephants…
hello@velsera.com Join our reception at Citrus & Salt tonight from 19:30-21:00

From algorithms to advancing care: genomics data drives progress

  • 1.
    From algorithms toadvancing care: genomics data drives progress Jack DiGiovanna, PhD Festival of Genomics Boston 24 June 2025
  • 2.
    2 Signal drives insights Purposefullyhigh-level as you are in diverse therapy areas, drug modalities, data types, & stages of drug discovery
  • 3.
    Image credits: ideogram.ai3 Signal drives insights Create VERY LARGE datasets
  • 4.
    Create VERY LARGEdatasets Image credits: ideogram.ai 4 Signal drives insights Create HIGH-SIGNAL datasets
  • 5.
    Create VERY LARGEdatasets Image credits: ideogram.ai 5 Signal drives insights Create HIGH-SIGNAL datasets There’s an art to designing the signal Pathways, mechanisms? Receptors, targets? Data type(s)? Analytics? Time?
  • 6.
    Lessons learned indata management Confidential – for internal use only 6 From experimentalist to analyst to partner
  • 7.
  • 8.
    Confidential – forinternal use only 8 Data is often unused because it’s not findable Worked closely with a Top 20 pharma across multiple therapy areas, locations, and analysis teams. Trying to incorporate population scale datasets and in- house data to accelerate target identification. Isn’t it expensive that one group collects that same data when a different group has already done this? Yes, but it’s less expensive than making the data findable.
  • 9.
    Confidential – forinternal use only 9 Findable, Accessible, Interoperable, & Reusable data would be great! Significant efforts across academia and life sciences since the publication of the FAIR guidelines (Wilkinson et al, Scientific Data, 2016) The concepts seem simple, but it create ~15 criteria to meet. ”FAIR is a goal, not an end state” Multiple pharma have incorporated FAIR principles into the design of Data as a Product. This may be local or org-wide.
  • 10.
    Confidential – forinternal use only 10 It’s challenging to make data FAIR… for myself Reuse the data & analytics? Feb 2024 Search usual external hard drives EPFL linked Evernote and NAS Search personal Google Drive Do my old laptops still boot up? Random walk through other hard drives Wait, didn't I use Dropbox as a post-doc? FOUND THE DATA & CODE!!!
  • 11.
    11 There is aspectrum of FAIRness and ROI is crucial There are multiple levels of information that are searchable, also endless debates in defining "data" vs "metadata" Dataset level Sample level Variant or feature level Insight level … dataset with AML patients, and both transcriptomics and EDoH data ... samples from Male, patients, with prior aggressive lymphoma tr eatment within 15 years prior to AML diagnosis ... subjects where there is a missense mutation in the (ALK or KRAS), TMB > N, environmental feature, and non-response to first- line therapy. ... subjects with variants also common in NSCLC, eligible for clinical trials within 100 miles, and likely to respond to immunotherapy
  • 12.
    12 Incentives are crucialto create FAIR data NIH Data Management & Sharing Policy (2023) requires sharing but limited • Funding or Time • Attribution (though the s-index challenge is a step in right direction) • Consequences • Sustainability Mixture of human and technology problems
  • 13.
    13 Top-down mandates maybe necessary but are not sufficient Similarly, many large pharma have taken a top-down approach of 1. Top-down mandates across pharma that data must be shared 2. Data generators want to maintain control of their shared data and describe their data for primary use 3. Lack of a centralized system allowing decentralized governance 4. Metadata may not be sufficient for secondary use Mixture of human (incentives) and technology problems “Permissions are the biggest problem today, structure & system consistently a blocker” Strategy & Portfolio Lead: Analytics & Insight “Governance is always an issue and the biggest challenge that they face” VP Data Science ~20 FTEs in Data Governance, FAIR data and analytics are important Exec Dir: Data Science
  • 14.
    14 Data integration isa consistent pain point Detailed interviews with Top 20 (n=3) and Mid-sized (n=3) pharma in 2025 to understand pain points with current solutions Data integration avg score = 9.3 Data management avg score = 8 Two of the top five pains reported for both groups Image credit: : https://labs.openai.com/s/VtVExoasoSJuPuAl6cTTsND8 Pain adapted from: Belbury - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=92568509
  • 15.
    Confidential – forinternal use only 15 Build vs buy considerations lean towards buy Engage with large data vendors likely have relevant data, e.g., Tempus, Truveta, Flatiron, ConcertAI, IQVIA, Komodo, SciBrite Data arrives assembled and often with more permissive Data Use Agreements Apply for access to nation-scale research datasets, e.g., UK Biobank, FinnGen, All of Us "The cost for my org to assemble data internally is incalculable... I think it's infinite" Pharma SVP at BioTechX EU 2023 (on why his org aggressively buys data) “We want to find target engagement biomarkers and use then guide our phase 1-2 trials. Shorten the research time to the first human dose which is ~1500 days and we target 500 days. It’s a big ambition … so we are heavily investing in cohort datasets for relevant indications." Head of Research Solutions (on why she needs data to achieve Corp goals)
  • 16.
  • 17.
    Confidential – forinternal use only 17 Solving data governance, mgmt, & integration leads to great things Calibration of a mixture- of-experts TMB estimator to rescue an ICI candidate after a failed stage with oncology datasets Optimization of neoantigen workflow and epitope prioritization leveraging Parker Institute datasets More effective and efficient target identification using population scale datasets Indication expansion, market access using claims and EMR data Leverage >10x more datapoints to guide diagnosis and treatment of aggressive pediatric cancers
  • 18.
    Confidential – forinternal use only 18 With only ~200 new cases of high-risk paediatric cancer per year, it is imperative to aggregate Australian data with global data to develop strategies to effectively treat high-risk childhood cancer Mark Cowley, ~2019 Image [2019] credit: https://www.zerochildhoodcancer.org.au/about/ research---clinical-partners
  • 19.
    19 ZERO has cuttingedge somatic multi-omic workflows
  • 20.
    20 Those portable workflowsalso can be brought to CBTN data
  • 21.
    21 Bringing compute todata increased samples used for treatment by 10x The workflows are running considerably faster & cheaper. We have more data. The seamless connection with NetAPP is an absolute game-changer. - Mark Cowley, Genomics and Bioinformatics Lead, ZERO
  • 22.
    22 Bringing compute todata increased samples used for treatment by 10x PROJECT TIME
  • 23.
    24 Key lessons learnedfrom cases where data drove progress FAIR is a good direction but is more complex than it seems. Understand who your secondary users are – this can dramatically affect metadata requirements and models. Prospective changes are much less effort than retrospective, spending time analyzing ROI of historical data is important. Invest early in solid Data Governance. Treat your data (and Analytics) as a Product. Leveraging international standards community can be helpful for scalability and sustainability.
  • 24.
    25 Our mission isto accelerate drug discovery
  • 25.
    26 Introducing the GlobalData Network (GDN) • The GDN empowers pharma and biotech to tackle rising costs and long timelines by leveraging fit-for-purpose data, reducing traditional data acquisition from years to weeks. • Previously impossible due to the lack of federated technology to bridge the gap between providers and allow the systemic pooling or stacking of data. • This globally diverse provider network includes private and public data companies/entities such as healthcare organizations, academic research institutions, biobanks, biospecimen providers, commercial data vendors, health nonprofits, and national research initiatives. • Approximately 40% of data is derived from non-US patients. The world's largest federated data network spread across over 175M patients, 50 data providers and hundreds of locations globally
  • 26.
    GDN provides unique,diverse data that is curated and assembled Data Sources + Geographies Data Modalities Therapeutic Areas Typical GDN datasets are composed of longitudinal EMR data paired with linked additional modalities. (e.g., Whole-exome seq, single-cell RNA, & imaging) Regulatory-grade, meticulously-validated data adheres to GDPR, HIPAA, and global standards. Delivered in a secure, compliant manner. All major disease areas covered, with a focus on Oncology, Immunology & Inflammation, Neurology, & Cardiometabolic data. Work with team of experts to define data criteria quickly and efficiently to expedite delivery. Access data from 175M patients globally. 50+ global data providers including healthcare orgs, biobanks, research institutions. 40% non-US sources including Europe, Asia, Latin America, and more. Access to 48 Pb of public data sets through Seven Bridges platforms.
  • 27.
    28 GDN provides globaldata for many use cases with lifetime consent Prioritization of modalities and specific consent mechanisms, including: ✓ Lifetime consent: Follow patients for their entire journey ✓ Direct to patient access: Recontact patients for additional data ✓ Prospective data collection: If needed North America South America Asia - Pacific Europe - Middle East - Africa Novel Target Identification and Disease Subtyping Patient Cohort Identification and Trial Feasibility Synthetic Control Arms and External Comparators Biomarker Discovery and Validation Real-World Evidence (RWE) for Regulatory or Label Expansion Post-Market Safety and Effectiveness Monitoring Digital twins AI/ML Model Training and Validation …
  • 28.
    29 Our approach returnsthe entire dataset if available instead of piecewise subsets
  • 29.
    30 Our approach returnsthe entire dataset if available instead of piecewise subsets The challenges of typical approaches lead to 70% of healthcare searches going unmet
  • 30.
    Bring us yourlast unmet data request Data Sourcing We tap into 175M+ diverse (40% non-US) patient records, to deliver existing datasets or collect bespoke cohorts. Deliver Insights Leverage the Velsera Seven Bridges Platform for secure, collaborative analysis to derive actionable insights for your R&D pipeline. Harmonization GDN harmonizes data subsets to meet your needs, connecting securely while respecting consent and DUAs. Define Needs Share your use case, data modalities, and cohort requirements. We iterate with you quickly and efficiently to refine criteria for optimal outcomes. 80% of requests fulfilled in a few weeks of initial inquiry into GDN Feasibility analysis is not charged Reduce your data acquisition time from years to weeks!
  • 31.
    36 Find out moreabout Velsera & Global Data Network at booth #44 Delighted to talk data, diagnostics, or discovery Somehow there’s also elephants… [email protected] Join our reception at Citrus & Salt tonight from 19:30-21:00