IIBMP2020 Poster
Generating annotation texts of HLA sequences with
antigen classes by a T5 (Text-to-Text Transfer
Transformer) model using International Nucleotide
Sequence Database
Eli Kaminuma1,2,3, Takatomo Fujisawa2, Osamu Ogasawara2, Masanori Arita2,3,
Yasukazu Nakamura2,4
(1. Tokyo Medical and Dental University 2. National Institute of Genetics, 3. RIKEN CSRS,
4. Kazusa DNA Research Institute)
■ International Nucleotide Sequence
Database (INDSC)
DNA Data Bank of Japan(DDBJ) collects
nucleotide sequences as a member of INSDC.
A problem of high labor costs for manual sequence annotation
in the data submission stage to INDSC
■Problem: INSDC’s sequence annotations
to be required seem to be high labor costs
DDBJ ANNOTATION HELP
DNASmartTagger :
A proposed machine learning tool for DNA sequence annotations
Accacactggtactgagacacggaccaga
ctcctacgggaggcagcagtgaggaatatt
ggacaatggagggaactctgatccagcca
tgccgcgtgcaggaagactgccctatgggt
tgtaaactgcttttatacaagaagaataag
agatacgtgtatcttgatgacggtattgtaa
gaataagcaccggctaactccgtgccagc
agccgcggtaatacggagggtgcaagcgt
tatccggaatcattgggtttaaagggtccgt
aggcggattaataagtcagtggtgaaagtc
tgcagcttaactgtagaattgccattgatac
tgttagtcttgaattattatgaagtagttag
aatatgtagtgtagcggtgaaatgcataga
tattaca
Input: DNA Sequence
sequence
e.g. INSDC FlatFile Format
Output: Annotation Tags
DNASmartTagger
data resources BioSample
452 attribute
tags
INSDC
132 attribute
tags
Machine Learning
Models
Others
annotations
(132 attribute
tags)
■ Retrieving INSDC Training Data ■ Building Deep Learning Models
SVM
(CV kfold=10)
CNN
(CV kfold=10)
k-mer freq 0.77 0.80
5’end fragm 0.72 0.73
- Evaluating machine learning models to infer
attribute values ( Evaluation metric : accuracy)
Deep learning model (CNN)+
Input parameter with k-mer frequency
■ Extracting the attribute tag “/altitude”
5,431 Sequences with Annotation for PLN Division,
Keyword Fungi (Retrieved from DDBJ ARSA)
ZONE Attribute
Value
Altitude
Zone Code
ALPINE
ZONE
1500m -- Z3
MONTAN
E ZONE
800m--
1500m
Z2
LOWLAND
ZONE
0--800m Z1
- Categorizing attribute values (/altitude)
Our conventional study of DNASmartTagger (2018): Predicting ecological
values of Biosample attribute tags from DNA sequences using deep learning
■ An example annotation of INSDC sequences for Human leukocyte antigen(HLA) allele.
LOCUS MG021788 3079 bp DNA linear HUM 04-SEP-2018
DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var allele, complete cds.
:
FEATURES Location/Qualifiers
source 1..3079
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
gene <1..>3079
/gene="HLA-F"
/allele="HLA-F*01:01:02var"
:
:
BASE COUNT 601 a 866 c 951 g 661 t
ORIGIN
1 gtgtcgccgc agttcccagg ttctaaagtc ccacgcaccc cgcgggactc atatttttcc
61 cagacgcgga ggttggggtc atggcgcccc gaagcctcct cctgctgctc tcaggggccc
121 tggccctgac cgatacttgg gcaggtgagt gcggggtcca gagagaaacg gcctctgtgg
181 ggaggagtga ggggcccgcc cggtgggggc gcaggactca gggagccgcg cccggaggag
241 ggtctggcgg gtctcagccc ctcctcgccc ccaggctccc actccttgag gtatttcagc
301 accgctgtgt cgcggcccgg ccgcggggag ccccgctaca tcgccgtgga gtacgtagac
A proposed sequence-to-text generation model to annotate
INSDC DEFINITION attributes from input sequences
HLA nomenclature (Xie et al, PMID: 21172045)
T5 (Text-To-Text Transfer Transformer) model is one of
available NLP deep learning models
Reference:
https://github.com/google-research/text-to-text-transfer-transformer
■ T5 model(Raffel et al; arXiv:1910.10683) =A text-to-text deep learning model
for treating a wide variety NLP tasks from Google AI.
*SuperGLUE LeaderBoard(2020/8/27)
■T5 model structure=A type of
encoder-decoder transformer models.
■C4 Tensorflow dataset
745GB (cf. Wikipedia 16GB)
■T5 character=large model size
T5-Small (60 million params)
T5-Base (220 million params)
T5-Large (770 million params)
T5-3B (3 billion params)
T5-11B (11 billion params)
■ Preparing reference texts of HLA sequences from INSDC database
- 1,100 (train 1,000 + test 100) sequence annotations.
- Selecting the top 6 gene names (HLA-B,A,C, DQB1, G, DQA1) in INSDC HLA data.
- Deleting HLA allelic description from DEFINITION texts.
- Preparing not nucleotide sequence but amino acid sequences as model inputs.
■ Fine-tuning methods
- a pre-trained T5-Small(60 million params) model
supported by Hugging Face to perform annotation
text generation from amino acid (AA) sequences.
■ Hardware and software
- NVIDIA Tesla K80 GPU 12GB assigned on Google Colaboratory
- Python wt. PyTorch 1.5.1 deep learning library.
■Evaluation metrics
① BLEU(BiLingual Evaluation Understudy)(Papinen et al, 2002)
② Accuracy for classifying key labels.
Experimental conditions to build T5 transfer learning models
https://github.com/huggingface/transformerss
Pn : n-gram precisions up to length N
wn : positive weights
c : length of the candidate translation
r : effective reference corpus length.
https://www.aclweb.org/anthology/P02-1040.pdf
Result(1):
Output texts generated by the proposed sequence-to-text T5 model
DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var
allele, complete cds.
Basic Format: [organism name] [gene name] gene, [allelic information], complete/partial cds.
(First 10 AA codes, Last 10 AA codes)
DFVFQFKGMC ..………………
…………………………………………
…………………………………………
……..VAFRGILQRR
MT output
Homo sapiens clone DQB1_111313_00
Reference:
Homo sapiens clone DQB1_110918_01032
MHC class II antigen HLA-DQB1 gene, exon
2 and partial cds
Input Sequence
MT output
MHC class I antigen HLA-A gene, complete
cds1 gene
Reference:
Homo sapiens MHC class I protein HLA-A
gene, complete cds
Output annotation texts
MAVMAPRTLV……….
.…………………….………...
......… SDMSLTACKV
T5
Result(2): Evaluation the proposed model using generated texts
and references
■ BLEU: A popular metric for text generation
BLEU score=0.28 (100 test sequences)
■ Accuracy for classifying key
labels (gene names and
complete/partial cds types )
https://cloud.google.com/translate/automl/docs/evaluate
We will collect more suitable reference datasets
and investigate training conditions of T5 models.
key labels accuracy
(only test data
including
labels)
gene name
6 classes※1
0.42
(0.95)
cds
completeness
2 classes※2
0.35
(0.83)
※1: HLA-B, -A, -C, -DQB1, -G, -DQA1
※2: complete, partial (exon etc.)
・ TMDU : Hiroshi Tanaka, Kazuki Hashimoto
・ DDBJ : Jun Mashima, Yuichi Kodama, Kosuge Takehide
・ DBCLS : Yasunori Yamamoto
・ AIST AIRC : Jun Sese, Motoko Tsuji, Yukiko Ochi
Acknowledgements
■ We are thankful to the following members for their supports.
■ This work is partially supported by the following grants.
・NIG Research Collaboration Grants 3A2019/ 55A2020
・JST CREST Grant Number JPMJCR1501
Future work
Issues of future works
- Clean reference datasets.
- Text generation for HLA allelic code (Large reference datasets).
- Multi-modal integration (k-mer frequency vs sequence fragments).
NLP+Computer Vision NLP

More Related Content

PDF
Grc ashg2015 workshop_mudge
PDF
Variant Calling II
PPTX
Ashg2014 grc workshop_schneider
PPTX
Understanding the reference assembly: CSHL Hackathon
PPTX
NGS data formats and analyses
PPTX
Portable and reproducible bioinformatic analysis. Neoantigen discovery.
PDF
Alignment Approaches II: Long Reads
PDF
agbt 2016 workshop church
Grc ashg2015 workshop_mudge
Variant Calling II
Ashg2014 grc workshop_schneider
Understanding the reference assembly: CSHL Hackathon
NGS data formats and analyses
Portable and reproducible bioinformatic analysis. Neoantigen discovery.
Alignment Approaches II: Long Reads
agbt 2016 workshop church

What's hot (20)

PDF
The Clinical Significance of Transcript Alignment Discrepancies
PPTX
Schneider_AGBT2014
PPTX
TAGC2016 schneider
PDF
Ashg grc workshop2015_tg
PPTX
Ashg2015 schneider final
PDF
New data from giab genomes promethion
PDF
New methods diploid assembly with graphs
PPTX
Workshop NGS data analysis - 1
PDF
New data from giab genomes pacbio ccs
PPTX
Church_GenomeAccess_2013_genome2013
PDF
Ashg2015 grc-pruitt
PPTX
Bioinformatica t2-databases
PPTX
diffReps: automated ChIP-seq differential analysis package
PDF
Discovery and annotation of variants by exome analysis using NGS
PDF
Variation graphs and population assisted genome inference copy
PPTX
Agbt2015 workshop schneider
PPTX
Dgaston dec-06-2012
PPTX
Genome editing comes of age
PPTX
Creating Reference-Grade Human Genome Assemblies
PDF
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
The Clinical Significance of Transcript Alignment Discrepancies
Schneider_AGBT2014
TAGC2016 schneider
Ashg grc workshop2015_tg
Ashg2015 schneider final
New data from giab genomes promethion
New methods diploid assembly with graphs
Workshop NGS data analysis - 1
New data from giab genomes pacbio ccs
Church_GenomeAccess_2013_genome2013
Ashg2015 grc-pruitt
Bioinformatica t2-databases
diffReps: automated ChIP-seq differential analysis package
Discovery and annotation of variants by exome analysis using NGS
Variation graphs and population assisted genome inference copy
Agbt2015 workshop schneider
Dgaston dec-06-2012
Genome editing comes of age
Creating Reference-Grade Human Genome Assemblies
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
Ad

Similar to [2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with antigen classes by a T5 model using INSDC (20)

PDF
rnaseq_from_babelomics
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
PDF
[2017-05-29] DNASmartTagger
PDF
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
PDF
RNA sequencing analysis tutorial with NGS
PDF
A Genome Sequence Analysis System Built with Hypertable
PPTX
Bioinfo ngs data format visualization v2
PPTX
Rnaseq forgenefinding
PDF
SeqinR - biological data handling
PPTX
Imgc2011 bioinformatics tutorial
PDF
Introducing data analysis: reads to results
PPT
PDF
RNA-Seq Data Analysis: An abstract Guide
PDF
1 2 10.1.1.468.7609
PPTX
Tools for Transcriptome Data Analysis
PPT
PPTX
Next-generation sequencing format and visualization with ngs.plot
PPTX
BITS training - UCSC Genome Browser - Part 2
DOCX
1_chlamydia task completely best.docx
PDF
Genome Assembly
rnaseq_from_babelomics
RNA-seq: analysis of raw data and preprocessing - part 2
[2017-05-29] DNASmartTagger
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
RNA sequencing analysis tutorial with NGS
A Genome Sequence Analysis System Built with Hypertable
Bioinfo ngs data format visualization v2
Rnaseq forgenefinding
SeqinR - biological data handling
Imgc2011 bioinformatics tutorial
Introducing data analysis: reads to results
RNA-Seq Data Analysis: An abstract Guide
1 2 10.1.1.468.7609
Tools for Transcriptome Data Analysis
Next-generation sequencing format and visualization with ngs.plot
BITS training - UCSC Genome Browser - Part 2
1_chlamydia task completely best.docx
Genome Assembly
Ad

More from Eli Kaminuma (13)

PDF
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
PDF
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
PDF
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
PDF
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
PDF
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
PDF
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
PDF
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
PDF
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
PDF
[2016-07-06] DDBJデータ解析チャレンジ概要
PDF
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
PDF
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
PDF
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
PDF
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...

Recently uploaded (20)

PDF
Human Computer Interaction Miterm Lesson
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PDF
The AI Revolution in Customer Service - 2025
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
PPTX
Module 1 Introduction to Web Programming .pptx
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
Human Computer Interaction Miterm Lesson
EIS-Webinar-Regulated-Industries-2025-08.pdf
Advancing precision in air quality forecasting through machine learning integ...
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
SGT Report The Beast Plan and Cyberphysical Systems of Control
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
The AI Revolution in Customer Service - 2025
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
4 layer Arch & Reference Arch of IoT.pdf
Early detection and classification of bone marrow changes in lumbar vertebrae...
Build automations faster and more reliably with UiPath ScreenPlay
Module 1 Introduction to Web Programming .pptx
Basics of Cloud Computing - Cloud Ecosystem
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
NewMind AI Weekly Chronicles – August ’25 Week IV
Lung cancer patients survival prediction using outlier detection and optimize...

[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with antigen classes by a T5 model using INSDC

  • 1. IIBMP2020 Poster Generating annotation texts of HLA sequences with antigen classes by a T5 (Text-to-Text Transfer Transformer) model using International Nucleotide Sequence Database Eli Kaminuma1,2,3, Takatomo Fujisawa2, Osamu Ogasawara2, Masanori Arita2,3, Yasukazu Nakamura2,4 (1. Tokyo Medical and Dental University 2. National Institute of Genetics, 3. RIKEN CSRS, 4. Kazusa DNA Research Institute)
  • 2. ■ International Nucleotide Sequence Database (INDSC) DNA Data Bank of Japan(DDBJ) collects nucleotide sequences as a member of INSDC. A problem of high labor costs for manual sequence annotation in the data submission stage to INDSC ■Problem: INSDC’s sequence annotations to be required seem to be high labor costs DDBJ ANNOTATION HELP
  • 3. DNASmartTagger : A proposed machine learning tool for DNA sequence annotations Accacactggtactgagacacggaccaga ctcctacgggaggcagcagtgaggaatatt ggacaatggagggaactctgatccagcca tgccgcgtgcaggaagactgccctatgggt tgtaaactgcttttatacaagaagaataag agatacgtgtatcttgatgacggtattgtaa gaataagcaccggctaactccgtgccagc agccgcggtaatacggagggtgcaagcgt tatccggaatcattgggtttaaagggtccgt aggcggattaataagtcagtggtgaaagtc tgcagcttaactgtagaattgccattgatac tgttagtcttgaattattatgaagtagttag aatatgtagtgtagcggtgaaatgcataga tattaca Input: DNA Sequence sequence e.g. INSDC FlatFile Format Output: Annotation Tags DNASmartTagger data resources BioSample 452 attribute tags INSDC 132 attribute tags Machine Learning Models Others annotations (132 attribute tags)
  • 4. ■ Retrieving INSDC Training Data ■ Building Deep Learning Models SVM (CV kfold=10) CNN (CV kfold=10) k-mer freq 0.77 0.80 5’end fragm 0.72 0.73 - Evaluating machine learning models to infer attribute values ( Evaluation metric : accuracy) Deep learning model (CNN)+ Input parameter with k-mer frequency ■ Extracting the attribute tag “/altitude” 5,431 Sequences with Annotation for PLN Division, Keyword Fungi (Retrieved from DDBJ ARSA) ZONE Attribute Value Altitude Zone Code ALPINE ZONE 1500m -- Z3 MONTAN E ZONE 800m-- 1500m Z2 LOWLAND ZONE 0--800m Z1 - Categorizing attribute values (/altitude) Our conventional study of DNASmartTagger (2018): Predicting ecological values of Biosample attribute tags from DNA sequences using deep learning
  • 5. ■ An example annotation of INSDC sequences for Human leukocyte antigen(HLA) allele. LOCUS MG021788 3079 bp DNA linear HUM 04-SEP-2018 DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var allele, complete cds. : FEATURES Location/Qualifiers source 1..3079 /organism="Homo sapiens" /mol_type="genomic DNA" /db_xref="taxon:9606" gene <1..>3079 /gene="HLA-F" /allele="HLA-F*01:01:02var" : : BASE COUNT 601 a 866 c 951 g 661 t ORIGIN 1 gtgtcgccgc agttcccagg ttctaaagtc ccacgcaccc cgcgggactc atatttttcc 61 cagacgcgga ggttggggtc atggcgcccc gaagcctcct cctgctgctc tcaggggccc 121 tggccctgac cgatacttgg gcaggtgagt gcggggtcca gagagaaacg gcctctgtgg 181 ggaggagtga ggggcccgcc cggtgggggc gcaggactca gggagccgcg cccggaggag 241 ggtctggcgg gtctcagccc ctcctcgccc ccaggctccc actccttgag gtatttcagc 301 accgctgtgt cgcggcccgg ccgcggggag ccccgctaca tcgccgtgga gtacgtagac A proposed sequence-to-text generation model to annotate INSDC DEFINITION attributes from input sequences HLA nomenclature (Xie et al, PMID: 21172045)
  • 6. T5 (Text-To-Text Transfer Transformer) model is one of available NLP deep learning models Reference: https://github.com/google-research/text-to-text-transfer-transformer ■ T5 model(Raffel et al; arXiv:1910.10683) =A text-to-text deep learning model for treating a wide variety NLP tasks from Google AI. *SuperGLUE LeaderBoard(2020/8/27) ■T5 model structure=A type of encoder-decoder transformer models. ■C4 Tensorflow dataset 745GB (cf. Wikipedia 16GB) ■T5 character=large model size T5-Small (60 million params) T5-Base (220 million params) T5-Large (770 million params) T5-3B (3 billion params) T5-11B (11 billion params)
  • 7. ■ Preparing reference texts of HLA sequences from INSDC database - 1,100 (train 1,000 + test 100) sequence annotations. - Selecting the top 6 gene names (HLA-B,A,C, DQB1, G, DQA1) in INSDC HLA data. - Deleting HLA allelic description from DEFINITION texts. - Preparing not nucleotide sequence but amino acid sequences as model inputs. ■ Fine-tuning methods - a pre-trained T5-Small(60 million params) model supported by Hugging Face to perform annotation text generation from amino acid (AA) sequences. ■ Hardware and software - NVIDIA Tesla K80 GPU 12GB assigned on Google Colaboratory - Python wt. PyTorch 1.5.1 deep learning library. ■Evaluation metrics ① BLEU(BiLingual Evaluation Understudy)(Papinen et al, 2002) ② Accuracy for classifying key labels. Experimental conditions to build T5 transfer learning models https://github.com/huggingface/transformerss Pn : n-gram precisions up to length N wn : positive weights c : length of the candidate translation r : effective reference corpus length. https://www.aclweb.org/anthology/P02-1040.pdf
  • 8. Result(1): Output texts generated by the proposed sequence-to-text T5 model DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var allele, complete cds. Basic Format: [organism name] [gene name] gene, [allelic information], complete/partial cds. (First 10 AA codes, Last 10 AA codes) DFVFQFKGMC ..……………… ………………………………………… ………………………………………… ……..VAFRGILQRR MT output Homo sapiens clone DQB1_111313_00 Reference: Homo sapiens clone DQB1_110918_01032 MHC class II antigen HLA-DQB1 gene, exon 2 and partial cds Input Sequence MT output MHC class I antigen HLA-A gene, complete cds1 gene Reference: Homo sapiens MHC class I protein HLA-A gene, complete cds Output annotation texts MAVMAPRTLV………. .…………………….………... ......… SDMSLTACKV T5
  • 9. Result(2): Evaluation the proposed model using generated texts and references ■ BLEU: A popular metric for text generation BLEU score=0.28 (100 test sequences) ■ Accuracy for classifying key labels (gene names and complete/partial cds types ) https://cloud.google.com/translate/automl/docs/evaluate We will collect more suitable reference datasets and investigate training conditions of T5 models. key labels accuracy (only test data including labels) gene name 6 classes※1 0.42 (0.95) cds completeness 2 classes※2 0.35 (0.83) ※1: HLA-B, -A, -C, -DQB1, -G, -DQA1 ※2: complete, partial (exon etc.)
  • 10. ・ TMDU : Hiroshi Tanaka, Kazuki Hashimoto ・ DDBJ : Jun Mashima, Yuichi Kodama, Kosuge Takehide ・ DBCLS : Yasunori Yamamoto ・ AIST AIRC : Jun Sese, Motoko Tsuji, Yukiko Ochi Acknowledgements ■ We are thankful to the following members for their supports. ■ This work is partially supported by the following grants. ・NIG Research Collaboration Grants 3A2019/ 55A2020 ・JST CREST Grant Number JPMJCR1501 Future work Issues of future works - Clean reference datasets. - Text generation for HLA allelic code (Large reference datasets). - Multi-modal integration (k-mer frequency vs sequence fragments). NLP+Computer Vision NLP