[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with antigen classes by a T5 model using INSDC

IIBMP2020 Poster
Generating annotation texts of HLA sequences with
antigen classes by a T5 (Text-to-Text Transfer
Transformer) model using International Nucleotide
Sequence Database
Eli Kaminuma1,2,3, Takatomo Fujisawa2, Osamu Ogasawara2, Masanori Arita2,3,
Yasukazu Nakamura2,4
(1. Tokyo Medical and Dental University 2. National Institute of Genetics, 3. RIKEN CSRS,
4. Kazusa DNA Research Institute)

■ International Nucleotide Sequence
Database (INDSC)
DNA Data Bank of Japan（DDBJ） collects
nucleotide sequences as a member of INSDC.
A problem of high labor costs for manual sequence annotation
in the data submission stage to INDSC
■Problem： INSDC’s sequence annotations
to be required seem to be high labor costs
DDBJ ANNOTATION HELP

DNASmartTagger :
A proposed machine learning tool for DNA sequence annotations
Accacactggtactgagacacggaccaga
ctcctacgggaggcagcagtgaggaatatt
ggacaatggagggaactctgatccagcca
tgccgcgtgcaggaagactgccctatgggt
tgtaaactgcttttatacaagaagaataag
agatacgtgtatcttgatgacggtattgtaa
gaataagcaccggctaactccgtgccagc
agccgcggtaatacggagggtgcaagcgt
tatccggaatcattgggtttaaagggtccgt
aggcggattaataagtcagtggtgaaagtc
tgcagcttaactgtagaattgccattgatac
tgttagtcttgaattattatgaagtagttag
aatatgtagtgtagcggtgaaatgcataga
tattaca
Input: DNA Sequence
sequence
e.g. INSDC FlatFile Format
Output: Annotation Tags
DNASmartTagger
data resources BioSample
452 attribute
tags
INSDC
132 attribute
tags
Machine Learning
Models
Others
annotations
(132 attribute
tags)

■ Retrieving INSDC Training Data ■ Building Deep Learning Models
SVM
(CV kfold=10)
CNN
(CV kfold=10)
k-mer freq 0.77 0.80
5’end fragm 0.72 0.73
- Evaluating machine learning models to infer
attribute values ( Evaluation metric : accuracy)
Deep learning model (CNN)＋
Input parameter with k-mer frequency
■ Extracting the attribute tag “/altitude”
5,431 Sequences with Annotation for PLN Division,
Keyword Fungi (Retrieved from DDBJ ARSA)
ZONE Attribute
Value
Altitude
Zone Code
ALPINE
ZONE
1500m -- Z3
MONTAN
E ZONE
800m--
1500m
Z2
LOWLAND
ZONE
0--800m Z1
- Categorizing attribute values (/altitude)
Our conventional study of DNASmartTagger (2018): Predicting ecological
values of Biosample attribute tags from DNA sequences using deep learning

■ An example annotation of INSDC sequences for Human leukocyte antigen(HLA) allele.
LOCUS MG021788 3079 bp DNA linear HUM 04-SEP-2018
DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var allele, complete cds.
:
FEATURES Location/Qualifiers
source 1..3079
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
gene <1..>3079
/gene="HLA-F"
/allele="HLA-F*01:01:02var"
:
:
BASE COUNT 601 a 866 c 951 g 661 t
ORIGIN
1 gtgtcgccgc agttcccagg ttctaaagtc ccacgcaccc cgcgggactc atatttttcc
61 cagacgcgga ggttggggtc atggcgcccc gaagcctcct cctgctgctc tcaggggccc
121 tggccctgac cgatacttgg gcaggtgagt gcggggtcca gagagaaacg gcctctgtgg
181 ggaggagtga ggggcccgcc cggtgggggc gcaggactca gggagccgcg cccggaggag
241 ggtctggcgg gtctcagccc ctcctcgccc ccaggctccc actccttgag gtatttcagc
301 accgctgtgt cgcggcccgg ccgcggggag ccccgctaca tcgccgtgga gtacgtagac
A proposed sequence-to-text generation model to annotate
INSDC DEFINITION attributes from input sequences
HLA nomenclature (Xie et al, PMID: 21172045)

T5 (Text-To-Text Transfer Transformer) model is one of
available NLP deep learning models
Reference:
https://github.com/google-research/text-to-text-transfer-transformer
■ T5 model（Raffel et al; arXiv:1910.10683）＝A text-to-text deep learning model
for treating a wide variety NLP tasks from Google AI.
＊SuperGLUE LeaderBoard(2020/8/27)
■T5 model structure＝A type of
encoder-decoder transformer models.
■C4 Tensorflow dataset
745GB (cf. Wikipedia 16GB)
■T5 character＝large model size
T5-Small (60 million params)
T5-Base (220 million params)
T5-Large (770 million params)
T5-3B (3 billion params)
T5-11B (11 billion params)

■ Preparing reference texts of HLA sequences from INSDC database
- 1,100 (train 1,000 + test 100) sequence annotations.
- Selecting the top 6 gene names (HLA-B,A,C, DQB1, G, DQA1) in INSDC HLA data.
- Deleting HLA allelic description from DEFINITION texts.
- Preparing not nucleotide sequence but amino acid sequences as model inputs.
■ Fine-tuning methods
- a pre-trained T5-Small(60 million params) model
supported by Hugging Face to perform annotation
text generation from amino acid (AA) sequences.
■ Hardware and software
- NVIDIA Tesla K80 GPU 12GB assigned on Google Colaboratory
- Python wt. PyTorch 1.5.1 deep learning library.
■Evaluation metrics
① BLEU（BiLingual Evaluation Understudy）(Papinen et al, 2002)
② Accuracy for classifying key labels.
Experimental conditions to build T5 transfer learning models
https://github.com/huggingface/transformerss
Pn : n-gram precisions up to length N
wn : positive weights
c : length of the candidate translation
r : effective reference corpus length.
https://www.aclweb.org/anthology/P02-1040.pdf

Result(1):
Output texts generated by the proposed sequence-to-text T5 model
DEFINITION Homo sapiens MHC class I antigen (HLA-F) gene, HLA-F*01:01:02var
allele, complete cds.
Basic Format: [organism name] [gene name] gene, [allelic information], complete/partial cds.
(First 10 AA codes, Last 10 AA codes)
DFVFQFKGMC ..………………
…………………………………………
…………………………………………
……..VAFRGILQRR
MT output
Homo sapiens clone DQB1_111313_00
Reference:
Homo sapiens clone DQB1_110918_01032
MHC class II antigen HLA-DQB1 gene, exon
2 and partial cds
Input Sequence
MT output
MHC class I antigen HLA-A gene, complete
cds1 gene
Reference:
Homo sapiens MHC class I protein HLA-A
gene, complete cds
Output annotation texts
MAVMAPRTLV……….
.…………………….………...
......… SDMSLTACKV
T5

Result(2): Evaluation the proposed model using generated texts
and references
■ BLEU: A popular metric for text generation
BLEU score=0.28 (100 test sequences)
■ Accuracy for classifying key
labels (gene names and
complete/partial cds types )
https://cloud.google.com/translate/automl/docs/evaluate
We will collect more suitable reference datasets
and investigate training conditions of T5 models.
key labels accuracy
(only test data
including
labels)
gene name
6 classes※1
0.42
(0.95)
cds
completeness
2 classes※2
0.35
(0.83)
※1: HLA-B, -A, -C, -DQB1, -G, -DQA1
※2: complete, partial (exon etc.)

・ TMDU : Hiroshi Tanaka, Kazuki Hashimoto
・ DDBJ : Jun Mashima, Yuichi Kodama, Kosuge Takehide
・ DBCLS : Yasunori Yamamoto
・ AIST AIRC : Jun Sese, Motoko Tsuji, Yukiko Ochi
Acknowledgements
■ We are thankful to the following members for their supports.
■ This work is partially supported by the following grants.
・NIG Research Collaboration Grants 3A2019/ 55A2020
・JST CREST Grant Number JPMJCR1501
Future work
Issues of future works
- Clean reference datasets.
- Text generation for HLA allelic code (Large reference datasets).
- Multi-modal integration (k-mer frequency vs sequence fragments).
NLP+Computer Vision NLP

[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with antigen classes by a T5 model using INSDC

More Related Content

What's hot (20)

Similar to [2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with antigen classes by a T5 model using INSDC (20)

More from Eli Kaminuma (13)

Recently uploaded (20)

[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with antigen classes by a T5 model using INSDC