SlideShare a Scribd company logo
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
DOI:10.5121/ijcsa.2015.5402 19
K-Mer Index Of DNA Sequence Based On Hash
Algorithm
Jinlin Liu1
, Qiang Chen2
and Chen Zhang3
]
1
College of Electronic and Electrical Engineering, Shanghai University of Engineering
Science, Shanghai 201620,China.
2
College of Electronic and Electrical Engineering, Shanghai University of Engineering
Science, Shanghai 201620,China.
3
School of Management, Shanghai University of Engineering Science
Shanghai, 201620, China.
ABSTRACT
K-mer frequency statistics of biological sequences is a very important and important problem in biological
information processing. This paper addresses the problem of index k-mer for large scale data reading DNA
sequences in a limited memory space and time. Using the hash algorithm to establish index, the index
model is set up to base pairing, and get the length of k-mer statistic information quickly, so as to avoid
searching all the sequences of the index. At the same time, the program uses hash table to establish index
and build search model, and uses the zipper method to resolve the conflict in the case of address conflict.
Algorithm of time complexity analysis and experimental results show that compared with the traditional
indexing methods, the algorithm of the performance improvement is obvious, and very suitable for to be
used in the k-mer length change with a wide range .
KEYWORDS
K-mer index; hash algorithm; DNA detecting; index model;
1.INTRODUCTION
With the rapid development of DNA sequencing technology in recent years, human generated
massive biological sequence data, and we need to analyze and process through effective
calculation means. Among the numerous biological sequence analysis and processing problems,
the k-mer of biological sequence data is a short sequence of DNA sequences of k sequences.
When the K value is appropriate, sequence k-mer frequency distribution contains all the
information in the genome constituting equivalent sequences .So we can learn biological
sequences of base distribution characteristics, functions, structures and evolution information by
analyzing DNA sequence k-mer distribution and different k-mer information
2.QUESTIONS
This paper aims to solve the problem of k-mer index of DNA sequence.According to the given K,
100 million DNA sequences will establish index, Then the computer will read every K length
DNA from the start to end for each sequence. Then move on to the next sequence to read again,
until the positions of the individual K-mer appeared in the sequence were recorded. Because
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
20
DNA sequencing fragments, large scale of data, so we have to handle large data sets under the
condition of limited memory and disk space, and make the space complexity and computational
complexity as much as possible has been optimized. So we have to solve these problems.
Q1.
According to the given K to establish index, then search every sequence. Each sequence uses a
hash algorithm to encode the base, and then convert the input specific K base fragment into the
decimal data, and then match in the 100 million sequence. In the end, the computer output line
and column base fragment.
Q2.
After the index is established, we build the hash table in memory, and every time we traverse, we
store the frequency and the position of the k-mer in the hash table. Under the limited memory
space, we can traverse a million DNA sequences.
3.PROBLEM ANALYSIS
3.1.problem abstraction.
First according to the 100 million genetic sequence, because the length of each gene sequence is
100, so gene sequence is equivalent to a two bit matrix array a, corresponding to the rows of a as:
1-1 000000, it is listed as the 1-100. The problem is abstracted from the matrix A[i][j] analysis,
i=1,2... 1000000; j=1,2,... 100.
3.2.Method solution
The base species of the sequence are: C, A, G, T. Using the hash algorithm, the four bases are
converted into four binary digits, and then the conversion sequence is converted, which is set
A=0, C=1, G=2, T=3,and then convert the four numbers to decimal digits in the matching query
.Hash value algorithm formula is Hash(value)=value*[4^(k-m-1)], value represents the
corresponding value of the character, K represents the length of M, and k-mer represents the
position range of the character in the string [0- (m-1)].For example, the sequence k=4 of a given
ATCG is converted into the corresponding decimal ATCG=[0* (4^3) +3* (4^2) +2* (4^1) +1*
(4^0)]=54. The base sequence of each row length of 100 can be converted to a 100-k+1 decimal
number. The same principle can be used for the same 1 million line base sequence, you can get
the corresponding decimal number and then stored in the two-dimensional array A[i][j].when the
same decimal number is matched, the program converts decimal conversion into a four - band
form of a corresponding length of K, like the example ATCG form. Then program will print base
fragment corresponding row and column labels mark.
After the establishment of the index, we use division method to build hash tables in memory, and
determine the address of the hash table. The column headers and corresponding location is stored
in the hash table every k-mer occurs. The search efficiency of the query million DNA sequences
is improved under the limited memory space.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
21
4.MODEL ESTABLISHMENT AND SOLUTION
Hash algorithm is the binary value of arbitrary length is mapped into a shorter fixed length of the
binary value, this small binary value called hash value.
In this paper, according to the principle of hash algorithm, the identity of the four bases of the
ACGT respectively 0123, converted to four hexadecimal number is then transformed into a
decimal number, let base conversion of decimal number and the first line of 100-k+1 to a decimal
number to match, if the base sequence matching, the program will output the row and column
label mark.
Flow chart as shown below:
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
22
4.1.Model two: search model based on hash table
The main requirement of this paper is to design hash function, according to the keyword k-mer to
build hash table.
There are a lot of methods of constructing hash function, digital analysis method, the direct
method of definite value, random numbers, random number method is usually used in the key
word length, this paper selects division method. The obtained nucleotide sequence of hash values
divided by 1000 to take over, get the number as the address of the hash table. All to take over the
business of the same number into the bucket, and in each bucket will remainder exists is not the
same, but business the same. Therefore, in order to solve the address conflict.
The method of the zipper is to resolve the conflict: the nodes of all keywords are synonymous
with the same single linked list.. If the selected hash table length is m, the hash table can be
defined as an array of pointers consisting of a m pointer T[0..M-1]. All the hash address for the
node of I, are inserted into the single T[i] pointer to the single chain table. The initial values of
each component in T should be null pointer. In the zipper method, the load factor can be greater
than 1, but generally take α less than 1.
Hash search: first of all, k-mer as the keyword, and program needs to use the hash function to
calculate the address. If the base arrangement is the same as the base sequence of the searched
sequence, if the same output of the node is all the information, if the relative should be found,
then returns continue to search.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
23
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
24
4.2.Model three: analysis of the memory space occupied by the hash table
Data definition analysis: int keyword denotes an integer, whose range from negative -
2147483647 to +2147483647 (including these two digits) (32 bits) of integer. The number of
bytes occupied per int type is 4B. The char holds no symbol for the 16 bit (double byte) code bits,
whose values range from 0 to 65535 (8 bits).
The number of bytes occupied per char type is 1B.
Overall data analysis:
row, 1000000 defined int type variable (4Byte)
Column, 100 defined char type variable (1Byte)
Each index information theory takes up the memory space size: (B), can also be converted into
memory occupancy size: (GB)
Different K values, the memory space corresponding to each index is shown in the table below
Table4.1 The Memory Space
K Memory Space((((GB))))
1 0.00000002
2 0.00000007
3 0.00000030
4 0.00000119
5 0.00000477
6 0.00001907
7 0.00007629
5 4
1024 1024 1024
k
 ×
 
× × 
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
25
5.RUN RESULTS SHOW
5.1.The interface
Figure5.1 The interface
8 0.00030518
9 0.00122070
10 0.00488281
11 0.01953125
12 0.07812500
13 0.31250000
14 1.25000000
15 5.00000000
16 20.00000000
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
26
5.2.Search interface
Figure5.2 the search interface
5.3.File generated results
K_mer.txt file shown in Figure
Figure5.3 the text file shown
International Journal on Computational Science
5.4.Results the output interface
5.5.The complexity of the algorithm
(1) establish index complexity analysis
Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is
deep.
Space complexity O ( )
(2) using index complexity analysis
Time complexity O (1)
Space complexity O (1)
6.CONCLUSIONS
In order to solve the problem of k
the hash algorithm index model, the hash table query model, and the memory analysis
model of hash table. The design uses the visual2010 software to traverse the optimal
results, and the occupancy memory is
is accurate. To provide a good solution for solving the problem of k
ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
esults the output interface
Figure5.4 the output interface
5.5.The complexity of the algorithm
(1) establish index complexity analysis
Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is
(2) using index complexity analysis
In order to solve the problem of k-mer index DNA, three kinds of models are proposed,
the hash algorithm index model, the hash table query model, and the memory analysis
The design uses the visual2010 software to traverse the optimal
results, and the occupancy memory is small, the traversal efficiency is high and the result
is accurate. To provide a good solution for solving the problem of k-mer index DNA.
August 2015
27
Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is
dex DNA, three kinds of models are proposed,
the hash algorithm index model, the hash table query model, and the memory analysis
The design uses the visual2010 software to traverse the optimal
small, the traversal efficiency is high and the result
mer index DNA.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
28
REFERENCES
[1] Singh, M.; Garg, D., "Choosing Best Hashing Strategies and Hash Functions," Advance Computing
Conference, 2009. IACC 2009. IEEE International , vol., no., pp.50,55, 6-7 March 2009
[2] Rizk G, Lavenier D, Chikhi R. DSK: k-mer counting with very low memory
usage[J].Bioinformatics, 2013, 29(5): 652-653
[3] Deorowicz S, Debudaj-Grabysz A, Grabowski S. Disk-based k-mer counting on a PC[J].BMC
bioinfonnatics, 2013, 14(1): 160.
[4] Roy K S, Bhattacharya D, Schliep A. Turtle: Identifying frequent k-mers with cache-efficient
algorithms[J]. arXiv preprint arXiv:1305.1861,2013.
[5] Chor B, Horn D, Goldman N, et al. Genomic DNA k-mer spectra: models and modalities[J].Genome
Biol, 2009, 10(10): 8108.
[6] Hao B, Lee H C, Zhang S. Fractals related to long DNA sequences and complete
genomes[J].Chaos,Solitions&Fractals,2000,11(6):825-836.
[7] Yang Xu; Lei Ma; Zhaobo Liu; Chao, H.J., "A Multi-dimensional Progressive Perfect Hashing for
High-Speed String Matching," Architectures for Networking and Communications Systems (ANCS),
2011 Seventh ACM/IEEE Symposium on , vol., no., pp.167,177, 3-4 Oct. 2011
[8] Yasuda, K.; Miura, T.; Shioya, I., "Distributed Processes on Tree Hash," Computer Software and
Applications Conference, 2006. COMPSAC '06. 30th Annual International , vol.2, no., pp.10,13, 17-
21 Sept. 2006
[9] Bradford, P.G.; Gavrylyako, O.V., "Hash chains with diminishing ranges for sensors," Parallel
Processing Workshops, 2004. ICPP 2004 Workshops. Proceedings. 2004 International Conference
on , vol., no., pp.77,83, 18-18 Aug. 2004
[10] Jian-Wei Fan; Chao-Wen Chan; Ya-Fen Chang, "A random increasing sequence hash chain and
smart card-based remote user authentication scheme," Information, Communications and Signal
Processing (ICICS) 2013 9th International Conference on , vol., no., pp.1,5, 10-13 Dec. 2013
Authors
Jinlin Liu is currently studying in Mechanical and Electronic Engineering from
Shanghai University of Engineering Science, China, where he is working towards the
Master degree. His current research interests include FPGA, design and develop in
Embedded system.

More Related Content

What's hot (20)

PPTX
Datastructures using c++
Gopi Nath
 
PPTX
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
PDF
Binary Similarity : Theory, Algorithms and Tool Evaluation
Liwei Ren任力偉
 
DOCX
Datastructures and algorithms prepared by M.V.Brehmanada Reddy
Malikireddy Bramhananda Reddy
 
PPTX
Data Structure & Algorithms | Computer Science
Transweb Global Inc
 
PDF
M v bramhananda reddy dsa complete notes
Malikireddy Bramhananda Reddy
 
PPT
Ch17 Hashing
leminhvuong
 
PDF
Introduction to Data Structure
Prof Ansari
 
PDF
Data structure
Shahariar limon
 
DOCX
Mit203 analysis and design of algorithms
smumbahelp
 
PDF
Searching and Sorting Techniques in Data Structure
Balwant Gorad
 
PPTX
C programming
Karthikeyan A K
 
PDF
IRJET- A Survey on Different Searching Algorithms
IRJET Journal
 
PDF
Ii pu cs practical viva voce questions
Prof. Dr. K. Adisesha
 
PPSX
Lecture 1 an introduction to data structure
Dharmendra Prasad
 
DOCX
Bc0038– data structure using c
hayerpa
 
PDF
UNIT I LINEAR DATA STRUCTURES – LIST
Kathirvel Ayyaswamy
 
PDF
Data structures Basics
DurgaDeviCbit
 
PPTX
Efficient Sparse Coding Algorithms
Anshu Dipit
 
PPT
Binary Search
kunj desai
 
Datastructures using c++
Gopi Nath
 
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Liwei Ren任力偉
 
Datastructures and algorithms prepared by M.V.Brehmanada Reddy
Malikireddy Bramhananda Reddy
 
Data Structure & Algorithms | Computer Science
Transweb Global Inc
 
M v bramhananda reddy dsa complete notes
Malikireddy Bramhananda Reddy
 
Ch17 Hashing
leminhvuong
 
Introduction to Data Structure
Prof Ansari
 
Data structure
Shahariar limon
 
Mit203 analysis and design of algorithms
smumbahelp
 
Searching and Sorting Techniques in Data Structure
Balwant Gorad
 
C programming
Karthikeyan A K
 
IRJET- A Survey on Different Searching Algorithms
IRJET Journal
 
Ii pu cs practical viva voce questions
Prof. Dr. K. Adisesha
 
Lecture 1 an introduction to data structure
Dharmendra Prasad
 
Bc0038– data structure using c
hayerpa
 
UNIT I LINEAR DATA STRUCTURES – LIST
Kathirvel Ayyaswamy
 
Data structures Basics
DurgaDeviCbit
 
Efficient Sparse Coding Algorithms
Anshu Dipit
 
Binary Search
kunj desai
 

Viewers also liked (20)

PDF
A countermeasure for flooding
ijcsa
 
PDF
Handling ambiguities and unknown words in named entity recognition using anap...
ijcsa
 
PDF
Energy efficient sensor selection in visual sensor networks based on multi ob...
ijcsa
 
PDF
Quantifying the impact of flood attack on
ijcsa
 
PDF
INTELLIGENT QUERY PROCESSING IN MALAYALAM
ijcsa
 
PDF
INVESTIGATION OF NONLINEAR DYNAMICS IN THE BOOST CONVERTER: EFFECT OF CAPACIT...
ijcsa
 
PDF
Automatic 3D view Generation from a Single 2D Image for both Indoor and Outdo...
ijcsa
 
PDF
SCHEDULING IN GRID TO MINIMIZE THE IMPOSED OVERHEAD ON THE SYSTEM AND TO INC...
ijcsa
 
PDF
tScene classification using pyramid histogram of
ijcsa
 
PDF
Theta θ(g,x) and pi π(g,x) polynomials of hexagonal trapezoid system tb,a
ijcsa
 
PDF
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
ijcsa
 
PDF
A LOCATION-BASED RECOMMENDER SYSTEM FRAMEWORK TO IMPROVE ACCURACY IN USERBASE...
ijcsa
 
PDF
Application of Taguchi Experiment Design for Decrease of Cogging Torque in P...
ijcsa
 
PDF
PORTFOLIO SELECTION BY THE MEANS OF CUCKOO OPTIMIZATION ALGORITHM
ijcsa
 
PDF
COUPLER, POWER DIVIDER AND CIRCULATOR IN V-BAND SUBSTRATE INTEGRATED WAVEGUID...
ijcsa
 
PDF
A COMPARATIVE PERFORMANCE STUDY OF OFDM SYSTEM WITH THE IMPLEMENTATION OF COM...
ijcsa
 
PDF
Data analysis by using machine
ijcsa
 
PDF
Automatic rectification of perspective distortion from a single image using p...
ijcsa
 
DOCX
JAVA 2013 IEEE IMAGEPROCESSING PROJECT Query adaptive image search with hash ...
IEEEGLOBALSOFTTECHNOLOGIES
 
PDF
Enhanced Hashing Approach For Image Forgery Detection With Feature Level Fusion
IJTET Journal
 
A countermeasure for flooding
ijcsa
 
Handling ambiguities and unknown words in named entity recognition using anap...
ijcsa
 
Energy efficient sensor selection in visual sensor networks based on multi ob...
ijcsa
 
Quantifying the impact of flood attack on
ijcsa
 
INTELLIGENT QUERY PROCESSING IN MALAYALAM
ijcsa
 
INVESTIGATION OF NONLINEAR DYNAMICS IN THE BOOST CONVERTER: EFFECT OF CAPACIT...
ijcsa
 
Automatic 3D view Generation from a Single 2D Image for both Indoor and Outdo...
ijcsa
 
SCHEDULING IN GRID TO MINIMIZE THE IMPOSED OVERHEAD ON THE SYSTEM AND TO INC...
ijcsa
 
tScene classification using pyramid histogram of
ijcsa
 
Theta θ(g,x) and pi π(g,x) polynomials of hexagonal trapezoid system tb,a
ijcsa
 
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
ijcsa
 
A LOCATION-BASED RECOMMENDER SYSTEM FRAMEWORK TO IMPROVE ACCURACY IN USERBASE...
ijcsa
 
Application of Taguchi Experiment Design for Decrease of Cogging Torque in P...
ijcsa
 
PORTFOLIO SELECTION BY THE MEANS OF CUCKOO OPTIMIZATION ALGORITHM
ijcsa
 
COUPLER, POWER DIVIDER AND CIRCULATOR IN V-BAND SUBSTRATE INTEGRATED WAVEGUID...
ijcsa
 
A COMPARATIVE PERFORMANCE STUDY OF OFDM SYSTEM WITH THE IMPLEMENTATION OF COM...
ijcsa
 
Data analysis by using machine
ijcsa
 
Automatic rectification of perspective distortion from a single image using p...
ijcsa
 
JAVA 2013 IEEE IMAGEPROCESSING PROJECT Query adaptive image search with hash ...
IEEEGLOBALSOFTTECHNOLOGIES
 
Enhanced Hashing Approach For Image Forgery Detection With Feature Level Fusion
IJTET Journal
 
Ad

Similar to K mer index of dna sequence based on hash (20)

PDF
Text encryption
tayseer Karam alshekly
 
PDF
Symmetric Key Generation Algorithm in Linear Block Cipher Over LU Decompositi...
ijtsrd
 
PDF
Computational intelligence based simulated annealing guided key generation in...
ijitjournal
 
PDF
A new dna based approach of generating keydependentmixcolumns
IJCNCJournal
 
PDF
A design of parity check matrix for short irregular ldpc codes via magic
IAEME Publication
 
PDF
Design of ternary sequence using msaa
Editor Jacotech
 
PDF
Design and Analysis of an Improved Nucleotide Sequences Compression Algorithm...
IJAAS Team
 
PDF
Truncated boolean matrices for dna
IJCSEA Journal
 
PDF
C6 agramakrishnan1
Jasline Presilda
 
PDF
A MODIFIED DNA COMPUTING APPROACH TO TACKLE THE EXPONENTIAL SOLUTION SPACE OF...
ijfcstjournal
 
PDF
Loss less DNA Solidity Using Huffman and Arithmetic Coding
IJERA Editor
 
PPTX
BCS304 Module 5 slides DSA notes 3rd sem
ticonah393
 
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
Combining text and pattern preprocessing in an adaptive dna pattern matcher
IAEME Publication
 
PDF
Analytical Study of AES and Proposed Variant with Enhance Block Length and Ke...
International Journal of Science and Research (IJSR)
 
PDF
A Novel Design For Generating Dynamic Length Message Digest To Ensure Integri...
IRJET Journal
 
PDF
A Cryptographic Hardware Revolution in Communication Systems using Verilog HDL
idescitation
 
PDF
I1803014852
IOSR Journals
 
PPT
Advance algorithm hashing lec II
Sajid Marwat
 
Text encryption
tayseer Karam alshekly
 
Symmetric Key Generation Algorithm in Linear Block Cipher Over LU Decompositi...
ijtsrd
 
Computational intelligence based simulated annealing guided key generation in...
ijitjournal
 
A new dna based approach of generating keydependentmixcolumns
IJCNCJournal
 
A design of parity check matrix for short irregular ldpc codes via magic
IAEME Publication
 
Design of ternary sequence using msaa
Editor Jacotech
 
Design and Analysis of an Improved Nucleotide Sequences Compression Algorithm...
IJAAS Team
 
Truncated boolean matrices for dna
IJCSEA Journal
 
C6 agramakrishnan1
Jasline Presilda
 
A MODIFIED DNA COMPUTING APPROACH TO TACKLE THE EXPONENTIAL SOLUTION SPACE OF...
ijfcstjournal
 
Loss less DNA Solidity Using Huffman and Arithmetic Coding
IJERA Editor
 
BCS304 Module 5 slides DSA notes 3rd sem
ticonah393
 
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Combining text and pattern preprocessing in an adaptive dna pattern matcher
IAEME Publication
 
Analytical Study of AES and Proposed Variant with Enhance Block Length and Ke...
International Journal of Science and Research (IJSR)
 
A Novel Design For Generating Dynamic Length Message Digest To Ensure Integri...
IRJET Journal
 
A Cryptographic Hardware Revolution in Communication Systems using Verilog HDL
idescitation
 
I1803014852
IOSR Journals
 
Advance algorithm hashing lec II
Sajid Marwat
 
Ad

Recently uploaded (20)

PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
July Patch Tuesday
Ivanti
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

K mer index of dna sequence based on hash

  • 1. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 DOI:10.5121/ijcsa.2015.5402 19 K-Mer Index Of DNA Sequence Based On Hash Algorithm Jinlin Liu1 , Qiang Chen2 and Chen Zhang3 ] 1 College of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620,China. 2 College of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620,China. 3 School of Management, Shanghai University of Engineering Science Shanghai, 201620, China. ABSTRACT K-mer frequency statistics of biological sequences is a very important and important problem in biological information processing. This paper addresses the problem of index k-mer for large scale data reading DNA sequences in a limited memory space and time. Using the hash algorithm to establish index, the index model is set up to base pairing, and get the length of k-mer statistic information quickly, so as to avoid searching all the sequences of the index. At the same time, the program uses hash table to establish index and build search model, and uses the zipper method to resolve the conflict in the case of address conflict. Algorithm of time complexity analysis and experimental results show that compared with the traditional indexing methods, the algorithm of the performance improvement is obvious, and very suitable for to be used in the k-mer length change with a wide range . KEYWORDS K-mer index; hash algorithm; DNA detecting; index model; 1.INTRODUCTION With the rapid development of DNA sequencing technology in recent years, human generated massive biological sequence data, and we need to analyze and process through effective calculation means. Among the numerous biological sequence analysis and processing problems, the k-mer of biological sequence data is a short sequence of DNA sequences of k sequences. When the K value is appropriate, sequence k-mer frequency distribution contains all the information in the genome constituting equivalent sequences .So we can learn biological sequences of base distribution characteristics, functions, structures and evolution information by analyzing DNA sequence k-mer distribution and different k-mer information 2.QUESTIONS This paper aims to solve the problem of k-mer index of DNA sequence.According to the given K, 100 million DNA sequences will establish index, Then the computer will read every K length DNA from the start to end for each sequence. Then move on to the next sequence to read again, until the positions of the individual K-mer appeared in the sequence were recorded. Because
  • 2. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 20 DNA sequencing fragments, large scale of data, so we have to handle large data sets under the condition of limited memory and disk space, and make the space complexity and computational complexity as much as possible has been optimized. So we have to solve these problems. Q1. According to the given K to establish index, then search every sequence. Each sequence uses a hash algorithm to encode the base, and then convert the input specific K base fragment into the decimal data, and then match in the 100 million sequence. In the end, the computer output line and column base fragment. Q2. After the index is established, we build the hash table in memory, and every time we traverse, we store the frequency and the position of the k-mer in the hash table. Under the limited memory space, we can traverse a million DNA sequences. 3.PROBLEM ANALYSIS 3.1.problem abstraction. First according to the 100 million genetic sequence, because the length of each gene sequence is 100, so gene sequence is equivalent to a two bit matrix array a, corresponding to the rows of a as: 1-1 000000, it is listed as the 1-100. The problem is abstracted from the matrix A[i][j] analysis, i=1,2... 1000000; j=1,2,... 100. 3.2.Method solution The base species of the sequence are: C, A, G, T. Using the hash algorithm, the four bases are converted into four binary digits, and then the conversion sequence is converted, which is set A=0, C=1, G=2, T=3,and then convert the four numbers to decimal digits in the matching query .Hash value algorithm formula is Hash(value)=value*[4^(k-m-1)], value represents the corresponding value of the character, K represents the length of M, and k-mer represents the position range of the character in the string [0- (m-1)].For example, the sequence k=4 of a given ATCG is converted into the corresponding decimal ATCG=[0* (4^3) +3* (4^2) +2* (4^1) +1* (4^0)]=54. The base sequence of each row length of 100 can be converted to a 100-k+1 decimal number. The same principle can be used for the same 1 million line base sequence, you can get the corresponding decimal number and then stored in the two-dimensional array A[i][j].when the same decimal number is matched, the program converts decimal conversion into a four - band form of a corresponding length of K, like the example ATCG form. Then program will print base fragment corresponding row and column labels mark. After the establishment of the index, we use division method to build hash tables in memory, and determine the address of the hash table. The column headers and corresponding location is stored in the hash table every k-mer occurs. The search efficiency of the query million DNA sequences is improved under the limited memory space.
  • 3. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 21 4.MODEL ESTABLISHMENT AND SOLUTION Hash algorithm is the binary value of arbitrary length is mapped into a shorter fixed length of the binary value, this small binary value called hash value. In this paper, according to the principle of hash algorithm, the identity of the four bases of the ACGT respectively 0123, converted to four hexadecimal number is then transformed into a decimal number, let base conversion of decimal number and the first line of 100-k+1 to a decimal number to match, if the base sequence matching, the program will output the row and column label mark. Flow chart as shown below:
  • 4. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 22 4.1.Model two: search model based on hash table The main requirement of this paper is to design hash function, according to the keyword k-mer to build hash table. There are a lot of methods of constructing hash function, digital analysis method, the direct method of definite value, random numbers, random number method is usually used in the key word length, this paper selects division method. The obtained nucleotide sequence of hash values divided by 1000 to take over, get the number as the address of the hash table. All to take over the business of the same number into the bucket, and in each bucket will remainder exists is not the same, but business the same. Therefore, in order to solve the address conflict. The method of the zipper is to resolve the conflict: the nodes of all keywords are synonymous with the same single linked list.. If the selected hash table length is m, the hash table can be defined as an array of pointers consisting of a m pointer T[0..M-1]. All the hash address for the node of I, are inserted into the single T[i] pointer to the single chain table. The initial values of each component in T should be null pointer. In the zipper method, the load factor can be greater than 1, but generally take α less than 1. Hash search: first of all, k-mer as the keyword, and program needs to use the hash function to calculate the address. If the base arrangement is the same as the base sequence of the searched sequence, if the same output of the node is all the information, if the relative should be found, then returns continue to search.
  • 5. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 23
  • 6. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 24 4.2.Model three: analysis of the memory space occupied by the hash table Data definition analysis: int keyword denotes an integer, whose range from negative - 2147483647 to +2147483647 (including these two digits) (32 bits) of integer. The number of bytes occupied per int type is 4B. The char holds no symbol for the 16 bit (double byte) code bits, whose values range from 0 to 65535 (8 bits). The number of bytes occupied per char type is 1B. Overall data analysis: row, 1000000 defined int type variable (4Byte) Column, 100 defined char type variable (1Byte) Each index information theory takes up the memory space size: (B), can also be converted into memory occupancy size: (GB) Different K values, the memory space corresponding to each index is shown in the table below Table4.1 The Memory Space K Memory Space((((GB)))) 1 0.00000002 2 0.00000007 3 0.00000030 4 0.00000119 5 0.00000477 6 0.00001907 7 0.00007629 5 4 1024 1024 1024 k  ×   × × 
  • 7. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 25 5.RUN RESULTS SHOW 5.1.The interface Figure5.1 The interface 8 0.00030518 9 0.00122070 10 0.00488281 11 0.01953125 12 0.07812500 13 0.31250000 14 1.25000000 15 5.00000000 16 20.00000000
  • 8. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 26 5.2.Search interface Figure5.2 the search interface 5.3.File generated results K_mer.txt file shown in Figure Figure5.3 the text file shown
  • 9. International Journal on Computational Science 5.4.Results the output interface 5.5.The complexity of the algorithm (1) establish index complexity analysis Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is deep. Space complexity O ( ) (2) using index complexity analysis Time complexity O (1) Space complexity O (1) 6.CONCLUSIONS In order to solve the problem of k the hash algorithm index model, the hash table query model, and the memory analysis model of hash table. The design uses the visual2010 software to traverse the optimal results, and the occupancy memory is is accurate. To provide a good solution for solving the problem of k ournal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 esults the output interface Figure5.4 the output interface 5.5.The complexity of the algorithm (1) establish index complexity analysis Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is (2) using index complexity analysis In order to solve the problem of k-mer index DNA, three kinds of models are proposed, the hash algorithm index model, the hash table query model, and the memory analysis The design uses the visual2010 software to traverse the optimal results, and the occupancy memory is small, the traversal efficiency is high and the result is accurate. To provide a good solution for solving the problem of k-mer index DNA. August 2015 27 Time complexity O (1) + O (m), m for the conflict when the length of the zipper, that is dex DNA, three kinds of models are proposed, the hash algorithm index model, the hash table query model, and the memory analysis The design uses the visual2010 software to traverse the optimal small, the traversal efficiency is high and the result mer index DNA.
  • 10. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 28 REFERENCES [1] Singh, M.; Garg, D., "Choosing Best Hashing Strategies and Hash Functions," Advance Computing Conference, 2009. IACC 2009. IEEE International , vol., no., pp.50,55, 6-7 March 2009 [2] Rizk G, Lavenier D, Chikhi R. DSK: k-mer counting with very low memory usage[J].Bioinformatics, 2013, 29(5): 652-653 [3] Deorowicz S, Debudaj-Grabysz A, Grabowski S. Disk-based k-mer counting on a PC[J].BMC bioinfonnatics, 2013, 14(1): 160. [4] Roy K S, Bhattacharya D, Schliep A. Turtle: Identifying frequent k-mers with cache-efficient algorithms[J]. arXiv preprint arXiv:1305.1861,2013. [5] Chor B, Horn D, Goldman N, et al. Genomic DNA k-mer spectra: models and modalities[J].Genome Biol, 2009, 10(10): 8108. [6] Hao B, Lee H C, Zhang S. Fractals related to long DNA sequences and complete genomes[J].Chaos,Solitions&Fractals,2000,11(6):825-836. [7] Yang Xu; Lei Ma; Zhaobo Liu; Chao, H.J., "A Multi-dimensional Progressive Perfect Hashing for High-Speed String Matching," Architectures for Networking and Communications Systems (ANCS), 2011 Seventh ACM/IEEE Symposium on , vol., no., pp.167,177, 3-4 Oct. 2011 [8] Yasuda, K.; Miura, T.; Shioya, I., "Distributed Processes on Tree Hash," Computer Software and Applications Conference, 2006. COMPSAC '06. 30th Annual International , vol.2, no., pp.10,13, 17- 21 Sept. 2006 [9] Bradford, P.G.; Gavrylyako, O.V., "Hash chains with diminishing ranges for sensors," Parallel Processing Workshops, 2004. ICPP 2004 Workshops. Proceedings. 2004 International Conference on , vol., no., pp.77,83, 18-18 Aug. 2004 [10] Jian-Wei Fan; Chao-Wen Chan; Ya-Fen Chang, "A random increasing sequence hash chain and smart card-based remote user authentication scheme," Information, Communications and Signal Processing (ICICS) 2013 9th International Conference on , vol., no., pp.1,5, 10-13 Dec. 2013 Authors Jinlin Liu is currently studying in Mechanical and Electronic Engineering from Shanghai University of Engineering Science, China, where he is working towards the Master degree. His current research interests include FPGA, design and develop in Embedded system.