Seqkit-通过gtf/gff提取基因序列

最新推荐文章于 2025-06-26 13:42:43 发布

绶卿

最新推荐文章于 2025-06-26 13:42:43 发布

阅读量9.4k

点赞数

CC 4.0 BY-SA版权

文章标签：生物信息学

本文链接：https://blog.csdn.net/weixin_45044758/article/details/118097119

Seqkit是一个强大的生物信息学工具，可以方便地从GFF或BED文件中根据基因位置信息快速提取对应的FA序列。只需简单的一行命令，如`seqkit subseq --gtf gtf_file.gtf --genome genome_file.fa`，即可完成操作，大大简化了工作流程，提高了效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

gff/gtf 注释文件包含了基因的位置及结构信息，但是如何通过位置信息快速生成fa文件呢？强推Seqtik，一行代码解决问题！

seqkit

安装

通过conda直接安装

conda install seqkit -c biocodna

使用

seqkit集众多功能于一体，今天只接受subseq，用于提取基因

Usage:
  seqkit subseq [flags]

Flags:
      --bed string        by tab-delimited BED file
      --chr strings       select limited sequence with sequence IDs when using --gtf or --bed (multiple value supported, case ignored)
  -d, --down-stream int   down stream length
      --feature strings   select limited feature types (multiple value supported, case ignored, only works with GTF)
      --gtf string        by GTF (version 2.2) file
      --gtf-tag string    output this tag as sequence comment (default "gene_id")
  -h, --help              help for subseq
  -f, --only-flank        only return up/down stream sequence
  -r, --region string     by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for cutting first 12 bases. type "seqkit subseq -h" for more examples
  -u, --up-stream int     up stream length

Global Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)


#根据bed、gtf文件提取基因
seqkit subseq --bed bedfile.bed -o gene.fa genomefile.fa
seqkit subseq --gtf gtffile.bed -o gene.fa genomefile.fa