gbk2tsv.py: tabulating genomic features in GenBank files
I finally got some time this morning to write a Python script gbk2tsv.py, which converts several GenBank files into tab-delimited feature tables (plain text files with an extension “.tsv”). It can be a useful tool when we need to summarise genome annotations or acquire nucleotide and protein sequences of certain genomic features. Although the Holt Lab, where I did my PhD, has an in-house script to do a similar job, it is inappropriate for me to use or share that intellectual property for projects outside of the Holt Lab without a specific permission. Therefore, I decided to create a script from scratch after a discussion on genome annotation with Hao Luo, a PhD student at the Chalmers University of Technology, Sweden, during a lunch break of the course MESB19.
Usage
usage: gbk2tsv.py [-h] -g GBKS [GBKS ...] [-o OUTDIR] [-f FEATURES] [-n] [-p]
Convert GenBank files to tab-delimited text files
optional arguments:
-h, --help show this help message and exit
-g GBKS [GBKS ...], --gbk GBKS [GBKS ...]
Input GenBank files
-o OUTDIR, --outdir OUTDIR
Output directory (no backslash or forward slash)
-f FEATURES, --features FEATURES
Comma-separated features to store (default
CDS,tRNA,rRNA)
-n, --nucl_seq Turn on this option to print nucleotide sequences of
features
-p, --prot_seq Turn on this option to print protein sequences of CDS
The script accepts three forms of input file names:
1. Single GenBank file
python gbk2tsv.py --gbk demo.gbk
2. Multiple GenBank files with known names
python gbk2tsv.py --gbk demo1.gbk demo2.gbk demo3.gbk
3. Multiple GenBank files matched to a wildcard
python gbk2tsv.py --gbk *.gbk
For details of using BioPython to process GenBank files, readers may see a post by Peter Cock from the University of Warwick.