Linux one-liners converting GFF3 to FASTA files of contigs
Despite the popularity of GFF3 format for genome annotations, to my knowledge there is no published tools for extracting DNA sequences of contigs from the GFF3 files and store them in a multi-FASTA file. EMBOSS seqret
is only able to pull out the last contig from the GFF3 file, whereas other tools aim to extract the DNA sequence per feature. Therefore, I develop two Linux one-liners in this post for extract contig sequences from a GFF3 file and transfer them to a FASTA file.
The procedure is simple and we do not need sophisticated code: since contig sequences are stored at the end of every GFF3 file with the section header ##FASTA
, the command line determines the line number of the section header and extract all content thereafter (that is, line number + 1). Assuming the name of every GFF3 file follows the format [genome name].velvet.gff
, for a particular genome g01
we can convert it to a FASTA file using the one-liner:
g='g01'; echo "$((`grep -n '##FASTA' ${g}.velvet.gff | grep -o -P '^[^:]+\s*'` + 1))" | xargs -I {} tail -n +{} ${g}.velvet.gff > ${g}.fasta
Furthermore, given a list of paths to GFF3 files that are stored under a series of subdirectories, we can run the following one-liner*:
for d in dir1 dir2 dir3; do for gff in `grep "$d" gff3Files.txt`; do g=`basename $gff '.velvet.gff'`; echo "$((`grep -n '##FASTA' $gff | grep -o -P '^[^:]+\s*'` + 1))" | xargs -I {} tail -n +{} $gff > ./${d}/${g}.fasta; done; done
supposing the content of file gff3Files.txt
is:
./annotation/dir1/g01.velvet.gff
./annotation/dir1/g02.velvet.gff
./annotation/dir2/g03.velvet.gff
./annotation/dir2/g04.velvet.gff
./annotation/dir3/g05.velvet.gff
./annotation/dir3/g06.velvet.gff
In practice, people can adapt these two command lines for their specific system environments and data structures.
Footnote
* This one-liner is easier to read if I rearrange its command lines into the following block:
for d in dir1 dir2 dir3; do
for gff in `grep "$d" gff3Files.txt`; do
g=`basename $gff '.velvet.gff'`
echo "$((`grep -n '##FASTA' $gff | grep -o -P '^[^:]+\s*'` + 1))" | xargs -I {} tail -n +{} $gff > ./${d}/${g}.fasta
done
done