Linux one-liners converting GFF3 to FASTA files of contigs

2020-03-25

Despite the popularity of GFF3 format for genome annotations, to my knowledge there is no published tools for extracting DNA sequences of contigs from the GFF3 files and store them in a multi-FASTA file. EMBOSS seqret is only able to pull out the last contig from the GFF3 file, whereas other tools aim to extract the DNA sequence per feature. Therefore, I develop two Linux one-liners in this post for extract contig sequences from a GFF3 file and transfer them to a FASTA file.

The procedure is simple and we do not need sophisticated code: since contig sequences are stored at the end of every GFF3 file with the section header ##FASTA, the command line determines the line number of the section header and extract all content thereafter (that is, line number + 1). Assuming the name of every GFF3 file follows the format [genome name].velvet.gff, for a particular genome g01 we can convert it to a FASTA file using the one-liner:

g='g01'; echo "$((`grep -n '##FASTA' ${g}.velvet.gff | grep -o -P '^[^:]+\s*'` + 1))" | xargs -I {} tail -n +{} ${g}.velvet.gff > ${g}.fasta

Furthermore, given a list of paths to GFF3 files that are stored under a series of subdirectories, we can run the following one-liner^*:

for d in dir1 dir2 dir3; do for gff in `grep "$d" gff3Files.txt`; do g=`basename $gff '.velvet.gff'`; echo "$((`grep -n '##FASTA' $gff | grep -o -P '^[^:]+\s*'` + 1))" | xargs -I {} tail -n +{} $gff > ./${d}/${g}.fasta; done; done

supposing the content of file gff3Files.txt is:

./annotation/dir1/g01.velvet.gff
./annotation/dir1/g02.velvet.gff
./annotation/dir2/g03.velvet.gff
./annotation/dir2/g04.velvet.gff
./annotation/dir3/g05.velvet.gff
./annotation/dir3/g06.velvet.gff

In practice, people can adapt these two command lines for their specific system environments and data structures.

Footnote

* This one-liner is easier to read if I rearrange its command lines into the following block:

for d in dir1 dir2 dir3; do
    for gff in `grep "$d" gff3Files.txt`; do
        g=`basename $gff '.velvet.gff'`
        echo "$((`grep -n '##FASTA' $gff | grep -o -P '^[^:]+\s*'` + 1))" | xargs -I {} tail -n +{} $gff > ./${d}/${g}.fasta
    done
done