A complete list of my software
This page introduces my computer code developed and published for the research community since 2015.
1. Population genomics
1.1. Detection of horizontal gene co-transfer between bacteria
GeneMates
The latest version: v0.2.2, which was released on 21 March 2020. (Documentation)
This R package implements my network approach for the detection of intra-species horizontal gene co-transfer (HGcoT) between bacteria. A manuscript is preparing for it. GeneMates takes as input bacterial whole-genome sequencing (WGS) data (in the forms of short reads and/or genome assemblies) and creates networks showing evidence of HGcoT at the allele level. This package can also be used for testing for allele-to-allele associations controlling for bacterial population structure. The following list introduces GeneMates functions that are frequently used in my experience.
- findPhysLink is the main function carrying out the detection of HGcoT. A number of functions are developed and packed into GeneMates to process outcomes of this function.
- ringPlotPAM creates a ring plot for antimicrobial resistance gene (ARG) alleles to compare their presence-absence and co-occurrence to a phylogenetic tree, with user-specified tree tips and/or branches highlighted. It is particularly useful for validation and interpretation of results.
- heatMapPAM creates a binary heat map from an allelic or genetic PAM and aligns rows of the heat map to a phylogenetic tree. It is an improvement of ggtree function gheatmap and can be used for inspecting the distribution of ARG alleles.
- showGeneContent draws bubble plots and bar charts for summarising frequencies of ARGs and their alleles.
- mkNetwork converts results of the function findPhysLink into a network object that can be processed by other GeneMates functions for network analysis and exported to Cytoscape for network visualisation.
- mkCoocurNetwork creates temporal co-occurrence networks over a given list of years in accordance with edges in an association network or a linkage network.
- tempNet compiles temporal co-occurrence networks into a dynamic network that can be displayed as an animation using R package ndtv. This function can be used for determining the earliest co-occurrence events of specific ARG alleles in a given collection of bacterial genomes.
alleleClusterLocator
This tool extracts and clusters the shortest sequences of a given set of allele clusters. Specifically, it was developed to resolve genetic structures underlying ARG allele clusters, to locate every cluster of co-occurring ARG alleles in contigs, extract the shortest nucleotide sequence containing all the alleles from each contig, and launches CD-HIT-EST to group the shortest sequences per allele cluster under user-specified criteria.
1.2. Core-genome SNP analysis
cgSNPs
The analysis of core-genome SNPs (cgSNPs) identified in bacterial genomes of the same species plays a pivotal role in the detection of HGcoT. My scripts developed for analysing cgSNPs implement contamination detection and SNP imputation:
- contamination_assessment: nine R or Linux Bash scripts have been developed for detecting DNA contamination in Illumina read sets. The scripts take as input results of RedDog, extract coordinates of homozygous SNP calls and heterozygous SNP calls in a reference genome from RedDog VCF files (extractInfoFromVCF.sh), generate a scatter plot of fold coverages and MAFs per bacterial strain (hetSNP_depthPlot.R), and perform several statistic analysis.
- imputation: two Python scripts have been created under this directory to process outputs of ClonalFrameML \cite{Didelot2015} after SNP imputation. Specifically, script clonalFrameML2Fasta.py creates pseudo alignments of imputed SNPs and save them as a FASTA file; script fasta2csv.py then converts the FASTA file into a comma-delimited SNP table that is required by GeneMates and GEMMA.
1.3. Read simulation
readSimulator
This software generates synthetic Illumina reads from complete bacterial genomes, which is a fundamental step for evaluating the accuracy of allelic physical distance (APD) measurements used by GeneMates. This package consists of three Python scripts, amongst which readSimulator.py implements the key functionality of read simulation. Advantages of readSimulator over some widely used tools (such as Li Heng’s wgsim and NIH’s ART ) include support for circular topology of bacterial genomes, flexible settings for fold coverages and error profiles, and parallel processes for multiple genomes.
1.4. Detection of antimicrobial resistance genes
The detection of ARGs at the allele level generates one of the key inputs for GeneMates. My computer code developed for this task implements two utilities: obtaining ARG profiles and consensus allele sequences from genome assemblies in an SRST2-compatible format, and assign allele identifiers across ARG profiles based on nucleotide sequence identity.
geneDetector
This code package carries out parallel targeted ARG-detection jobs for contig-level genome assemblies and produces SRST2-compatible gene profiles and consensus allele sequences. The Python script detector.py performs an essential role amongst all five scripts of this tool.
PAMmaker
This pipeline takes as input gene profiles and consensus allele sequences from SRST2 and geneDetector, assesses reliability and unicity of allele calls, calls CD-HIT-EST to perform sequence clustering, assigns allele identifiers based on sequence clusters, and produces an allelic PAM across bacterial samples. This matrix is an obligate input of GeneMates.
SRST2_toolkit
This is a package of Python scripts that process SRST2 outputs (directory Genotyping) and curate an SRST2-compatible reference database (directory Db_curation).
1.5. Measurement and evaluation of allelic physical distances
APDtools
This code package runs Bandage (distance) for the measurement of APDs in genome assemblies (dist_from_graphs.py and compile_dists.py), calculate true APDs in complete genomes (calc_dr.R), pair APD measurements and true APDs (merge_dr_ds.R), prioritise APD measurements based on their sources (prioritise_dists.R), and evaluate their accuracy given a maximum node number or distance (accuracy_vs_nodes.R and accuracy_vs_dist.R). In addition, I provide a script dist2cluster.py in this package to obtain the minimum physical distance between two ARG allele clusters in each genome in which they are co-localised.
1.6. General bioinformatic practice
BINF_toolkit
This package is the most starred package amongst my GitHub repositories. It provides scripts for routine bioinformatic practice. In particular, I highlight my script gbk2tbl.py here, which converts a GenBank file (.gbk or .gb) into a Sequin feature table (.tbl) that is required by GenBank’s tbl2asn for sequence submission.
2. Statistics
2.1. General application
handyR
This is a repository consisting of R functions performing common statistical tests and data processing.