Team III Functional Annotation Group

From Compgenomics 2020
Jump to navigation Jump to search

Group Members - Allison, Gulay Bengu Ulukaya, Cheng, Pallavi Misra

Introduction

Initial Pipeline

Clustering

USEARCH

We chose to cluster the genes based upon similarity in order to reduce the amount of overlap when annotating these genes. This is executed through the UCLUST algorithm. UCLUST preforms this by creating clusters that contain a single centroid sequence upon which the other sequences must have a certain sequence similarity to be considered apart of the cluster. We can set an identity threshold which can be thought of as the radius of the cluster.

For our analysis we decided to use an identity threshold of 97% because we obtained larger average cluster sizes and relatively low amount of singletons 1%

This demonstrates how the number of clusters exponentially increases as the threshold identity increases and begins to level off as the threshold limit decreases. Based on this 97% appeared to be a suitable option for our analyses. We were able to decrease the amount of genes needing to be annotated from approximately 150,000 to 11,720

Although 95% identity had a larger average cluster size we wanted a higher degree of specificity in our results.

Homology Tools

eggNOG mapper

eggNOG-mapper is a Tool for functional assignments based on precomputed orthologous clusters present it the eggNOG database. This is performed in the steps as follows

1) Sequence Mapping using HMMER3 or DIAMOND, for our analyses we use DIAMOND as a result of size of input and as because it is recommended over HMMER3 when annotating organisms with close relatives among the species covered by eggNOG

2)Orthology assignment

3)Functional Annotation, which is restricted to closest orthologs for reduction of false positives

CARD-RGI

Comprehensive Antibiotic Resistance Database (CARD) is a rigorously curated collection of characterized, peer-reviewed Antibiotic Resistance Genes which is monthly updated. Resistance Gene Identifier(RGI) is a toolkit based on CARD for annotating Antimicrobial genes.

VFDB

Virulence Factor Database (VFDB) is an integrated and comprehensive online resource for curating information about virulence factors of bacterial pathogens (recently updated in 2019). The database contains information such as structure features of the virulence factors, functions and mechanisms used by the pathogens for circumventing host defense mechanisms and causing pathogenicity. Core dataset of DNA sequences was downloaded from VFDB website, which include genes associated with experimentally verified Virulence Factors only. BLAST database was build based on the downloaded dataset from VFDB and BLASTN was used.

Ab-initio Tools

PILERCR

CRISPR are family of DNA sequences found in the genomes of prokaryotic organisms- bacteria and archaea. They are derived from DNA fragments of viruses that had previously infected the prokaryote and provides protection from viruses and plays a major role in antiviral defense system. PILERCR identifies CRISPR repeats by using BLAST to find their fragmented/ degraded copies. A CRISPR array is found when it fulfills the criteria of having a set of CRISPR repeats with intervening unique sequences known as spacers. This program provides fast identification and classification of CRISPR genes and also has both high sensitivity and high specificity.

HMMTOP

HMMTOP (Hidden Markov Model for Topology Prediction) transmembrane topology prediction tool predicts both the localization of helical transmembrane segments, their start and end positions in the sequence, and the topology of transmembrane proteins. HMMTOP method is based on the transmembrane proteins determined by the maximum divergence of amino acid composition of sequence segments.

SIGNALP

Signal peptides are short amino acid sequences of newly synthesized proteins that target proteins into, or across, membranes. SignalP 5.0 distinguishes three types of signal peptides in prokaryotes: Sec substrates cleaved by SPase I (Sec/SPI), Sec substrates cleaved by SPase II (Sec/SPII), and Tat substrates cleaved by SPase I (Tat/SPI). SignalP consists of two different predictors based on neural network and hidden Markov model algorithms. In order to predict potential signal peptides of proteins, the D-score from the SignalP output is used for discrimination of signal peptide versus non-signal peptide.

Merge Results

We merged our final results by initially formatting the outputs that were not already in .gff format to .gff format and then mapped the annotations from the tools to each gene of every sample to produce merged .gff files. From these merged gff files we then utilized bedtools getfasta to produce fasta files and then EMBOSS transeq to translate the nucleotide sequences to protein sequences.

Final Pipeline

Results

eggNOG-mapper

$ download_eggnog_data.py bact
$ emapper.py -i ../USEARCH/All_Centroid.fasta -m diamond  --translate -d bact -o eggNog_annotations

CARD-RGI

$ rgi -i <input_file> -o <output_file>

VFDB

$ makeblastdb -in <input_db> -parse_seqids -blastdb_version 5 -dbtype nucl -out <name_db>
$ blastn -db <name_db> -query <input_file> -out <output_file> -max_hsps 1 -max_target_seqs 1 -num_threads 4 -evalue 1e-5

PILERCR

$ ./pilercr -in <input_file> -out <output_file>

HMMTOP

$ ./hmmtop -if=<input_file> -of=<output_file>

SIGNALP

$ ./signalp -fasta <input_file> -org gram+ -format short -gff3 -prefix <output_file_prefix>

References

Huerta-Cepas, Jaime, et al. "Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper." Molecular biology and evolution 34.8 (2017): 2115-2122.

Edgar, Robert C. "Search and clustering orders of magnitude faster than BLAST." Bioinformatics 26.19 (2010): 2460-2461.

Buchfink, Benjamin, Chao Xie, and Daniel H. Huson. "Fast and sensitive protein alignment using DIAMOND." Nature methods 12.1 (2015): 59.


Barrangou R. The roles of CRISPR-Cas systems in adaptive immunity and beyond. Curr Opin Immunol. 2015;32:36–41. doi:10.1016/j.coi.2014.12.008

Edgar, Robert C. "PILER-CR: fast and accurate identification of CRISPR repeats." BMC bioinformatics 8.1 (2007): 18.

Alcock, Brian P., et al. "CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database." Nucleic acids research 48.D1 (2020): D517-D525.

Liu, Bo, et al. "VFDB 2019: a comparative pathogenomic platform with an interactive web interface." Nucleic acids research 47.D1 (2019): D687-D692.


Armenteros, José Juan Almagro, et al. “SignalP 5.0 Improves Signal Peptide Predictions Using Deep Neural Networks.” Nature News, Nature Publishing Group, 18 Feb. 2019, www.nature.com/articles/s41587-019-0036-z.5.

Simon. “HMMTOP Transmembrane Topology Prediction Server.” OUP Academic, Oxford University Press, 1 Sept. 2001, academic.oup.com/bioinformatics/article/17/9/849/206573.