Team I Functional Annotation Group

From Compgenomics 2020
Revision as of 23:53, 21 April 2020 by Mvegesna3 (talk | contribs) (→‎TMHMM Results)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Team 1 Functional Annotation

Team members: Maria Ahmad, Shuheng Gan, Manasa Vegesna, Hyeonjeong Cheon, and Kenji Gerhardt

Introduction

Background

Functional annotation is the process of associating predicted genes with their functional role in a cell. This includes the type of gene (translated/untranslated), their location within the cell, and their chemical and biological roles. These can be derived both by comparison to similar, already annotated genes with known (or anticipated) functions, or through ab initio annotation that relies on existing models, but which does not rely on databases of other annotated genes.

Objective

Our goal in this step is to functionally annotate the genes supplied to us by the gene prediction group.

Data

The genomes supplied to us were identified during assembly as Escherichia coli. E. coli is the most highly studied microorganism: at present, NCBI contains 1072 complete E. coli genomes in its genome database, along with thousands of additional chromosomes, contigs, and scaffolds. This depth of study extends to its genes, where multiple pathogenic strains are identified and characterized well.

Pipeline

pipeline

Clustering

Protein clustering is the task of reducing a large number of sequences down to a smaller set of cluster representatives. A cluster representative is a single sequence whose functional characteristics are similar/the same as the other sequences contained within the cluster. The point of clustering is to produce a set of fewer, and thereby computationally easier/quicker to annotate, representative sequences whose subsequent annotation can then be applied to the other members of the cluster they represent.

In order to cluster sequences, the tool we selected for our final pipeline (USearch) sets a boundary based on percent identity between sequences in a dataset. This boundary at which sequences are clustered is the minimum threshold of difference between the two sequences for them to be included in the same cluster; sequences more dissimilar than the boundary will either be clustered with a different sequence, or will become their own cluster if they fail to cluster with any other sequences. As a result, the ideal boundary is one which captures sequences sharing function within single clusters, and does not cluster sequences with different functions in the same cluster.

In practice, such an ideal boundary will always be beset with problems: very similar sequences may have different functions, very disparate sequences may share function, and the necessary similarity to share function between proteins will vary by protein. Further, due to the degeneracy of the translation between nucleotides and amino acids, sequences which are similar at the protein level may be highly divergent at the nucleotide level, excluding the untranslated untranslated proteins. As a result, performance will be imperfect, and a decision between accuracy of clustering results and efficacy of clustering has to be made.

For the purposes of our work, we decided to cluster amino acids rather than nucleotide sequences for the majority of our tasks. Protein function is ultimately dictated by amino acid sequence, at least insofar as the data that annotation tools receive from a user. Although most tools are designed to work both with nucleotide sequences and amino acid sequences, clustering nucleotide sequences with the aim of clustering protein function is a largely misguided effort.

NCBI guidelines [28] indicate that protein functional similarity is very likely at 40% amino acid identity. However, since our task is performed in the context of 50 samples of the same species, we decided upon a much more conservative 70% amino acid identity boundary to cluster our sequences. This produces more clusters than is likely necessary, but is less likely to cluster proteins with disparate functions than a more lenient standard.

USearch

The tool our pipeline uses is called USearch. Among other functions, USearch performs sequence clustering using a greedy agglomerative algorithm, meaning it creates clusters and assigns cluster identity in the order of input sequences. Identity is calculated through the use of kmers in a heuristic manner. This approach is approximate, but performs well in most cases and is widely utilized by other tools.

While it is possible that this order-dependent behavior may result in a sequence being assigned to a cluster that it does not ideally join, the conservative cutoff for sequence similarity that we selected makes it unlikely that this will affect function. More probably, a sequence will be assigned to one of a few similar clusters which in a perfect case scenario should be joined, as the multiple clusters share the same function.

The result of USearch is pair of files: one tab-separated file listing the cluster representative, the sequences belonging to that representative and their identity to the representative, and a second listing just the representative sequences in FASTA format, which is friendly to annotation tools.

usearch

Ab-initio Approach

Ab-Initio Tools predict and annotate different regions of the prokaryotic genome using 1) sequence composition, 2) likelihoods within the gene models, 3) gene content and 4) Signal Detection. The Ab-Initio Approach can be used for finding new genes, and no external data or evidence is needed for the prediction. However, it is limited by the presence of False Positives in the predicted data as well as over-prediction of small genes.

Prediction of Signal Peptides

A signal peptide is a short peptide (16-20 amino acids long) present at the N-terminus of newly synthesized proteins that are involved in the secretory pathway. In prokaryotes, the signal peptides are known to direct the synthesized protein to specific protein channels.

Ab-Initio tools take advantage of their common structure to predict the presence of signal peptides in the given protein sequences. The signal peptide structure is described as a positively charged n-region, followed by a hydrophobic h-region and a neutral but polar c-region.

SignalP 5.0

The SignalP 5.0 tool predicts the presence of signal peptides and the location of their cleavage sites in proteins from Archaea, Gram-positive Bacteria, Gram-negative Bacteria, and Eukarya.

For Bacteria and Archaea, SignalP 5.0 is known to discriminate between three types of signal peptides Sec/SPI (standard signal peptide), Sec/SPII (lipoprotein signal peptide) and Tat/SPI (tat signal peptide).

SignalP 5.0 is based on a deep convolutional and recurrent neural network architecture including a conditional random field.

Input file: FASTA file Output file: gff3 file

Command used:

signalp -fasta <input file path> -org gram- -format -short -prefix <output_file_path> -gff

Prediction of Transmembrane Proteins

Transmembrane protein is an integral membrane protein that spans the entirety of the cell membrane in an organism. Transmembrane proteins contain crucial components for cell-cell signaling, mediate the transport of ions and solutes across the membrane. Transmembrane helices are a basic type of transmembrane proteins.

TMHMM

TMHMM is a membrane protein topology prediction tool that focuses on the prediction of transmembrane helices in proteins with high accuracy. Its accuracy can be comprised in the presence of signal peptides.

Input file: FASTA file Output file: gff file

Command used:

cat <input file path> | tmhmm > <output file path>

Prediction of CRISPR Regions

CRT

  • Predict CRISPR regions based on K-mer based approaches
  • Fast and Memory Efficient, High Recall Rate and Quality
  • Faster for genomes containing larger number of repeats

PILER-CR

  • Predict CRISPR regions based on identified repeats. It's default options are that 3 repeats minimum in an array, 16 minimum repeat length, 64 maximum repeat length, 8 minimum spacer length, 64 maximum spacer length, 0.9 minimum repeat ratio, 0.75 minimum spacer ratio.
  • High Sensitivity and Fast - completes a 5 Mb genome in 5 seconds
  • More precision than the CRISPR Recognition Tool, so less false positives.
  • We went with PilerCR for the above reason, rather than CRT.

Command used:

pilercr -in <input file> -out <output file> -minrepeat <n> -minspacer <n>

Homology-based Tools

Homology between genes means they share ancestry. Homologous genes that have recently diverged usually share function. We are looking to transfer annotation on known genes to our predicted genes by finding homology genes. Gene databases are collections of annotated genes. When we search a gene against a database, the search is looking for homology between our gene sequences and those in the database to determine what our genes’ function will be.

Homology-based tools are more accurate and reliable than Ab-initio tools, and can be targeted for specific purposes, e.g. antibiotic resistance genes. However, homology tools are dependent on existing annotations and on what databases are being searched.

homology

EggNOG-mapper

  • Performes fast functional annotation of genes and proteins
  • Uses precomputed orthologous groups and phylogenies from the eggNOG database which consists of Orthologous Groups( OGs) of proteins at taxonomic levels.
  • Has a mode recommended for large datasets called DIAMOND. Diamond mode searches for the best seed ortholog located in the eggNOG protein database, and is faster than the other mode HMM. The HMM mode is comprised of a collection of precompiled hidden Markov models.

To download bacteria databases:

python2.7 download_eggnog_data.py bact

Command:

python emapper.py -i <input_file> --output <output_file> -m diamond -d bact -o <output directory>

eggnog

InterProScan

  • InterProScan is function annotation tool that scans sequences against InterPro protein signature databases. It classifies proteins into families and predict domains and important sites.
  • Signatures are predictive models constructed from multiple sequence alignments that can be used to classify proteins. Using protein signatures is often a more sensitive way of identifying protein function than pairwise sequence similarity searches, such as BLAST.
  • InterPro combines four types of protein signatures (patterns, profiles, fingerprints and HMMs) from multiple databases.

Download (https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload)

# download interproscan and most database
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.42-78.0/interproscan-5.42-78.0-64-bit.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.42-78.0/interproscan-5.42-78.0-64-bit.tar.gz.md5
md5sum -c interproscan-5.42-78.0-64-bit.tar.gz.md5

# download Panther
cd [InterProScan5 home]/data/
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-14.1.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-14.1.tar.gz.md5
md5sum -c panther-data-14.1.tar.gz.md5

Requirement (https://github.com/ebi-pf-team/interproscan/wiki/InstallationRequirements)

- Perl 5.22
- Python 3
- Java 11

Command

interproscan.sh -i <input_file_name> -dp -d <output_directory> -appl <databases_you_choose> -f <output_format> -t <sequence_type>
parameters:
-d: optional, default current directory [./]
-appl: optional, default all 14 databases [CDD-3.17,Coils-2.2.1,Gene3D-4.2.0,Hamap-2020_01,MobiDBLite-2.0,PANTHER-14.1,Pfam-32.0,PIRSF-3.02,
PRINTS-42.0,ProSitePatterns-2019_11,ProSiteProfiles-2019_11,SFLD-4,SMART-7.1,SUPERFAMILY-1.75,TIGRFAM-15.0]
-f: optional, default [gff3, tsv, and xml for protein | gff3 and xml for nucleotide]
-t: optional, n/p, default protein [p]
runtime of default setting: ~12h

DeepARG

DeepARG is a deep learning tool that annotate antibiotic resistance genes in metagenomes. It is composed of two models for two types of input: DeepARG-SS for short sequence reads from Next Generation Sequencing (NGS) technologies such as Ilummina and DeepARG-LS for long gene-like sequences from assembled samples.

Download

git clone https://bitbucket.org/gusphdproj/deeparg-ss

Requirement

- Python 2.7

Command

python ./deepARG.py --align --genes --type prot --input <gene-like_sequences_fasta_file> --out <output_file_name>
output:
- output_file_name.align.daa, DIAMOND alignment archive
- output_file_name.align.daa.tsv, tabular format. Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
- output_file_name.mapping.ARG, contains the sequences with a probability >= --prob (0.8 default)
- output_file_name.mapping.potential.ARG, contains the sequences with a probability < --prob (0.8 default)

Results

Clustering

Because the genomes in our samples were each of one e coli a piece, we concatenated the amino acid sequences of the results of gene prediction and clustered this resultant file. The reason that the samples were combined was to allow for genes shared between different strains to be clustered and thereby annotated together. Although genes may have multiple copies within a singe sample, it is more likely given the structure of our data that clusters would emerge between, rather than within, samples.

The concatenation of the samples produced 231894 protein sequences, which when clustered at 70% amino acid identity were reduced to only 7361 representative sequences.

Ab-initio Approach

SignalP v.5 Results

We determined signal peptides using the SignalP v.5 tool on both clustered sequences and unclustered amino acid sequences.

Results for Clustered Sequences:

Results for Unclustered Sequences:

Types of Signal Peptides Identified in Clustered Sequences:

Types of Signal Peptides Identified in Unclustered Sequences:

TMHMM Results

We determined Transmembrane proteins using TMHMM on both clustered and unclustered amino acid sequences.

Results for Clustered Sequences:

Results for Unclustered Sequences:

PilerCR Results

Homology Approach

EggNOG-mapper Results

Results

Total: 226422 / Average: 4717.13 / Maximum: 4340 / minimum: 4340

eggnogresult

Interproscan Resutls

Average 94% (4452) genes get annotations in each sample.

Run: ~12h for all databases


Interproscan Annotation Result

DeepARG Resutls

Average 106 genes are annotated as ARG and 83 kinds of ARG in each sample.

Runtime: ~46s

DeepARG Annotation Result

A list of 16 class antibiotic categories in DeepARG result:

  1. aminoglycoside
  2. aminoglycoside:aminocoumarin
  3. bacitracin
  4. beta-lactam
  5. diaminopyrimidine
  6. fluoroquinolone
  7. fosmidomycin
  8. glycopeptide
  9. MLS
  10. multidrug
  11. peptide
  12. phenicol
  13. pleuromutilin
  14. rifamycin
  15. sulfonamide
  16. tetracycline

In-Class Presentations

References

[1] Joshi, T., & Xu, D. (2007). Quantitative assessment of relationship between sequence similarity and function similarity. BMC genomics, 8(1), 222.

[2] Zou, Q., Lin, G., Jiang, X., Liu, X., & Zeng, X. (2020). Sequence clustering in bioinformatics: an empirical study. Briefings in bioinformatics, 21(1), 1-10.

[3] Nielsen, H., Tsirigos, K. D., Brunak, S., & von Heijne, G. (2019). A brief history of protein sorting prediction. The protein journal, 38(3), 200-216.

[4] Viklund, H., & Elofsson, A. (2004). Best α‐helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Science, 13(7), 1908-1917.

[5] Romine, M. F. (2011). Genome-wide protein localization prediction strategies for gram negative bacteria. BMC genomics, 12(S1), S1.

[6] Yu, N. Y., Wagner, J. R., Laird, M. R., Melli, G., Rey, S., Lo, R., ... & Brinkman, F. S. (2010). PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics, 26(13), 1608-1615.

[7] Armenteros, J. J. A., Tsirigos, K. D., Sønderby, C. K., Petersen, T. N., Winther, O., Brunak, S., ... & Nielsen, H. (2019). SignalP 5.0 improves signal peptide predictions using deep neural networks. Nature biotechnology, 37(4), 420-423.

[8] Juncker, A. S., Willenbrock, H., Von Heijne, G., Brunak, S., Nielsen, H., & Krogh, A. (2003). Prediction of lipoprotein signal peptides in Gram‐negative bacteria. Protein Science, 12(8), 1652-1662.

[9] Bendtsen, J. D., Nielsen, H., Widdick, D., Palmer, T., & Brunak, S. (2005). Prediction of twin-arginine signal peptides. BMC bioinformatics, 6(1), 167.

[10] Käll, L., Krogh, A., & Sonnhammer, E. L. (2004). A combined transmembrane topology and signal peptide prediction method. Journal of molecular biology, 338(5), 1027-1036.

[11] Krogh, A., Larsson, B., Von Heijne, G., & Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Journal of molecular biology, 305(3), 567-580.

[12] Couvin, D., Bernheim, A., Toffano-Nioche, C., Touchon, M., Michalik, J., Néron, B., ... & Pourcel, C. (2018). CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic acids research, 46(W1), W246-W251.

[13] Mao, F., Dam, P., Chou, J., Olman, V., & Xu, Y. (2009). DOOR: a database for prokaryotic operons. Nucleic acids research, 37(Database issue), D459–D463. https://doi.org/10.1093/nar/gkn757

[14] Chen, L., Yang, J., Yu, J., Yao, Z., Sun, L., Shen, Y., & Jin, Q. (2005). VFDB: a reference database for bacterial virulence factors. Nucleic acids research, 33(Database issue), D325–D328. https://doi.org/10.1093/nar/gki008

[15] eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Jaime Huerta-Cepas, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Davide Heller, Mathias C. Walter, Thomas Rattei, Daniel R. Mende, Shinichi Sunagawa, Michael Kuhn, Lars Juhl Jensen, Christian von Mering, and Peer Bork.Nucl. Acids Res. (04 January 2016) 44 (D1): D286-D293. doi: 10.1093/nar/gkv1248

[16] McArthur et al. 2013. The Comprehensive Antibiotic Resistance Database. Antimicrobial Agents and Chemotherapy, 57, 3348-57.

[17] Nucleic Acids Res. 2009 Jan;37(Database issue):D443-7. doi: 10.1093/nar/gkn656. Epub 2008 Oct 2.

[18] Arndt, D., Grant, J., Marcu, A., Sajed, T., Pon, A., Liang, Y., Wishart, D.S. (2016) PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res., 2016 May 3.

[19] Arango-Argoty, G., Garner, E., Pruden, A. et al. DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 6, 23 (2018). https://doi.org/10.1186/s40168-018-0401-z

[20] Ashburner et al. Gene ontology: tool for the unification of biology. Nat Genet. May 2000;25(1):25-9.

[21] Prediction of lipoprotein signal peptides in Gram-negative bacteria.A. S. Juncker, H. Willenbrock, G. von Heijne, H. Nielsen, S. Brunak and A. Krogh.Protein Sci. 12(8):1652-62, 2003

[22] Huerta-Cepas, J., Forslund, K., Coelho, L. P., Szklarczyk, D., Jensen, L. J., Von Mering, C., & Bork, P. (2017). Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Molecular biology and evolution, 34(8), 2115-2122.

[23] Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068-2069.

[24] Arango-Argoty, G., Garner, E., Pruden, A., Heath, L. S., Vikesland, P., & Zhang, L. (2018). DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome, 6(1), 1-15.

[25] Finn, R. D., Attwood, T. K., Babbitt, P. C., Bateman, A., Bork, P., Bridge, A. J., ... & Gough, J. (2017). InterPro in 2017—beyond protein family and domain annotations. Nucleic acids research, 45(D1), D190-D199.

[26] Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+

[27] Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature methods, 12(1), 59.

[28] Pearson W. R. (2013). An introduction to sequence similarity ("homology") searching. Current protocols in bioinformatics, Chapter 3, Unit3.1. https://doi.org/10.1002/0471250953.bi0301s42

[29] Edgar, R.C. (2007) PILER-CR: fast and accurate identification of CRISPR repeats, BMC Bioinformatics, Jan 20;8:18.

[30] Bland, C., Ramsey, T. L., Sabree, F., Lowe, M., Brown, K., Kyrpides, N. C., & Hugenholtz, P. (2007). CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC bioinformatics, 8, 209. https://doi.org/10.1186/1471-2105-8-209